alpha-meta
alpha-meta OP t1_j72dpto wrote
Reply to comment by _Arsenie_Boca_ in [D] Why do LLMs like InstructGPT and LLM use RL to instead of supervised learning to learn from the user-ranked examples? by alpha-meta
Good point, so you mean they incorporate things like beam search + changing temperature, top-k sampling, and nucleus sampling in the RL PPO-based optimizaton?
alpha-meta OP t1_j6yud7x wrote
Reply to comment by Jean-Porte in [D] Why do LLMs like InstructGPT and LLM use RL to instead of supervised learning to learn from the user-ranked examples? by alpha-meta
Ah yes, I see what you mean now, thanks!
alpha-meta OP t1_j6xylk8 wrote
Reply to comment by Jean-Porte in [D] Why do LLMs like InstructGPT and LLM use RL to instead of supervised learning to learn from the user-ranked examples? by alpha-meta
Could you help me understand what the far-away rewards represent here in this context? The steps are generating the individual words? So in this case you mean words that occur early in the text? In this case, a weighting scheme for the cross-entropy loss components could be used?
alpha-meta OP t1_j6x1r2j wrote
Reply to comment by Jean-Porte in [D] Why do LLMs like InstructGPT and LLM use RL to instead of supervised learning to learn from the user-ranked examples? by alpha-meta
But isn't this only if you train it on the loss (negative log-likelihood) via next-word prediction, i.e., what they do during pretraining?
If you use the ranks (from having users rank the documents) to compute the loss on the instead of the words as labels, would that still be the case?
alpha-meta OP t1_j6wvgbr wrote
Reply to comment by koolaidman123 in [D] Why do LLMs like InstructGPT and LLM use RL to instead of supervised learning to learn from the user-ranked examples? by alpha-meta
Thanks for the response! I just double-checked the InstructGPT paper and you were right regarding the rankings -- they are pairwise, and I am not sure why I thought otherwise.
Regarding the updates on a sentence level, that makes sense. That would be more of a discrete problem as well for which you probably can't backpropagate (otherwise, you would be back to token-level).
alpha-meta OP t1_j72dxx7 wrote
Reply to comment by bigabig in [D] Why do LLMs like InstructGPT and LLM use RL to instead of supervised learning to learn from the user-ranked examples? by alpha-meta
I think it's probably the non-differentiable nature of the sampling techniques. If it's just about limited training data and using the reward model, in that case you can also use weakly supervised learning with that reward model.