Here is the question that RLHF answers: a language model trained only to predict the next word on a huge pile of internet text learns the statistics of language, but nothing tells it that a helpful, honest, on-topic answer is preferable to a plausible-sounding but useless one. Next-word prediction is not the same objective as "do what the user asked." Reinforcement learning from human feedback — RLHF — is the step that bridges that gap. The way this actually works is to stop relying on the text corpus alone and start training the model on human judgments about which of its own outputs are better.
The clearest public reference is the 2022 paper "Training language models to follow instructions with human feedback" by Long Ouyang and colleagues, which produced the InstructGPT models. It opens by naming the problem RLHF exists to solve.
Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user.— Training language models to follow instructions with human feedback (arXiv:2203.02155), source
The mechanism the paper describes comes in three stages, and it is worth walking through each because the phrase "reinforcement learning from human feedback" hides all of them. Stage one is supervised fine-tuning: human labelers write demonstrations of the desired behavior — good answers to prompts — and the base model (in the paper, GPT-3) is fine-tuned on that set with ordinary supervised learning. The model gets a first taste of the response style wanted. Stage two is where the human preference becomes a learnable signal. The model generates several candidate outputs for a prompt, and labelers rank them from best to worst. Those rankings are used to train a separate model — a reward model — whose only job is to predict how a human would score a given output. In effect, the messy, subjective notion of "a better answer" is distilled into a function the machine can compute.
Where the 'reinforcement learning' part comes in
Stage three is the reinforcement-learning step that gives the method its name. With a reward model now standing in for human judgment, the language model is further fine-tuned to produce outputs that the reward model scores highly — optimized, using a reinforcement-learning algorithm, against that learned reward. The model is no longer just imitating demonstrations; it is being pushed toward whatever earns approval from the proxy for human preference. The paper's own summary of the pipeline is compact: "We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT."
It helps to see why a separate reward model is needed at all, rather than just having humans score outputs forever. Human labeling is slow and expensive, and reinforcement learning needs a reward signal for vast numbers of generated outputs — far more than people could rate by hand. The reward model solves this by learning, from a finite set of human rankings, to predict the human judgment for any new output, so the reinforcement-learning loop can query it cheaply and continuously. That is the quiet engineering insight in the recipe: the human effort is concentrated into building demonstrations and rankings once, and the reward model then amortizes that effort across the whole optimization. The trade-off is that the model is now chasing a learned approximation of human preference rather than the real thing — which is why the field watches for the system gaming the reward model in ways a human would not actually endorse.
The headline result is the one most worth internalizing, because it cuts against the "bigger is better" reflex. The authors report that outputs from the 1.3-billion-parameter InstructGPT model were preferred by human evaluators to outputs from the 175-billion-parameter GPT-3 — a model roughly 100 times larger — and that InstructGPT showed improvements in truthfulness and reductions in toxic output. In other words, a small model aligned with human feedback beat a much larger model that was not, on the thing users actually care about: did it do what I asked, helpfully and without harm. Alignment, here, did more than scale.
What RLHF is and isn't
A few boundaries keep the concept precise. RLHF is a fine-tuning technique applied on top of a pretrained model; it shapes behavior, it does not create the underlying language ability, which comes from pretraining on text. The "human feedback" is concentrated in the labeling and ranking stages — a finite set of human judgments that the reward model then generalizes from — so the quality and coverage of that human data bound what the method can do. And the reward model is a learned approximation of human preference, not human preference itself; optimizing too hard against an imperfect proxy is a known failure mode the field works to manage. The paper itself is candid that "InstructGPT still makes simple mistakes."
Strip it down and RLHF is a way of teaching a model an objective that plain text never states: be the kind of answer a person would prefer. It does this by turning human rankings into a reward and then using reinforcement learning to chase that reward. The InstructGPT result — a 100-times-smaller model preferred over GPT-3 — is the durable lesson: for the property of following instructions, how a model is fine-tuned on human feedback can matter more than how many parameters it has. That is why some version of this preference-tuning step now sits between nearly every raw language model and the assistant a user actually talks to.
Comments
Loading comments…