PPO-RLHF

This is part one of 4 blog articles covering modern post-training for LLMs:

PPO
DPO
GRPO
GDPO

RLHF for LLMs: Why do we need it instead of SFT?

LLMs are trained on Reddit dumps, Wikipedia, YouTube transcripts—data scraped from everywhere. That data contains things we don't want in the response distribution: bias, toxicity, noise. We might want the model to know about these patterns, but not reproduce them.

SFT can't solve this because it's fundamentally imitation—you learn to predict the next token for whatever distribution you train on. After pretraining and instruction tuning, the model contains everything: brilliance and garbage alike. You can reduce this with heavy curation, but RLHF gives you a second stage where you explicitly optimize for the assistant behavior you want—using preferences or verifiers—rather than hoping it emerges from imitation alone.

RLHF exists because assistant quality is often:

Not uniquely labeled. For most prompts there isn't a single "correct" response. There are many plausible answers, and what we want is relative preference: helpful vs unhelpful, safe vs unsafe, honest vs hallucinated, concise vs rambling. Preferences are naturally pairwise ("A is better than B"), not one gold target.
Hard to specify as supervised targets. You can't easily curate a dataset that perfectly encodes style, refusal behavior, calibration, and truthfulness across the full prompt distribution. Even if you try, you'll miss edge cases and the model will still reflect the statistical artifacts of the dataset.
Outcome-based and sometimes non-differentiable. In tool use, coding, and reasoning, what matters is the result: unit tests pass, tool calls parse, constraints are satisfied. That's not a clean "label"; it's a verifier signal. RLHF/RLVR-style training can directly optimize those outcomes.
Prone to compounding errors under deployment. SFT trains on "correct context" (teacher forcing). At inference, the model conditions on its own outputs. RL-style training explicitly optimizes behavior under the model's rollout distribution (often with a KL anchor so it doesn't drift wildly).

PPO is old, why should I care?

PPO may be a bit older, but it is the backbone of modern algorihtms like GDPO. We cover it to gain a grounded understanding of how the later algorithms—DPO, GRPO, GDPO—work. Also, key terms show up here that, if you aren't aware of them, make the later algorithms harder to parse.

What problem did PPO solve?

Policy gradients at that time were sample-inefficient and unstable because they could make overly large policy updates during training. Trust region policy optimization (TRPO) fixed that with a trust-region/KL constraint but was too complex (required more than first-order optimization). PPO aims to get TRPO-like stable improvement using only first-order methods by limiting how much the policy can change per update via a clipped probability-ratio objective. This enables multiple epochs of minibatch updates on the same data.

At a high level

Don't try to understand this summary right now. Read it once, then spend time with the detailed algorithm below, then come back.

A policy $\pi_\theta$ (the LLM) generates rollouts $y$ for sampled prompts $x$ . Each rollout is stored at the token level: for timestep $t$ , we log $s_t=(x,y_{<t})$ , the sampled token $a_t=y_t$ , $\log \pi_{\theta_{\text{old}}}(a_t|s_t)$ , $\log \pi_{\text{ref}}(a_t|s_t)$ , and the critic prediction $V_\psi(s_t)$ . A reward signal is computed using a reward model $r_\phi(x,y)$ (typically one scalar per completion) plus a per-token KL penalty to keep the policy close to the reference $\pi_{\text{ref}}$ . Using these rewards and $V_\psi$ , we compute advantages $\hat{A}_t$ (often via GAE). Then we update the policy with the PPO clipped objective and update the critic with a value regression loss. Classic RLHF involves three learned components: the policy, a value head (critic), and a separate reward model (trained beforehand and held fixed during PPO).

Three learned components: policy, value function, reward model. (Often the policy and value share a backbone; RM is usually separate and frozen during PPO). That's crazy

PPO-RLHF is heavy

PPO-RLHF is a full RL control system. I think this is because it's one of the earliest examples of applying RL to LLMs—they grabbed deep RL tooling at that time, which has a lot of machinery, and imported it wholesale to the LLM post-training stack.

PPO-RLHF algorithm steps

I think the best way to understand PPO-RLHF is to look directly at the algorithm. It can be tough to extract these details from the paper. I recommend reading the paper as it's canon, but this is a direct injection.

Phase 0: Setup

Train SFT policy → define it as reference $\pi_{\text{ref}}$ (frozen).
Train reward model $r_\phi(x,y)$ (frozen).
Initialize trainable policy $\pi_\theta \leftarrow \pi_{\text{ref}}$ .
Initialize value head $V_\psi$ .

Repeat for each PPO iteration

Step 1 — Freeze the behavior policy

Snapshot current policy: $\theta_{\text{old}} \leftarrow \theta$ Define behavior policy $\pi_{\theta_{\text{old}}}$ (frozen for this iteration). Actor is the policy (language model) that defines $\pi(\cdot|s)$ and samples tokens during training.

Step 2 — Collect rollouts using the frozen snapshot

Sample prompts $x$
Generate completion $y \sim \pi_{\theta_{\text{old}}}(\cdot|x)$ , where $\pi_{\theta_{\text{old}}}$ is the Actor
For each token step $t$ , store:
- $s_t=(x,y_{<t})$ , $a_t=y_t$
- $\log \pi_{\theta_{\text{old}}}(a_t|s_t)$ (needed for PPO ratio)
- $\log \pi_{\text{ref}}(a_t|s_t)$ (needed for KL penalty)
- $V_\psi(s_t) \; \leftarrow$ Critic

Step 3 — Compute rewards, returns, advantages (on the collected batch)

Compute terminal reward-model (RM) score: $r^{RM}=r_\phi(x,y)$ .
Compute per-token KL shaping (typical): $r^{KL}_t = -\beta\big(\log \pi_{\theta}(a_t|s_t) - \log \pi_{\text{ref}}(a_t|s_t)\big)$ This penalizes the policy for drifting from the reference—if the policy assigns higher probability than the reference, the reward gets docked. In practice, the KL term may be computed using the behavior snapshot $\pi_{\theta_{\text{old}}}$ at rollout time and held fixed for the batch, and/or computed using the current $\pi_{\theta}$ during optimization as an explicit regularizer.
Define total per-step reward: $r_t = r^{KL}_t \quad (t<T), \qquad r_T = r^{KL}_T + r^{RM}$
Compute returns $\hat R_t = \sum_{k=t}^T r_k$ (often $\gamma=1$ ).
Compute advantages $\hat A_t$ via:
- simple: $\hat A_t = \hat R_t - V_\psi(s_t)$ , or
- GAE.

Step 4 — Optimize $\pi_\theta$ and $V_\psi$ using PPO on this fixed batch

For $K$ epochs over the rollout batch:

Compute ratio: $r_t(\theta)=\exp(\log \pi_\theta(a_t|s_t) - \log \pi_{\theta_{\text{old}}}(a_t|s_t))$
Policy loss (clipped PPO): $\mathcal{L}_{\pi} = -\mathbb{E}_t\Big[\min(r_t\hat A_t,\; \text{clip}(r_t,1-\epsilon,1+\epsilon)\hat A_t)\Big]$
Value loss: $\mathcal{L}_{V}=\mathbb{E}_t (V_\psi(s_t)-\hat R_t)^2$
(optional) entropy bonus
Take optimizer steps on $\theta$ and $\psi$
Expectation subscript $t$ ( $\mathbb{E}_t$ ) means over all tokens in the response dataset. See Concrete rollout with example below to understand what this means.

End of iteration. Go back to Step 1 (make a new snapshot, collect fresh rollouts).

Return and advantage, and why they matter.

Return $\hat R_t = \sum_{k=t}^T r_k$ is the total future scores from here.

Value/critic $V_{\psi}(s_t)$ is the expected return from here. It's viewed as $V(s_t) \approx \mathbb{E}[\hat R_t \, | \, s_t]$

Then, the advantage $\hat A_t = \hat R_t - V_{\psi}(s_t)$ tells you how much better/worse the sampled continuation turned out than the critic expected from that prefix. So it's the difference between the actual return and the expected return.

What's the point of the advantage?

Advantage stabilizes learning by removing "baseline" signal from return so the policy updates only on what was better or worse than expected.

Take the following scenarios

1) It removes "always true" goodness from the gradient

Suppose a prompt is just easy and almost any reasonable continuation gets high RM reward. Then $\hat R_t$ will be large for all tokens—regardless of which token you chose.

If you push on $\hat R_t$ directly, you'd increase probability of everything that happened just because the prompt was easy.

With advantage, if "this is usually easy," then $V(s_t)$ is also high, so $\hat A_t \approx 0$ and the gradient doesn't go crazy. You only reinforce tokens that were better than what you'd already expect.

2) It makes the update local and fair

Advantage says: "credit assignment should go to the particular decisions that made things better than baseline, not to the whole trajectory indiscriminately."

In LLM terms: you don't want the model to over-learn generic tokens ("the", "I", "and") just because the overall completion got a good RM score.

Finally

Subtracting a baseline that depends only on $s_t$ doesn't change the expected policy gradient, but it can massively reduce variance—so $V(s_t)$ is a learned variance-reduction baseline.

What "clip" does.

Vanilla policy gradient would do $\max_{\theta} \: \mathbb{E}[r_t(\theta) \hat A_t]$

where

$r_t(\theta) = \frac{\pi_{\theta}(a_t | s_t)}{\pi_{\theta_{\text{old}}}(a_t | s_t)}$

That is, we find the $\theta$ that maximizes the reward with an advantage multiplier that says if that token actually yielded a positive return over baseline.

If the sampled token $a_t$ has a positive advantage ( $\hat A_t > 0$ ), increase its probability; push $r_t$ up. If it had a negative advantage ( $\hat A_t < 0$ ), push $r_t$ down.

The issue is nothing stops $r_t$ from becoming huge (or tiny), which makes updates unstable and can collapse behavior. For instance, one failure mode of vanilla policy gradient is let's say we sample prompts and the responses have some sort of outlier type of response that compounds over iterations on the batch, resulting in $\pi_{\theta}$ straying too far from $\pi_{\theta_{\text{old}}}$ .

So PPO clips the reward ratio to remain close to 1, and takes the min of the unclipped and clipped rewards (times their advantage).

unclipped: $u = r_t\hat A_t$
clipped: $c = \text{clip}(r_t, \, 1 - \epsilon, \, 1 + \epsilon)\hat A_t$

And now, here is the whole improvement PPO makes over vanilla policy gradient. We maximize the expectation of the min of clipped and unclipped rewards.

$\max_{\theta} \, \mathbb{E}_t[\min \, (u, c)]$

The $\min$ makes it a pessimistic bound of improvement. Meaning it refuses to credit you for improvements that would require changing the policy too much.

Concrete rollout with example

Prompt: $x = \text{"Where do we go from here?"}$

One rollout (response): $y = \text{"idk you tell me"}$

Assume word-tokens: $y_1=\text{"idk"},\; y_2=\text{"you"},\; y_3=\text{"tell"},\; y_4=\text{"me"}$

The RL "state" and "action" at each step:

t = 1
- $s_1 = (x,\; \text{""})$ (prompt + empty response prefix)
- $a_1 = y_1 = \text{"idk"}$
t = 2
- $s_2 = (x,\; \text{"idk"})$
- $a_2 = y_2 = \text{"you"}$
t = 3
- $s_3 = (x,\; \text{"idk you"})$
- $a_3 = y_3 = \text{"tell"}$
t = 4
- $s_4 = (x,\; \text{"idk you tell"})$
- $a_4 = y_4 = \text{"me"}$

What exactly gets stored per token step

For each timestep $t$ in this rollout batch, you store:

the token $a_t$ (e.g., "tell")
the logprob under the policy snapshot that generated it: $\log \pi_{\theta_{\text{old}}}(a_t \mid s_t)$
the logprob under the reference: $\log \pi_{\text{ref}}(a_t \mid s_t)$
(masking prompt, padding, and eos tokens when calculating log probs)
the value prediction: $V_\psi(s_t)$

That's it. Think of it like a dataset of rows: $(s_t, \; a_t, \; \log p_{\text{old}}, \; \log p_{\text{ref}}, \; V(s_t))$ for many tokens across many sampled completions.

Why store both logprobs?

$\log p_{\text{old}}$ is used for PPO's ratio when you later update the policy: $r_t(\theta) = \exp(\log \pi_\theta(a_t|s_t) - \log \pi_{\theta_{\text{old}}}(a_t|s_t))$
$\log p_{\text{ref}}$ is used for the KL penalty shaping (keeps you near the SFT/reference policy).

Again—PPO is heavy

It runs three learning problems at once:

Reward model learns what humans like (or you replace it with a verifier).
Critic/value head learns to predict returns to reduce variance.
Policy learns to maximize reward while staying close to reference.

Plus a data collection loop (on-policy rollouts), which makes everything coupled and expensive.

Why modern methods are winning

DPO deletes the RL loop and deletes the critic.
GRPO keeps RL-style improvement but deletes the critic (uses group-relative baselines).
GDPO fixes the multi-reward training instability problem.

Conclusion

PPO-RLHF works. It's battle-tested and it's what got us the first wave of aligned assistants. But it's expensive: three models, on-policy rollouts, and a critic that exists purely to reduce variance. Every piece adds compute, memory, and complexity.

The algorithms that came after—DPO, GRPO, GDPO—are all asking the same question: what can we delete while keeping the alignment signal? DPO answers "the RL loop and the critic." GRPO answers "the critic, but keep the RL loop." GDPO answers "nothing, but fix how we handle multiple rewards."

Understanding PPO means you now know what's being removed and why. Next up: DPO, which collapses the whole thing into a supervised loss.

Modern post-training part 1: PPO—the ancestor algorithm