Back to blog
Research

Modern post-training part 1: PPO—the ancestor algorithm

•15 min read

PPO-RLHF

Original Paper (arXiv)


This is part one of 4 blog articles covering modern post-training for LLMs:

  1. PPO
  2. DPO
  3. GRPO
  4. GDPO

RLHF for LLMs: Why do we need it instead of SFT?

LLMs are trained on Reddit dumps, Wikipedia, YouTube transcripts—data scraped from everywhere. That data contains things we don't want in the response distribution: bias, toxicity, noise. We might want the model to know about these patterns, but not reproduce them.

SFT can't solve this because it's fundamentally imitation—you learn to predict the next token for whatever distribution you train on. After pretraining and instruction tuning, the model contains everything: brilliance and garbage alike. You can reduce this with heavy curation, but RLHF gives you a second stage where you explicitly optimize for the assistant behavior you want—using preferences or verifiers—rather than hoping it emerges from imitation alone.

RLHF exists because assistant quality is often:

  • Not uniquely labeled. For most prompts there isn't a single "correct" response. There are many plausible answers, and what we want is relative preference: helpful vs unhelpful, safe vs unsafe, honest vs hallucinated, concise vs rambling. Preferences are naturally pairwise ("A is better than B"), not one gold target.

  • Hard to specify as supervised targets. You can't easily curate a dataset that perfectly encodes style, refusal behavior, calibration, and truthfulness across the full prompt distribution. Even if you try, you'll miss edge cases and the model will still reflect the statistical artifacts of the dataset.

  • Outcome-based and sometimes non-differentiable. In tool use, coding, and reasoning, what matters is the result: unit tests pass, tool calls parse, constraints are satisfied. That's not a clean "label"; it's a verifier signal. RLHF/RLVR-style training can directly optimize those outcomes.

  • Prone to compounding errors under deployment. SFT trains on "correct context" (teacher forcing). At inference, the model conditions on its own outputs. RL-style training explicitly optimizes behavior under the model's rollout distribution (often with a KL anchor so it doesn't drift wildly).

PPO is old, why should I care?

PPO may be a bit older, but it is the backbone of modern algorihtms like GDPO. We cover it to gain a grounded understanding of how the later algorithms—DPO, GRPO, GDPO—work. Also, key terms show up here that, if you aren't aware of them, make the later algorithms harder to parse.

What problem did PPO solve?

Policy gradients at that time were sample-inefficient and unstable because they could make overly large policy updates during training. Trust region policy optimization (TRPO) fixed that with a trust-region/KL constraint but was too complex (required more than first-order optimization). PPO aims to get TRPO-like stable improvement using only first-order methods by limiting how much the policy can change per update via a clipped probability-ratio objective. This enables multiple epochs of minibatch updates on the same data.

At a high level

Don't try to understand this summary right now. Read it once, then spend time with the detailed algorithm below, then come back.

A policy (the LLM) generates rollouts for sampled prompts . Each rollout is stored at the token level: for timestep , we log , the sampled token , , , and the critic prediction . A reward signal is computed using a reward model (typically one scalar per completion) plus a per-token KL penalty to keep the policy close to the reference . Using these rewards and , we compute advantages (often via GAE). Then we update the policy with the PPO clipped objective and update the critic with a value regression loss. Classic RLHF involves three learned components: the policy, a value head (critic), and a separate reward model (trained beforehand and held fixed during PPO).

Three learned components: policy, value function, reward model. (Often the policy and value share a backbone; RM is usually separate and frozen during PPO). That's crazy

PPO-RLHF is heavy

PPO-RLHF is a full RL control system. I think this is because it's one of the earliest examples of applying RL to LLMs—they grabbed deep RL tooling at that time, which has a lot of machinery, and imported it wholesale to the LLM post-training stack.


PPO-RLHF algorithm steps

I think the best way to understand PPO-RLHF is to look directly at the algorithm. It can be tough to extract these details from the paper. I recommend reading the paper as it's canon, but this is a direct injection.

Phase 0: Setup

  1. Train SFT policy → define it as reference (frozen).
  2. Train reward model (frozen).
  3. Initialize trainable policy .
  4. Initialize value head .

Repeat for each PPO iteration

Step 1 — Freeze the behavior policy

  1. Snapshot current policy: Define behavior policy (frozen for this iteration). Actor is the policy (language model) that defines and samples tokens during training.

Step 2 — Collect rollouts using the frozen snapshot

  1. Sample prompts

  2. Generate completion , where is the Actor

  3. For each token step , store:

    • ,
    • (needed for PPO ratio)
    • (needed for KL penalty)
    • Critic

Step 3 — Compute rewards, returns, advantages (on the collected batch)

  1. Compute terminal reward-model (RM) score: .

  2. Compute per-token KL shaping (typical): This penalizes the policy for drifting from the reference—if the policy assigns higher probability than the reference, the reward gets docked. In practice, the KL term may be computed using the behavior snapshot at rollout time and held fixed for the batch, and/or computed using the current during optimization as an explicit regularizer.

  3. Define total per-step reward:

  4. Compute returns (often ).

  5. Compute advantages via:

    • simple: , or
    • GAE.

Step 4 — Optimize and using PPO on this fixed batch

  1. For epochs over the rollout batch:
  • Compute ratio:
  • Policy loss (clipped PPO):
  • Value loss:
  • (optional) entropy bonus
  • Take optimizer steps on and
  • Expectation subscript () means over all tokens in the response dataset. See Concrete rollout with example below to understand what this means.
  1. End of iteration. Go back to Step 1 (make a new snapshot, collect fresh rollouts).

Return and advantage, and why they matter.

Return is the total future scores from here.

Value/critic is the expected return from here. It's viewed as

Then, the advantage tells you how much better/worse the sampled continuation turned out than the critic expected from that prefix. So it's the difference between the actual return and the expected return.

What's the point of the advantage?

Advantage stabilizes learning by removing "baseline" signal from return so the policy updates only on what was better or worse than expected.

Take the following scenarios

1) It removes "always true" goodness from the gradient

Suppose a prompt is just easy and almost any reasonable continuation gets high RM reward. Then will be large for all tokens—regardless of which token you chose.

If you push on directly, you'd increase probability of everything that happened just because the prompt was easy.

With advantage, if "this is usually easy," then is also high, so and the gradient doesn't go crazy. You only reinforce tokens that were better than what you'd already expect.

2) It makes the update local and fair

Advantage says: "credit assignment should go to the particular decisions that made things better than baseline, not to the whole trajectory indiscriminately."

In LLM terms: you don't want the model to over-learn generic tokens ("the", "I", "and") just because the overall completion got a good RM score.

Finally

Subtracting a baseline that depends only on doesn't change the expected policy gradient, but it can massively reduce variance—so is a learned variance-reduction baseline.


What "clip" does.

Vanilla policy gradient would do

where

That is, we find the that maximizes the reward with an advantage multiplier that says if that token actually yielded a positive return over baseline.

If the sampled token has a positive advantage (), increase its probability; push up. If it had a negative advantage (), push down.

The issue is nothing stops from becoming huge (or tiny), which makes updates unstable and can collapse behavior. For instance, one failure mode of vanilla policy gradient is let's say we sample prompts and the responses have some sort of outlier type of response that compounds over iterations on the batch, resulting in straying too far from .

So PPO clips the reward ratio to remain close to 1, and takes the min of the unclipped and clipped rewards (times their advantage).

  • unclipped:
  • clipped:

And now, here is the whole improvement PPO makes over vanilla policy gradient. We maximize the expectation of the min of clipped and unclipped rewards.

The makes it a pessimistic bound of improvement. Meaning it refuses to credit you for improvements that would require changing the policy too much.


Concrete rollout with example

Prompt:

One rollout (response):

Assume word-tokens:

The RL "state" and "action" at each step:

  • t = 1

    • (prompt + empty response prefix)
  • t = 2

  • t = 3

  • t = 4

What exactly gets stored per token step

For each timestep in this rollout batch, you store:

  • the token (e.g., "tell")
  • the logprob under the policy snapshot that generated it:
  • the logprob under the reference:
  • (masking prompt, padding, and eos tokens when calculating log probs)
  • the value prediction:

That's it. Think of it like a dataset of rows: for many tokens across many sampled completions.

Why store both logprobs?

  • is used for PPO's ratio when you later update the policy:
  • is used for the KL penalty shaping (keeps you near the SFT/reference policy).

Again—PPO is heavy

It runs three learning problems at once:

  1. Reward model learns what humans like (or you replace it with a verifier).
  2. Critic/value head learns to predict returns to reduce variance.
  3. Policy learns to maximize reward while staying close to reference.

Plus a data collection loop (on-policy rollouts), which makes everything coupled and expensive.

Why modern methods are winning

  • DPO deletes the RL loop and deletes the critic.
  • GRPO keeps RL-style improvement but deletes the critic (uses group-relative baselines).
  • GDPO fixes the multi-reward training instability problem.

Conclusion

PPO-RLHF works. It's battle-tested and it's what got us the first wave of aligned assistants. But it's expensive: three models, on-policy rollouts, and a critic that exists purely to reduce variance. Every piece adds compute, memory, and complexity.

The algorithms that came after—DPO, GRPO, GDPO—are all asking the same question: what can we delete while keeping the alignment signal? DPO answers "the RL loop and the critic." GRPO answers "the critic, but keep the RL loop." GDPO answers "nothing, but fix how we handle multiple rewards."

Understanding PPO means you now know what's being removed and why. Next up: DPO, which collapses the whole thing into a supervised loss.