Justin Barry - ML Scientist

This is part one of 4 blog articles covering modern post-training for LLMs:

PPO
DPO
GRPO
GDPO

Reader note

As usual, my deep dives on papers are less about summarizing the paper, and more about me diving into concepts that weren't clear when I read the paper. Papers typically assume audience expertise, leaving much to be desired for someone who's not up to speed. I'll try to fully explain the missing pieces and things that weren't obvious to me or that I got stuck on. "DPO? That's just supervised finetuning of preference pairs. That's easy!" Not so much. Might seem easy on the surface, but there some subtle yet important details and gotchas.

Introduction & background

If you read my previous blog post on PPO-RLHF you know the following:

RLHF for LLMs solves the human-alignment problem that the simple mimicry of SFT cannot
PPO-RLHF is heavy and complex: it involves a lot of machinery.

PPO provides the justification for DPO. You should read that post to get up to speed on concepts and terminology as I will not be repeating things already covered.

What is DPO? (high-level)

DPO is derived from a canonical KL-regularized reinforcement learning (RL) objective, but trained with supervised learning on preference pairs (no rollouts/advantages in the loop).

In a nutshell, DPO is an algorithm to modify the distribution of the LLM post-training for increased alignment with human preferences. It typically does this in a mostly offline and off-policy way: collecting preference pairs of prompt response rollouts, and using a supervised training objective to train the LLM by maximizing the contrast between preferred and dispreferred preference pairs.

See later section for explanation for what off/on-line and off/on-policy mean)

Deriving the objective

I like to derive objectives; it's true. But I try to avoid that in my blog posts unless 1) it's the point of the blog post, or 2) it adds something. In this case, it's (2): I want to show you how DPO relies on KL tooling from PPO-RLHF to prevent policy drift.

What is Bradley-Terry?

First, we need to understand Bradley-Terry.

Bradley-Terry is the statistical model that lets us write preferences as a function of score differences, which is why the derivation works. It is the simplest statistical model for preferences

$P(A \succ B) = \frac{e^{\text{score}(A)}}{e^{\text{score}(A)} + e^{\text{score}(B)}} = \sigma(\text{score}(A) - \text{score}(B)),$

where $\text{score}$ is some scoring function that produces a real value.

In RLHF, we model a rater's prefernc eusing a latent reward model $r_{\phi}(x, y)$ as the score

$P(y^+ \succ y^-) = \sigma(r_{\phi}(x, y^+) - r_{\phi}(x, y^-)).$

Preferences are noisy comparisons explained by a score difference.

Where does KL-divergence come into play in DPO?

The canonical RLHF objective is

$\max_\pi \; \mathbb{E}_{y \sim \pi(\cdot | x)}[r(x,y)] - \beta \: \text{KL}(\pi(\cdot|x)\;||\;\pi_{\text{ref}}(\cdot|x)),$

(This is not the PPO objective. This is the underlying objective that a lot of RL setups intend to optimize).

So we're choosing the policy that maximizes the reward and we're subtracting a KL penalty term—which enforces a soft trust region around $\pi_{\text{ref}}$ , where $\beta$ controls how far the policy is allowed to drift.

In practice, $\pi_{\text{ref}}$ is typically an SFT'd model rather than raw pretrained weights."

The KL-regularized RL objective implies the optimal policy has the form $\pi^*(y|x) \propto \pi_{\text{ref}}(y|x) \, \text{exp}\left( \frac{r(x, y)}{\beta} \right)$

Therefore

$\log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} = \frac{r(x, y)}{\beta} - constant$

Comparing two completions, the constants cancel $\log\frac{\pi^*(y^+|x)}{\pi_{\text{ref}}(y^+|x)} - \log\frac{\pi^*(y^-|x)}{\pi_{\text{ref}}(y^-|x)} = \frac{r(x, y^+) - r(x, y^-)}{\beta}$

Multiply both sides by $\beta$ and we now have preferences modeled as logistic of reward difference.

$\beta \left( \log\frac{\pi^*(y^+|x)}{\pi_{\text{ref}}(y^+|x)} - \log\frac{\pi^*(y^-|x)}{\pi_{\text{ref}}(y^-|x)} \right)$

Bradley–Terry says the probability the rater prefers $y^+$ over $y^-$ is $P(y^+ \succ y^- | x) = \sigma(r(x,y^+) - r(x,y^-))$

if $r^+$ >> $r^-$ , win prob $\approx$ 1
if $r^+$ = $r^-$ , win prob $\approx$ .5

Then we model the preferences using Bradley-Terry, and here's the "DPO" move $\begin{equation} P(y^+ \succ y^- | x) = \sigma \left(\beta \left[ \log\frac{\pi^*(y^+|x)}{\pi_{\text{ref}}(y^+|x)} - \log\frac{\pi^*(y^-|x)}{\pi_{\text{ref}}(y^-|x)} \right] \right) \end{equation}$

DPO Objective

Continuting from equation (1) above, we replace the unknown optimal policy $\pi^*$ with a parametric policy $\pi_{\theta}$ and maximize the log-likelihood of observed preferences. That gives us our objective

Objective function

$\mathcal{L}_{DPO}(\pi_{\theta}\; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y^+, y^-) \sim D}\left[\log \sigma \left(\beta \left[\log \frac{\pi_{\theta}(y^+ | x)}{\pi_{\text{ref}}(y^+ | x)}- \log \frac{\pi_{\theta}(y^- | x)}{\pi_{\text{ref}}(y^- | x)}\right] \right)\right]$

We can use the properties of $\log$ with fractions and rearrange terms to get a more intuitive form of the loss function

$\mathcal{L}_{DPO}(\pi_{\theta}; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y^+, y^-) \sim D}[\log \sigma (\beta (\Delta \log \pi_{\theta} - \Delta \log \pi_{\text{ref}}))],$

where $\Delta \log \pi_{\theta} = \log \pi_{\theta}(y^+| x) - \log \pi_{\theta}(y^-| x)$

$\Delta \log \pi_{\text{ref}} = \log \pi_{\text{ref}}(y^+ | x) - \log \pi_{\text{ref}}(y^- | x)$

Minimizing $-\log \sigma(z)$ is the same as maximizing $\log \sigma(z)$ —the difference of differences. So we want the maximize the contrast of + to - under the new policy $\theta$ relative to $\text{ref}$ 's existing gap between $+$ and $-$ .

If the reference already strongly prefers $y^+$ over $y^-$ (big $\Delta_{\text{ref}}$ ), DPO demands your model at least match that preference, and ideally exceed it.
If the reference is indifferent (small $\Delta_{\text{ref}}$ ), DPO is free to create a larger gap—this is where you "learn" the preference.
If the reference actually prefers the loser (negative $\Delta_{\text{ref}}$ ), DPO has to overcome that and flip the sign.

What does $\beta$ do?

$\beta$ sets the strength (temperature) of the preference optimization. Higher $\beta$ pushes the model to separate winners from losers more aggressively (relative to the reference), which often increases drift risk; lower $\beta$ keeps updates conservative and may learn little.

Benefit over PPO-RLHF: DPO bypasses fitting an explicit reward and performing RL to learn the policy while still maintaining alignment ot a reference policy.

Intuition: extra thoughts.

DPO increases the winner–loser contrast under the new policy relative to the winner–loser contrast under the reference. That “relative to the reference” part is the whole point. It’s not merely “make $y^+$ more likely than $y^-$ .” It’s “make $y^+$ more likely than $y^-$ more than the reference already does.”

That structure comes directly from the KL-regularized objective: the KL term defines a soft trust region around $\pi_{\text{ref}}$ . In DPO, the explicit KL penalty disappears, but its effect survives because the learning signal is a log-ratio vs the reference.

DPO says: increase the preference gap only insofar as you improve on the reference’s gap.

So DPO behaves like it has a leash without enforcing a hard constraint—drift is controlled implicitly through those reference-relative log ratios.

How is the dpo loss calculated in practice?

It helps me to always spend time understanding exactly how to iterate over the dataset and compute loss in practice

It would be something like this:

x cat $y^+$ : "<prompt>here is the question<response>here is + response<eos>"
x cat $y^-$ :"<prompt>here is the question<response>here is - response<eos>"

For each response $y \in \{y^+, y^-\}$ , compute $\begin{align*} \log \pi_{\theta}(y|x) = \sum_i^T \log \pi_{\theta}(y_i | x, y_{<i}) \\ \log \pi_{\text{ref}}(y|x) = \sum_i^T \log \pi_{\text{ref}}(y_i | x, y_{<i}) \end{align*}$

You run the model on the whole sequence prompt + response (so the response tokens are conditioned on the prompt), but when you compute the scalar you plug into DPO, you typically sum log-probs over the response tokens only and mask out the prompt tokens.

Why loss masking matters?

This not attention masking, it's loss masking: we compute log probabilities for the full sequence for causal conditioning, but sum loss only over response tokens.

If you included prompt tokens in the sum, you'd be "rewarding" the model for predicting the user's prompt—which isn't what you want, and it can skew lengths/scale across examples.

What DPO removes from PPO-RLHF (while keeping alignment)

Now, our main thesis for this modern RL for post-training series was that DPR, GRPO, and GDPO remove portions of the heavy machinery in PPO-RLHF while still maintaining alignment and also contributing novel improvements that solve specific problems. This is why we first grounded our understanding in the original PPO technique (read the blog post here), otherwise this article might be hard to understand.

So what does DPO remove from PPO that makes it lighter wait, but still able to align with a reference policy?

Removes the whole RL/actor-critic machinery:

No critic / value function (so no value loss)
No advantage/return estimation (no rewards → returns → GAE bookkeeping)
No on-policy RL loop (no "sample rollouts → score → PPO update" inner loop)

In PPO-style RLHF, the policy gradient term explicitly uses an advantage estimator $\hat A_t$ (actor-critic framing), and you're updating relative to $\pi_{\theta_{\text{old}}}$ .

Also removes the "explicit reward model + RL optimization" two-stage pipeline as a requirement: DPO does not require fitting a standalone reward model $r_\phi$ and then running PPO against it. Instead, it folds the reward-modeling idea into the policy update itself ("your LM is secretly a reward model").

What stays (this is why alignment is still maintained)

Preference data $(x, y^+, y^-)$ still drives learning.
A reference policy $\pi_{\text{ref}}$ still anchors the model.
KL-control still exists, but it's "baked into" the objective via the log-ratio terms vs $\pi_{\text{ref}}$ (with temperature $\beta$ ). The DPO loss is literally a logistic loss on $\beta\Big[\log \frac{\pi_\theta(y^+|x)}{\pi_{\text{ref}}(y^+|x)} - \log \frac{\pi_\theta(y^-|x)}{\pi_{\text{ref}}(y^-|x)}\Big],$ which is the closed-form rewrite of the same KL-regularized reward maximization objective that PPO-RLHF is trying to optimize.

In DPO training, you typically aren't collecting rollouts online—you're training offline on a preference dataset. You can collect new preference pairs as a separate data-collection step, but it's not part of the optimization loop the way it is in PPO-RLHF.

PPO-RLHF vs DPO: training regimen

PPO-RLHF is online and on-policy
DPO can be offline and off-policy and mostly is in practice

Online

Online = data collection happens during training and is driven by the current policy's behavior.

On-policy

On-policy = the data used for updates was generated by the same policy you're updating. This also means update steps on a batch of data from the current policy must be conducted in such a way that things don't drift too far during training. Hence trust regions, policy ratios, and limited number of update steps, early stopping, clipped objectives: basically constrain updates so the mismatch doesn't get too large.

Offline

Offline = training uses a fixed logged dataset; no interactive data collection is part of the training loop. Offline refers to the optimization loop (no sampling inside training). Preference data may have been collected however you want.

Off-policy

Off-policy = you update $\pi_\theta$ using data generated by a different behavior policy (often a mixture of older policies).

Brief summary

Online: collect new experience during training (policy influences data).
Offline: train only on a static logged dataset.
On-policy: update using data from the current/recent policy.
Off-policy: update using data from other/older policies.

Where DPO breaks down

DPO trades engineering complexity for data and hyperparameter sensitivity.

(1) Sensitivity to data quality (garbage preferences in, garbage policy out) It seems self-explanatory on the surface, but the subtlety is that DPO has no intermediate checkpoint. In PPO-RLHF, you fit a reward model first—you can inspect it, see if it learned something sensible, catch problems before they propagate. DPO folds everything together. Bad preferences go straight into the policy with no buffer. You find out something was wrong when the model misbehaves, not before.

(2) The assumption that the reference policy is reasonable Reference policy: The non-obvious part is about support. If $\pi_{\text{ref}}$ assigns near-zero probability to completions that appear in your preference data, the log-ratio explodes and training becomes unstable. The reference needs to "cover" the space your preference pairs live in. This is why you want an SFT model that's already seen similar prompts and response styles, not a raw base model that might put negligible mass on the kind of completions humans preferred.

(3) The $\beta$ hyperparameter and what happens when you get it wrong, length exploitation issues. As mentioned in the section "What does $\beta$ do?", you have to choose $\beta$ carefully as to not encourage policy drift, mode collapse, or prevent learning.

The annoying part is that there's no principled way to set it. Optimal $\beta$ varies by dataset, model size, how clean your preferences are, how far the SFT model already is from the behavior you want. People do hyperparameter sweeps, which is expensive. Some papers report sensitivity analyses showing performance cliffs on either side of a narrow good region.

Interaction with noise: If your preference labels are noisy, high $\beta$ amplifies that noise—you're aggressively chasing a signal that's partly garbage. High $\beta$ is more robust to label noise but also limits how much you can learn from clean signal.

Modern post-training part 2: Direct Preference Optimization (DPO)