GDPO—proper advantages from a multi-reward category setup

Where this fits in the series

This post is part of my Modern post-training for LLMs series. The throughline is simple:

Modern techniques simplify PPO by removing machinery while still preventing policy drift.

PPO
DPO
GRPO
GDPO <- we're here!

You already saw the full PPO machinery in my PPO article. GRPO keeps the PPO optimizer but removes the PPO critic.

GDPO makes GRPO work in a setting where there are multiple reward categories per rollout.

Introduction

GRPO is clean when each rollout has a single scalar reward. But real post-training uses multiple reward categories (correctness, format, safety, length, etc.). The standard move—sum rewards, then normalize—can erase structure: very different reward vectors can collapse into the same standardized advantage, so the policy gets a coarse or misleading learning signal.

GDPO is a minimal fix. It normalizes each reward dimension across the rollout group first, then aggregates. Same GRPO-style training loop, no critic—just an advantage estimator that actually respects multi-reward setups.

The failure mode: "reward signal collapse" in multi-reward GRPO

They formalize the common multi-reward setup:

You have a prompt/question $q_i$ , and the behavior policy $\pi_{\theta_{\text{old}}}$ samples (G) rollouts/responses.
You have (n) reward components $r_1,\dots,r_n$ .
You compute an aggregated reward per rollout:

$r^{(i,j)}_{\text{sum}} = r^{(i,j)}_1 + \cdots + r^{(i,j)}_n$
Then GRPO computes group-relative advantage by normalizing those aggregated rewards within the group:

$A^{(i,j)}_{\text{sum}} = \frac{r^{(i,j)}_{\text{sum}} - \mathrm{mean}({r^{(i,1)}_{\text{sum}},\dots,r^{(i,G)}_{\text{sum}}})}{\mathrm{std}({r^{(i,1)}_{\text{sum}},\dots,r^{(i,G)}_{\text{sum}}})}$

Their key observation: once you collapse a reward vector to a scalar sum, the update can no longer distinguish which category was satisfied. Different reward vectors that share the same summed reward get identical advantages and produce identical learning signals.

Let's look at a hard-coded example using Python.

import pandas as pd

# First example: GRPO (multiple distinct rows share the same row-sum)
reward_df = pd.DataFrame(
    [
        [0, 1, 2],  # sum = 3 (same reward sum)
        [2, 0, 1],  # sum = 3 (same reward sum)
        [3, 0, 0],  # sum = 3 (same reward sum)
        [3, 0, 2],  # sum = 5
        [1, 4, 1],  # sum = 6
        [2, 2, 3],  # sum = 7
    ]
)
row_sums = reward_df.sum(axis=1) # summing across reward category <- reward collapse
avg = row_sums.mean()
std = row_sums.std()
advantages = (row_sums - avg) / std
print(advantages)
# RESULTS
# 0   -0.851943
# 1   -0.851943
# 2   -0.851943
# 3    0.283981
# 4    0.851943
# 5    1.419905

# SAME BUCKET

If two rows share the same row_sum, they share the same GRPO advantage—by construction. The policy update cannot tell how a rollout was good, only that its total score was higher or lower.

GDPO's fix: decouple normalization per reward, then aggregate

GDPO changes only the advantage computation (still in the GRPO-family—no value model).

Instead of normalizing the summed reward, GDPO normalizes each reward dimension across the (G) rollouts:

$A^{(i,j)}_k=\frac{r^{(i,j)}_k-\mathrm{mean}({r^{(i,1)}_k,\dots,r^{(i,G)}_k})}{\mathrm{std}({r^{(i,1)}_k,\dots,r^{(i,G)}_k})} \quad \text{for } k=1,\dots,n \;\; \text{reward components.}$

Then it sums these already-normalized per-reward advantages:

$A^{(i,j)}_{\text{sum}} = A^{(i,j)}_1 + \cdots + A^{(i,j)}_n$

Here's GDPO in Python.

import pandas as pd

# Second example: GDPO
col_means = reward_df.mean(axis=0)
col_stds = reward_df.std(axis=0)

per_reward_adv = (reward_df - col_means) / col_stds
advantages = per_reward_adv.sum(axis=1)
print(advantages)
# RESULTS
# 0   -1.195531
# 1   -1.062384
# 2   -1.160448
# 3    0.746478
# 4    0.578968
# 5    2.092917

# DIFFERENT BUCKETS!

normalized_advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
print(normalized_advantages)
# 0   -0.882445
# 1   -0.784167
# 2   -0.856549
# 3    0.550990
# 4    0.427348
# 5    1.544823

We're using the same reward_df from the previous GRPO example, where three distinct row vectors

[0, 1, 2]: sum = 3
[2, 0, 1]: sum = 3
[3, 0, 0]: sum = 3

share the same row sum. In GRPO, all three get the same advantage because the reward vector is scalarized first. Under GDPO, those same three vectors no longer have to tie—normalization happens per reward dimension before aggregation.

For a single prompt, GDPO normalizes each reward category across rollouts: each reward is rescaled according to its own category distribution. So even if the row sums start out equal, per-dimension rescaling changes the sum of per-reward z-scores, often making each $A_{\text{sum}}^{(i, j)}$ distinct—though uniqueness isn't guaranteed.

Batch-wise advantage normalization (the "keep the scale sane" step)

After summing, they apply a batch-wise normalization over all rollouts in the batch (not per-question) so the magnitude doesn't blow up as you add more reward terms:

$A^{(i,j)}_{\text{sum}} \leftarrow \frac{A^{(i,j)}_{\text{sum}} - \mathrm{mean}_{\text{batch}}(A_{\text{sum}})}{\mathrm{std}_{\text{batch}}(A_{\text{sum}})+\epsilon}$

They say this keeps the numerical scale stable and improves training; removing it can cause occasional convergence failures.

You can see how batch normalization works in the example above.

Final intuition.

GRPO treats the reward vector like it's 1D—collapse it to a scalar then normalize. GDPO treats it as genuinely multi-dimensional—normalize in each dimension first—so "which reward you satisfied" still matters after normalization.

Reward priority variation using thresholds

There's a common failure mode when scoring rollouts with multiple reward categories. Say one category is easier to optimize than another. Easy could mean:

Dense / frequent (you get feedback on lots of rollouts)
Low-skill (doesn't require new reasoning capability)
Low-variance (consistent gradient signal)
Often surface-level (formatting, length limits, emitting a token pattern)

Hard could mean:

Sparse (few rollouts succeed)
Noisy / judge-dependent
Long-horizon / credit-assignment heavy (many tokens must be right)
Potentially in tension with other objectives

The optimizer will prioritize maximizing the easy reward over the hard one.

To fix this, the authors suggest thresholding. Specifically,

$r_k = \begin{cases} r_k, & \text{if } r_l \geq t \\ 0 & \text{otherwise.} \end{cases}$

The idea: gate the easy reward ( $r_k$ ) behind the hard reward ( $r_l$ ). The easy reward only counts if the hard reward meets threshold $t$ . Now the model can't harvest cheap wins without doing the real work.

Examples

Let's make it concrete with three toy training batches. In each one:

Easy objective = the model can improve it with cheap, local changes that work across many prompts (high hit-rate).
Hard objective = improvement requires actually solving the task (low hit-rate early on, more brittle).

The key: training follows the steepest, most reliable hill first. If one reward is much easier to pick up, it can dominate even if you weight them equally.

Example 1: format + correctness (classic)

Say you're training tool calling.

Format reward: 1 if the output is valid JSON with the right keys.
Correctness reward: 1 if the tool arguments are actually correct.

Sample 4 rollouts for the same prompt ("book a flight…"):

Valid JSON, wrong args
Valid JSON, wrong args
Valid JSON, wrong args
Invalid JSON, (doesn't matter)

Scores:

rollout	format	total
1	1	1
2	1	1
3	1	1
4	0	0

What's easy here? Format. The model can learn "always output JSON shaped like X" pretty fast. It doesn't require understanding flight booking constraints—just a template.

What's hard? Correctness. It needs to parse the prompt, extract cities/dates, handle ambiguity, and produce consistent arguments.

So early training will often crank format to 100% while correctness barely moves.

Why this matters

Even if you set weights (format=1, correctness=1), the model can "win" a lot of reward by getting format right without solving correctness. That's exactly the "easy dominates" dynamic.

The gating trick

You redefine format reward so it only counts if correctness crosses a threshold:

If correctness == 0 → format reward becomes 0 (even if valid JSON)
If correctness == 1 → format reward counts as usual

Now format-only hacks stop paying out. The model must move the hard objective to get anything.

Example 2: accuracy + length (very common in instruction tuning)

Say:

Accuracy reward: +1 if answer is correct.
Length reward: +1 if under 60 tokens (or a penalty if over).

Rollouts:

Short, wrong
Short, wrong
Long, correct
Long, correct

Scores:

rollout	accuracy	length	total
1	0	1	1
2	0	1	1
3	1	0	1
4	1	0	1

Notice something nasty: all totals tie.

If you're doing group-relative updates and everything ties (or nearly ties), the learning signal becomes mushy. But even before that:

Which objective is easier? Length. The model can get "short" by truncating, being vague, or refusing to answer.

Accuracy is harder because it requires the real work.

So without careful design, the model might learn "be short no matter what" instead of "be correct and then be short."

A cleaner reward design

Make length conditional:

Give length bonus only if accuracy is correct.
Or give a soft shaping: length penalty only applies after some minimum correctness threshold (like "final answer present and consistent").

Same idea: don't let the model harvest the easy reward while skipping the hard one.

Example 3: safety + helpfulness (a more subtle case)

Suppose:

Safety reward: +1 if it refuses disallowed requests, avoids unsafe content, uses safe framing.
Helpfulness reward: +1 if it answers the user's actual question well.

In many prompts, the easiest way to guarantee safety is to be conservative: refuse too often, give generic disclaimers, avoid specifics.

That's easier than being both safe and truly helpful in edge cases.

Safety isn't always easy in an absolute sense, but it's often easier to satisfy reliably via conservative heuristics than helpfulness is to satisfy reliably via actually doing the task.

Gating here looks like:

Helpfulness reward only counts if safety score exceeds a threshold.
Or conversely, safety reward is applied but you also explicitly penalize unnecessary refusal on benign prompts (a separate "over-refusal" objective).

Conclusion

GDPO is a quick idea with a practical lesson: when your reward is multi-dimensional, your advantage estimator has to respect that structure.

GRPO's "sum then normalize" trick is fine when you truly have one reward. But the moment you score rollouts with multiple categories (format, correctness, safety, length, tool-usage, etc.), you can accidentally destroy information: different reward vectors collapse into the same standardized advantage, so the policy update can't reliably distinguish how a rollout was good.

GDPO fixes this with one move: normalize each reward category across the rollout group first, then aggregate. That preserves "which objective you satisfied" in the learning signal. The extra batch-wise normalization is just housekeeping—it keeps advantage scale stable as you add more reward categories.

The other takeaway (Sec 3.2 in the paper) is about incentives, not math: in a multi-objective setup, the model often grabs the easiest reward first, even if you weight objectives equally. If you care about a specific priority—correctness before brevity, valid tool format only when the call is correct, helpfulness only when safe—you can't always use weights to solve it. You often have to change the reward definition so the easy reward doesn't count unless the priority condition is met.

So, in the arc of this series:

PPO-RLHF: heavy machinery (critic + advantage estimation + careful soft trust-region updates).
GRPO: keep the PPO-style policy update, drop the critic, use group-relative advantages.
GDPO: keep GRPO's simplicity, but make it actually behave in the real world where rewards come in categories.

If PPO is the full RLHF stack, GDPO is the "engineering fix" version of modern post-training: fewer moving parts, fewer ways to silently break your learning signal, and a cleaner mental model for multi-reward alignment.

Modern LLM post-training part 4: GDPO