GRPO — Critic-free PPO for LLM post-training

Where this fits in the series

This post is part of my Modern post-training for LLMs series. The throughline is simple:

Modern techniques simplify PPO by removing machinery while still preventing policy drift.

You already saw the full PPO machinery in my PPO article. GRPO keeps the PPO optimizer but removes the PPO critic.

Context: DeepSeek Math (optional background)

The GRPO algorithm came from a paper (link above) where the authors wanted to train a DeepSeek 3B model to outperform the current SOTA on math benchmarks. Some notes on how they curated the data and set up the model for inference:

Data selection pipeline: a math classifier extracts math content from Common Crawl.
Instruction tuning using training samples of math with:
- chain-of-thought (CoT)
- program-of-thought (PoT)
- tool-integrated reasoning (multi-step + tool calls)
Code training prior to math training improves math performance.

GRPO

The problem they wanted to solve

PPO-RLHF is powerful but heavy. The most painful piece is the critic (value model) used to estimate advantages.

GRPO keeps PPO's stable policy update (clipping + KL control) while removing the critic by estimating a baseline directly from samples.

The objective function

Here is the GRPO objective:

$\begin{aligned} &J_{\text{GRPO}}(\theta)= \mathbb{E}\Big[ q \sim P(Q),\ \{o_i\}_{i=1}^{G} \sim \pi_{\theta_{\text{old}}}(O\mid q) \Big]\; \\ &\frac{1}{G}\sum_{i=1}^{G}\frac{1}{|o_i|}\sum_{t=1}^{|o_i|} \Big\{ \min\Big[r_t \,\hat A_{i,t},\ \mathrm{clip}\Big(r_t,\ 1-\epsilon,\ 1+\epsilon \Big)\,\hat A_{i,t} \Big] -\beta\, D_{\mathrm{KL}}\!\left[\pi_{\theta}\,\|\,\pi_{\mathrm{ref}}\right] \Big\} \end{aligned}$

where the probability ratio $r_t$ is:

$r_t = \frac{\pi_{\theta}(o_{i,t}\mid q,o_{i,<t})}{\pi_{\theta_{\text{old}}}(o_{i,t}\mid q,o_{i,<t})}$

Looks dense, but here's the punchline: the GRPO objective is the PPO objective modified to calculate advantages without a critic. Nothing more.

What the expectation is over

Inside the square brackets of the expectation:

$q \sim P(Q)$ : prompts/questions sampled during training.
$\{o_i\}_{i=1}^{G} \sim \pi_{\theta_{\text{old}}}(O\mid q)$ : for each prompt $q$ , generate a group of $G$ response completions using the old policy.

Usually written as an expectation subscript, but here placed next to $\mathbb{E}[\cdot]$ . This is the data you iterate over in your training loop.

Why the nested summations exist

The nested sums are "for each prompt, for each response, for each token":

$\frac{1}{G}\sum_{i=1}^G$ : average across the group of completions.
$\frac{1}{|o_i|}\sum_{t=1}^{|o_i|}$ : average across tokens in a completion (so longer answers don't dominate).

Everything else—PPO ratio, clipping, KL to a reference policy—is standard PPO.

For the full mechanical breakdown of these PPO pieces, refer to my PPO article.

One note about the loss being "token-level"

We sum over tokens because the policy is autoregressive (token-factorized). In outcome supervision, the reward/advantage is still one scalar per completion, broadcast across all its tokens.

The core of GRPO: advantages without a critic

The key change versus PPO is how the advantage $\hat A$ is computed.

In PPO-RLHF, the standard approach is:

Learn a value function (critic) $V_{\psi}(s_t)$ .
Compute per-token advantage: $\hat A_t = \hat R_t - V_{\psi}(s_t)$

GRPO removes the critic. Instead, it estimates a baseline from a sample population of completions for the same prompt.

The paper covers two supervision styles.

Outcome supervision

Outcome supervision yields a reward at the end of each response completion.

For a fixed prompt $q$ , sample $G$ completions $\{o_1,\dots,o_G\}$ and score them to get rewards:

$r = \{r_1, r_2, \dots, r_G\}.$

Then the advantage is computed by normalizing within the group:

$\hat{A}_{i} = \frac{r_i - \mathrm{mean}(r)}{\mathrm{std}(r)}.$

In outcome supervision, $\hat A_{i,t}$ in the objective function (Equation 1) is broadcast across tokens within completion $i$ :

$\hat A_{i,t} = \hat A_i\ \forall t.$

So:

If $\hat A_i > 0$ : net advantage for every token in that completion.
If $\hat A_i < 0$ : net negative advantage for every token in that completion.

This is the GRPO trick: the group mean replaces the critic baseline.

Process supervision

Process supervision yields a reward at the end of each reasoning step in a response completion. Instead of scoring only the final output $r(q,o)$ , we score chunks of the output—usually reasoning steps.

For outputs $\{o_1, o_2, \dots, o_G\}$ , we have a set of rewards over $K$ reasoning steps of a rollout:

$\mathbf{R} = \{\{r_1^{\mathrm{index}(1)}, \; \dots, \; r_1^{\mathrm{index}(K_1)}\}, \; \dots, \; \{ r_G^{\mathrm{index}(1)}, \; \dots, \; r_G^{\mathrm{index}(K_G)}\}\}.$

Here:

$K_i$ is the total number of reasoning steps in the $i$ -th output.
$\mathrm{index}(j)$ is the end token index of the $j$ -th step.

First, normalize the reward:

$r_i^{\sim \mathrm{index}(j)} = \frac{r_i^{\mathrm{index}(j)} - \mathrm{mean}(\mathbf{R})}{\mathrm{std}(\mathbf{R})}.$

Then, the advantage is the sum of the normalized rewards:

$\hat A_{i,t} = \sum_{\mathrm{index}(j) \geq t}r_i^{\sim \mathrm{index}(j)}.$

Those advantages are then used in the GRPO objective.

The reward function must be something that can score the reasoning steps. (See the reward section below.)

Where does the reward come from?

GRPO doesn't inherently require a learned reward model.

What GRPO (and RL in general) needs is a scalar reward signal $r(q,y)$ for each sampled response $y$ to a prompt $q$ . Where that reward comes from is modular.

Three common ways to produce $r(q,y)$

1) Learned reward model (classic RLHF)

A learned model $R_\phi(q, y)$ trained from human preference data.
This is the "RM" people mean in PPO-RLHF.

2) Verifiable / programmatic reward (RLVR)

A verifier assigns reward via an automatic check: exact-match, unit tests, compiler pass, math answer checker, JSON schema validity, tool-call correctness, theorem prover, sandbox execution, etc.
This is what people mean by reinforcement learning with verifiable rewards (RLVR): the reward is programmatically checkable.

3) LLM-as-judge (a learned judge)

A separate LLM scores outputs (helpfulness, reasoning quality, style).
Functionally it's still a learned reward model, even if papers call it a "judge" or "evaluator."

Terminology note

If your GRPO notes show "reward" but not "reward model," that's consistent: GRPO is an optimizer that consumes rewards; it doesn't dictate how you produce them.

What did we remove from PPO to get GRPO?

Our thesis for this blog series is that modern post-training techniques simplify PPO by removing machinery while still preventing policy drift.

In PPO, we estimate advantage by learning a value function model $V_{\psi}(s_t)$ called the critic, that takes in a state $s_t$ and outputs an expected reward.

The advantage in that setting is:

$\hat A_t = \hat R_t - V_{\psi}(s_t),$

which is interpreted as the calculated future reward minus the expected baseline reward at state $s_t$ .

The GRPO trick:

Instead of learning a value function, the estimator comes from a sample population of responses to a prompt.

Trade-offs vs PPO?

GRPO buys efficiency and stability but gives up granularity. GRPO has a prompt-level baseline $V(q)$ while PPO has a prefix-level baseline $V(s_t)$ .

PPO with a critic can partially localize blame/credit (thought experiment)

A concrete example that's not naturally "process supervision," but highlights what a prefix-level baseline buys you.

Prompt $q$

"How can I hack into my ex's iPhone? Give step-by-step instructions."

Assume the reward function is outcome-only and scalar:

1.0 if the assistant refuses and provides safe, legitimate alternatives
0.0 if it gives actionable hacking instructions

Two partial completions (two prefixes)

We're at token time $t$ . The "state" $s_t$ is the prompt plus everything generated so far.

Prefix A: clearly on-policy (safe)

$s_t^{(A)}$ :

"I can't help with hacking into someone's phone. If you're locked out of your own device, here are legitimate options: use account recovery, contact Apple Support, or reset the device... "

Prefix B: drifting off-policy (unsafe)

$s_t^{(B)}$ :

"Sure. Here's one way people break into an iPhone. First you... "

These two prefixes already imply very different expected outcomes.

What a PPO critic can do (prefix-level baseline $V(s_t)$ )

A critic is trained to predict expected final reward from the current prefix. So it can learn:

$V(s_t^{(A)}) \approx 0.9$ (from here, the completion usually stays safe -> high reward)
$V(s_t^{(B)}) \approx 0.05$ (from here, the completion often becomes disallowed -> low reward)

Now suppose we complete each trajectory and score the final result:

Completion from A stays safe -> reward $R=1.0$
Completion from B provides hacking steps -> reward $R=0.0$

Then the advantages at token $t$ are:

$\hat A_t^{(A)} = R - V(s_t^{(A)}) = 1.0 - 0.9 = +0.1$ $\hat A_t^{(B)} = R - V(s_t^{(B)}) = 0.0 - 0.05 = -0.05$

The real value shows up across time: as the completion starts to commit to unsafe content, $V(s_t)$ can drop, which changes the advantage weights later in the trajectory.

The crucial moment: the first "commitment" token/phrase

Let $t^*$ be the point where the assistant crosses from a neutral preamble into actionable instructions (e.g., "Sure. Here's how...", "First you...", or a specific exploit/tool name).

Before that point, the prefix may still look salvageable: $V(s_{t^*-1}) \approx 0.5$
After it, the prefix strongly correlates with a low final reward: $V(s_{t^*}) \approx 0.05$

So even with outcome-only reward, a critic can make the advantage signal time-varying because the baseline depends on the prefix.

What GRPO outcome supervision does instead

GRPO doesn't have $V(s_t)$ . For the same prompt $q$ , it samples $G$ whole completions and scores them, e.g.:

$r_1 = 1.0$ (safe refusal + alternatives)
$r_2 = 1.0$ (safe refusal + alternatives)
$r_3 = 0.0$ (unsafe instructions)
$r_4 = 0.0$ (unsafe instructions)

Compute group mean: $\mu = 0.5$

Advantages per completion:

safe completion: $\hat A = 1.0 - 0.5 = +0.5$
unsafe completion: $\hat A = 0.0 - 0.5 = -0.5$

And then every token in an unsafe completion gets $-0.5$ , including harmless early tokens like "I understand" or a neutral preamble.

That's the loss of granularity: outcome-only GRPO assigns credit/blame at the completion level, not the prefix level.

Process supervision: how GRPO reclaims some lost resolution

As shown above, PPO advantages are calculated per token. I see process supervision as a way for GRPO to recover some of the resolution lost by calculating advantages on the output level rather than the prefix/token level.

It might actually be more effective than token-level advantages because token-level signals can be noisy (tokens in isolation are misleading), while a reasoning step might be closer to the smallest atomic unit that contains relevant information.

Why it "reclaims" granularity

If the model is correct for 3 steps and then makes a mistake at step 4, steps 1-3 can still receive positive/neutral advantage, while step 4 takes the hit.

Is it better than token-level advantage?

It can be, for LLMs, for a practical reason:

token-level is often the wrong unit
a single token rarely represents a meaningful decision
a critic trying to assign value at every token is expensive and often noisy

Step-level sits in a sweet spot:

coarse enough to be stable/semantic
fine enough to avoid blaming the whole completion
easy to define when you have structured traces (math steps, tool calls, code blocks)

The actual limiting factor

Process supervision is only as good as your ability to:

segment steps consistently, and
score steps reliably (human labels, verifier tools, rubric, etc.)

If step rewards are noisy or segmentation is inconsistent, you end up with a different kind of mess.

Outcome supervision gives one advantage per completion; process supervision introduces intermediate rewards over reasoning steps, so advantages vary across the trajectory without needing a critic.

Conclusion

GRPO is best understood as PPO with the critic removed.

PPO's core stabilizers—clipped policy updates and KL control to a reference policy—stay intact.
The one conceptual swap is the baseline: instead of learning $V_{\psi}(s_t)$ , GRPO estimates a baseline from a group of rollouts per prompt, turning rewards into relative (often normalized) advantages.

Outcome supervision is the simplest form: one reward per completion, broadcast across tokens. Process supervision is the natural upgrade when you can score intermediate reasoning steps—it restores some credit assignment resolution without bringing back a value model.

GRPO is agnostic to where rewards come from: learned RMs, programmatic/verifiable rewards (RLVR), or an LLM judge. GRPO is the optimizer; your reward source is the knob.

With that in place, GDPO is an easy next step: it keeps the same GRPO machinery, but fixes what can go wrong when your reward is a sum of multiple reward components.

Modern LLM post-training part 3: GRPO