â Back to all posts
Modern LLM Post-Training
A 5-part series covering the evolution of post-training algorithms: from PPO-RLHF to DPO, GRPO, GDPO, and practical SFT experiments. Each post builds on the previous, showing what each method removes or modifies and why.
Part 1
Modern post-training part 1: PPOâthe ancestor algorithm
â˘15 min read
Part 1 of a 4-part series on modern post-training. PPO-RLHF is the ancestor algorithmâthree models, on-policy rollouts, and a critic for variance reduction. Understanding PPO explains what DPO, GRPO, and GDPO delete and why.
Part 2
Modern post-training part 2: Direct Preference Optimization (DPO)
â˘12 min read
Part 2 of the modern post-training series. DPO removes the RL machinery from PPO-RLHFâno critic, no advantage estimation, no rolloutsâwhile preserving KL-regularized alignment through a closed-form supervised objective on preference pairs.
Part 3
Modern LLM post-training part 3: GRPO
â˘15 min read
Part 3 of the modern post-training series. GRPO keeps PPO's optimizer but removes the criticâadvantages are estimated from a group of rollouts per prompt instead of learning a value function.
Part 4
Modern LLM post-training part 4: GDPO
â˘12 min read
Part 4 of the modern post-training series. GDPO fixes GRPO's reward signal collapse in multi-reward setups by normalizing each reward dimension separately before aggregation.
Part 5
Bug-fixing agent: json tool calling SFT
â˘12 min read
SFT plan to instruction-tune Qwen2.5-7B for JSON tool-protocol compliance. Covers step-level dataset extraction from teacher traces, data quality filtering, QuixBugs train/test splits, and end-to-end pipeline commands.