Research Blog

Deep dives into machine learning research papers, mathematical derivations, and implementation details.

Essential PyTorch: the 5% that shows up in almost every serious model build
•15 min read
Research
PyTorch looks huge from the outside. In practice, I build Transformers, CLIP, diffusion, flows, and VAEs using a small set of tensor and module patterns over and over. This post is that core.
Hierarchical Reasoning Model (HRM): recursion, deep supervision, and why ACT is challenging
•12 min read
Research
A deep dive into HRM architecture—recursive refinement, deep supervision, DEQ-style gradients, and a mechanical breakdown of ACT's halt/continue logic.
Tiny Recursive Model (TRM)
•10 min read
Research
TRM simplifies HRM by removing fixed-point theory, eliminating ACT's extra forward pass, and backpropagating through full unrolled recursion with no-grad refinement passes.
Designing a Deterministic LLM Agent
•12 min read
Research
A coding agent is a stochastic model wrapped in a deterministic scaffold. This post walks through how to build that scaffold with a ledger-style History, derived summaries, allowlisted tools, driver-owned tests, and reflection triggers.
JanusFlow
•10 min read
Research
JanusFlow unifies image understanding and generation in a single LLM backbone using rectified flow. This post explains the architecture, training stages, representation alignment, and why the shared backbone matters.
LLM Repo Agent v2: What's changed since v1
•8 min read
Research
V2 brings Chat Completions adapters, JSON tool protocol, multi-turn message lists, prompt hardening with invariants, sandboxing, multithreading, Monte Carlo evaluation, and SFT/DPO fine-tuning pipelines.
Bug-fixing agent: json tool calling SFT
•12 min read
Research
SFT plan to instruction-tune Qwen2.5-7B for JSON tool-protocol compliance. Covers step-level dataset extraction from teacher traces, data quality filtering, QuixBugs train/test splits, and end-to-end pipeline commands.
Modern post-training part 1: PPO—the ancestor algorithm
•15 min read
Research
Part 1 of a 4-part series on modern post-training. PPO-RLHF is the ancestor algorithm—three models, on-policy rollouts, and a critic for variance reduction. Understanding PPO explains what DPO, GRPO, and GDPO delete and why.
Modern post-training part 2: Direct Preference Optimization (DPO)
•12 min read
Research
Part 2 of the modern post-training series. DPO removes the RL machinery from PPO-RLHF—no critic, no advantage estimation, no rollouts—while preserving KL-regularized alignment through a closed-form supervised objective on preference pairs.
Modern LLM post-training part 3: GRPO
•15 min read
Research
Part 3 of the modern post-training series. GRPO keeps PPO's optimizer but removes the critic—advantages are estimated from a group of rollouts per prompt instead of learning a value function.
Modern LLM post-training part 4: GDPO
•12 min read
Research
Part 4 of the modern post-training series. GDPO fixes GRPO's reward signal collapse in multi-reward setups by normalizing each reward dimension separately before aggregation.