Justin Barry - ML Scientist

TLDR

We’re using step-level supervised fine-tuning to teach a small model (Qwen2.5-7B) how to reliably act as a JSON tool-using repo agent, not to make it smart at bug fixing yet. The focus is on behavior: emitting strictly valid JSON tool calls, using the right tools with the right arguments, avoiding shallow loops, and reliably reaching write_file. We generate strong teacher traces on QuixBugs, break each rollout into individual agent steps, and train the model to predict the next tool call from the full conversation context, turning every step into a high-leverage training example. We filter aggressively for clean, successful tool executions and mostly train on passing trajectories to avoid teaching bad habits, then evaluate using reliability metrics like parse errors, loop rates, average steps, and write-file frequency, with pass rate as a secondary signal. This is a foundation-building phase: once the model can consistently survive and behave inside the tool loop, we can meaningfully optimize for intelligence and patch quality later.

Scope

This document is SFT-only. The objective is to instruction-tune a small model (target: Qwen2.5-7B, optionally 14B) so it can reliably operate inside this repo-agent under --tool-protocol=json.

Primary focus:

Tool protocol compliance (valid JSON, correct tool names/args).
Basic next-action policy (productive tool sequencing; reach write_file when appropriate).

Non-goal (for this phase):

Maximizing patch correctness on harder corpora (that's a later phase once the model is reliably tool-using).

Problem Statement

Smaller models often fail before they even start:

Invalid JSON tool calls → parse_errors → the driver can't execute tools.
Hallucinated tool names/args or prose+JSON mixtures → tool execution fails.
Shallow loops (e.g., repeated read_file with tiny max_chars) → never reaches write_file.

If the model can't reliably participate in the tool loop, "patch quality" is hard to measure because it never reaches patching.

Proposed Solution: Step-Level SFT From Teacher Traces

Generate teacher tool-using traces on QuixBugs, then extract step-level supervised examples where the target is the next JSON tool call.

Key design choice:

We do not need 1K–3K unique tasks.
We treat each agent step as one training example: (conversation_so_far) → (next JSON tool_call).

Hypotheses / Success Criteria

Hypotheses

H1 (Protocol): SFT reduces parse_errors by teaching strict JSON tool-call format.
H2 (Trajectory): SFT improves the rate of "productive" sequences (more useful grep/read_file, more runs that reach write_file).
H3 (Downstream): On held-out QuixBugs tasks, reliability metrics improve (fewer loops, fewer wasted steps). Pass rate may improve, but is not guaranteed.

Primary metrics (cheap + reliable)

parse_errors ↓
loop_detections ↓
avg_steps ↓
% runs that call write_file at least once ↑
tool_breakdown becomes more "agent-like" (less thrash, more purposeful reads/greps)

Secondary metric (optional)

success / pass rate on held-out tasks (requires running tests during eval)

Dataset: Exact Format + Examples

Dataset format (Together SFT chat JSONL)

One JSON object per line:

{"messages":[{"role":"system","content":"..."},{"role":"user","content":"..."},{"role":"assistant","content":"..."}]}

Under --tool-protocol=json:

Tool calls are represented as assistant messages whose content is exactly one JSON object:

{"type":"tool_call","name":"read_file","args":{"rel_path":"python_programs/quicksort.py","max_chars":2000}}

Tool results are represented as user messages prefixed with [tool_result]:
```
[tool_result]
...tool output...
```

Important constraint (current implementation):

repo-agent sft-extract currently emits tool_call steps only (it does not emit final examples).

We considered whether to emit final examples. For v1 SFT, the answer is no as long as the main goal is tool-protocol compliance + next-action behavior; the highest-leverage improvement is reducing parse_errors and getting consistent tool_call JSON. Watch for a specific failure mode after SFT: tests pass (or the model should stop), yet the model never emits {"type":"final",...} and runs hit max_iters. If that shows up, the fix is straightforward: add a small number of final examples (e.g., 1 per successful rollout), or extend repo-agent sft-extract to optionally emit final steps from trace final events, or add a driver policy like "stop when tests pass" (so termination doesn't depend on the model).

Example: first action

{
  "messages": [
    {"role": "system", "content": "You are a repo-fixing agent...\nTOOLS:\n[...]"},
    {"role": "user", "content": "GOAL:\nFix the quicksort implementation in python_programs/quicksort.py so that the tests pass"},
    {"role": "assistant", "content": "{\"type\":\"tool_call\",\"name\":\"list_files\",\"args\":{\"rel_dir\":\".\",\"max_files\":200}}"}
  ]
}

Example: later action (with tool result in context)

{
  "messages": [
    {"role": "system", "content": "You are a repo-fixing agent...\nTOOLS:\n[...]"},
    {"role": "user", "content": "GOAL:\nFix the quicksort implementation in python_programs/quicksort.py so that the tests pass"},
    {"role": "assistant", "content": "{\"type\":\"tool_call\",\"name\":\"list_files\",\"args\":{\"rel_dir\":\".\",\"max_files\":200}}"},
    {"role": "user", "content": "[tool_result]\npython_programs/quicksort.py\npython_testcases/test_quicksort.py\n..."},
    {"role": "assistant", "content": "{\"type\":\"tool_call\",\"name\":\"read_file\",\"args\":{\"rel_path\":\"python_programs/quicksort.py\",\"max_chars\":2000}}"}
  ]
}

How we construct examples (from traces)

repo-agent sft-extract builds one step sample by pairing:

the most recent llm_request.payload.messages (the prompt the model saw),
the subsequent llm_action tool call (must be a valid {"type":"tool_call",...}),
and requiring a corresponding tool_result event (so the action actually executed).

Clarifications on filtering: "only successful tool calls" means tool_result.ok == true at the step level via --require-valid-tool-ok, and "only PASS rollouts" means last_test.ok == true at the run level via --require-success (the default starting point for cleaner targets). For v1 we don't need perfect optimality labels because the core objective is tool-call validity + basic agent behavior; some "valid but slightly suboptimal" steps are acceptable noise in SFT. What we do want to avoid teaching are systematic bad habits (thrash/loops), so the practical strategy is to keep strict schema + tool_result.ok filters, drop steps after loop detection (or after obvious repeated calls), and, if needed, keep only the best successful rollouts per task (e.g., shortest PASS) instead of all PASS rollouts. If you want to be more aggressive about removing "valid but suboptimal" tool calls without adding a judge model, the best heuristic is per task, select top-K successful rollouts by steps (or by "no loop detections"), and only extract SFT samples from those. Bottom line: for this phase it's nbd as long as we filter out thrash and measure post-extract dataset quality; DPO (or more sophisticated filtering) is where you really optimize policy preferences.

Implementation reference: src/llm_repo_agent/sft/extract.py.

Data Quality Decisions (From QA)

Step-level invariants (what we keep)

Hard requirements (default):

Target action is a valid JSON tool call:
- type == "tool_call"
- name ∈ {list_files, read_file, grep, write_file}
- required arg keys present (per tool schema)
Tool execution succeeded: tool_result.ok == true
- This is enforced by repo-agent sft-extract --require-valid-tool-ok (default True).
Tool outputs are truncated to a bounded size (--max-context-chars) to control sequence length.

Anti-thrash filter (decision):

Do not train on obviously looping steps.
Practical rule for v1: drop steps after the first "Loop detected…" driver note (and optionally drop exact repeats).
- This is not implemented in the current extractor yet, but it is a clear next improvement if needed.

Rollout-level invariants (what we keep)

We distinguish two training intents:

Protocol compliance SFT (cheapest, highest leverage early)

We do not require the whole rollout to pass tests.
We keep "good prefix" steps as long as they satisfy step-level invariants above. A prefix is the beginning part of a rollout trajectory: iterations 0..t (the first few steps), before later steps happen. A "good prefix step" is a step early in the rollout where the action is still clean and useful, even if the rollout later degrades (loops, thrash, bad patch, etc.). Concretely in this plan, a "good prefix step" is one that satisfies the step-level invariants: valid JSON tool call (type="tool_call") with allowed tool name + required args, tool execution succeeded (tool_result.ok == true), not an obvious repeat/thrash step, and (once implemented) occurs before the first "Loop detected…" driver note. This is why we can sometimes keep early steps from a rollout that eventually fails: early evidence-gathering calls can still be the right behavior to imitate.

The current repo-agent sft-extract implementation does not look at driver_note events (including "Loop detected…"), so it can't automatically cut off samples after loop detection. Today it only filters on rollout success (--require-success, default True) and per-step tool success (--require-valid-tool-ok, default True). "Drop postfix after loop detection" is the anti-thrash filter we would add if the initial dataset contains too much thrash. Implementation-wise, it's straightforward: track a loop_detected = True flag when a trace event kind == "driver_note" and the note contains "Loop detected", then stop emitting further step samples for that run (or at least stop after the first occurrence).

Policy imitation SFT (closer to "solve QuixBugs")

Prefer to train on successful rollouts (tests pass) to avoid learning teacher mistakes.
This is enforced by repo-agent sft-extract --require-success (default True).

Decision for the first iteration:

Start with --require-success enabled (cleaner targets) and --test-policy on_write (see below).
If dataset volume is too low, relax to --no-require-success but keep the step-level invariants and anti-thrash filtering.

QuixBugs Task Split (Train/Test)

Why we split by task:

Splitting by step leaks file paths and tool outputs across train/test and makes evaluation meaningless.

QuixBugs in your checkout:

python_programs/*.py: 50 buggy programs
python_testcases/test_*.py: 42 test files (best subset for pass/fail eval)

Decision: split on the ~42 tasks that have testcases so the eval metric "pass/fail" is available on held-out tasks.

Recommended split (pick one and stick to it):

70/30 for more stable eval, or
80/20 for more train data.

Make it deterministic (so comparisons are fair):

hash-based split on task_id, or
seeded shuffle with a fixed seed.

Artifacts:

eval/suites/quixbugs_train_suite.json
eval/suites/quixbugs_test_suite.json

Teacher Trace Generation (How + Why)

Why `--test-policy on_write` helps (even though it doesn't guarantee pass)

When --test-policy on_write is enabled:

After write_file, the driver runs tests and injects [TESTS PASS/FAIL] output into the next prompt context.
A strong teacher is then more likely to produce "test-driven" corrective trajectories:
- smaller, targeted edits,
- grounded follow-up tool calls based on failure output,
- iterative fixes instead of stopping after a single guess.

This tends to produce better traces to imitate (especially if you later filter to successful rollouts).

Variance / temperature decisions

Concern: many rollouts may look similar on easy tasks (e.g., fix_gcd).

Decisions:

We do not rely on high variance within a single easy task; diversity comes primarily from many tasks and different tool outputs.
Use a small non-zero teacher temperature to allow alternate but still-correct routes:
- Teacher temperature: ~0.2 (safe range: 0.1–0.3).
- At temperature=0.0, changing seeds may have little effect; with temperature>0, seeds matter and variance increases.
If you still need diversity, prefer:
- adding more tasks (expand from 30 → 42 tasks-with-tests),
- allocating more rollouts only to harder tasks,
- optionally using a second teacher model,
- and/or deduplicating the final dataset.

Pipeline: End-to-End Commands

1) Generate teacher rollouts + traces (train suite)

We can use repo-agent prefs as a rollout runner (it already runs multiple seeds and writes traces).

poetry run repo-agent prefs \
  --suite eval/suites/quixbugs_train_suite.json \
  --rollouts 8 \
  --trace-dir runs/quixbugs_teacher_traces \
  --out runs/quixbugs_teacher_traces/_prefs.jsonl \
  --llm-provider together \
  --model Qwen/Qwen2.5-72B-Instruct-Turbo \
  --temperature 0.2 \
  --seed 42 \
  --max-workers 5 \
  --tool-protocol json \
  --test-policy on_write

Cheaper alternative (protocol-only traces):

Use --test-policy never, then extract with --no-require-success.

2) Extract step-level SFT dataset from traces

Default (cleaner targets, success-only):

poetry run repo-agent sft-extract \
  --trace-dir runs/quixbugs_teacher_traces \
  --output runs/instruction_tuning/quixbugs_tool_sft_train.jsonl \
  --format json

If you need more volume (allow failed rollouts, still requires tool_result.ok by default):

poetry run repo-agent sft-extract \
  --trace-dir runs/quixbugs_teacher_traces \
  --output runs/instruction_tuning/quixbugs_tool_sft_train.jsonl \
  --format json \
  --no-require-success

3) Dataset volume checkpoint (decision)

After extraction:

Count lines in runs/instruction_tuning/quixbugs_tool_sft_train.jsonl.
Optionally deduplicate by hashing messages and re-count.

If under target (2k–3k examples):

add tasks (expand toward all 42 tasks-with-tests),
increase rollouts only for harder tasks,
add a second teacher model,
or slightly increase teacher temperature within 0.1–0.3.

4) Fine-tune (Together SFT + LoRA)

poetry run python sft_finetune_quick.py \
  --dataset runs/instruction_tuning/quixbugs_tool_sft_train.jsonl \
  --model Qwen/Qwen2.5-7B-Instruct-Turbo \
  --suffix qwen25-7b-json-tools-sft \
  --epochs 1 \
  --batch-size max \
  --learning-rate 1e-5 \
  --lora \
  --watch

5) Evaluate on held-out QuixBugs tasks

poetry run repo-agent eval \
  --suite eval/suites/quixbugs_test_suite.json \
  --trace-dir runs/quixbugs_eval_ft \
  --report runs/quixbugs_eval_ft/report.json \
  --llm-provider together \
  --model <YOUR_FINETUNED_MODEL_NAME> \
  --tool-protocol json \
  --num-workers 4 \
  --test-policy on_write

Expected Dataset Size (Step-Level)

Approximation:

num_samples ≈ (#train_tasks) × (rollouts_per_task) × (avg_valid_tool_calls_per_rollout)

Example:

29 train tasks (70% of 42)
8 rollouts each
~10 valid tool calls per rollout → ~2320 samples

Known Limitations / Risks (SFT-Only)

The extractor emits tool-call steps only; if final behavior matters, we may need to add final-step SFT examples.
write_file requires outputting full file content; it works for QuixBugs-sized files but doesn't scale as well as patch-based tools.
SFT may make the model more reliable as an agent without making it much "smarter" at patching; that's expected and acceptable for phase 1.

Results and Discussion

Dataset Volume

We extracted ~1,600 training samples from teacher traces, significantly below the expected ~2,300. This shortfall occurred because fewer rollouts reached successful completion than anticipated. Also, I lack the budget at the moment to produce the amount of data I'd need.

Evaluation Setup

We evaluated on my_suite.json containing 3 tasks (quicksort, mergesort, gcd) that were present in the training data—a deliberate choice to test whether the model could at least fit to seen tasks. Each task ran 5 rollouts (15 total).

Commands used:

# Baseline
poetry run repo-agent eval \
    --suite eval/suites/my_suite.json \
    --trace-dir runs/qwen25_7b_it \
    --report runs/qwen25_7b_it/report.json \
    --llm-provider together \
    --model Qwen/Qwen2.5-7B-Instruct-Turbo \
    --tool-protocol json \
    --rollouts 5 \
    --num-workers 5 \
    --print-mode standard

# SFT Model
poetry run repo-agent eval \
    --suite eval/suites/my_suite.json \
    --trace-dir runs/qwen25_7b_it_sft \
    --report runs/qwen25_7b_it_sft/report.json \
    --llm-provider together \
    --model justinbarrye_c241/Qwen2.5-7B-Instruct-qwen25-7b-instruct-sft-pilot-0078c2e9-7ed87e84 \
    --tool-protocol json \
    --rollouts 5 \
    --num-workers 5 \
    --print-mode standard

Quantitative Results

Metric	Baseline	SFT	Change
Success Rate	20% (1/15)	0% (0/15)	↓ worse
Runs That Started	7/15 (47%)	5/15 (33%)	↓ worse
Errored Before Starting	8/15 (53%)	10/15 (67%)	↑ worse
Mid-Run Parse Errors	8	10	↑ worse
Avg Steps per Run	8.8	4.5	↓ (fails earlier)
Total Valid Tool Actions	129	62	↓ 52%

Metric definitions:

Success Rate: Percentage of rollouts where the agent produced a patch that passed all tests.
Runs That Started: Rollouts where the model produced a valid first action and entered the tool loop. This is a critical measure of basic JSON compliance.
Errored Before Starting: Rollouts where the model failed to produce a valid first action—the LLM response couldn't be parsed as JSON at all. These runs never entered the tool loop.
Mid-Run Parse Errors: Invalid JSON responses that occurred after the run successfully started. These are less severe than initial failures.
Avg Steps per Run: Average number of agent iterations before termination. Lower values here indicate the model fails or gives up earlier, not efficiency.
Total Valid Tool Actions: Sum of all successfully parsed and executed tool calls across all rollouts.

Key observation: The baseline model already had a significant JSON compliance problem—53% of runs couldn't even produce a valid first action. This is exactly why we attempted SFT in the first place. However, SFT made this worse: 67% of runs now fail before starting (up from 53%).

Failure Mode Analysis

Examining the llm_parse_error events in the trace logs reveals why both models fail, and why SFT made it worse.

Both models fail the same way: they output prose before the JSON tool call, which the parser rejects. The parser expects pure JSON with no preamble.

Baseline failures start with reasoning like:

"Based on the provided code, the current implementation of the gcd function seems to be correct and should pass the tests. However, to ensure that the tests are correctly set up..."

...followed by the JSON tool call inline.

SFT failures are more verbose and use markdown formatting:

"It seems that the shortest_paths_test.py file does not contain any tests for the gcd function. Given that, we should focus on the gcd function in gcd.py. The provided implementation looks correct, but we need to ensure that the tests for the gcd function are properly set up and run. Let's create a test case..."

...followed by a markdown code block (triple backticks with json language tag) containing the tool call. This markdown formatting breaks the parser entirely.

Key differences in SFT failure mode:

More verbose: Multi-paragraph explanations before attempting any tool call
Markdown code blocks: Wraps JSON in fenced code blocks, which the parser cannot handle
Hallucinated actions: Sometimes writes Python code directly in the response instead of using the write_file tool

Discussion

Why did SFT make things worse?

The baseline model clearly had a JSON compliance problem (53% of runs failed before starting), which is exactly why we pursued SFT. Yet the finetuned model performed even worse. Here's why:

Insufficient training data (due to budget): ~1,600 samples is far below the 6,000–10,000 range typically needed for reliable behavior change. With so few examples, the model likely overfit to spurious patterns without learning the core constraint ("output ONLY valid JSON, no prose").
No negative examples: The training data only contained valid tool calls from successful rollouts. The model never learned what not to do—it never saw the "prose before JSON" failure mode and learned to avoid it. SFT on positive examples alone doesn't teach boundaries.
Verbosity inheritance: The teacher model (72B) may have included reasoning in its outputs. Even if the JSON was successfully parsed (and thus included in training), surrounding context or patterns could have taught the student to be more verbose.
Catastrophic forgetting: The narrow SFT task may have disrupted the base model's existing instruction-following capabilities. Qwen2.5-7B-Instruct has general chat abilities; aggressive finetuning on a small dataset can overwrite useful behaviors.

Hypothesis Evaluation

H1 (Protocol compliance): ❌ REJECTED — Parse errors increased, not decreased
H2 (Productive trajectories): ❌ REJECTED — Fewer valid tool actions, more early failures
H3 (Downstream reliability): ❌ REJECTED — Pass rate dropped from 20% to 0%

Next Steps

Scale up training data: This reason trumps all others. This is probably the singular culprit Target 6,000–10,000 samples by expanding to all 42 QuixBugs tasks, increasing rollouts per task, and relaxing --require-success to include good-prefix steps from failed rollouts.
Add negative examples: Include examples of the failure mode (prose before JSON) with corrected outputs, or use DPO to contrast good vs. bad outputs.
Audit teacher outputs: Ensure the training data contains only pure JSON responses, no reasoning preambles.
Reduce learning rate / epochs: Try more conservative hyperparameters to reduce catastrophic forgetting.
Consider RLHF/DPO: Direct preference optimization may be better suited to teaching "don't do X" constraints than pure SFT.

Bug-fixing agent: json tool calling SFT