Scaling Long-Horizon Coding Agents (28 minute read)

Meta researchers developed a framework that improves coding agents by summarizing their extended work sessions into reusable structured knowledge, achieving significant benchmark gains.

What: A test-time scaling framework that addresses a core limitation in long-horizon coding agents: instead of just generating more attempts, it converts each agent's work trajectory (actions, errors, partial progress) into compact summaries that can be compared and reused across attempts.

Why it matters: Traditional test-time scaling works for short outputs that can be easily compared, but coding agents produce extended sequences of work that are hard to evaluate directly. The breakthrough is treating this as a representation problem rather than just a generation problem.

Deep dive

The framework introduces two complementary scaling approaches: Recursive Tournament Voting (RTV) for parallel scaling and adapted Parallel-Distill-Refine (PDR) for sequential scaling
RTV recursively narrows down a population of rollout summaries through small-group comparisons, similar to a tournament bracket
PDR conditions new agent attempts on distilled summaries from previous rollouts, enabling knowledge transfer between sequential attempts
Structured summaries preserve salient hypotheses, progress tracking, and failure modes while discarding low-signal trace details
Claude-4.5-Opus achieved 77.6% on SWE-Bench Verified (up from 70.9%) using the mini-SWE-agent implementation
On Terminal-Bench v2.0 with Terminus 1, performance jumped from 46.9% to 59.1%
The research reframes test-time scaling for agents as fundamentally about representation, selection, and reuse rather than raw generation volume
The 70-page paper includes extensive evaluation across multiple frontier coding agents and benchmark datasets
Results suggest that effective knowledge representation between attempts is more valuable than simply running more parallel attempts

Decoder

Test-time scaling: Improving model performance by using more computation during inference (when answering queries) rather than during training
Rollout trajectories: The complete sequence of actions, observations, errors, and states an agent goes through while attempting to solve a problem
SWE-Bench Verified: A benchmark dataset for evaluating coding agents on real-world software engineering tasks from GitHub issues
Terminal-Bench: A benchmark for testing coding agents on terminal-based development tasks
Recursive Tournament Voting (RTV): A selection method that repeatedly pairs and compares solutions in groups to identify the best candidates
Parallel-Distill-Refine (PDR): A technique that generates multiple attempts in parallel, extracts key insights, and uses them to improve subsequent attempts

Original article

Scaling Test-Time Compute for Agentic Coding

Test-time scaling has become a powerful way to improve large language models. However, existing methods are best suited to short, bounded outputs that can be directly compared, ranked or refined. Long-horizon coding agents violate this premise: each attempt produces an extended trajectory of actions, observations, errors, and partial progress taken by the agent. In this setting, the main challenge is no longer generating more attempts, but representing prior experience in a form that can be effectively selected from and reused. We propose a test-time scaling framework for agentic coding based on compact representations of rollout trajectories. Our framework converts each rollout into a structured summary that preserves its salient hypotheses, progress, and failure modes while discarding low-signal trace details. This representation enables two complementary forms of inference-time scaling. For parallel scaling, we introduce Recursive Tournament Voting (RTV), which recursively narrows a population of rollout summaries through small-group comparisons. For sequential scaling, we adapt Parallel-Distill-Refine (PDR) to the agentic setting by conditioning new rollouts on summaries distilled from prior attempts. Our method consistently improves the performance of frontier coding agents across SWE-Bench Verified and Terminal-Bench v2.0. For example, by using our method Claude-4.5-Opus improves from 70.9% to 77.6% on SWE-Bench Verified (mini-SWE-agent) and 46.9% to 59.1% on Terminal-Bench v2.0 (Terminus 1). Our results suggest that test-time scaling for long-horizon agents is fundamentally a problem of representation, selection, and reuse.