Devoured - April 28, 2026
Batch API is terrible for one agent. It might be great for a fleet (6 minute read)

Batch API is terrible for one agent. It might be great for a fleet (6 minute read)

AI Read original

Anthropic's Batch API offers 50% cost savings but adds 90-120 second latency per turn, making it impractical for single agents but potentially powerful when pooling requests across agent fleets.

What: An experimental Python harness that routes every AI agent turn through Anthropic's asynchronous Batch API instead of synchronous endpoints, revealing the practical tradeoffs between cost and latency at different scales of agent deployment.
Why it matters: The findings challenge conventional wisdom about model routing: expensive slow models should use batch paths while cheap fast models stay synchronous, and the real economic value emerges when a smart proxy pools requests from multiple agents rather than optimizing for individual users.
Takeaway: Consider implementing fleet-level batching infrastructure if you're running multiple agents concurrently, or experiment with the open-source batching-harness on GitHub to understand the latency characteristics firsthand.
Deep dive
  • Single-agent batch usage adds 90-120 seconds per turn, turning a five-turn interaction into a ten-minute wait that makes interactive agents unusable
  • Haiku batches counterintuitively take longer than Sonnet or Opus batches, possibly because Haiku's fast synchronous execution leaves fewer idle scheduler windows for batch work
  • The conventional "cheap models for offline work" strategy inverts under batching: since latency is already high, route expensive models (Opus) through batch for maximum absolute savings while keeping fast models (Haiku) on synchronous paths
  • Real economic value emerges at three scale points: latency-insensitive workloads (overnight evals), parallel agent execution (20+ concurrent subagents), and fleet-level request pooling across independent harnesses
  • Batch and prompt caching discounts stack, and the 1-hour cache window (versus 5-minute default) creates opportunities for fleet-level optimizers to shape request timing for predictable cache hits
  • The optimal architecture is likely a smart local proxy (like the author's LunaRoute project) that transparently routes requests between sync and batch endpoints based on per-request latency tolerance
  • Existing agent harnesses could gain automatic 50% discounts without code changes by pointing their API base URL at an intelligent proxy that handles batching decisions invisibly
  • The fundamental insight is that the batching unit should be a fleet's worth of turns pooled by infrastructure, not individual user interactions
  • The batching-harness implementation uses rich for terminal UI and sandbox-runtime (bubblewrap/Seatbelt) for basic execution safety, distinct from the author's main AgentSH security project
  • Building fleet-level batching infrastructure moves beyond experimental 800-line scripts into production-grade routing and scheduling problems
Decoder
  • Batch API: Anthropic's asynchronous API endpoint that processes requests with up to 24-hour delay in exchange for 50% cost reduction compared to synchronous responses
  • Haiku/Sonnet/Opus: Anthropic's Claude model tiers, ranging from fast and cheap (Haiku) to slow and expensive (Opus) with increasing capability
  • Agent harness: The orchestration layer that manages the request-response loop between a user and an AI model, including tool execution and multi-turn conversations
  • Prompt caching: A feature that reuses previously processed prompt prefixes to reduce token processing costs and latency for requests with shared context
  • Tool loop: The iterative cycle where an AI model requests to execute functions (tools), the harness runs them, and returns results until the model signals completion
  • LunaRoute: The author's localhost proxy project that routes requests across multiple LLM providers, being extended to add intelligent batch-aware routing
Original article

Batch API is terrible for one agent. It might be great for a fleet.

What does an agent harness feel like when every model turn goes through Anthropic's Batch API instead of the synchronous endpoint?

Batches are 50% off. For anyone burning real money on agents (eval suites, background subagents, anything that runs unattended), half-price tokens are the kind of number that makes you stop and squint. The trade is latency: batches are asynchronous, with up to a 24-hour processing window.

So I built a tiny harness to find out what that actually feels like. The result is batching-harness, a single-file Python REPL that wraps every turn in a one-entry batch, polls until it ends, and runs the tool loop on top. About 800 lines. rich for the terminal UI, sandbox-runtime (bubblewrap on Linux, Seatbelt on macOS) to keep the bash tool from nuking my home directory, and a /stats panel that compares what I paid via batch against what I would have paid via the synchronous endpoint. The sandbox setup here is intentionally minimal: just enough to keep an experiment from going sideways. For real execution-layer security for AI agents across models, harnesses, and frameworks, that's AgentSH, my main project.

What I actually wanted to know

The experiment isn't whether the Batch API works. Anthropic's docs cover that fine. The interesting question is what the agent loop looks like when every turn is async.

So you sit at the prompt. You type something. The harness submits a one-entry batch and shows you a spinner with an elapsed counter. A minute or two later (usually 90 to 120 seconds), the batch ends. The model returns either text or a tool_use block. If it's a tool call, the harness runs it locally and submits another batch. Repeat until end_turn.

That's it. The entire experience is "agent, but with a two-minute polling spinner between every turn."

Which is the wrong way to use batch. And that was the point.

What I observed

With parallel=1 (one request in flight at a time, like this harness), you lose most of the actual benefit of batching. You get the 50% discount, sure, but you're paying for it in wall-clock time on every single turn. Ninety to 120 seconds per turn turns a five-turn agent loop into a ten-minute exercise. For an interactive agent, that's terrible: nobody wants to wait two minutes to be told "I need to run ls."

There's also a counterintuitive thing I noticed and didn't expect: Haiku batches tend to take longer than Sonnet or Opus batches. One possibility (and it's just a guess) is that Haiku runs so fast on the synchronous path that there are simply fewer idle windows where the batch scheduler can slot work in. The cheaper, faster model ends up being the worse fit for batching, at least at the single-request volumes I was throwing at it. I haven't benchmarked this rigorously; it's a vibe from a few hours of poking. But if you were building routing logic on top of this, it's the kind of thing that would matter. You'd probably want to avoid batching Haiku and reserve the async path for the bigger, slower models where the queue wait is a smaller fraction of total turn time.

Which actually flips the usual intuition. If you're already eating the latency, you should point the async path at the smart models. The 50% discount has much more absolute leverage on Opus than on Haiku, and since speed isn't the binding constraint anymore, the case for picking the cheaper, dumber model evaporates. You take the better answer instead. The conventional "use cheap models for offline work" gets inverted: cheap fast models stay on the sync path; expensive slow models go to batch.

When batching actually pays

The 50% discount is only worth the wait when something else is going on:

  • You don't care about latency. Overnight evals, scheduled audits, anything where "done in an hour" is fine.
  • You're running many agents in parallel. If you have 20 subagents working concurrently, batching them together (real batches, not single-entry ones) is where the throughput-per-dollar curve actually bends.
  • You're amortizing across multiple harnesses. Same idea, scaled out: pool requests from many independent agents into shared batch submissions and the economics start looking very different.

The third one is the part I find genuinely interesting. A single user at a single REPL is the worst case for batching. But a fleet of agents (your CI runs, your background research subagents, your team's automated workflows) could be pooled by a smart proxy and submitted as actual N-wide batches. That's a real cost lever, not a curiosity.

There's also a compounding effect with prompt caching that gets sharper at fleet scale. Agents in a fleet often share a lot of prompt structure (system prompts, tool definitions, common context). Batch and cache discounts already stack, and the 1-hour cache duration is worth considering for async workloads where related requests may land outside the default 5-minute window. The interesting question isn't whether the discounts compose. They do. It's whether a fleet-level batcher can shape request timing and shared prefixes well enough to make cache hits predictable. That's an operational problem, and it's the kind of thing a smart proxy could actually solve.

What's next

I don't know if this turns into anything bigger. The version where it gets interesting is the multi-harness, multi-subagent fanout: pooling requests across independent agents and submitting them as real batches, with a router that decides which path to take per request based on latency tolerance. That's no longer an 800-line REPL. That's infrastructure.

The natural home for that routing logic is a local proxy. I've been hacking on LunaRoute (a localhost LLM proxy that sits in front of multiple model providers), and adding batch awareness to it is on the list. The shape of it: existing harnesses like Claude Code or Codex point their ANTHROPIC_BASE_URL at LunaRoute and never have to know batching exists. The proxy decides per request whether to pass through to the synchronous endpoint or quietly submit as a batch, then returns the completed response through the same client-facing interface when it lands. Harnesses that don't know about batching get the discount anyway. That's the version of this experiment I actually want to ship, but it's enough work that it deserves its own post. (More on that soon.)

For now, batching-harness is on GitHub under MIT. Clone it, set an Anthropic API key, and try it if you want to see this firsthand.

The most useful thing I learned wasn't about the Batch API itself. It was that the unit of "what to batch" probably isn't a single user's turn. It's a fleet's worth of turns, batched together by a layer the user never sees.