Batch API is terrible for one agent. It might be great for a fleet (6 minute read)

Batch API is terrible for one agent. It might be great for a fleet.

What does an agent harness feel like when every model turn goes through Anthropic's Batch API instead of the synchronous endpoint?

Batches are 50% off. For anyone burning real money on agents (eval suites, background subagents, anything that runs unattended), half-price tokens are the kind of number that makes you stop and squint. The trade is latency: batches are asynchronous, with up to a 24-hour processing window.

So I built a tiny harness to find out what that actually feels like. The result is batching-harness, a single-file Python REPL that wraps every turn in a one-entry batch, polls until it ends, and runs the tool loop on top. About 800 lines. rich for the terminal UI, sandbox-runtime (bubblewrap on Linux, Seatbelt on macOS) to keep the bash tool from nuking my home directory, and a /stats panel that compares what I paid via batch against what I would have paid via the synchronous endpoint. The sandbox setup here is intentionally minimal: just enough to keep an experiment from going sideways. For real execution-layer security for AI agents across models, harnesses, and frameworks, that's AgentSH, my main project.

What I actually wanted to know

The experiment isn't whether the Batch API works. Anthropic's docs cover that fine. The interesting question is what the agent loop looks like when every turn is async.

So you sit at the prompt. You type something. The harness submits a one-entry batch and shows you a spinner with an elapsed counter. A minute or two later (usually 90 to 120 seconds), the batch ends. The model returns either text or a tool_use block. If it's a tool call, the harness runs it locally and submits another batch. Repeat until end_turn.

That's it. The entire experience is "agent, but with a two-minute polling spinner between every turn."

Which is the wrong way to use batch. And that was the point.

What I observed

With parallel=1 (one request in flight at a time, like this harness), you lose most of the actual benefit of batching. You get the 50% discount, sure, but you're paying for it in wall-clock time on every single turn. Ninety to 120 seconds per turn turns a five-turn agent loop into a ten-minute exercise. For an interactive agent, that's terrible: nobody wants to wait two minutes to be told "I need to run ls."

There's also a counterintuitive thing I noticed and didn't expect: Haiku batches tend to take longer than Sonnet or Opus batches. One possibility (and it's just a guess) is that Haiku runs so fast on the synchronous path that there are simply fewer idle windows where the batch scheduler can slot work in. The cheaper, faster model ends up being the worse fit for batching, at least at the single-request volumes I was throwing at it. I haven't benchmarked this rigorously; it's a vibe from a few hours of poking. But if you were building routing logic on top of this, it's the kind of thing that would matter. You'd probably want to avoid batching Haiku and reserve the async path for the bigger, slower models where the queue wait is a smaller fraction of total turn time.

Which actually flips the usual intuition. If you're already eating the latency, you should point the async path at the smart models. The 50% discount has much more absolute leverage on Opus than on Haiku, and since speed isn't the binding constraint anymore, the case for picking the cheaper, dumber model evaporates. You take the better answer instead. The conventional "use cheap models for offline work" gets inverted: cheap fast models stay on the sync path; expensive slow models go to batch.

When batching actually pays

The 50% discount is only worth the wait when something else is going on:

You don't care about latency. Overnight evals, scheduled audits, anything where "done in an hour" is fine.
You're running many agents in parallel. If you have 20 subagents working concurrently, batching them together (real batches, not single-entry ones) is where the throughput-per-dollar curve actually bends.
You're amortizing across multiple harnesses. Same idea, scaled out: pool requests from many independent agents into shared batch submissions and the economics start looking very different.

The third one is the part I find genuinely interesting. A single user at a single REPL is the worst case for batching. But a fleet of agents (your CI runs, your background research subagents, your team's automated workflows) could be pooled by a smart proxy and submitted as actual N-wide batches. That's a real cost lever, not a curiosity.

There's also a compounding effect with prompt caching that gets sharper at fleet scale. Agents in a fleet often share a lot of prompt structure (system prompts, tool definitions, common context). Batch and cache discounts already stack, and the 1-hour cache duration is worth considering for async workloads where related requests may land outside the default 5-minute window. The interesting question isn't whether the discounts compose. They do. It's whether a fleet-level batcher can shape request timing and shared prefixes well enough to make cache hits predictable. That's an operational problem, and it's the kind of thing a smart proxy could actually solve.

What's next

I don't know if this turns into anything bigger. The version where it gets interesting is the multi-harness, multi-subagent fanout: pooling requests across independent agents and submitting them as real batches, with a router that decides which path to take per request based on latency tolerance. That's no longer an 800-line REPL. That's infrastructure.

The natural home for that routing logic is a local proxy. I've been hacking on LunaRoute (a localhost LLM proxy that sits in front of multiple model providers), and adding batch awareness to it is on the list. The shape of it: existing harnesses like Claude Code or Codex point their ANTHROPIC_BASE_URL at LunaRoute and never have to know batching exists. The proxy decides per request whether to pass through to the synchronous endpoint or quietly submit as a batch, then returns the completed response through the same client-facing interface when it lands. Harnesses that don't know about batching get the discount anyway. That's the version of this experiment I actually want to ship, but it's enough work that it deserves its own post. (More on that soon.)

For now, batching-harness is on GitHub under MIT. Clone it, set an Anthropic API key, and try it if you want to see this firsthand.

The most useful thing I learned wasn't about the Batch API itself. It was that the unit of "what to batch" probably isn't a single user's turn. It's a fleet's worth of turns, batched together by a layer the user never sees.