Software Is Eating the World (But Actually This Time) (32 minute read)

Tech aiagentsllminfrastructure Read original

AI agents are turning work itself into software loops that can read, reason, call tools, and verify autonomously, fundamentally changing which tasks consume inference and how much.

What: A deep analysis arguing that previous software automation only replaced interfaces while humans still did the work, but AI agents can now execute complete workflows through autonomous loops—customer support calls, insurance claims, code debugging—all running as multi-step inference processes consuming orders of magnitude more tokens than simple chat.

Why it matters: The piece explains why inference demand is exploding exponentially (Google saw 50x token growth year-over-year) and provides a framework for identifying which industries will be automated next: workflows that are "coding-shaped" with structured inputs, deterministic logic, and digital verification can sustain deep agent loops. As models commoditize, the defensible position shifts to apps that capture messy operational data from real-world agent executions.

Takeaway: Evaluate any workflow by asking how many autonomous steps an agent can take before needing human intervention and whether verification can happen digitally—these determine position on the "token ladder" and automation potential.

Deep dive

The "software ate the world" narrative from 2011 was really about software eating interfaces and distribution (apps, websites, routing systems), while humans continued doing the actual work like analyzing documents, making decisions, and handling exceptions
AI agents now execute complete workflows as code: a customer service call becomes speech recognition → account lookup via API → policy retrieval → reasoning about eligibility → refund trigger → text-to-speech response, all in an autonomous loop
The "token ladder" shows how agentic tasks consume vastly more inference than simple chat: basic Q&A uses ~900 tokens, retrieval uses ~7,500 tokens, agentic support uses tens of thousands, and coding agents use hundreds of thousands to millions per task
An 8-minute support call might have only 3,000 tokens of transcript but consume 40,000+ tokens when accounting for continuous orchestration, context replay, tool outputs, and parallel models for sentiment/compliance monitoring
A coding agent fixing a race condition might produce only 500 tokens of visible code but burn ~900,000 tokens across 30 iterations of reading context, forming hypotheses, editing, running tests, and revising—three orders of magnitude more than the output
Workloads get "eaten" when they're essentially state transitions plus exception handling, inputs can be captured as text/voice/documents, and verification can happen digitally rather than requiring weeks of physical validation
METR data shows autonomous task horizons doubling every 131 days since 2023: GPT-4 handled 4-minute tasks, Claude 3.5 Sonnet reached 11 minutes, Claude 3.7 Sonnet hit 1 hour, o3 reached 2 hours, GPT-5 hit 3.5 hours, and Claude Opus 4.6 pushed toward 12 hours
Longer task horizons directly multiply inference demand because models can stay in loops longer—each additional step means more context replay, tool output processing, and reasoning, often growing faster than linearly
This creates a version of Jevons paradox: per-token prices are rising for frontier models, but value per million tokens rises faster because models can complete in one session what previously required dozens of brittle attempts or was impossible
Market growth reflects three compounding curves: more users, more tasks per user being routed through models, and more tokens per task as models sustain deeper workflows—OpenAI processes 15B tokens/minute (up from 6B six months prior), Google went from 9.7T to 480T tokens/month in a year
Industries most ready for automation sit where workflows are "coding-shaped" (structured inputs, deterministic logic, digital verification) and high-volume (healthcare admin, customer support, insurance claims)
As models commoditize, defensible applications will be those that capture operational data invisible to benchmarks: tool calls, retries, escalations, corrections, and edge cases that reveal how specific workflows actually run in production
The strategic advantage shifts from model access to accumulated knowledge of how this specific insurer handles claims, how this hospital processes denials, how this codebase breaks—proprietary operational context that improves agent performance over time

Decoder

METR: AI safety research org that measures how long frontier models can autonomously handle multi-step tasks, calibrated against human expert time
Token ladder: Framework ranking tasks by inference consumption, from ~900 tokens for basic chat up to millions for deep coding workflows
Coding-shaped workflow: Tasks with structured inputs, deterministic decision logic, and digital verification that allow agents to loop autonomously for many steps
Task horizon: How long a model can work autonomously on a task before needing human intervention, measured in equivalent human expert time
Agentic loop: Autonomous execution cycle where an AI agent reads context, reasons, calls tools, verifies results, and revises iteratively until completing a task
Context replay: The process of re-feeding accumulated conversation history and state to the model at each step, which multiplies token consumption in long-running tasks
Jevons paradox: Economic principle where efficiency gains increase total consumption—here, better models use more tokens per task but deliver more value per token

Original article

Software ate distribution, but most of the work was still done by humans. AI changes that - the work is now becoming software. Agents can read, reason, call tools, verify, revise, and perform long-running tasks. As models commoditize, the apps that capture messy operational data will be the ones to improve fastest and defend their position longest.