Software Is Eating the World (But Actually This Time) (32 minute read)
AI agents are turning work itself into software loops that can read, reason, call tools, and verify autonomously, fundamentally changing which tasks consume inference and how much.
What: A deep analysis arguing that previous software automation only replaced interfaces while humans still did the work, but AI agents can now execute complete workflows through autonomous loops—customer support calls, insurance claims, code debugging—all running as multi-step inference processes consuming orders of magnitude more tokens than simple chat.
Why it matters: The piece explains why inference demand is exploding exponentially (Google saw 50x token growth year-over-year) and provides a framework for identifying which industries will be automated next: workflows that are "coding-shaped" with structured inputs, deterministic logic, and digital verification can sustain deep agent loops. As models commoditize, the defensible position shifts to apps that capture messy operational data from real-world agent executions.
Takeaway: Evaluate any workflow by asking how many autonomous steps an agent can take before needing human intervention and whether verification can happen digitally—these determine position on the "token ladder" and automation potential.
Deep dive
- The "software ate the world" narrative from 2011 was really about software eating interfaces and distribution (apps, websites, routing systems), while humans continued doing the actual work like analyzing documents, making decisions, and handling exceptions
- AI agents now execute complete workflows as code: a customer service call becomes speech recognition → account lookup via API → policy retrieval → reasoning about eligibility → refund trigger → text-to-speech response, all in an autonomous loop
- The "token ladder" shows how agentic tasks consume vastly more inference than simple chat: basic Q&A uses ~900 tokens, retrieval uses ~7,500 tokens, agentic support uses tens of thousands, and coding agents use hundreds of thousands to millions per task
- An 8-minute support call might have only 3,000 tokens of transcript but consume 40,000+ tokens when accounting for continuous orchestration, context replay, tool outputs, and parallel models for sentiment/compliance monitoring
- A coding agent fixing a race condition might produce only 500 tokens of visible code but burn ~900,000 tokens across 30 iterations of reading context, forming hypotheses, editing, running tests, and revising—three orders of magnitude more than the output
- Workloads get "eaten" when they're essentially state transitions plus exception handling, inputs can be captured as text/voice/documents, and verification can happen digitally rather than requiring weeks of physical validation
- METR data shows autonomous task horizons doubling every 131 days since 2023: GPT-4 handled 4-minute tasks, Claude 3.5 Sonnet reached 11 minutes, Claude 3.7 Sonnet hit 1 hour, o3 reached 2 hours, GPT-5 hit 3.5 hours, and Claude Opus 4.6 pushed toward 12 hours
- Longer task horizons directly multiply inference demand because models can stay in loops longer—each additional step means more context replay, tool output processing, and reasoning, often growing faster than linearly
- This creates a version of Jevons paradox: per-token prices are rising for frontier models, but value per million tokens rises faster because models can complete in one session what previously required dozens of brittle attempts or was impossible
- Market growth reflects three compounding curves: more users, more tasks per user being routed through models, and more tokens per task as models sustain deeper workflows—OpenAI processes 15B tokens/minute (up from 6B six months prior), Google went from 9.7T to 480T tokens/month in a year
- Industries most ready for automation sit where workflows are "coding-shaped" (structured inputs, deterministic logic, digital verification) and high-volume (healthcare admin, customer support, insurance claims)
- As models commoditize, defensible applications will be those that capture operational data invisible to benchmarks: tool calls, retries, escalations, corrections, and edge cases that reveal how specific workflows actually run in production
- The strategic advantage shifts from model access to accumulated knowledge of how this specific insurer handles claims, how this hospital processes denials, how this codebase breaks—proprietary operational context that improves agent performance over time
Decoder
- METR: AI safety research org that measures how long frontier models can autonomously handle multi-step tasks, calibrated against human expert time
- Token ladder: Framework ranking tasks by inference consumption, from ~900 tokens for basic chat up to millions for deep coding workflows
- Coding-shaped workflow: Tasks with structured inputs, deterministic decision logic, and digital verification that allow agents to loop autonomously for many steps
- Task horizon: How long a model can work autonomously on a task before needing human intervention, measured in equivalent human expert time
- Agentic loop: Autonomous execution cycle where an AI agent reads context, reasons, calls tools, verifies results, and revises iteratively until completing a task
- Context replay: The process of re-feeding accumulated conversation history and state to the model at each step, which multiplies token consumption in long-running tasks
- Jevons paradox: Economic principle where efficiency gains increase total consumption—here, better models use more tokens per task but deliver more value per token
Original article
Software ate distribution, but most of the work was still done by humans. AI changes that - the work is now becoming software. Agents can read, reason, call tools, verify, revise, and perform long-running tasks. As models commoditize, the apps that capture messy operational data will be the ones to improve fastest and defend their position longest.