Darwinian Specialization in AI (3 minute read)
The AI inference market is fragmenting into specialized segments for different workloads, creating opportunities for multiple infrastructure winners rather than a single dominant player.
Deep dive
- NVIDIA's data center revenue grew 17x in three years following ChatGPT's launch, from $3.6B to $62.3B quarterly, demonstrating explosive inference market growth
- The fragmentation mirrors the database market evolution, where different workload requirements (real-time transactions vs batch analytics, ACID vs eventual consistency) created distinct product categories
- Real-time inference (sub-100ms) for voice assistants and autonomous vehicles requires geographically distributed infrastructure with dedicated capacity, no batching tolerance
- Near-real-time (100ms-2s) serves most current LLM applications like chatbots and code completion, where batching and queuing can optimize throughput without degrading user experience
- Batch processing (seconds to hours) prioritizes cost efficiency over speed, running document processing and content generation on spot instances during off-peak hours
- Multimodal workloads face different bottlenecks: text models are memory-constrained by KV cache growth, while image/video generation is compute-bound (50 sequential passes per image)
- Edge inference has unique constraints including privacy requirements, connectivity limitations, and power budgets (Tesla FSD chips draw 72 watts, Apple runs 3B parameter models on-device)
- The model ecosystem reflects this fragmentation: a few dominant LLMs with long half-lives coexist with 90,000+ image generation models on Hugging Face, each with different serving requirements
- No single architecture can simultaneously optimize for compute-heavy video generation, memory-intensive long-context windows, and power-constrained edge devices
- The $100B inference market fragmenting along these lines creates room for multiple specialized winners, each optimizing for specific workload characteristics
Decoder
- Inference: Running a trained AI model to generate predictions or outputs, as opposed to training the model initially
- KV cache: Key-value cache that stores previous context in language models to avoid recomputing it for each new token, grows with conversation length
- Latency: The delay between sending a request and receiving a response, critical for user experience in real-time applications
- Batching: Processing multiple inference requests together to improve throughput and hardware utilization
- Quantized models: Models with reduced numerical precision (e.g., 8-bit instead of 32-bit) to decrease memory usage and increase speed at edge devices
- Modality: The type of data being processed (text, image, video, audio), each with different computational characteristics
- Spot instances: Cloud computing capacity sold at steep discounts when spare capacity is available, suitable for non-time-sensitive workloads
Original article
The inference market is the fastest growing market in the world & it's splitting up. Each modality is developing its own inference stack.
NVIDIA's data center revenue was flat through 2022. Then ChatGPT launched. Three years later : 17x growth.
Databases did the same thing. What started as one market fragmented into relational, document, key-value, graph, time series, vector, & others. Each category reflects different workload requirements : real-time transactions vs batch analytics, ACID compliance vs eventual consistency.
The inference market is fragmenting for the same reason : workloads are different. Images & video are compute-heavy. Longer context windows demand more memory for KV cache. Edge devices have power constraints. A single architecture can't optimize for all of them.
The model ecosystem reflects this. A few dominant LLMs with long half-lives sit alongside 90,000+ image generation models on Hugging Face, with new variants appearing daily. Each model type has different serving requirements, which fragments the infrastructure. Today, we see these segments :
Latency Tiers : Real-Time, Near-Real-Time, & Batch
Latency defines three distinct segments. Real-time (sub-100ms) serves voice assistants, live translation, & autonomous vehicles. Users won't wait, so infrastructure must be geographically distributed with dedicated capacity.
Near-real-time (100ms-2s) covers chatbots, code completion, & search augmentation. Most LLM applications today operate here, where batching & queuing optimize throughput without degrading experience.
Batch (seconds to hours) handles document processing & content generation at scale. Cost efficiency matters more than speed, so workloads run during off-peak hours on spot instances.
Multimodal (Image, Video, Audio)
The bottleneck shifts. For chatbots, the problem is memory. The model holds the entire conversation in its head, & that memory grows with every turn. For image & video generation, the problem is raw compute. A single image requires 50 sequential passes through the model. Different architectures, different constraints, different infrastructure.
Edge (On-Device & On-Premise)
Privacy requirements, connectivity constraints, & latency sensitivity push inference to edge devices. Mobile phones, industrial sensors, medical devices. Apple runs a 3-billion-parameter model on-device for Apple Intelligence. Tesla runs vision models on FSD chips drawing 72 watts. Quantized models, specialized chips, & limited memory create different optimization challenges than cloud inference.
The database market produced Oracle, MongoDB, Databricks, & Snowflake. A $100B inference market fragmenting the same way creates room for similar winners.