Building a fault-tolerant metrics storage system at Airbnb (9 minute read)
Airbnb shares how they built a metrics storage system handling 50 million samples per second by implementing multi-tenant isolation and resource guardrails to prevent any single service from degrading system performance.
What: Airbnb developed an internal metrics storage system that ingests approximately 50 million samples per second across 1.3 billion time series, using per-service tenancy and shuffle sharding for strict tenant isolation along with read/write guardrails.
Why it matters: The architecture demonstrates practical solutions to "noisy neighbor" problems in multi-tenant systems, where one service's metrics workload could otherwise degrade performance for all other services sharing the infrastructure.
Takeaway: Consider implementing shuffle sharding and per-tenant resource limits if you're building or operating shared infrastructure systems at scale to prevent cascading failures.
Decoder
- Time series: A sequence of data points indexed by timestamp, commonly used for metrics like CPU usage or request counts over time
- Shuffle sharding: A technique that assigns each tenant to a random subset of resources, limiting the blast radius when failures occur
- Multi-tenant isolation: Architectural pattern ensuring multiple services sharing infrastructure don't interfere with each other's performance
- Guardrails: Automatic limits or controls that prevent individual tenants from over-consuming shared resources
Original article
Airbnb built an internal metrics storage system capable of ingesting ~50 million samples/sec across ~1.3 billion time series by introducing strict multi-tenant isolation (per-service tenancy, shuffle sharding) and guardrails on reads/writes to prevent any single workload from overwhelming the system.