Building a fault-tolerant metrics storage system at Airbnb (9 minute read)

Data infrastructureobservabilitydevops Read original

Airbnb shares how they built a metrics storage system handling 50 million samples per second by implementing multi-tenant isolation and resource guardrails to prevent any single service from degrading system performance.

What: Airbnb developed an internal metrics storage system that ingests approximately 50 million samples per second across 1.3 billion time series, using per-service tenancy and shuffle sharding for strict tenant isolation along with read/write guardrails.

Why it matters: The architecture demonstrates practical solutions to "noisy neighbor" problems in multi-tenant systems, where one service's metrics workload could otherwise degrade performance for all other services sharing the infrastructure.

Takeaway: Consider implementing shuffle sharding and per-tenant resource limits if you're building or operating shared infrastructure systems at scale to prevent cascading failures.

Decoder

Time series: A sequence of data points indexed by timestamp, commonly used for metrics like CPU usage or request counts over time
Shuffle sharding: A technique that assigns each tenant to a random subset of resources, limiting the blast radius when failures occur
Multi-tenant isolation: Architectural pattern ensuring multiple services sharing infrastructure don't interfere with each other's performance
Guardrails: Automatic limits or controls that prevent individual tenants from over-consuming shared resources

Original article

Airbnb built an internal metrics storage system capable of ingesting ~50 million samples/sec across ~1.3 billion time series by introducing strict multi-tenant isolation (per-service tenancy, shuffle sharding) and guardrails on reads/writes to prevent any single workload from overwhelming the system.