Expedia's Service Telemetry Analyzer (6 minute read)
Expedia built a Service Telemetry Analyzer that uses LLMs to parse Datadog monitoring data and accelerate incident investigation workflows.
What: A tool developed by Expedia that combines large language models with Datadog's telemetry and observability data to help teams diagnose and resolve production incidents faster by reducing mean time to know and mean time to recover.
Decoder
- LLM: Large Language Model, AI systems trained on vast text data that can understand and generate human-like text
- Telemetry data: Automated measurements and diagnostic information collected from systems (metrics, logs, traces) to monitor health and performance
- MTTR/Time to recover: Mean Time To Recover, the average time it takes to restore service after an incident
- Datadog: A popular cloud monitoring and observability platform that collects and analyzes application and infrastructure metrics
Original article
Expedia's Service Telemetry Analyzer uses LLMs plus Datadog's telemetry data to speed incident investigation and reduce time to know/recover.