How Linux 7.0 Broke PostgreSQL (9 minute read)
Linux 7.0's switch from PREEMPT_NONE to PREEMPT_LAZY scheduling cut PostgreSQL throughput in half by causing backends to spin on locks during memory page faults.
What: An AWS engineer discovered that PostgreSQL performance dropped 50% on Linux 7.0 due to a kernel scheduling change that increased spinlock hold times when memory page faults occurred, with huge pages offering a workaround.
Why it matters: This reveals a real conflict between kernel optimization goals and database workload patterns, and affects production PostgreSQL deployments upgrading to newer Linux versions.
Takeaway: PostgreSQL administrators should enable huge pages in production (set huge_pages=on and configure huge pages at the OS level) to avoid this regression on Linux 7.0+.
Deep dive
- Benchmark on 96-vCPU Graviton4 showed PostgreSQL throughput dropped from 98,565 to 50,751 transactions per second between Linux 6.x and 7.0, with profiling revealing 55% of CPU time spent spinning inside a single lock function
- Linux 7.0 removed PREEMPT_NONE scheduling option on modern architectures, leaving only PREEMPT_FULL and PREEMPT_LAZY, with most distributions defaulting to PREEMPT_LAZY as a supposed drop-in replacement for server workloads
- PostgreSQL's StrategyGetBuffer function uses a global spinlock to coordinate buffer pool access across hundreds of concurrent backends, with the assumption that lock holders will finish in nanoseconds
- The root cause is minor page faults occurring while a backend holds the spinlock: with a 120GB shared buffer pool and default 4KB memory pages, there are roughly 31 million potential first-touch page faults during a benchmark run
- Under PREEMPT_NONE, a backend triggering a page fault while holding the lock would handle it without being rescheduled, keeping the delay minimal; under PREEMPT_LAZY, the scheduler may preempt the lock holder mid-fault, extending hold time from microseconds to milliseconds
- The preemption delay is multiplied across all spinning backends, so if one backend is delayed by t milliseconds, hundreds of other backends each burn t CPU cycles waiting, creating massive waste on high-concurrency workloads
- Switching to 2MB huge pages reduces potential page faults from 31 million to ~61,000, while 1GB huge pages reduce it to just 120, effectively eliminating the problem and restoring performance
- Huge pages also reduce TLB pressure since far fewer translation entries are needed to cover the same memory region, avoiding expensive page table walks on hot paths
- The tradeoff is that huge pages must be pre-allocated and reserved upfront, making that memory unavailable to other processes even if unused, plus potential waste if only a fraction of each huge page is utilized
- Intel kernel engineer proposed PostgreSQL adopt Restartable Sequences (rseq) to detect and retry preempted critical sections, but the PostgreSQL community pushes back on changing their code to work around a kernel regression
- The debate centers on Linux's "don't break userspace" principle: software that worked correctly before a kernel upgrade should continue working after, rather than requiring application-level workarounds
Decoder
- PREEMPT_NONE: kernel scheduling mode where threads run until they voluntarily give up CPU (via syscall, I/O, or sleep), minimizing context switches for maximum throughput
- PREEMPT_LAZY: kernel scheduling mode that can interrupt threads but tries to wait for natural boundaries, intended as a throughput-friendly replacement for PREEMPT_NONE
- Spinlock: locking mechanism where waiting threads actively loop checking for lock availability rather than sleeping, efficient only when lock holders finish in nanoseconds
- StrategyGetBuffer: PostgreSQL function responsible for finding a buffer slot to store a data page, protected by a single global spinlock that becomes a contention point under high parallelism
- Minor page fault: occurs when a process accesses virtual memory that's allocated but not yet mapped to physical memory, requiring the kernel to allocate and map a physical page (takes microseconds)
- TLB (Translation Lookaside Buffer): hardware cache that stores recent virtual-to-physical address translations, avoiding expensive page table walks; misses require walking multi-level page tables in memory
- Huge pages: larger-than-default memory pages (2MB or 1GB vs 4KB) that reduce the number of page table entries and TLB pressure, pre-allocated and reserved by the kernel
- pgbench: PostgreSQL's standard benchmarking tool for measuring transaction throughput under various workloads
- Restartable Sequences (rseq): Linux kernel facility allowing userspace code to detect if it was preempted during a critical section and restart the operation
Original article
Linux 7.0 accidentally cut PostgreSQL performance in half because a scheduling change increased how long spinlocks were held during memory page faults, causing massive CPU waste, and switching to huge memory pages fixes the issue.