How Linux 7.0 Broke PostgreSQL (9 minute read)

Data infrastructurelinuxdatabase Read original

Linux 7.0's switch from PREEMPT_NONE to PREEMPT_LAZY scheduling cut PostgreSQL throughput in half by causing backends to spin on locks during memory page faults.

What: An AWS engineer discovered that PostgreSQL performance dropped 50% on Linux 7.0 due to a kernel scheduling change that increased spinlock hold times when memory page faults occurred, with huge pages offering a workaround.

Why it matters: This reveals a real conflict between kernel optimization goals and database workload patterns, and affects production PostgreSQL deployments upgrading to newer Linux versions.

Takeaway: PostgreSQL administrators should enable huge pages in production (set huge_pages=on and configure huge pages at the OS level) to avoid this regression on Linux 7.0+.

Deep dive

Benchmark on 96-vCPU Graviton4 showed PostgreSQL throughput dropped from 98,565 to 50,751 transactions per second between Linux 6.x and 7.0, with profiling revealing 55% of CPU time spent spinning inside a single lock function
Linux 7.0 removed PREEMPT_NONE scheduling option on modern architectures, leaving only PREEMPT_FULL and PREEMPT_LAZY, with most distributions defaulting to PREEMPT_LAZY as a supposed drop-in replacement for server workloads
PostgreSQL's StrategyGetBuffer function uses a global spinlock to coordinate buffer pool access across hundreds of concurrent backends, with the assumption that lock holders will finish in nanoseconds
The root cause is minor page faults occurring while a backend holds the spinlock: with a 120GB shared buffer pool and default 4KB memory pages, there are roughly 31 million potential first-touch page faults during a benchmark run
Under PREEMPT_NONE, a backend triggering a page fault while holding the lock would handle it without being rescheduled, keeping the delay minimal; under PREEMPT_LAZY, the scheduler may preempt the lock holder mid-fault, extending hold time from microseconds to milliseconds
The preemption delay is multiplied across all spinning backends, so if one backend is delayed by t milliseconds, hundreds of other backends each burn t CPU cycles waiting, creating massive waste on high-concurrency workloads
Switching to 2MB huge pages reduces potential page faults from 31 million to ~61,000, while 1GB huge pages reduce it to just 120, effectively eliminating the problem and restoring performance
Huge pages also reduce TLB pressure since far fewer translation entries are needed to cover the same memory region, avoiding expensive page table walks on hot paths
The tradeoff is that huge pages must be pre-allocated and reserved upfront, making that memory unavailable to other processes even if unused, plus potential waste if only a fraction of each huge page is utilized
Intel kernel engineer proposed PostgreSQL adopt Restartable Sequences (rseq) to detect and retry preempted critical sections, but the PostgreSQL community pushes back on changing their code to work around a kernel regression
The debate centers on Linux's "don't break userspace" principle: software that worked correctly before a kernel upgrade should continue working after, rather than requiring application-level workarounds

Decoder

PREEMPT_NONE: kernel scheduling mode where threads run until they voluntarily give up CPU (via syscall, I/O, or sleep), minimizing context switches for maximum throughput
PREEMPT_LAZY: kernel scheduling mode that can interrupt threads but tries to wait for natural boundaries, intended as a throughput-friendly replacement for PREEMPT_NONE
Spinlock: locking mechanism where waiting threads actively loop checking for lock availability rather than sleeping, efficient only when lock holders finish in nanoseconds
StrategyGetBuffer: PostgreSQL function responsible for finding a buffer slot to store a data page, protected by a single global spinlock that becomes a contention point under high parallelism
Minor page fault: occurs when a process accesses virtual memory that's allocated but not yet mapped to physical memory, requiring the kernel to allocate and map a physical page (takes microseconds)
TLB (Translation Lookaside Buffer): hardware cache that stores recent virtual-to-physical address translations, avoiding expensive page table walks; misses require walking multi-level page tables in memory
Huge pages: larger-than-default memory pages (2MB or 1GB vs 4KB) that reduce the number of page table entries and TLB pressure, pre-allocated and reserved by the kernel
pgbench: PostgreSQL's standard benchmarking tool for measuring transaction throughput under various workloads
Restartable Sequences (rseq): Linux kernel facility allowing userspace code to detect if it was preempted during a critical section and restart the operation

Original article

Linux 7.0 accidentally cut PostgreSQL performance in half because a scheduling change increased how long spinlocks were held during memory page faults, causing massive CPU waste, and switching to huge memory pages fixes the issue.