Devoured - April 30, 2026
How Linux 7.0 Broke PostgreSQL (9 minute read)

How Linux 7.0 Broke PostgreSQL (9 minute read)

Data Read original

Linux 7.0's switch from PREEMPT_NONE to PREEMPT_LAZY scheduling cut PostgreSQL throughput in half by causing backends to spin on locks during memory page faults.

What: An AWS engineer discovered that PostgreSQL performance dropped 50% on Linux 7.0 due to a kernel scheduling change that increased spinlock hold times when memory page faults occurred, with huge pages offering a workaround.
Why it matters: This reveals a real conflict between kernel optimization goals and database workload patterns, and affects production PostgreSQL deployments upgrading to newer Linux versions.
Takeaway: PostgreSQL administrators should enable huge pages in production (set huge_pages=on and configure huge pages at the OS level) to avoid this regression on Linux 7.0+.
Deep dive
  • Benchmark on 96-vCPU Graviton4 showed PostgreSQL throughput dropped from 98,565 to 50,751 transactions per second between Linux 6.x and 7.0, with profiling revealing 55% of CPU time spent spinning inside a single lock function
  • Linux 7.0 removed PREEMPT_NONE scheduling option on modern architectures, leaving only PREEMPT_FULL and PREEMPT_LAZY, with most distributions defaulting to PREEMPT_LAZY as a supposed drop-in replacement for server workloads
  • PostgreSQL's StrategyGetBuffer function uses a global spinlock to coordinate buffer pool access across hundreds of concurrent backends, with the assumption that lock holders will finish in nanoseconds
  • The root cause is minor page faults occurring while a backend holds the spinlock: with a 120GB shared buffer pool and default 4KB memory pages, there are roughly 31 million potential first-touch page faults during a benchmark run
  • Under PREEMPT_NONE, a backend triggering a page fault while holding the lock would handle it without being rescheduled, keeping the delay minimal; under PREEMPT_LAZY, the scheduler may preempt the lock holder mid-fault, extending hold time from microseconds to milliseconds
  • The preemption delay is multiplied across all spinning backends, so if one backend is delayed by t milliseconds, hundreds of other backends each burn t CPU cycles waiting, creating massive waste on high-concurrency workloads
  • Switching to 2MB huge pages reduces potential page faults from 31 million to ~61,000, while 1GB huge pages reduce it to just 120, effectively eliminating the problem and restoring performance
  • Huge pages also reduce TLB pressure since far fewer translation entries are needed to cover the same memory region, avoiding expensive page table walks on hot paths
  • The tradeoff is that huge pages must be pre-allocated and reserved upfront, making that memory unavailable to other processes even if unused, plus potential waste if only a fraction of each huge page is utilized
  • Intel kernel engineer proposed PostgreSQL adopt Restartable Sequences (rseq) to detect and retry preempted critical sections, but the PostgreSQL community pushes back on changing their code to work around a kernel regression
  • The debate centers on Linux's "don't break userspace" principle: software that worked correctly before a kernel upgrade should continue working after, rather than requiring application-level workarounds
Decoder
  • PREEMPT_NONE: kernel scheduling mode where threads run until they voluntarily give up CPU (via syscall, I/O, or sleep), minimizing context switches for maximum throughput
  • PREEMPT_LAZY: kernel scheduling mode that can interrupt threads but tries to wait for natural boundaries, intended as a throughput-friendly replacement for PREEMPT_NONE
  • Spinlock: locking mechanism where waiting threads actively loop checking for lock availability rather than sleeping, efficient only when lock holders finish in nanoseconds
  • StrategyGetBuffer: PostgreSQL function responsible for finding a buffer slot to store a data page, protected by a single global spinlock that becomes a contention point under high parallelism
  • Minor page fault: occurs when a process accesses virtual memory that's allocated but not yet mapped to physical memory, requiring the kernel to allocate and map a physical page (takes microseconds)
  • TLB (Translation Lookaside Buffer): hardware cache that stores recent virtual-to-physical address translations, avoiding expensive page table walks; misses require walking multi-level page tables in memory
  • Huge pages: larger-than-default memory pages (2MB or 1GB vs 4KB) that reduce the number of page table entries and TLB pressure, pre-allocated and reserved by the kernel
  • pgbench: PostgreSQL's standard benchmarking tool for measuring transaction throughput under various workloads
  • Restartable Sequences (rseq): Linux kernel facility allowing userspace code to detect if it was preempted during a critical section and restart the operation
Original article

Linux 7.0 accidentally cut PostgreSQL performance in half because a scheduling change increased how long spinlocks were held during memory page faults, causing massive CPU waste, and switching to huge memory pages fixes the issue.