You can beat the binary search (10 minute read)
A specialized search algorithm for sorted 16-bit integer arrays beats binary search by combining quaternary interpolation with SIMD parallel comparisons.
What: The SIMD Quad algorithm divides sorted arrays of 16-bit integers into 16-element blocks, uses quaternary interpolation (splitting into quarters instead of halves) to find the likely block containing the target value, then checks all 16 elements in that block simultaneously using SIMD instructions.
Why it matters: Demonstrates that classic textbook algorithms can be substantially improved by leveraging modern processor features like data parallelism and memory-level parallelism that weren't considerations when these algorithms were originally designed.
Takeaway: Review the open-source implementation to understand how SIMD optimization techniques can be applied to other search-heavy data structures in your own projects.
Deep dive
- Developed specifically for Roaring Bitmap format which uses sorted arrays of 16-bit integers ranging from 1 to 4096 elements
- Two key insights drove the design: modern processors can compare eight 16-bit integers with a single SIMD instruction, and they have excellent memory-level parallelism that suggests quaternary over binary search
- Algorithm divides arrays into fixed 16-element blocks and uses the last element of each block as interpolation keys for the coarse search phase
- Quaternary search splits the range into quarters rather than halves, generating more instructions but better exploiting memory-level parallelism since instruction count isn't the limiting factor
- Benchmarks on Intel Emerald Rapids (GCC) and Apple M4 (LLVM) both show SIMD Quad consistently beating binary search across all scenarios
- Intel results: more than 2x faster on warm cache, lesser benefits on cold cache
- Apple results: more than 2x faster on cold cache, more marginal benefits on warm cache
- Comparison with a binary version (same SIMD but without quaternary search) reveals the quad approach has little effect on Apple but provides decent optimization on Intel for large arrays
- The quaternary search better exploits memory-level parallelism on Intel server processors by allowing multiple memory operations to proceed in parallel
- Author suggests even better algorithms are likely possible by getting creative with modern processor parallelism features
Decoder
- SIMD: Single Instruction, Multiple Data - processor instructions that perform the same operation on multiple data points simultaneously
- Quaternary search: Search strategy that divides the search space into quarters instead of halves at each step
- Interpolation search: Search algorithm that estimates where a target value should be based on its value relative to the range endpoints
- Memory-level parallelism: A processor's ability to handle multiple memory operations simultaneously rather than waiting for each to complete
- Warm/cold cache: Warm cache means data is already in fast processor cache; cold cache means data must be fetched from slower main memory
- NEON: ARM processor's SIMD instruction set
- SSE2: Intel/AMD x86 processor's SIMD instruction set
- Roaring Bitmap: Compressed bitmap index format used for efficient set operations on large datasets
Original article
The SIMD Quad algorithm is an efficient search algorithm for sorted arrays of 16-bit unsigned integers that leverages the strengths of both algorithmic optimization and hardware acceleration to achieve faster speeds than binary search.