Devoured - April 23, 2026
Entropy-Guided KV Cache Summarization via Low-Rank Attention Reconstruction (9 minute read)

Entropy-Guided KV Cache Summarization via Low-Rank Attention Reconstruction (9 minute read)

Data Read original

A new KV cache compression technique uses entropy-based selection and low-rank reconstruction to preserve LLM attention patterns more accurately than pruning, achieving both lower error and lower memory usage.

What: The SRC (Selection-Reconstruction-Compression) pipeline addresses KV cache memory limits in LLMs by identifying low-entropy tokens, mathematically reconstructing their contribution via ordinary least squares, and compressing the result using singular value decomposition, rather than simply deleting tokens.
Why it matters: This challenges the dominant pruning approach by showing that compression can outperform deletion in both accuracy and memory footprint, potentially enabling longer context windows without distributed GPU setups or aggressive token eviction strategies.
Takeaway: The full research notebook with implementation is available on Zenodo for developers working on LLM inference optimization.
Deep dive
  • The fundamental problem: KV cache scales linearly O(N) with sequence length, quickly exhausting GPU VRAM for long contexts and forcing either distributed sharding or aggressive pruning
  • Existing Top-K and sliding window methods assume irrelevant tokens stay irrelevant, but attention patterns in natural language are context-dependent and non-stationary
  • Token-level analysis reveals pruning fails selectively and unpredictably—some tokens show catastrophic reconstruction errors despite low local importance, with error spikes indicating structurally important but not immediately dominant tokens
  • The SRC pipeline's Selection stage uses Shannon entropy H(P) to identify high-entropy (diffuse attention) tokens for a "recycle bin" while keeping low-entropy anchor tokens in active cache
  • Reconstruction stage solves an OLS problem to find weights that minimize reconstruction error: W* minimizes the difference between reconstructed and original attention outputs using the Moore-Penrose pseudoinverse
  • Compression stage applies SVD for low-rank approximation, representing the weight matrix W as U_k Σ_k V_k^T, filtering noise while preserving principal signal directions
  • Evaluation uses FAIR (equal effective capacity with adjusted budgets) and REAL (actual memory footprint) settings to account for summarization overhead versus true deployment conditions
  • Results show HAE achieves 3× lower reconstruction error at 30% keep ratio compared to Top-K in FAIR evaluation, with gains most significant at aggressive compression levels
  • Counterintuitively, HAE uses less actual memory than Top-K across all keep ratios in REAL evaluation while maintaining better accuracy, because low-rank representations remove redundancy rather than explicitly storing tokens
  • Ablation studies show that selection alone (entropy without OLS) and compression alone (OLS without entropy) both perform poorly—the full pipeline is necessary
  • The key conceptual shift: memory efficiency should be measured by information density rather than token count, and compression can outperform pruning on both dimensions
  • Main limitation is increased computation time for OLS and SVD steps, with planned mitigation through custom Triton kernel implementations to amortize latency
Decoder
  • KV Cache: The keys and values stored for each token in transformer attention, required for generating subsequent tokens and scales linearly with sequence length
  • VRAM: Video RAM, the memory on GPUs used to store model weights and activations during inference
  • Top-K pruning: A strategy that keeps only the K tokens with highest attention scores and deletes the rest
  • Sliding window: A caching strategy that keeps only the most recent N tokens, ignoring global dependencies
  • Shannon Entropy: A measure of uncertainty or diffuseness in a probability distribution; high entropy means attention is spread out rather than focused
  • OLS (Ordinary Least Squares): A method for finding the best-fit linear approximation by minimizing squared reconstruction errors
  • SVD (Singular Value Decomposition): A matrix factorization technique that decomposes a matrix into principal components ranked by importance
  • Low-rank approximation: Representing a matrix with fewer dimensions by keeping only the most significant singular values and their vectors
  • Moore-Penrose pseudoinverse: A generalization of matrix inversion used when the matrix isn't square or isn't invertible
  • Triton kernels: Custom GPU kernels written in Triton, a Python-based language for high-performance GPU programming
Original article

Entropy-Guided KV Cache Summarization via Low-Rank Attention Reconstruction

As Large Language Models (LLMs) move toward million-token context windows, we are hitting a physical limit: the KV Cache. Storing the Keys and Values for every single token in a sequence causes VRAM usage to scale linearly.

In the standard Transformer architecture, each new token requires access to the keys and values of all preceding tokens. As a result, the KV cache grows linearly with sequence length: $[\mathcal{O}(N)]$. For long-context models, this quickly exceeds the VRAM capacity of a single GPU (e.g., an H100), necessitating either distributed sharding or aggressive pruning strategies.

Existing Heavy Hitter or Top-K eviction strategies rely on a simple premise: if a token isn't being looked at now, it won't be looked at later. However, information in natural language is inherently context-dependent and non-stationary. A token that is irrelevant in one segment may become the primary anchor in another.

In this post, I'll walk through another paradigm: The SRC (Selection-Reconstruction-Compression) Pipeline. Instead of deleting tokens, we mathematically summarize them using Information Theory and Linear Algebra.

To understand why this assumption fails, we first need to examine the structure of attention itself.

Why Pruning Fails: Attention is Structured

The attention mechanism does not operate on tokens independently. Instead, it produces a dense interaction pattern between queries and keys, where each token participates in a global computation.

Each row corresponds to a query token, and each column corresponds to a key token. The intensity reflects how strongly a query attends to a given key.

Several observations emerge:

  • Attention is highly structured, not sparse
  • Multiple tokens contribute jointly to outputs
  • Dependencies are distributed across the sequence

Token-wise Error Under Pruning

While pruning appears effective on average, a finer-grained analysis reveals a critical failure mode.

This plot shows the reconstruction error for each token after applying Top-K pruning.

Most tokens exhibit low error, suggesting that pruning works well locally. However, a few tokens show sharp spikes in error, indicating catastrophic failures.

These spikes correspond to tokens whose contribution is:

  • not immediately dominant
  • but structurally important for downstream attention

Notably, these failures are not predictable from local importance alone.

Pruning fails not uniformly, but selectively and unpredictably.

The SRC Paradigm: A Three-Stage Evolution

To address these limitations, we shift from a heuristic, decision-based paradigm to a structured transformation pipeline.

Instead of deciding which tokens to remove, we aim to approximate their contribution through a sequence of operations:

Selection → Reconstruction → Compression

Selection: The Entropy "Recycle Bin"

How do we decide which tokens to summarize? Instead of magnitude, we look at Information Uncertainty.

We calculate the Shannon Entropy H(P) of the attention weights for each head. If a token has high entropy, it means the model is "diffusing" its focus; the information is less specific and potentially more redundant.

$$H(P) = -\sum_{i} p_i \log(p_i)$$

Tokens with high entropy and low cumulative importance are moved to a "Recycle Bin." Low-entropy "anchor" tokens are kept in the high-fidelity Active Cache.

Entropy Landscape of Tokens

Each point represents the entropy of a token's attention distribution.

We observe that:

  • Some tokens exhibit low entropy (sharp, focused attention)
  • Others exhibit high entropy (diffuse, uncertain attention)

Low-entropy tokens tend to act as anchors, concentrating attention and carrying specific semantic meaning. High-entropy tokens, in contrast, distribute attention broadly and often encode less precise information.

Reconstruction: Solving for the Semantic Essence

Once tokens are in the Recycle Bin, we want to represent them as a single centroid token. Simply averaging them is insufficient, as it ignores the specific queries that might interact with them.

We instead frame this as an Ordinary Least Squares (OLS) problem. Our goal is to find a weight matrix $( W )$ that minimizes the reconstruction error of the attention output produced by the binned tokens.

Given a set of reference queries $( Q_{\text{ref}} )$ and the original binned values $( V_{\text{bin}} )$, we solve:

$$[ W^* = \arg\min_{W} \left| Q_{\text{ref}} W - \text{Attn}(Q_{\text{ref}}, K_{\text{bin}}, V_{\text{bin}}) \right|_2^2 ]$$

This formulation ensures that the reconstructed representation preserves the functional contribution of the discarded tokens.

In practice, we solve this analytically using the Moore-Penrose pseudoinverse:

$$[ W = Q_{\text{ref}}^{\dagger} \cdot \text{Attn}(Q_{\text{ref}}, K_{\text{bin}}, V_{\text{bin}}) ]$$

This allows the compressed representation to accurately approximate the original attention outputs while using significantly fewer tokens.

The core OLS logic from the implementation

def summarize_bin_ols(Q_ref, bin_v):
    # Solve for weights that reconstruct the original V output
    pinv_Q = torch.linalg.pinv(Q_ref)
    W_reconstruction = pinv_Q @ bin_v
    return W_reconstruction

Compression: Low-Rank Approximation (SVD)

The reconstruction weight matrix $( W )$ is still memory-intensive. To achieve actual VRAM savings, we compress $( W )$ using Singular Value Decomposition (SVD).

By performing a rank $-( k )$ approximation, we retain only the most significant singular values and their corresponding vectors. This yields a compact representation that captures the dominant structure of the original matrix.

$$ [ W \approx U_k \Sigma_k V_k^T ]$$

This low-rank factorization allows us to interpret each retained component as a synthetic key-value pair, effectively producing a centroid token that acts as a proxy for the entire Recycle Bin.

Importantly, this process filters out low-energy components (noise) while preserving the principal directions (signal) that are most relevant for reconstructing attention outputs.

As a result, we achieve substantial compression without significantly degrading the functional behavior of the attention mechanism.

Evaluation Protocol: FAIR vs REAL

Before presenting results, it is important to distinguish between two evaluation settings.

A naive comparison between methods can be misleading, as different approaches utilize memory differently. In particular, summarization-based methods introduce additional tokens, making direct comparisons with pruning methods non-trivial.

To address this, we evaluate under two complementary regimes:

FAIR: Equal Effective Capacity

In the FAIR setting, we ensure that all methods operate under the same effective KV budget.

For HAE, summarization introduces additional tokens (e.g., centroid tokens). To account for this, we adjust the usable budget:

effective_budget = k_budget − (bin_size − k_rank)

REAL: Actual Memory Usage

In the REAL setting, we measure the true memory footprint of each method without any adjustments.

def kv_memory(K, V):
    return (K.numel() + V.numel()) * 4 / (1024 * 1024)

Unlike the FAIR setting, we do not compensate for summarization overhead. This reflects real-world deployment conditions, where every stored token contributes to memory usage.

Results

Memory Evolution Over Time

This plot shows the number of KV tokens retained as the sequence progresses.

Two distinct behaviors emerge:

  • Top-K (blue) maintains a flat cap, enforcing a strict upper bound on memory
  • HAE (orange) exhibits a step-like pattern, where memory grows and is periodically compressed

Interpretation

Top-K follows a static retention policy:

  • Once the budget is reached, tokens are continuously evicted
  • No information from evicted tokens is preserved

In contrast, HAE follows a dynamic compression policy:

  • Tokens are temporarily accumulated
  • When the recycle bin fills, they are summarized and compressed
  • The cache size drops before growing again

Each downward step in HAE corresponds to:

Recycle Bin Full → OLS Reconstruction → SVD Compression → Reinsertion

KV Compression Strategies (FAIR)

We extend our evaluation to compare multiple KV cache compression strategies under the FAIR setting.

Methods evaluated:

  • Top-K: Retains tokens with highest attention importance
  • Sliding Window: Keeps only the most recent tokens
  • OLS (rank-k): Low-rank approximation without entropy-based selection
  • HAE (Vanilla): Entropy-based selection without reconstruction
  • HAE (Entropy + OLS): Full SRC pipeline (Selection + Reconstruction + Compression)

1. Full SRC Pipeline Performs Best

Across all memory budgets, HAE (Entropy + OLS) consistently achieves the lowest reconstruction error.

  • Significant gains at low keep ratios (≤ 30%)
  • Maintains strong performance even under aggressive compression

2. Selection Alone is Not Enough

Comparing:

  • HAE (Vanilla) vs
  • HAE (Entropy + OLS)

We observe that:

Entropy-based selection improves over naive methods, but reconstruction is critical.

Without OLS, performance degrades significantly, especially at lower budgets.

3. Compression Alone is Insufficient

The OLS (rank-k) baseline remains nearly flat across ratios:

  • It does not adapt to token importance
  • Lacks query-aware selection

This highlights that:

Compression without selection fails to capture meaningful structure.

4. Sliding Window Performs Poorly

Sliding window exhibits the highest reconstruction error:

  • Ignores global dependencies
  • Retains tokens purely based on recency

5. Top-K is Strong but Fragile

Top-K performs reasonably well at higher budgets, but:

  • Degrades rapidly at low ratios
  • Suffers from the selective failure modes observed earlier

FAIR vs REAL: Accuracy and Memory Tradeoff

We now evaluate the SRC pipeline across both FAIR and REAL settings, capturing the tradeoff between reconstruction fidelity and memory usage.

FAIR: Reconstruction Error

The left plot shows reconstruction error (MSE) under equal effective capacity.

Across all keep ratios, HAE consistently outperforms Top-K:

  • At 30% keep ratio, HAE achieves nearly 3× lower error
  • At 50% and 70%, the gap remains significant
  • At higher ratios, both methods converge as more tokens are retained

This demonstrates that:

HAE preserves the attention function more effectively under constrained budgets.

REAL: Memory Footprint

The right plot shows the actual memory consumption of each method.

Interestingly, HAE uses:

Less memory than Top-K across all ratios

This occurs because:

  • HAE compresses multiple tokens into low-rank representations
  • Redundant information is removed rather than explicitly stored

Combined Interpretation

Taken together, the results reveal a key distinction:

  • Top-K:

    • Optimizes for token retention
    • Suffers from high reconstruction error at low budgets
  • HAE:

    • Optimizes for functional preservation
    • Achieves both lower error and lower memory usage

Rethinking the Objective

These findings challenge a common assumption:

Fewer tokens do not necessarily imply better efficiency.

Instead:

  • Memory should be evaluated in terms of information density, not token count
  • Compression can outperform pruning in both accuracy and footprint

Conclusion

This research established that the linear scaling of the KV cache can be mitigated through mathematically-guided summarization. The synergy between Hierarchical Attention Entropy and Low-Rank Reconstruction allows the model to "compact" diffused information into a representative centroid rather than simply evicting it.

Our benchmarks on a synthetic Transformer show that this method maintains the semantic "shape" of attention patterns that pruning strategies typically degrade. The primary trade-off observed is the increased calculation time for OLS and SVD steps. Consequently, the next phase of this work involves implementing these operations within custom Triton kernels to amortize latency. By viewing the cache through the lens of reconstruction fidelity rather than just memory capacity, we can develop more sustainable architectures for long-context inference.

The full research notebook open for review

Citation

@misc{hae_kv_cache_2026,
  title     = {Entropy-Guided KV Cache Summarization via Low-Rank Attention Reconstruction},
  author    = {Chandra, Jayanth},
  year      = {2026},
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.19657329},
  url       = {https://doi.org/10.5281/zenodo.19657329}
}