Recursive Language Models, clearly explained (3 minute read)

MIT researchers developed Recursive Language Models to solve "context rot," where large language models get worse at reasoning over massive documents even when they can retrieve specific facts.

What: RLMs separate the query from the context by loading large documents into a Python REPL runtime memory slot, then give the model tools to explore that context programmatically (peek, grep, partition, and recursive calls) rather than forcing it to process everything at once.

Why it matters: This addresses a real problem developers experience: long Claude or ChatGPT sessions get sluggish and require more repetition because models degrade at reasoning tasks (counting, classification) even when they ace simple retrieval benchmarks on the same content.

Takeaway: A research paper and starter GitHub code are available for developers who want to experiment with building RLM-based applications.

Deep dive

Context rot is a reasoning failure, not a window size failure—models advertise 1M token windows but produce garbage on 50K token documents because reasoning collapses under massive context loads
Standard needle-in-a-haystack benchmarks only measure retrieval against token blobs, not reasoning across those tokens, which is why they miss this degradation
RLMs use context-centric decomposition where the model itself decides how to break down context, unlike agent frameworks where humans pre-design the decomposition steps
The architecture separates query from context: the document lives in a runtime memory slot (like a dataframe in Jupyter) while the root model only sees the question and available tools
Four core tools enable exploration: peek (view first 2K chars), grep (regex filter), partition (chunk into pieces), and recursive self-calls on those chunks
Example workflow: for "count billing questions from these 3 users in 5,000 tickets," the model peeks structure, greps to reduce 5,000 lines to 50, spawns recursive classification, returns result
The root model's context stays small throughout the entire process, preventing context rot regardless of input document size
Benefits include unlimited effective context (10M tokens just means more partitions), full interpretability of model decisions, cost efficiency from smaller API calls, and automatic improvements as base LLMs improve
The approach combines code execution with language reasoning—it's neither summarization nor a rigid agent workflow
Strategy emerges dynamically from what the model discovers rather than following human-scripted steps

Decoder

Context rot: The phenomenon where LLMs experience reasoning degradation when processing very large context windows, even though they can still retrieve individual facts
REPL: Read-Eval-Print Loop, an interactive programming environment where code is executed and variables persist across commands (like a Jupyter notebook)
Needle-in-a-haystack benchmark: A test where a specific sentence is hidden in filler text to see if a model can retrieve it; measures retrieval but not reasoning ability
Context-centric decomposition: Letting the model decide how to break down and process context, rather than having humans pre-design the task decomposition steps

Original article

MIT researchers have introduced Recursive Language Models (RLMs) to solve "context rot," a phenomenon where large language models experience reasoning degradation when processing massive context windows, even if they excel at basic retrieval tasks. Instead of forcing a model to ingest an entire document at once, an RLM loads the context into a Python REPL runtime memory slot.