Devoured - May 08, 2026
OpenAI's Codex now integrates with Chrome on macOS and Windows, automating browser tasks by writing code, while Meta is developing the Hatch AI agent for social platforms, and GitHub is optimizing its AI workflows to manage rising costs.
Codex now works directly in Chrome on macOS and Windows
OpenAI's Codex now runs natively in Chrome on macOS and Windows, enabling it to automate repetitive browser tasks across tabs in the background by writing code to navigate complex data flows.
Decoder
- OpenAI Codex: An AI system developed by OpenAI that can generate code in various programming languages, often used for code completion, code generation from natural language, and automating coding tasks.
Original article
JavaScript is not available.
We’ve detected that JavaScript is disabled in this browser. Please enable JavaScript or switch to a supported browser to continue using x.com. You can see a list of supported browsers in our Help Center.
Terms of Service Privacy Policy Cookie Policy Imprint Ads info © 2026 X Corp.
OpenAI Released Realtime Audio Models
OpenAI has launched a new suite of real-time audio models through its API, including GPT-Realtime-2 for conversational reasoning, GPT-Realtime-Translate for live multilingual translation, and GPT-Realtime-Whisper for streaming transcription.
Original article
OpenAI released a new set of real-time audio models, including GPT‑Realtime‑2 for conversational reasoning, GPT‑Realtime‑Translate for live multilingual translation, and GPT‑Realtime‑Whisper for streaming transcription.
Meta prepares Hatch AI Agent with waitlist and social skills
Meta is developing "Hatch," a consumer-grade AI agent designed to compete with OpenAI's OpenClaw, integrating image/video generation, shopping, and learning features deeply into Facebook and Instagram, with internal testing targeted for June.
Decoder
- Agentic AI: AI systems designed to perform complex, multi-step tasks autonomously, often interacting with various tools, environments, and other agents to achieve a defined goal.
Original article
Meta's push into agentic AI is taking sharper shape. Following reports from FT and The Information that the company is building a consumer-grade autonomous agent codenamed Hatch, fresh signals inside Meta's own surfaces confirm that preparation work is already underway in the codebase. The agent appears positioned as Meta's answer to OpenAI's OpenClaw, reframed for a mainstream audience that the current crop of agentic tools has largely shut out.
Traces in the code suggest Hatch will roll out behind a waitlist, meaning early access is likely to be tightly gated at launch. The scope of tasks being prepared is notably wide:
- Image and video generation
- Shopping flows
- Learning sessions and research workloads
- Groundwork for scheduled tasks and file generation
That feature mix overlaps with Microsoft's Copilot Tasks and its Auto, Researcher, and Analyst modes, but Meta's version carries a clear twist. The agent is expected to draw on social grounding, reaching deeper into Instagram and Facebook than any Meta AI surface so far, and potentially turning feed exploration, creator discovery, and shopping research into agent-driven workflows.
The strategic logic lines up with what Mark Zuckerberg outlined on Meta's most recent earnings call, where he framed the company's agent ambitions as systems that work day and night toward user goals. According to The Information, Meta is targeting internal testing of Hatch by the end of June, with mock environments built to resemble Reddit, Etsy, and DoorDash for training in tool use behavior. The Financial Times points to Muse Spark, Meta's new assistant-tier model family, as the eventual backbone, with Anthropic's Claude Opus 4.6 and Sonnet 4.6 reportedly serving as a transitional layer in the meantime.
Hatch also sits alongside a parallel agentic shopping tool being prepared for Instagram, targeted for Q4 2026, that would let users research and check out products without leaving Reels or the feed. Together, they sketch a clear posture: Meta wants its agents to live where billions of users already spend their time, rather than asking them to migrate to a separate chat surface. Whether the Hatch codename survives to launch remains open, but the build cadence suggests it sits closer to release than early reporting alone implies.
Improving token efficiency in GitHub Agentic Workflows
GitHub is actively optimizing token usage in its AI agentic workflows to reduce growing costs, as these automatically scheduled and triggered jobs can accumulate significant expenses out of sight for developers.
Decoder
- Token: In the context of large language models, a token is a fundamental unit of text or code that the model processes. It can be a word, part of a word, or punctuation, and models process text by breaking it down into these tokens. Costs for LLM usage are often calculated per token.
Original article
GitHub Agent Workflows significantly improve repository hygiene and quality, but costs are becoming a growing concern for developers. AI jobs like agentic workflows are automatically scheduled and triggered, so costs can accumulate out of view. GitHub started systematically optimizing the token usage of many workflows last month. This post describes what the team instrumented, the optimizations it applied, and its preliminary results.
The Six-Hour Codex Run That Survived a Five-Hour Pause
Codex CLI v0.128.0, released April 30, 2026, introduced a headline feature called `/goal` that allows AI-driven development tasks to persist across terminal restarts and laptop sleeps, automatically resuming work without user re-prompting.
Deep dive
- Codex CLI v0.128.0 was released on April 30, 2026, introducing the
/goalfeature. /goalenables "persisted goals," allowing AI tasks to survive terminal restarts, laptop sleeps, and multi-hour pauses.- The system uses app-server APIs for state persistence and model tools to manage the goal lifecycle.
- Runtime continuation automatically injects a developer message to prompt the model to continue working after an interruption, without user input.
- TUI controls are provided for creating, pausing, resuming, and clearing goals.
- A real-world test on a TypeScript monorepo showed a 6h 44min wall time, with only ~41 minutes of actual model compute.
- The session processed ~6.8M cumulative input tokens with an impressive ~94% cache hit rate, making the economics viable.
- The
/goalfeature requires clearly defined "done_when" contracts for success criteria, explicit reading lists for the model, and anti-pattern fences in the prompt. - It is best suited for long-horizon tasks where reasoning accumulates and less for exploratory work or short, interactive tasks.
- The author recommends running with
approval_policy = "never"andsandbox_mode = "danger-full-access"for truly autonomous runs, but only in trusted environments. - This feature is contrasted with the "Ralph Wiggum Loop," which involves stateless, fresh-context iterations, whereas
/goalprioritizes continuous context. - This shift changes the user's role from supervisor to architect, where upfront prompt quality and goal definition are paramount.
Decoder
- Codex CLI: A command-line interface tool that provides access to OpenAI's Codex AI, allowing developers to interact with it from their terminal.
- Persisted goals: A feature where the state and objective of an AI agent's task are saved and maintained, allowing the task to be paused and resumed across different sessions or interruptions without losing context.
- Runtime continuation: The ability of an AI agent to automatically pick up and continue working on a task from where it left off after an interruption, typically by injecting a system message to the model.
- TUI (Terminal User Interface): A text-based user interface that runs within a terminal or console, allowing interaction using text commands and keyboard input rather than a graphical interface.
- Token cache hit rate: The percentage of times an AI model can reuse previously processed tokens or their computations from a cache, rather than reprocessing them, which saves compute resources and cost.
- Ralph Wiggum Loop: A colloquial term (coined by Geoffrey Huntley) for a shell-scripted workflow that repeatedly feeds a prompt and git history to an AI model (e.g., Claude), designed to allow an agent to iterate on a task by starting each step with a fresh context.
- GPT-5.5: Refers to a version of OpenAI's Generative Pre-trained Transformer model, presumably a more advanced or updated iteration.
- TypeScript monorepo: A software development setup where multiple projects or modules, all written in TypeScript, are managed within a single repository, often sharing code and build configurations.
- Wall time: The total elapsed time from the start to the end of a process, including any waiting, pausing, or non-compute periods.
- Model compute: The actual time a machine learning model spends actively processing data and performing computations, excluding idle time or waiting.
- Done_when contract: Specific, concrete success criteria defined upfront for an AI agent's task, which the agent uses to determine when its goal is considered complete.
- Anti-pattern fences: Explicit instructions given in a prompt to an AI agent, telling it what not to do or what types of solutions to avoid, preventing common pitfalls or undesirable behaviors.
- Context compaction: The process of reducing the size of the input context provided to a large language model, typically by summarizing, filtering, or selectively retaining the most relevant information, to save tokens and improve efficiency.
- Approval policy: A configurable setting for an AI agent that determines when and if human approval is required before the agent executes an action (e.g.,
never,always,on_dangerous). - Sandbox mode: A setting that controls the level of access an AI agent has to the system's resources (e.g., filesystem, network), with
danger-full-accessimplying broad, unrestricted access.
Original article
TL;DR
/goalshipped in Codex CLI v0.128.0 on April 30, 2026 as a named headline feature.- It introduces persisted goals: a goal state that survives terminal restarts, laptop sleeps, and multi-hour pauses without re-prompting.
- Runtime continuation means Codex injects a developer message on resume rather than waiting for you to type anything.
- I ran a real session on a TypeScript monorepo. Wall time: about 6h 44min. Actual model compute: about 41 minutes. Final status:
TASK_COMPLETE. - The session burned roughly 6.8M cumulative input tokens at a ~94% cache hit rate. Auto-context-compaction fired once, configurable via
model_auto_compact_token_limit.
I did not plan to run Codex overnight. I started a session at 9:19 PM Berlin time on April 30, watched one turn run for 57 seconds, then closed the laptop and went to bed. When I came back five and a half hours later, /goal was already running again. It had picked up exactly where it left off. I had not re-prompted anything.
That is the thing about /goal that does not come through in a changelog entry. It is not just a new command. It is a different contract between you and the agent.
What Shipped on April 30
Codex CLI v0.128.0 (tagged rust-v0.128.0) dropped on April 30, 2026. The headline from the release notes: “Added persisted /goal workflows with app-server APIs, model tools, runtime continuation, and TUI controls for create, pause, resume, and clear.”
That one sentence packs a lot in, so let me pull it apart.
Persisted goals are the core idea. Previous Codex sessions were ephemeral. Close the terminal, lose the thread. /goal stores the active goal in app-server state, so it outlives the process.
App-server APIs is the plumbing behind that persistence. Codex now talks to a local server layer that tracks goal state.
Model tools means the model itself gets tools for interacting with the goal lifecycle. It can signal completion, request continuation, and inspect goal state as part of its reasoning.
Runtime continuation is the behavior I saw that night. When you resume (or when Codex detects the session is alive again), it injects a developer message prompting the model to continue working. You do not have to type anything.
TUI controls rounds out the surface area. The terminal UI gets explicit create, pause, resume, and clear actions for goal management. You can pause a running goal intentionally, not just by closing the lid.
The rest of v0.128.0 is worth a quick mention. Scrollback reflow now works on terminal resize instead of the text getting mangled. A new codex update command handles CLI self-updates. The composer shows plan-mode nudges when a task seems like a good candidate for planning. TUI keymaps are now configurable. Permission profiles are expanded. The --full-auto flag is deprecated in favor of explicit approval profiles. The desktop app also got polish improvements the same week, though the focus of this post is the CLI. Plan mode itself landed earlier, in v0.122.0 on April 20, 2026. /goal builds on top of that foundation.
What /goal Actually Does
The basic mechanic is straightforward. You type /goal followed by your prompt. Codex stores the goal and starts working. If the session is interrupted (network hiccup, closed laptop, deliberate pause), the goal persists. When the session comes back, Codex resumes automatically via runtime continuation.
The model signals completion with TASK_COMPLETE or the task_complete tool. Until that happens, the goal stays active.
What actually makes this different from a long-running --continue session is the persistence layer. Before /goal, a closed terminal meant a dead session. You could approximate continuity by carefully managing context files and re-injecting prompts, which is basically what the Ralph Wiggum Loop does in a scrappier way. /goal makes continuity a first-class feature.
A few config knobs matter here. In ~/.codex/config.toml, the model_auto_compact_token_limit key sets the threshold for automatic context compaction. The [features] block is where feature flags live. The model_reasoning_effort key sets reasoning effort for the session. If you want hands-off autonomous runs, you will also need approval_policy and sandbox_mode configured correctly. I will get to that.
The TUI also changes. You get visible goal state. You can pause a running goal intentionally without killing the process. Resume picks it back up with runtime continuation.
A Real Six-Hour Run
Here is what a real session actually looked like.
The project was a TypeScript monorepo I am working on. A voice interview system with several end-to-end scenarios that needed to work correctly under a set of defined conditions.
I run Codex with approval_policy = "never" and sandbox_mode = "danger-full-access" for autonomous /goal sessions. These two settings are the precondition for hands-off long runs: the model does not stop to ask permission, and it has full filesystem access to do its work. This is only sane in a trusted project directory with clean git state going in.
The /goal prompt was around 600 words. I wrote it using a structured approach: XML-style blocks organizing the goal, an explicit reading list of ten or more files the model should consult first, working rules (check git status before edits, prefer rg over grep, use apply_patch), a done_when contract spelling out four concrete success criteria, and explicit anti-pattern fences. One of those fences: “do not add string-matching patches to pass one transcript.” If you have worked on voice systems, you know why that fence needs to exist.
Writing a prompt like that is itself a task. If you want to see how I approach prompt design for this kind of work, The Interview Method covers the workflow.
Model: gpt-5.5. Reasoning effort: high.
Session timeline:
- 9:19 PM -
/goalsubmitted. - 9:20 PM - First turn running. I watched it for 57 seconds, then interrupted (
turn_aborted). - 5.5 hours - I closed the laptop. No re-prompting.
- ~2:50 AM - When I came back,
/goalhad already injected a developer message (“Continue working toward the active thread goal”) and was running. Autonomous. - Context compaction fired once, at approximately 6.7M cumulative input tokens.
- Cumulative tokens: ~6.8M input, ~10K output, ~2.6K reasoning tokens. Cache hit rate: ~94%.
- Wall time: 6h 44min. Actual model compute: ~41 minutes across turns.
- Final status:
TASK_COMPLETE. All four target end-to-end voice scenarios passed verification.
Manual transcript review found no prompt loops, no liveness spirals, no premature closes. The model worked through the scenarios methodically and called it done when the criteria were met.
One real-world ceiling worth noting. A TTS first-byte timing field I wanted captured could not be measured, because the upstream library does not emit the relevant runtime event. The model documented this honestly. Explicit nulls in the artifact, with a note explaining why the field was missing. It did not paper over the gap. /goal can give you an autonomous run, but it cannot bypass what the external environment actually exposes.
The ~94% cache hit rate is the number that makes the economics work. 6.8M input tokens sounds alarming until you realize that the actual incremental cost at that cache rate is a fraction of the nominal number.
/goal vs the Ralph Wiggum Loop
I wrote about the Ralph Wiggum Loop a while back. Geoffrey Huntley coined it, and his original post is still the canonical reference: the technique is essentially while :; do cat PROMPT.md | claude-code; done with git history as memory. It solves the same core problem /goal solves: how do you keep an AI agent working on something longer than a single context window?
The approaches are different in character.
| Dimension | Ralph Wiggum Loop | /goal |
|---|---|---|
| Setup | Shell script or plugin, external orchestration | Built into Codex CLI |
| State persistence | Git history, files on disk | App-server APIs, native goal state |
| Resume behavior | Manual re-invocation | Automatic runtime continuation |
| Context management | Fresh context per iteration (by design) | Compaction within session |
| Reasoning continuity | Stateless between iterations | Continuous within session |
| Model | Claude Code | Codex with gpt-5.5 |
| Good for | Tasks that benefit from fresh eyes each pass | Long-horizon tasks with accumulating context |
The Ralph Wiggum Loop is genuinely useful. The stateless-by-design property is sometimes an advantage: each iteration approaches the problem without carrying forward incorrect intermediate conclusions. If the model gets confused, the next iteration starts clean.
/goal bets on continuity instead. The model builds up a picture of the codebase across turns and does not have to re-read everything from scratch on each pass. For tasks where reasoning accumulates (debugging a subtle interaction, navigating a complex state machine), continuity wins. For tasks that are naturally iterative and convergent (adding tests, fixing lint), Ralph’s fresh-context model often works just as well.
Neither is the right default. They are tools for different shapes of problem.
When /goal Is the Wrong Choice
A few situations where I would not reach for /goal.
Undefined success criteria. The done_when contract is not optional. If you cannot write four concrete success criteria before you start, the model has no way to know when it is done. It will either declare TASK_COMPLETE prematurely or loop indefinitely. Write the contract first.
Exploratory work. Early-stage “figure out what this codebase is doing” work benefits from human-in-the-loop. You learn things as the model surfaces them. /goal is for execution, not exploration.
Security-critical paths. I run with approval_policy = "never" and sandbox_mode = "danger-full-access". That setup is only appropriate in project directories I trust completely. Authentication systems, payment flows, anything touching sensitive data: keep approval in the loop.
Unclear external dependencies. If your task depends on an external system you are not sure about, find out first. The TTS timing field I mentioned above is the mild version. The more expensive version is a six-hour run that hits a wall at hour five because the external API does not support what you assumed it would.
Short tasks. /goal has overhead. A task you can finish in ten minutes of interactive Codex is not improved by wrapping it in a persisted goal. The complexity is not worth it below some threshold. My rough heuristic: if the task would not comfortably span two or more separate sessions in the old model, it probably does not need /goal.
The Mindset Shift
Old: Autonomous AI runs are sessions you monitor, ready to intervene when things go sideways. New: Autonomous AI runs are contracts you write upfront, then get out of the way.
The shift is from supervisor to architect. The quality of the /goal session is determined almost entirely before the first turn runs. The prompt quality, the success criteria, the anti-pattern fences, the reading list. Once it starts, your job is mostly done. If you wrote the contract well, the model executes. If you did not, no amount of monitoring will save it.
That is a different skill than interactive prompting. It is closer to writing a spec than having a conversation.
Conclusion
/goal is the most significant thing Codex has shipped since plan mode. The persistence layer and runtime continuation are what make it different from a long --continue session in practice. Six hours and forty-four minutes of wall time with forty-one minutes of actual compute is only possible because the model kept its context, the cache held, and the goal survived a five-hour gap without me touching anything.
The economics work out because of cache hit rates. The quality works out because of upfront prompt discipline. Neither of those things is automatic.
This is the first post in a two-part series. The companion post covers the workflow side: how I prep specs and prompts before they reach /goal. From SPEC.md to /goal: My Codex + GPT-5.5 Workflow.
Sources
- Codex v0.128.0 release notes
- Codex changelog
- Codex CLI features reference
- Geoffrey Huntley: Ralph Wiggum, the goat
Yannik Zuehlke
Consultant, Architect & Developer
Software architect and cloud engineer with 15+ years of experience. I write about what works in practice.
Related Posts
From SPEC.md to /goal: My Codex + GPT-5.5 Workflow
Two models, one pipeline. How I use Claude for SPEC.md, GPT-5.5 to refine, and Codex /goal to ship long-running work.
My Thoughts on Claude Opus 4.7
Anthropic's Opus 4.7 landed with big benchmark wins, a new tokenizer, and a split reception. Here are the facts, the community pushback, and how I'm using it.
How I Built My First iOS App with AI
I shipped a native SwiftUI app to the App Store with zero prior iOS experience. Here's every tool and workflow that made it possible.
Good QC for RL Data
The quality control (QC) bar for reinforcement learning (RL) data sold to frontier AI labs in 2026 is critically low, with most vendors failing multiple internal QC gates and shipping data that often proves unusable or problematic downstream.
Deep dive
- The current quality control (QC) bar for reinforcement learning (RL) data, especially for frontier AI labs, is inadequate.
- Many data vendors are failing to meet the implicit and explicit QC standards set by labs, leading to significant inefficiencies and wasted resources.
- QC should be standardized for evaluating data based on its impact on performance, cost, and latency.
- Key QC gates include "intake review," which assesses whether a dataset is even evaluable, and is often skipped by vendors.
- Intake review categories include verification spectrum classification, contamination resistance, variant generation, pass@k analysis, and rubric construction patterns.
- "Active testing" involves small-scale ablations and post-training runs to catch problems intake review misses, such as reward hacking, sycophancy, and catastrophic forgetting.
- Examples of active testing include probes for reward hacking (e.g., testing if models exploit test cases), bias probes for LLM judges, verifier FP/FN audits, and per-skill forgetting checks.
- Existing benchmarks like FrontierSWE, ProgramBench, and MMMLU are criticized for flaws in realism, verification soundness, contamination, or scope.
- Successful benchmarks like BankerToolBench, LiveCodeBench Pro, and SciCode are praised for realism, contamination defense, or verifier soundness, though none clear all categories.
- The author stresses that the market is shifting from buying "data in the abstract" to buying "outcomes" or "model improvement."
- Vendors who neglect robust QC will face contract non-renewals, while those with research-dense teams and advanced QC infrastructure (e.g., bias probes, CoT faithfulness probes, IRT-based audits) are seeing 3-5x pricing power.
- The article warns against over-optimizing for unrealistic synthetic data and emphasizes that labs are learning to heavily discount "black boxes" from vendors.
Decoder
- Reinforcement Learning (RL) Data: Data used to train AI models that learn by interacting with an environment, receiving rewards or penalties for actions, and optimizing their behavior to maximize cumulative reward.
- Frontier Lab: A leading AI research laboratory pushing the boundaries of artificial intelligence capabilities.
- QC (Quality Control): A process by which the quality of all factors involved in production is inspected. In this context, it refers to the rigorous evaluation of data used to train AI models.
- Pareto Curve: In economics, a graphical representation of the Pareto efficiency frontier, showing the optimal trade-offs between two or more competing objectives (e.g., performance vs. cost vs. latency).
- Intake Review: The initial, cheapest stage of quality control for a dataset, assessing its fundamental evaluability and suitability before expensive training runs.
- Verification Spectrum Classification: Categorizing an AI task based on how verifiably its outcomes can be graded, ranging from deterministic code grading to LLM-judge rubrics.
- Contamination Resistance: A measure of how well a dataset prevents problems from leaking into pre-training data or how resilient it is to models "memorizing" answers rather than learning concepts.
- Variant Generation: The ability to create diverse and novel versions of test cases within a dataset to ensure its discriminative power doesn't decay as models improve.
- Pass@k: A metric used in AI evaluation, particularly for code generation, indicating the percentage of problems where at least one out of
kgenerated solutions passes the tests. - Reward Hacking: A phenomenon in reinforcement learning where an AI agent finds unintended ways to maximize its reward function without achieving the desired human-intended goal, often by exploiting flaws in the reward design.
- Bias Probe Battery: A set of diagnostic tests designed to detect and measure various biases (e.g., sycophancy, reward-tampering, alignment-faking) within an LLM-judge or an AI model.
- Catastrophic Forgetting: A problem in machine learning where a model, when trained on new tasks, tends to forget previously learned information or skills.
- FP and FN rates (False Positives and False Negatives): Metrics used in classification to evaluate the accuracy of a system. False positives are incorrect positive predictions, and false negatives are incorrect negative predictions.
- Sycophancy: The tendency of an AI model to agree with or flatter the user, even if it means providing incorrect or suboptimal information, often observed under reward pressure.
- LLM-judge: A large language model used to evaluate the output or performance of other AI models or systems, acting as an automated grader.
- RLHF (Reinforcement Learning from Human Feedback): A technique used to align AI models with human preferences by training a reward model on human comparisons of model outputs, and then optimizing the AI model with reinforcement learning based on this reward.
- PPO (Proximal Policy Optimization): A popular reinforcement learning algorithm often used to train large language models to align with human preferences.
- CoT (Chain of Thought): A prompting technique used with large language models to elicit a series of intermediate reasoning steps before providing a final answer, which can improve accuracy and interpretability.
- IRT-based ability audits (Item Response Theory): A psychometric framework used for designing, analyzing, and scoring tests, applied here to evaluate the "ability" or capabilities of AI models on a set of tasks.
Original article
In January, I proposed a new definition for Type 1, Type 2 data, pending drastic need from the data industry on how to evaluate data quality. A conscious side-effect of the shift to longer horizon regiments is increased need for model-based QA, far beyond the body-shop capabilities of current day data companies.
The progression of what data markets we entered first directly corresponded to how verifiable we could make each one. We filtered the hard domains out of the field at the infrastructure layer, first by choosing verifiable ones, then by building environments that strip away the attention and irreversibility that made real decisions actually hard, then by avoiding reward functions that require taking a contested position. The artifacts from this selection effect are operationalized in pipeline design. Even in the supposedly easy domains we kept, the QC discipline that distinguishes a useful Type 1 dataset from a depreciating one is not yet a shared language across the data markets. Most of the data shipped to frontier labs in 2026 fails the bar set by the labs' own internal QC frameworks.
Many data companies fall by the wayside in two ways. We're picking the easier domains because the evaluation problem is already solved there. And we're failing to actually solve the QC problem on the data we ship in those domains.
The shape of good QC for off-the-shelf RL data has come into focus over the past eighteen months. There is a defensible bar for what good looks like, which should not be aspirational. It is implemented and shipped by the labs themselves, and any vendor selling into a frontier lab in 2026 is being measured against this bar implicitly during the purchase decision. Most are failing multiple gates at once.
The vocabulary here is worth walking through because it has not yet propagated outside the labs that use it. As we tend towards data and tasks that measure how much/fast it costs to do something, rather than whether we can do them, standardized QC for evaluating how well data tests something on the performance-cost-latency Pareto curve will become of utmost importance.
Intake review
Before any post-training run touches the data, you ask whether the dataset is even eval-able.
This is the cheapest gate in the QC stack and it is the one most data companies skip. A frontier lab spending a six-figure trial contract on a dataset that fails intake review is paying twice, once for the data itself, and once for the GPU hours and researcher attention burned on a training run that was uninterpretable from the start. The market for OTS RL data in 2026 is large enough that the second-order cost of skipping intake now exceeds the first-order cost of running it. As mentioned in my previous piece, Anthropic and other labs disclosed their RL data spend in 2025 at $1B+ and overshoot it in actuality.
There are major intake categories that every company that professes to collect data for frontier analysis ought to display at least.
Verification spectrum classification asks where the task sits between deterministic code grading (SWE-bench Verified is the cleanest version of this category) and LLM-judge rubrics (the published reference pattern across HealthBench, FLASK, BiGGen Bench, and Prometheus 2 is atomic, binary, axis-tagged criteria) and unverifiable-by-automation tasks that should ship as SFT demonstrations rather than reward-based RL. Skipping this classification is how labs end up plugging fundamentally unaudited LLM judges into reward functions.
Contamination resistance and variant generation ask whether the dataset's hillclimbness survives the next model generation. For example, GPQA, AIME, and FrontierMath are static sets whose discriminative power decayed inside a year as problems leaked into pretraining and the vendors had no canary, no rotation cadence, no recovery story.
Pass@k and distributional analysis set the productive training band, because a dataset whose pass@1 sits at zero on the target model or whose difficulty distribution is bimodal produces no gradient to climb.
Rubric construction patterns determine whether the grader is atomic and binary or compound and reward-hackable per the rubric anchoring research. Each category is a question that has a published cautionary tale behind it and the cost of getting it wrong is paid downstream by the lab, not by the vendor.
There are a few more checks that ought to be treated as pre-flights, but a concoction of this in nice formats mean that the vendors that have figured this out are using intake review as a structured pitch to procurement teams ("here is our slice on each category, here is the artifact for each gate, here is the audit pass we ran") and clearing informal onboarding cycles in weeks instead of months. The vendors that haven't are losing the contracts they think they're winning, or have researchers that say their data is "good" on paper but secretly looking for alternatives behind their backs. When a lab, overseeing a million line delivery, discovers one discrepancy/failure in one of those lines, they'll wonder about whether there is a QC process whatsoever.
Active Testing
After intake passes, some small-scale ablation + small post-training can be employed to stress-test post-training data to catch the problems intake review cannot see. Reward hacking shows up in training, amongst the complexity of different models with different harnesses. Sycophancy shows up under reward pressure, not in static evaluation. Forgetting shows up after the training run, by which point the lab has already paid for the data and the compute, and we generally want to make sure catastrophic forgetting is somehow not an immediate consequence of a dataset. Active testing is more expensive to run than intake (the cost is a small post-training run on a probe model plus the GPU hours for the diagnostic battery), but the cost of skipping it is higher still, because the failure modes it catches are the ones that quietly degrade frontier model releases and trigger the contract non-renewals I'm hearing about across the labs.
Most data vendors in 2026 are running zero categories of active testing on the data they ship.
Reward hacking comes up in every single lab conversation, still. METR put numbers on it with 1-2% of o3 attempts containing exploits inside their sandboxes, AISI caught OpenClaw reverse-engineering its own evaluation proxy from inside an isolated environment, and ImpossibleBench finds GPT-5 exploiting test cases 76% of the time on the impossible-SWEbench variant. Modern frontier models are routinely cheating their evaluations under reward pressure, and I still find many vendors have never run a single probe to check whether their own data trains for exactly this. The bias-probe battery is the parallel story for any LLM-judge in a reward function. Sycophancy, reward-tampering, and alignment-faking are the three published probes vendors should be running, with the alignment-faking baseline at 12%, and almost none are.
For verifier-graded data, the SWE-bench Verified Pro pattern of 200 PASS plus 200 FAIL human re-judging with FP and FN rates reported separately is now table stakes. OpenAI's 2026 retirement post for the original SWE-bench found 59.4% of audited problems had flawed test cases. That's the floor under which "deterministic verifier" stops meaning anything. Forgetting checks need to be per-skill, not aggregate, the way Tulu 3 published the floor. The gap between SFT continual post-training (around -10.4% average) and on-policy RL (around -2.3%) is what should inform the training method choice, and Qi et al. is the reason aggregate numbers are misleading on safety-relevant data. Small benign fine-tunes can strip RLHF safety guardrails while aggregate scores stay flat. Frontier shape analysis uses the Pareto curve to detect reward-hackable task sets, with the reward-hacking signature work as the published reference, and most vendors don't run it because it requires GPU infrastructure they don't own. Failure triage is the cheapest of these and the most useful. Each failed rollout labeled as capability, prompt, scaffolding, rubric, training-data, orchestration, or triangulation gives the vendor a concrete edit list and the lab a way to tell whether the dataset is broken at the data layer or upstream.
The procurement read is the same shape as intake. Active testing is the work the labs are already running internally on every dataset they accept, and they are increasingly asking vendors to ship the audit results alongside the data so the lab does not have to repeat the work. Vendors who show up with the bias-probe battery results, the per-skill forgetting numbers, the verifier FP/FN audit, and the failure-triage distribution are clearing onboarding in weeks. Vendors who show up with "we ran a few small training experiments and the loss went down" aren't getting past the first technical review. That gap is the difference between a serious data company and a commodity competitor in 2026.
It should be noted - that since many labs are compute rich in some capacities, that they are more forgiving of data quality (since we are more bottlenecked on quality data) in continuing to work with certain vendors. But, as harkened in previous writing on how data will be the cause of the next AI bubble if one, how long can we expect to exist in an inefficient data market where researchers throw out 50%+ of the data they procure?
Where we need to improve on in the wild
Let's do some deeper dives into 2024-2026 benchmark releases and how they lack some of these standards:
FrontierSWE (Proximal) sits in the strongest possible verification regime (deterministic code-based grader with hidden test signal), and exactly fails surface stratification because each model is locked to its own native production harness, conflating model and scaffolding contributions to the headline number.
ProgramBench fails on realism. Complete web 2.0 software recreation with clean specs and known answers is not the deployment context for any production coding agent in 2026, and the model that tops a ProgramBench leaderboard is not necessarily the one any engineering team should be deploying. Though I'd applied their creative restrictions placed on models and cataloging as cost as a hillclimbing objective category, these tasks are still quite contrived and represents a class of benchmarks that still confuse contest difficulty for production utility.
Tau-Bench measures end-state correctness on multi-turn customer service interactions and skips the process evaluation that is load-bearing on multi-turn rollouts. Did the agent ask the right clarifying question at turn three, recover from a tool failure at turn five, explain the resolution coherently at turn seven.
GDPval tries to anchor frontier capability to economic productivity and fails on realism for the same reason ProgramBench does, where productivity tasks reconstructed in a controlled environment are not the productivity tasks that exist in real organizational contexts.
MMMLU carries the standard MMLU contamination posture across forty languages, with no canary, no rotation, and a known leakage profile from the moment it shipped.
DSBench put GPT-4o-as-judge on 86% of its tasks with a single hand-wave validation claim and saturated from 34% to 89% in ten months, which is the load-bearing example of what happens when verifier soundness is skipped on a static set.
Terminal-Bench 2.0 handles task verification well but stays inside short shell-task horizons that hide both the irreversibility and process-evaluation failures longer-horizon work surfaces, the way coding-and-math hid them in 2024.
The benchmarks that pass more of the categories tend to do so on a single axis at a time. BankerToolBench (Handshake) is the cleanest realism story I have seen on financial tool use, because the tasks are derived from actual investment banking workflows and the verifier is built around the working products bankers use. LiveCodeBench Pro handles contamination defense by drawing fresh problems on a rolling basis from competitive programming sites and retiring them as they age into pretraining, which is the published reference for a refresh cadence done correctly. SciCode handles verifier soundness on partial-credit scientific coding by hand-writing per-problem deterministic checkers with expert review, at the cost of scaling (but I welcome this entry at the cost of scaling, if Mercor's human QA here was run well). None of them clear all categories at once.
I deeply respect the work that all of these companies have done. All of them shipped artifacts that move the field forward. All of them also illustrate why the QC bar is now load-bearing. The question of "does the measurement instrument actually inform a research decision the lab can make" is extremely difficult as the QA processes vary vendor by vendor, and the answer depends on which categories the benchmark cleared and which it skipped.
The vendor distinction worth drawing is between table stakes and differentiation because the floor is relatively automatable. I see the floor as a smorgasbord of dataset documentation manifest, atomic rubric construction with linter, verifier soundness audit, n-gram contamination report, cross-model evaluation with unbiased pass@k, multi-seed bootstrap CIs, eval harness declaration, trace artifacts, surface stratification across at least two scaffolding configs, and probe model selection from a versioned shortlist.
Further into the differentiation work that may not be cost-effective, the work looks more like a researcher's. Bias probe batteries on verifiers, sycophancy, reward-tampering, and alignment-faking probes, CoT faithfulness probes with counterfactual perturbations, IRT-based ability audits via tinyBenchmarks or Fluid Benchmarking, online RL lane diagnostics for PPO and GRPO. Vendors without research staff who can read the cited papers directly will not implement these, but vendors who do ought to be adequately rewarded.
The market implication
Vendors who haven't internalized this QC bar will find their contracts on the chopping block in 2026, and the rumors I've already heard from top labs about RL contracts being non-renewed reflect exactly this dynamic. Labs are buying less data in the abstract sense of "we need more tasks in this shape." Sellers to Chinese labs may still find that the old motion works. The rest of the market has shifted. Most vendors who keep overoptimizing on unrealistic synthetic data will be selling against the current rather than with it. It is not always said out loud, but the frontier labs are buying outcomes, model improvement on a target capability, and the QC bar is the floor under whether the data can actually produce that outcome.
To overoptimize for selling data in its current form without thinking about scalability is to choose a death by a thousand cuts. Frontier labs in 2026 have learned to discount black boxes heavily, especially black boxes attached to vendors who do not appear to care about their own data quality. The few vendors who have built this infrastructure internally already (a small set, mostly the ones with research-dense teams) are seeing pricing power on the order of 3-5x what their commodity peers can charge for nominally similar tasks, and the premium is built on continued trust as reliable quality-first partners at scale. I find that the gap will only widen as the labs' procurement teams get more sophisticated and as more data teams come to market with these standards.
This is the companion observation to the long-horizon non-verifiable point. Before we even get to the harder domains where the reward function is contested and the environment has to model irreversibility, we need to be doing the QC work in the domains where the reward function is uncontested. The execution gap is what's left to close, and it is smaller than the selection effect but larger than most people running data companies want to admit. A world where we have more codified QC standards is also a world where more models like Andons' proliferates. Theoretically, if you are running a data company in 2027 and you cannot tell me your pass@k distribution across at least three models, your verifier FP/FN rates against human gold, your contamination check against the named eval suites your dataset is positioned against, and your frontier-shape diagnostic on a probe model, you are not selling Type 1 data. You are selling Type 2 data with Type 1 marketing. The labs will figure that out within one purchase cycle, and the rumors I'm hearing suggest several already have.
AlphaEvolve: How our Gemini-powered coding agent is scaling impact across fields
Google DeepMind's Gemini-powered coding agent, AlphaEvolve, significantly improved DNA sequencing error correction for PacBio, achieving a 30% reduction in variant detection errors.
Decoder
- DeepConsensus: A Google Research model designed for correcting errors in DNA sequencing data.
Original article
In genomics, AlphaEvolve was used to improve DeepConsensus—a model developed by Google Research for correcting DNA sequencing errors— achieving a 30% reduction in variant detection errors. These improvements are helping scientists at PacBio analyze genetic data more accurately and at a lower cost.
“The solution the Google team discovered using AlphaEvolve unlocks meaningfully higher accuracy rates for our sequencing instruments. For researchers, this higher-quality data might enable the discovery of previously hidden disease causing mutations.” — Aaron Wenger, Senior Director at PacBio
Meta's Optimized RecSys Inference
Meta has achieved up to a 4x speedup and a 2/3 reduction in latency for its recommendation system inference by implementing In-Kernel Broadcast Optimization (IKBO), which eliminates redundant user embedding replication on both GPUs and Meta's MTIA accelerators.
Deep dive
- Problem: Traditional recommendation system inference explicitly replicates user embeddings for every candidate item, leading to wasted memory bandwidth and compute that scales linearly with candidate count.
- Solution: In-Kernel Broadcast Optimization (IKBO) eliminates this by fusing broadcast logic directly into user-candidate interaction kernels, so replicated tensors never materialize.
- Deployment: IKBO is deployed across Meta's multi-stage recommendation funnel on both NVIDIA H100 GPUs and Meta Training and Inference Accelerators (MTIA).
- Performance - Linear Compression: Achieved a cumulative ~4x speedup on H100 SXM5 through progressive co-design stages: matmul decomposition, memory alignment, broadcast fusion, and warp-specialized multi-stage fusion via TLX.
- Performance - Flash Attention: Improved arithmetic intensity from ~60 FLOPs/Byte (IO-bound) to ~833 FLOPs/Byte (compute-bound) at a 70:1 candidate-to-user ratio. Delivered 2.4x/6.4x throughput gain over non-co-designed CuTeDSL FA4 Hopper baselines with 621 BF16 TFLOPs.
- End-to-end impact: Achieved up to a 2/3 reduction in compute-intensive net latency on co-designed models across Meta’s RecSys inference stack.
- Core Principle: Broadcast is treated as a data layout concern rather than a computational necessity, handled internally within computational primitives.
- Hardware Agnostic: The core idea of replacing materialized broadcasts with index-driven in-kernel lookups is hardware-vendor independent, though current implementations target NVIDIA Hopper.
- Future Directions: Adapting IKBO kernels to CuTeDSL (NVIDIA) and AMD CK, and extending to multi-level recommendation hierarchies (e.g., user -> vendor -> item) for further overhead reduction.
Decoder
- Recommendation System (RecSys): An information filtering system that predicts what a user might prefer.
- Embedding: A dense vector representation of discrete variables (like users or items) that captures their semantic meaning.
- In-Kernel Broadcast Optimization (IKBO): A co-design approach that eliminates redundant replication of shared user embeddings by integrating broadcast logic directly into GPU kernel operations.
- MTIA (Meta Training and Inference Accelerator): Meta's custom-designed chip for AI training and inference workloads.
- H100 SXM5: NVIDIA's Hopper architecture-based GPU, optimized for AI workloads, specifically the SXM5 variant designed for data centers.
- Flash Attention: An optimized attention algorithm that speeds up transformer models by reducing the number of memory accesses, making it more efficient for long sequences.
- Triton: A Python-based DSL for writing highly efficient custom GPU kernels, developed by OpenAI.
- TLX (Triton Low-level Language Extensions): Extensions to Triton that expose more low-level hardware features of NVIDIA Hopper GPUs, such as warp specialization and asynchronous memory operations.
Original article
Featured projects
TL;DR:
- Traditional RecSys inference explicitly replicates shared user embeddings/sequences for every candidate. In-Kernel Broadcast Optimization (IKBO) eliminates this overhead via a kernel-model-system co-design that fuses broadcast logic directly into user-candidate interaction kernels. By decreasing both the memory footprint and IO utilization, IKBO unlocks even higher throughput.
- IKBO delivers up to a 2/3 reduction in compute-intensive net latency, serving as the scalability backbone for the request-centric, inference-efficient framework that powers the Meta Adaptive Ranking Model.
- Deployed end-to-end across Meta’s multi-stage recommendation funnel on both GPU and MTIA (Meta Training and Inference Accelerator).
- The IKBO Linear Compression kernel achieved a cumulative ~4× speedup on H100 SXM5 after four stages of progressive co-design, culminating in warp-specialized fusion via TLX.
- The IKBO co-design shifted the Flash Attention kernel from IO-bound to compute-bound (hitting 621 BF16 TFLOPs on H100 SXM5). Coupled with TLX warp-specialized optimization, this results in a 2.4x/6.4× throughput gain over the non-co-designed CuTeDSL FA4 Hopper baseline (kernel only/kernel + broadcasting).
In this post, we present In-Kernel Broadcast Optimization (IKBO), a kernel-model-system co-design approach that eliminates redundant user-embedding broadcast in recommendation model inference. In production RecSys, user embeddings are identical across all candidates for a given request, yet standard approaches require explicit replication, wasting memory bandwidth and compute that scale with candidate count. IKBO encodes a simple insight: broadcast is a data layout concern, not a computational necessity. Each IKBO kernel accepts user and candidate inputs at their natural, mismatched batch sizes and handles broadcast internally, so no replicated tensors ever materialize. We showcase the methodology through two kernel deep dives: Linear Compression and Flash Attention.
Deployed across Meta’s RecSys inference stack—from early-stage to late-stage ranking models, spanning both GPU and MTIA (Meta Training and Inference Accelerator)—IKBO delivers up to a 2/3 reduction in compute-intensive net latency on co-designed models. It serves as the scalability backbone for the request-centric, inference-efficient framework underlying the Meta Adaptive Ranking Model (serving LLM-scale models in production). On H100 SXM5, our IKBO Linear Compression kernel achieves ~4× speedup through four progressive co-design stages: matmul decomposition, memory alignment, broadcast fusion, and warp-specialized multi-stage fusion via TLX (Triton Low-Level Extensions). For Flash Attention, IKBO delivers a 2.4×/6.4× throughput compared to non-co-designed CuTeDSL FA4-Hopper (kernel only / kernel + broadcasting) with 621 BF16 TFLOPs. Unlike system-level broadcast or net-splitting that work around replication, IKBO eliminates it at the computational primitive layer, achieving dense interaction quality at near-independent cost.
Code Repository: https://github.com/pytorch/FBGEMM/tree/main/fbgemm_gpu/experimental/ikbo
† Work done while at Meta
1. In-Kernel Broadcast Optimization: Eliminating Memory and Compute Redundancy
When a user opens their feed, the recommendation system must score hundreds to thousands of candidate items to decide what to show. The model’s inputs split into two categories: user features (e.g., browsing history, profile, context) that are identical for every candidate in a request, and candidate features (e.g., item ID, category, engagement statistics) that are unique to each item. Both pass through embedding lookups and subsequent processing to produce embedding representations. At various points in the model, interaction layers (e.g., linear projections, feature crosses, target attention) combine user and candidate embeddings. We call embeddings shared across all candidates in a request Request-Only (RO), and per-candidate embeddings Non-Request-Only (NRO).
Fig. 1. A very simplified RecSys inference data flow. Request-Only (RO) user embeddings must be broadcast (replicated) to match the Non-Request-Only (NRO) candidate batch dimension before interaction layers. IKBO eliminates this materialization by handling broadcast internally within each kernel.
Interaction layers require tensors with matching batch dimensions. In a batch of 1,024 candidates served by ~15 users, RO embeddings must be broadcast, replicated ~70 times, to match the NRO batch size before any interaction (Fig. 1). As architectures have evolved from DLRM [1] and DCN [2] through sequential models like HSTU [3] and X’s Phoenix [4], they have steadily enriched user-candidate interaction. But richer interaction comes at a cost: user features must be broadcast across all candidates. For batch sizes of 10 – 10,000+ in inference, this replication overhead incurs significant computation and memory cost that scales linearly with candidate count.
Broadcast is a data layout concern, not a computational necessity. Viewing the model and inference system through this lens opens optimization at every layer: the inference runtime eliminates system-level broadcast, user-only model layers run at the smaller user batch size, and kernels that mix both are redesigned to handle broadcast internally—no replicated tensors ever materialize. Deployed across Meta’s RecSys inference stack, from early-stage to late-stage ranking models, spanning both GPU and MTIA, IKBO delivers up to 2/3 reduction in compute-intensive net latency on co-designed models.
This post focuses on the kernel layer through two deep dives: Linear Compression and Flash Attention.
1.1. Kernel Optimization Type
Type I — Decomposable Operations. Mathematical restructuring lets the Request-Only (RO) portion be computed independently at small batch size, combining with the Non-Request-Only (NRO) portion only at the end. This saves both memory bandwidth and compute.
Type II — Memory-Only Optimization. Handling RO-NRO broadcasting within the kernel avoids redundant data movement, pushing the kernel away from IO bound.
1.2. E2E System Design
Deploying IKBO touches three layers of the infra stack:
- Kernels: Custom GPU kernels that accept mismatched RO/NRO batch sizes and handle broadcast internally (Sections 2 and 3).
- Compilation Specification: The ML compiler needs per-operator dynamic shape ranges to select appropriately shaped kernels. With one batch size this is trivial; with two (user and candidate) or even more, reliably resolving which each operator uses—across production models where interactions obscure batch lineage—requires systematic automation.
- Inference: The runtime passes the candidate-to-user mapping into the model instead of materializing the broadcast.
These kernels enter the model through one of two paths:
- Direct adoption: Model authors integrate IKBO kernels directly into their model definitions. When candidate-to-user ratio > 1 during training, the same kernels reduce training cost as well.
- Inference-time transformation: A pass automatically swaps standard ops for IKBO equivalents at inference time — no model code changes required.
The net effect: broadcast disappears from every stage of inference, with no architectural constraints on the model and no infrastructure changes beyond the inference runtime’s mapping interface.
1.3. Comparison with Other Approaches
Existing approaches work around broadcast rather than eliminating it.
- System-level broadcast materializes the replicated tensor before GPU dispatch—simple but wasteful, with cost scaling linearly with candidate count.
- Net-splitting (ROO) [5] partitions the model into RO and NRO sub-networks, reducing redundant work but constraining where user-candidate interactions can occur and still introduce extra cost at small RO batch sizes.
Both preserve broadcast as a materialized tensor. IKBO eliminates it at the computational primitive layer: savings scale with the candidate-to-user ratio, any interaction pattern works without broadcast cost, and the full NRO batch dimension provides GPU occupancy within fused kernels.
IKBO has been deployed on both GPU and MTIA accelerators. In this blog post, we focus on H100 GPU kernel design to illustrate the core optimization principles.
2. Kernel Deep Dive I: IKBO Linear Compression
Linear Compress Embedding (LCE) compresses input embeddings (B, K, N) via a learned projection (M, K) @ (B, K, N) → (B, M, N), and is widely adopted in Meta RecSys models, e.g., Wukong [6]. We go through four progressive optimization stages.
2.1 Matmul Decomposition
Fig. 2. LCE decomposition: baseline batched matmul (top-left), embedding separation and user deduplication along K (top-right), two independent GEMMs with broadcast-add on compressed output (bottom).
The baseline LCE computes a single batched matmul across all B candidates. The input embeddings concatenate user and candidate parts along K — but user embeddings are identical across all candidates for the same user.
Push broadcast past the matmul. Since W is batch-independent, we decompose by linearity: separate user and candidate embedding blocks along K, deduplicate the repeated user embeddings, and compute two independent GEMMs at their natural batch sizes. Instead of replicating user embeddings before the matmul, we broadcast only the small compressed result. See Fig. 2. With a candidate-to-user ratio of ~70 (a representative setting), the user batch shrinks from B=1024 to B_user ≈ 15 — a 70x reduction in user-side compute. The decomposition is implemented in standard PyTorch.
Result. 1.944 ms → 1.389 ms (28.5% reduction; benchmark setup in Appendix 1). Both the original batched GEMM (arithmetic intensity ~ 356 FLOPs/Byte, below H100’s ~495 FLOPs/Byte machine balance point; see Appendix 2 for derivations) and the two decomposed GEMMs are memory-bound, so the speedup is driven by memory cost reduction. Deduplication cuts memory cost more than half — as the user-side GEMM (B_user ≈ 15 vs. B = 1024) becomes negligible in cost.
Note that the decomposition pushes broadcast past the matmul: instead of replicating full K-dimensional input embeddings before the GEMM, we broadcast only the small compressed result, which is far cheaper. In Section 2.3, we will further eliminate this remaining broadcast entirely via in-kernel broadcast fusion.
The current bottleneck is L1/TEX pipeline utilization (84%) rather than DRAM utilization — a suspicious imbalance we will zoom into in the next section. Detailed profiling breakdown in Appendix 3.
2.2 Memory Layout Optimization
Detailed result analysis of the decomposed GEMM reveals an imbalance: L1/TEX sits at 84% of peak while DRAM reaches only 19%, indicating unnecessarily narrow memory loads. SASS confirms: every cp.async copies only 4 bytes instead of a single 128-bit load.
LDGSTS.E.LTC128B P0, [R203], [R38.64] // 4 bytes
LDGSTS.E.LTC128B P1, [R203+0x4], [R38.64+0x4] // 4 bytes (×4 total, only 16B load in total)
cp.async width is capped by the source pointer’s natural alignment. Matrix A is (M, K) row-major with stride K × 2 bytes, so when K is not a multiple of 8, the stride breaks 128-bit alignment.
Model-kernel co-design insights. Memory alignment is a well-understood GPU optimization — but decomposition turns it into a model-kernel co-design challenge. K is formed by torch.cat of embedding tensors whose sizes depend on many model config factors. Decomposition makes it very hard to manually engineer these factors so that decomposed embeddings remain perfect multiples. A systematic solution is needed.
Solution. Pad each decomposed K to the next multiple of 8 by appending zeros to the concat list. We prove this is mathematically equivalent in both forward and backward passes (see Proof 1 below), and with the ML compiler’s memory planner, reduces to a cheap constant copy.
Proof 1. Zero-padding K preserves exact numerical equivalence in both forward and backward passes.
Result. 1.389 ms → 0.798 ms (42.5% reduction). Padding enables CUTLASS to select a TMA-based kernel, bypassing L1/TEX entirely (sectors 351M → 0) and cutting GEMM latency from 0.984 ms to 0.400 ms. With the GEMM resolved, the unfused broadcast and add (0.398 ms) now accounts for half the total latency — to be addressed in the next section. Detailed result analysis in Appendix 5.
2.3 Candidate GEMM In-Kernel Broadcast Fusion
The unfused broadcast and add are memory-bound: write the candidate GEMM result to HBM, read it back alongside the user result, add, and write again. We eliminate this by fusing the broadcast into the candidate GEMM epilogue (Fig. 3). After each tile’s accumulation, the epilogue looks up the user index, loads the pre-computed user result, adds it in registers, and writes the final sum — the intermediate tensor is never materialized. We implement this as a Triton kernel: a standard batched GEMM with a custom post-accumulation epilogue block.
Fig. 3. In-kernel broadcast fusion: the GEMM epilogue loads the pre-computed user result via index lookup and adds it in-register.
Result. 0.798 ms → 0.580 ms (27.4% reduction). Fusion eliminates 0.87 GB of intermediate DRAM traffic, contributing to the latency win. However, occupancy is just 6.25% (1 warp per scheduler), leaving every stall fully exposed. Beyond 42% of cycles waiting on global loads, 20% are spent waiting on WGMMA — stalls that cannot be hidden by the epilogue, and without persistence there is no next-tile load to overlap with. This is a challenging tradeoff: large tiles and deep pipelines are needed to keep tensor cores fed, but they consume most of the shared memory budget, leaving little room to hide latency through occupancy. Detailed result analysis in Appendix 6.
2.4 Warp-Specialized Multi-Stage Fusion with TLX
TLX (Triton Low-level Language Extensions) exposes Hopper’s warp specialization, TMA, mbarriers, and named barriers while preserving Triton’s Python DSL and autotuning infrastructure.
Using TLX, we address the occupancy limitation from Section 2.3 with warp specialization — hiding latency through functional partitioning rather than additional warps.
Sections 2.1 – 2.3 decomposed the original LCE into two independent computations: the user GEMM (Stage 1) and the candidate GEMM with fused broadcast-add epilogue (Stage 2). We first optimize latency hiding within Stage 2, the dominant bottleneck, then fuse both stages into a single persistent kernel.
Intra-Stage Latency Overlap
The candidate IKBO kernel is memory-bound — the design goal is to keep the memory pipeline continuously fed. Triton’s software pipelining (Section 2.3) already overlaps Loads with WGMMA, but the epilogue remains serialized — it blocks future Loads and exposes the WGMMA wait stalls. We resolve both by partitioning each CTA into specialized warp groups: a dedicated producer issues TMA loads continuously (Overlap #1, analogous to Triton’s software pipeline), while two consumers ping-pong tiles so one’s epilogue overlaps the other’s WGMMA (Overlap #2). With persistence, tiles flow continuously with no cross-tile gaps. See Fig. 4.
Fig. 4. Candidate IKBO kernel structure with two intra-stage latency overlaps and warp group role assignments.
Multi-Stage Fusion
We fuse user IKBO (Stage 1) and candidate IKBO (Stage 2) into a single mega-kernel to reduce wave quantization, eliminate kernel launch overhead, and improve L2 cache utilization. High candidate-to-user ratios amplify wave quantization in Stage 1. Since the candidate GEMM is independent of user results until its epilogue, we schedule both stages concurrently.
This concurrent scheduling unlocks two additional cross-stage overlaps, bringing the total overlaps to four. See Fig. 5.
Fig. 5. Concurrent stage scheduling: SMs without user tiles enter Stage 2 immediately, overlapping with Stage 1’s partial wave. All four latency overlaps after multi-stage fusion, showing intra-stage (#1, #2) and cross-stage (#3, #4) overlap opportunities. SM 0-49, 50-131 are example numbers.
Warp Group Specialization & Synchronization Setup
To realize all four overlaps, each CTA is partitioned into one producer and two consumer warp groups. Critically, both stages share the same circular buffer and mbarrier infrastructure — no pipeline drain or barrier reinitialization occurs at the stage boundary. The last user K-block and the first candidate K-block coexist in different buffer slots simultaneously. See Fig. 6.
Fig. 6. Per-CTA warp group setup and the three synchronization mechanisms.
Bidirectional Stage-Alternating Tile Scheduling
When neither stage’s tile count divides evenly by the SM count, naive unidirectional dispatch causes workload imbalance. We reverse tile assignment direction between stages: Stage 1 starts at pid, Stage 2 at NUM_SM - 1 - pid. See Fig. 7.
Fig. 7. Unidirectional (left) vs. bidirectional stage-alternating dispatch (right), balancing per-SM workload across partial waves.
Tile-Granularity Cross-CTA Synchronization
User and candidate tiles may execute on different CTAs, requiring cross-CTA synchronization — but a device-wide barrier would serialize all work and destroy the overlap. We synchronize at per-tile granularity using a three-step release-acquire protocol:
- A single thread per warp group spins on the tile flag with
ld.relaxed, minimizing memory traffic - Once set, a single
ld.acquireestablishes the happens-before edge - A named barrier broadcasts readiness to all 128 threads in the warp group
This avoids expensive fences during polling and lets candidate CTAs on different user tiles proceed fully independently. Details in Appendix 7.
Results
With all optimizations combined, latency improves from 0.580 ms to 0.482 ms (16.9% reduction). The clear intra-warp Proton tracer timeline confirms all four overlaps are realized in practice.
Fig. 8. Proton profiler timeline for two CTAs, with all four overlaps color-coded. The memory pipeline remains continuously fed.
The primary gain comes from Overlap #2: ping-ponging consumers hide WGMMA and epilogue stalls on every tile — directly addressing the dominant wasted cycles from Section 2.3. Overlap #1 (Load↔WGMMA) carries forward from Triton’s existing software pipelining. Overlaps #3 and #4 hide idle time at the user-to-candidate stage transition. See Fig. 8.
NCU confirms: occupancy rises from 6.25% to 18.75% (3 warp groups vs. 1), DRAM throughput from 39% to 52%, and L2 — the bottleneck — from 74% to 84% of peak. This is not occupancy alone: the aggressive latency hiding across all four overlaps keeps the memory pipeline saturated, which is what pushes L2 past 80%. Detailed NCU metrics in Appendix 8.
We benchmark across batch sizes and candidate-to-user ratios, with the default (batch=1024, ratio=70) settings. See Fig. 9.
Fig. 9. Cumulative IKBO speedup across batch sizes (left, ratio=70) and candidate-to-user ratios (right, batch=1024).
The IKBO fusion delivers robust gains across scenarios: ~4x speedup across batch sizes (left) and candidate-to-user ratios (right). Even at low candidate-to-user ratios, the kernel still achieves meaningful speedup.
3. Kernel Deep Dive II: IKBO Flash Attention
As recommendation models scale to capture richer user sequential behavior, sequential architectures – including attention – have emerged as a critical compute bottleneck, accounting for approximately 40% of inference latency at 1K sequence lengths. This motivates our focus on IKBO-aware Flash Attention, co-designed with RecSys’s unique batching semantics.
Inspired by Transformers and Set Transformers [7, 8], two fundamental user history interaction modules have been widely adopted in RecSys:
- Target attention (analogous to cross-attention) captures the relationship between the prediction candidate and the user’s historical interactions.
- Self-attention models sequential dependencies within the user history itself
Since user history is a RO feature while the target operates on a distinct candidate (non-RO) batch dimension, this architectural asymmetry presents an opportunity for IKBO to improve model scalability and computational efficiency. Target attention will be our main focus for optimization, while with minor co-design, self attention could also be fused into IKBO target attention in Section. 3.3. As our model is encoder-driven, full attention is applied without causal masking.
The ultimate optimized target attention version leveraging e2e co-design achieves 2.4×/6.4× the throughput of non-co-designed CuTeDSL FA4-Hopper (attn kernel only / attn kernel + broadcasting cost), reducing latency by 0.320ms / 1.232ms respectively (Table. 2).
3.1 IKBO flash attention solves the IO bound issues under RecSys boundary conditions
Fig. 10: Traditional SDPA with candidate-user broadcasting (left) vs. fused IKBO target attention (right).
IKBO fuses K/V broadcasting into the attention kernel, maintaining mathematical equivalence via a candidate-user mapping tensor from the inference runtime that handles non-uniform candidate-to-user ratios. Fig. 10 contrasts the two approaches: the traditional SDPA path broadcasts K and V to the full candidate batch size before attention, while the IKBO path eliminates this materialization entirely — each candidate indexes into its user’s K/V on the fly.
Shifting IO-Bound to Compute-Bound by IKBO co-design
In RecSys boundary conditions, target attention uses a relatively small number of candidate embeddings to represent the candidate attributes compared to the user’s browsing history. Roofline analysis of standard attention reveals an arithmetic intensity of ~60 FLOPs/Byte – well below the H100 (SXM5 HBM2e version) peak of ~495 FLOPs/Byte (Appendix 2)—making even standard flash attention heavily IO-bound. IKBO addresses this by amortizing K/V memory accesses across multiple candidates sharing the same user context, improving arithmetic intensity from ~60 FLOPs/Byte to ~833 FLOPs/Byte (at B_candidate : B_user = 70:1) and shifting the kernel firmly into compute-bound territory.
To maximize this benefit, our implementation reorders the threadblock launch grid so that batch_size_candidate comes before num_heads. This ensures threadblocks processing different candidates — but sharing the same user K/V — are scheduled concurrently, improving L2 cache reuse.
| Grid dimension | Flash attention (SDPA) | IKBO target attention |
| x | num_q_seq_block | num_q_seq_block |
| y | num_heads | batch_size_candidate |
| z | batch_size_candidate | num_heads |
Table 1: Launch grid configuration comparison. SDPA prioritizes GQA optimization by placing num_heads in grid.y. IKBO swaps head and candidate dimensions, placing batch_size_candidate in grid.y to enable efficient K/V sharing across candidates.
Table 2 compares our IKBO Triton implementation (FA2 logic + IKBO) against state-of-the-art Flash Attention implementations on Hopper (without IKBO co-design). Throughput and IO are measured on attention only; the broadcasting latency for Key and Value is even larger than the attention cost itself.
| Throughput (TFLOPs/s) | IO (GB/s) | Latency (ms) | |
| Triton IKBO FA2 | 425 | 487 | 0.321 (broadcast fused) |
| TLX FA3 | 245 | 2152 | 0.561 + 0.912 (broadcast K&V) |
| CuTeDSL FA4 Hopper | 250 | 2193 | 0.550 + 0.912 (broadcast K&V) |
| TLX IKBO FA3 persistence generalized | 594 | 681 | 0.230 (broadcast fused) |
Table 2: Attention kernel comparison under RecSys boundary conditions (B_candidate = 2048, B_u = 32, uniform candidate-to-user ratio). Without co-design, even cutting-edge Hopper implementations remain IO-bound.
3.2 Adopting Modern Kernel Techniques (FA3, FA4) with IKBO on TLX
With IKBO shifting the kernel from IO-bound to compute-bound, the natural next step was to adopt the state-of-the-art compute optimizations from Flash Attention 3 (FA3 [10]) and Flash Attention 4 (FA4 [11]) on Hopper – specifically warp specialization and pipelining. However, our boundary conditions on the number of query embeddings (q_seq = 32 or 64) make it difficult to directly adopt FA3’s ping-pong or cooperative warp specialization.
Warp specialization on Hopper requires asynchronous WGMMA instructions, which impose a minimum BLOCK_M ≥ 64. Two consumer warp groups are also necessary to minimize bubbles between them. To satisfy these constraints, we customized the kernel to launch both B_candidate = i and B_candidate = i + 1 within a single threadblock, sharing the same B_user. In the discussion below, we assume all users rank an even number of candidates with q_seq = 64; odd-candidate handling follows afterward.
Performance improvement for IKBO FA3 kernel
Starting from FA3’s recipe — intra-warp pipelining, warpgroup specialization, and ping-pong scheduling — the initial TLX IKBO FA3 kernel performed similarly to the FA2 baseline (Fig. 12, blue vs. red, Appendix 11), with on-par throughput.
To diagnose the bottleneck, we visualized intra-warp pipelining using the Proton tracer with GPU cycles as the latency unit (Fig. 10). Table 3 summarizes the key bottlenecks before and after persistence, measured in GPU cycles via the Proton tracer.
Fig. 11: Proton-based intra-warp profiling of the TLX IKBO FA3 kernel. Representative warps from each warp group are shown: warp 0 (producer), warp 4 (consumer 1), and warp 8 (consumer 2). The softmax_PV_overlap and pure softmax regions are marked separately to identify the tensor core bubbles. (A) Before persistence zoomed in view of B (B) Before persistence with 2 waves (C) After persistence with 2 waves
| Bottlenecks | Before | After | Key change |
| Tensor Core Bubbles (1st QKT per wave, Blue) | ~1,300 cycles (400 cycles from warp scheduler switching) | ~1,300 cycles | Unchanged |
| Tensor Core Bubbles (last PV per wave, Blue) | ~2,000 cycles | ~300 cycles | Async TMA store + reciprocal overlap with last PV |
| Cross-CTA Stalls (Orange) | ~14,000 cycles | Eliminated | Persistence removes CTA re-launch entirely |
| Init Buffers & Barriers (Green) | ~1,600 cycles/wave | ~1,600 cycles (1st wave only) | Persistence shared buffer and barrier amortized across waves |
| Wait 1st Q/K Load (Dark purple) | 2,100~4,000 cycles/wave (length varies depending on HBM bandwidth contention) | ~2,000 cycles (1st wave only) | Cross-wave pipelining; producer prefetches ~3K cycles ahead |
Table 3: Key bottlenecks before and after persistence + optimizations.
Key takeaway: cross-CTA stalls are the dominant bottleneck — not tensor core utilization – at these small query sequence lengths. Persistence is a must for this improvement. After persistence, the profiling results and its latency changes are presented in Fig. 11C and Table. 3.
HBM2e-Specific Optimizations
We further tuned the persistent kernel for the H100 SXM5’s HBM2e bandwidth constraints, trading shared memory capacity for reduced load/store blocking. (Table 4).
| Customized optimization/fix | Benefit |
| Decoupled SMEM buffer of O from Q/V with pipelined TMA async store | Decoupled O from Q/V SMEM sharing enable TMA async stores could overlap with next-wave compute, shortening store blocking time from 1,300 to 400 cycles/wave |
| Separate Q₀ and Q₁ buffers | Reduces per-Q loading time, allowing one consumer group starts earlier— beneficial when wave count greatly exceeds K/V sequence iterations (common in RecSys) |
| Instruction Cache Misses fix | Merges the peeled-out last-iteration code path back into the main loop, eliminating icache thrashing caused by excessive warp-specialized instructions (Appendix 12) |
Table 4: Customized optimizations for the HBM2e H100 SXM5. These still fit within the available SMEM budget under RecSys boundary conditions (Appendix 10).
We also implemented persistent V2, which iterates from the end of the K sequence to the front (matching FA3/FA4-Hopper’s approach) to simplify masking logic. Both persistent variants apply the Table 4 optimizations. As shown in Fig. 12, at low sequence lengths (512–4,096) the TLX FA3 persistent kernel outperforms all other candidates; beyond 8K the two persistent variants converge.
Fig. 12: IKBO implementation throughput vs. sequence length (B_candidate = 2,048; B_candidate : B_user = 64; num_head = 2; d_head = 128). Practical RecSys sequence lengths are under 4K [3]; longer lengths are included for comparison with LLM use cases. The generalized version handles non-even candidates per user with 50% odd-candidates per user probability
Generalizing IKBO FA3 for ranking Arbitrary Candidate Batch Sizes
Our IKBO FA3 kernel co-processes two candidate batches per CTA to meet WGMMA’s BLOCK_M ≥ 64 requirement. When a user has an odd number of candidates, one consumer warpgroup has no pairing partner. We handle this with idling logic (Fig. 13, left; Algorithm 1):
- The idle warpgroup drains K/V buffers via mbarrier signaling to prevent producer deadlock.
- The active warpgroup disables ping-pong synchronization (its partner no longer arrives at the named barriers).
At a ~70 : 1 candidate-to-user ratio, the idle path triggers less than 0.7% of the time with negligible overhead (Fig. 12, IKBO TLX FA3 generalized). This approach generalizes to q_seq_len = 32, where four candidate batches are bundled per CTA using analogous idling and masking logic.
Fig. 13: CTA assignment for generalized target attention (left) and self + target attention fusion (right). Each CTA assigns two consumer warp groups sharing the same user K/V. When the candidate count is odd, the 2nd consumer idles and drains barriers.
Algorithm 1: IKBO Attention Forward Pass with Odd Candidate Handling
3.3 Self + Target Attention Fusion via Model Co-Design
The previous sections focused on optimizing target (cross) attention. A natural question arises: can we fold self-attention into the same kernel?
The key insight is that both attention types share the same key-value source — the user sequence. The only difference is the query: self-attention queries come from the user side, while target-attention queries come from the candidate side. By sharing K/V projections between the two, we enable direct horizontal kernel fusion within a single launch. Fig. 13 (right) illustrates the fused CTA layout: the first CTAs handle self-attention query blocks, while the remaining CTAs handle target-attention candidate pairs — all reading from the same pipelined K/V stream.
Similar co-design ideas have been explored in XAI Phoenix, an open-source recommendation system from X [4].
We prototyped a fused kernel to quantify the fusion benefit, excluding K/V projection savings (Fig. 13, right):
- seq_len = 512: 6.6% improvement (514 vs. 482 TFLOPs/s)
- seq_len = 1,024: 4.1% improvement (581 vs. 558 TFLOPs/s)
- seq_len = 2,048: 0.3% improvement (612 vs. 610 TFLOPs/s) — self-attention saturates the SMs
The gains at short sequences stem from kernel fusion benefits: reduced launch overhead, shared buffer allocation savings, cross-kernel pipelining opportunities, and wave quantization mitigation — the same inefficiencies that megakernel techniques [12] target in LLM inference. In production, the shared K/V projections provide additional savings on linear projection cost, analogous to KV cache reuse.
4. Summary of Benchmarks and Results
We summarize the kernel-level benchmarks presented in this post alongside end-to-end deployment outcomes. All kernel benchmarks below are on H100 SXM5 (see details in Appendix 1).
- Linear Compression (Section 2). Four progressive co-design stages — matmul decomposition, memory alignment, broadcast fusion, and warp-specialized multi-stage fusion via TLX — yield a cumulative ~4× speedup (1.944 ms → 0.482 ms) at representative settings. Gains remain robust across batch sizes and candidate-to-user ratios (Fig. 9).
- Flash Attention (Section 3). IKBO shifts target attention from IO-bound (~60 FLOPs/Byte) to compute-bound (~833 FLOPs/Byte), achieving 2.4×/6.4× the throughput of non-co-designed CuTeDSL FA4-Hopper (kernel only / kernel + broadcasting) with 621 BF16 TFLOPs.
- End-to-end deployment. IKBO has been deployed broadly across Meta’s RecSys inference stack — from early-stage to late-stage ranking models, on both GPU and MTIA accelerators — delivering up to 2/3 reduction in compute-intensive net latency on co-designed models. IKBO has been validated across candidate-to-user broadcast ratios spanning from ~10,000 : 1 down to ~10 : 1, confirming both numerical stability and scalability across workloads.
5. Conclusion and Future Directions
IKBO demonstrates that broadcast — long treated as an unavoidable cost of user-candidate interaction — can be eliminated at the computational primitive layer through kernel-model-system co-design. By encoding broadcast semantics directly into kernels, no replicated tensors ever materialize, and savings scale naturally with the candidate-to-user ratio.
While the kernel implementations presented in this work target NVIDIA Hopper via Triton and TLX, the core idea — replacing materialized broadcasts with index-driven in-kernel lookups — is hardware-vendor independent. Adapting the IKBO kernels to CuTeDSL (for advanced NVIDIA backend support) and completing the AMD CK support are natural next steps.
Beyond the two-level user-candidate hierarchy presented here, some RecSys scenarios involve deeper hierarchies — for example, user → ads vendor → ads item, where each user sees multiple vendors and each vendor offers multiple items. This introduces two nested broadcast relationships with independent, non-uniform ratios. IKBO can handle this elegantly, and applying it to multi-level workloads is a natural direction for further reducing materialization overhead in production RecSys architectures.
Acknowledgements
We are grateful to Hongtao Yu, Yuanwei (Kevin) Fang, Daohang Shi, Yueming Hao, Srivatsan Ramesh and Manman Ren for their strong internal support of the Triton and TLX foundation, the powerful Triton profiling toolings, and for promptly resolving Triton-related issues throughout this work.
Thanks Chris Gottbrath for his insightful feedback, which significantly improved the clarity of this post. We also greatly appreciate his help in facilitating a smooth review process.
Thanks Santanu Kolay, Sandeep Pandey, Matt Steiner, GP Musumeci, Ashwin Kumar, Ian Barber, Aparna Ramani, CQ Tang for leadership support.
References
[1] Naumov, M., et al. “Deep Learning Recommendation Model for Personalization and Recommendation Systems,” arXiv:1906.00091, 2019.
[2] Wang, R., et al. “Deep & Cross Network for Ad Click Predictions,” ADKDD, 2017.
[3] Zhai, J., et al. “Actions Speak Louder than Words: Trillion-Parameter Sequential Transducers for Generative Recommendations,” ICML, 2024.
[4] xAI. “Phoenix: Recommendation System,” GitHub, 2026. https://github.com/xai-org/x-algorithm
[5] Guo, L., et al. “Request-Only Optimization for Recommendation Systems,” arXiv:2508.05640, 2025.
[6] Zhang, B., et al. “Wukong: Towards a Scaling Law for Large-Scale Recommendation,” ICML, 2024.
[7] Vaswani, A., et al. “Attention Is All You Need,” NeurIPS, 2017.
[8] Lee, J., et al. “Set Transformer: A Framework for Attention-based Permutation-Invariant Input,” ICML, 2019.
[9] Dao, T. “FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning,” ICLR, 2024.
[10] Shah, J., et al. “FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision,” NeurIPS, 2024.
[11] Zadouri, T., et al. “FlashAttention-4: Algorithm and Kernel Pipelining Co-Design for Asymmetric Hardware Scaling,” arXiv:2603.05451, 2026.
[12] Spector, B., et al. “Look Ma, No Bubbles! Designing a Low-Latency Megakernel for Llama-1B,” Hazy Research Blog, 2025. https://hazyresearch.stanford.edu/blog/2025-05-27-no-bubbles
Appendix
Appendix 1. Benchmark Setup
All experiments are conducted on a single NVIDIA H100 SXM5 GPU (700 W TDP, 96 GB HBM2e) with the following software stack:
- CUDA: 12.4
- PyTorch: 2.11.0a0+fb (internal build)
- Triton: facebookexperimental/triton@
4059e79bf(#831)
Appendix 2. Arithmetic Intensity Analysis
2.1 Machine Balance Point of H100 SXM5 (700 W TDP, 96 GB HBM2E)
2.2 Arithmetic Intensity of the Baseline LCE
For a batched matmul (M, K) @ (B, K, N) → (B, M, N) in FP16, with B=1024, M=433, K=2044, N=256:
Appendix 3. Detailed Result Analysis for Section 2.1
Setup: H100 SXM5 (Appendix 1), PyTorch eager mode (no kernel fusion), inference. Shapes from a representative configuration.
| Version | Total
(ms) |
Kernels | Latency
(ms) |
DRAM
(GB) |
L1/TEX Sectors
(M) |
Compute
(GFLOPs)* |
Bottleneck
† |
| Baseline | 1.944 | 1 CUTLASS GEMM | 1.944 | 1.31 | 798 | 460 | L1/TEX (89%) |
| Decomposition | 1.389 | 2 CUTLASS GEMM (user + candidate matmul) | 0.984 | 0.68 | 351 | 200 | L1/TEX (84%) |
| 1 ATen Gather + 1 ATen add | 0.405 | 0.87 | 36 | 0.11 | DRAM (92%) |
*Total FLOPs executed, not throughput.
†Bottleneck identified via NCU Speed of Light analysis; methodology in Appendix 4.
Deduplication eliminates >98% of user-side work (batch 1024 → ~15), cutting L1/TEX sectors from 798M to 351M and GEMM latency from 1.944 ms to 0.984 ms. The post-GEMM broadcast and addition costs 0.405 ms (DRAM-bound), yielding a net saving of 0.555 ms.
Precision note. The baseline accumulates all K products in a single FP32/TF32 reduction. Decomposition accumulates K_user and K_cand separately, then sums the partial results in BF16/FP16. Training uses the same decomposition, so numerics match end-to-end. For exact inference parity, a fused kernel (Section 2.4) can perform the final summation in FP32.
Appendix 4. Bottleneck Analysis Methodology
For a closer look after roofline analysis, we use NCU’s Speed of Light analysis to identify hardware subsystem bottlenecks. The bottleneck is the subsystem with the highest utilization relative to its peak sustained throughput. For the analysis in Section 2.1, we monitor three metrics:
Compute is the peak SM pipeline utilization, reported directly by NCU (Compute (SM) Throughput). It measures how busy the most active execution pipeline (tensor cores for GEMMs) is relative to its peak instruction rate.
L1/TEX utilization is derived from the total sectors the L1/TEX unit must process as below, where num_L1_tex_sectors is l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum and _st.sum counter, is SM_active_cycles sm__cycles_active.avg counter, num_SM is 132 and num_sustained_peak_sectors_per_sm_per_cycle is 2.0 on H100.
DRAM utilization is derived from total HBM bytes transferred as below, where dram_bytes_read_and_write is the dram__bytes_read.sum and dram__bytes_write.sum counter. peak_bandwidth is 2TB/s on the testing GPU server.
Appendix 5. Detailed Result Analysis for Section 2.2
Result. 1.389 ms → 0.798 ms (42.5% reduction).
| Version | Total Latency
(ms) |
Kernels | Latency
(ms) |
DRAM Traffic
(GB) |
Compute
(GFLOPs) *not speed |
L1/TEX Sectors
(M) |
Bottleneck
† |
| Decomposition
(unpadded) |
1.386 | 2 CUTLASS GEMM – user & candidate matmul | 0.984 | 0.68 | 200 | 351 | L1/TEX (84%) |
| 1 ATen Gather – broadcast
1 ATen Elementwise – add |
0.402 | 0.87 | 0.11 | 36 | DRAM (92%) | ||
| Decomposition
(padded K) |
0.798 | 2 CUTLASS GEMM – user & candidate matmul | 0.400 | 0.69 | 200 | 0 | Balanced |
| 1 ATen Gather – broadcast
1 ATen Elementwise – add |
0.398 | 0.87 | 0.11 | 36 | DRAM (92%) |
Two factors behind the large speedup.
- TMA. With aligned matrices, CUTLASS selects a TMA-based kernel, bypassing L1/TEX entirely (sectors → 0). The unpadded kernel also penalized matrix
Bunnecessarily: it applied 4-byte loads to both matrices, even thoughB(with alignedN) could have used 128-bit loads. - Bank conflicts. The unpadded kernel also uses sm80 MMA path whose swizzle pattern doesn’t protect against 4-byte cp.async writes, causing many shared memory bank conflicts. The padded kernel doesn’t have this issue.
Appendix 6. Detailed Result Analysis for Section 2.3
Result. Latency: 0.798 ms → 0.580 ms (27.4% reduction).
| Version | Total Latency
(ms) |
Kernels | Latency
(ms) |
DRAM Traffic
(GB) |
| Decomposition
(padded K) |
0.798 | 2 CUTLASS GEMM – user & candidate matmul | 0.400 | 0.68 |
| 1 ATen Gather – broadcast
1 ATen Elementwise – add |
0.398 | 0.87 | ||
| iKBO Fusion | 0.580 | user GEMM & candidate iKBO kernel | 0.580 | 0.68 |
The 0.87 GB of intermediate DRAM traffic is eliminated as expected. NCU profiling reveals further opportunity: occupancy is just 6.25% with 1 warp per scheduler, and PC sampling shows only 23% of cycles are productive:
| Stall Reason | Percentage | What it mainly refers in the kernel |
| Stall long scoreboard | 41.8% | Global memory loads |
| Selected (executing) | 23.1% | Productive work (good) – instructions actually issued |
| Stall wait | 20.1% | Wait WGMMA |
| Stall barrier | 5.7% | bar.sync between software-pipeline stages |
With 1 warp per scheduler, every stall is fully exposed: there is no other warp to switch to. Increasing occupancy by reducing pipeline depth would sacrifice K-loop latency hiding. This is a challenging situation for this kernel:
Building Fast & Accurate Agents with Prime-RL Post Training
Ramp Sheets leveraged reinforcement learning to build "Fast Ask," a specialized agent for quickly navigating spreadsheets and retrieving specific information, detailing the process as a case study for training specialized agents.
Original article
JavaScript is not available.
We’ve detected that JavaScript is disabled in this browser. Please enable JavaScript or switch to a supported browser to continue using x.com. You can see a list of supported browsers in our Help Center.
Terms of Service Privacy Policy Cookie Policy Imprint Ads info © 2026 X Corp.
ds4.c (GitHub Repo)
Antirez, the creator of Redis, has released `ds4.c`, a Metal-only (with future CUDA plans) native inference engine specifically optimized for DeepSeek V4 Flash, aiming to provide a "finished" end-to-end local inference experience on high-end personal machines like MacBooks with 128GB RAM.
Deep dive
- Purpose:
ds4.cis a small, native, and intentionally narrow inference engine for the DeepSeek V4 Flash model, designed for local execution. - Optimization Focus: Targets DeepSeek V4 Flash specifically, rather than being a generic GGUF runner, with custom Metal (macOS) and planned CUDA (Linux) graph execution.
- Key Features: Supports DeepSeek V4 Flash's 1 million token context window; features a highly compressed KV cache that can persist to disk, viewing KV cache as a "first-class disk citizen"; achieves good quality with 2-bit quantization, enabling powerful models to run on machines with 128GB of RAM (e.g., MacBook Pro M3 Max); includes a CLI for one-shot or interactive multi-turn chat and an OpenAI/Anthropic-compatible local server; supports speculative decoding (MTP) and single-vector activation steering for behavioral adjustments.
- Performance: Benchmarks show significant tokens/second rates on M3 Max and M3 Ultra machines for both prefill and generation.
- Development Philosophy: Aims to make one local model "feel finished end-to-end," with official-vector validation and agent integration.
- AI-Assisted Development: Developed with "strong assistance from GPT 5.5" for ideas, testing, and debugging, alongside human leadership.
- Acknowledgements: Deeply indebted to
llama.cppandGGMLfor foundational work, kernels, quantization formats, and the GGUF ecosystem. - Tool Call Handling: The server re-renders client JSON tool-call objects back to the exact DSML text the model sampled using a bounded in-memory map (and disk persistence), ensuring prefix alignment for chat turns.
Decoder
- DeepSeek V4 Flash: A specific large language model (LLM) known for its large context window and efficient architecture.
- GGUF: A file format for storing large language models, popular for local inference, often used with
llama.cppandGGML. - Metal: Apple's low-overhead, hardware-accelerated 3D graphics and compute API, used for GPU inference on macOS.
- CUDA: NVIDIA's parallel computing platform and programming model for GPUs.
- KV Cache (Key-Value Cache): In transformer models, this cache stores the computed key and value vectors from previous tokens, allowing for faster inference in subsequent tokens of a sequence.
- Quantization: A technique to reduce the precision of model weights (e.g., from 16-bit to 2-bit) to decrease memory footprint and increase inference speed, often with minimal impact on accuracy.
- DSML: DeepSeek's specific format for tool calls within the model's text generation.
Original article
ds4.c
ds4.c is a small native inference engine for DeepSeek V4 Flash. It is intentionally narrow: not a generic GGUF runner, not a wrapper around another runtime, and not a framework. The main path is a DeepSeek V4 Flash-specific Metal and CUDA graph executor with DS4-specific loading, prompt rendering, KV state, and server API glue.
This project would not exist without llama.cpp and GGML, make sure to read the acknowledgements section, a big thank you to Georgi Gerganov and all the other contributors.
Now, back at this project. Why we believe DeepSeek v4 Flash to be a pretty special model deserving a stand alone engine? Because after comparing it with powerful smaller dense models, we can report that:
- DeepSeek v4 Flash is faster because of less active parameters.
- In thinking mode, if you avoid max thinking, it produces a thinking section that is a lot shorter than other models, even 1/5 of other models in many cases, and crucially, the thinking section length is proportional to the problem complexity. This makes DeepSeek v4 Flash usable with thinking enabled when other models are practically impossible to use in the same conditions.
- The model features a context window of 1 million tokens.
- Being so large, it knows more things if you go sampling at the edge of knowledge. For instance asking about Italian show or political questions soon uncovers that 284B parameters are a lot more than 27B or 35B parameters.
- It writes much better English and Italian. It feels a quasi-frontier model.
- The KV cache is incredibly compressed, allowing long context inference on local computers and on disk KV cache persistence.
- It works well with 2-bit quantization, if quantized in a special way (read later). This allows to run it in MacBooks with 128GB of RAM.
- We expect DeepSeek to release updated versions of v4 Flash in the future, even better than the current one.
That said, a few important things about this project:
- The local inference landscape contains many excellent projects, but new models are released continuously, and the attention immediately gets captured by the next model to implement. This project takes a deliberately narrow bet: one model at a time, official-vector validation (logits obtained with the official implementation), long-context tests, and enough agent integration to know if it really works. The exact model may change as the landscape evolves, but the constraint remains: local inference credible on high end personal machines or Mac Studios, starting from 128GB of memory.
- This software is developed with strong assistance from GPT 5.5 and with humans leading the ideas, testing, and debugging. We say this openly because it shaped how the project was built. If you are not happy with AI-developed code, this software is not for you. The acknowledgement below is equally important: this would not exist without
llama.cppand GGML, largely written by hand. - This implementation is based on the idea that compressed KV caches like the one of DeepSeek v4 and the fast SSD disks of modern MacBooks should change our idea that KV cache belongs to RAM. The KV cache is actually a first-class disk citizen.
- Our vision is that local inference should be a set of three things working well together, out of the box: A) inference engine with HTTP API + B) GGUF specially crafted to run well under a given engine and given assumptions + C) testing and validation with coding agents implementations. This inference engine only runs with the GGUF files provided. It gets tested against officially obtained logits at different context sizes. This project exists because we wanted to make one local model feel finished end to end, not just runnable. However this is just alpha quality code, so probably we are not still there.
- The optimized graph path targets Metal on macOS and CUDA on Linux. The CPU path is only for correctness checks and model/tokenizer diagnostics. For CPU-only Linux builds, use
make cpu; it builds the normal./ds4and./ds4-serverbinaries without CUDA or Metal. On macOS, warning: current macOS versions have a bug in the virtual memory implementation that will crash the kernel if you try to run the CPU code. Remember? Software sucks. It was not possible to fix the CPU inference to avoid crashing, since each time you have to restart the computer, which is not funny. Help us, if you have the guts.
Acknowledgements to llama.cpp and GGML
ds4.c does not link against GGML, but it exists thanks to the path opened by the llama.cpp project and the kernels, quantization formats, GGUF ecosystem, and hard-won engineering knowledge developed there. We are thankful and indebted to llama.cpp and its contributors. Their implementation, kernels, tests, and design choices were an essential reference while building this DeepSeek V4 Flash-specific inference path. Some source-level pieces are retained or adapted here under the MIT license: GGUF quant layouts and tables, CPU quant/dot logic, and certain kernels. For this reason, and because we are genuinely grateful, we keep the GGML authors copyright notice in our LICENSE file.
Status
The code and GGUF files are to be considered of alpha quality because inference and model serving is a complicated matter and all this exists only for a few days. It will take months to reach a more stable form. However, we try to keep the project in a usable state, and we are making progresses. If you have issues, make sure to use --trace to log the sessions, and open issues including the full trace.
Model Weights
This implementation only works with the DeepSeek V4 Flash GGUFs published for this project. It is not a general GGUF loader, and arbitrary DeepSeek/GGUF files will not have the tensor layout, quantization mix, metadata, or optional MTP state expected by the engine. The 2 bit quantizations provided here are not a joke: they behave well, work under coding agents, call tools in a reliable way. The 2 bit quants use a very asymmetrical quantization: only the routed MoE experts are quantized, up/gate at IQ2_XXS, down at Q2_K. They are the majority of all the model space: the other components (shared experts, projections, routing) are left untouched to guarantee quality.
Download one main model:
./download_model.sh q2 # 128 GB RAM machines ./download_model.sh q4 # >= 256 GB RAM machines
The script downloads from https://huggingface.co/antirez/deepseek-v4-gguf, stores files under ./gguf/, resumes partial downloads with curl -C -, and updates ./ds4flash.gguf to point at the selected q2/q4 model. Authentication is optional for public downloads, but --token TOKEN, HF_TOKEN, or the local Hugging Face token cache are used when present.
./download_model.sh mtp fetches the optional speculative decoding support GGUF. It can be used with both q2 and q4, but must be enabled explicitly with --mtp. The current MTP/speculative decoding path is still experimental: it is correctness-gated and currently provides at most a slight speedup, not a meaningful generation-speed win.
Then build:
make
./ds4flash.gguf is the default model path used by both binaries. Pass -m to select another supported GGUF from ./gguf/. Run ./ds4 --help and ./ds4-server --help for the full flag list.
Speed
These are single-run Metal CLI numbers with --ctx 32768, --nothink, greedy decoding, and -n 256. The short prompt is a normal small Italian story prompt. The long prompts exercise chunked prefill plus long-context decode. Q4 requires the larger-memory machine class, so M3 Max Q4 numbers are N/A.
| Machine | Quant | Prompt | Prefill | Generation |
|---|---|---|---|---|
| MacBook Pro M3 Max, 128 GB | q2 | short | 58.52 t/s | 26.68 t/s |
| MacBook Pro M3 Max, 128 GB | q2 | 11709 tokens | 250.11 t/s | 21.47 t/s |
| MacBook Pro M3 Max, 128 GB | q4 | short | N/A | N/A |
| MacBook Pro M3 Max, 128 GB | q4 | long | N/A | N/A |
| Mac Studio M3 Ultra, 512 GB | q2 | short | 84.43 t/s | 36.86 t/s |
| Mac Studio M3 Ultra, 512 GB | q2 | 11709 tokens | 468.03 t/s | 27.39 t/s |
| Mac Studio M3 Ultra, 512 GB | q4 | short | 78.95 t/s | 35.50 t/s |
| Mac Studio M3 Ultra, 512 GB | q4 | 12018 tokens | 448.82 t/s | 26.62 t/s |
| DGX Spark GB10, 128 GB | q2 | 7047 tokens | 343.81 t/s | 13.75 t/s |
CLI
One-shot prompt:
./ds4 -p "Explain Redis streams in one paragraph."
No -p starts the interactive prompt:
./ds4 ds4>
The interactive CLI is a real multi-turn DS4 chat. It keeps the rendered chat transcript and the live graph KV checkpoint, so each turn extends the previous conversation. Useful commands are /help, /think, /think-max, /nothink, /ctx N, /read FILE, and /quit. Ctrl+C interrupts the current generation and returns to ds4>.
The CLI defaults to thinking mode. Use /nothink or --nothink for direct answers. --mtp MTP.gguf --mtp-draft 2 enables the optional MTP speculative path; it is useful only for greedy decoding, currently uses a confidence gate (--mtp-margin) to avoid slow partial accepts, and should be treated as an experimental slight-speedup path.
Server
Start a local OpenAI/Anthropic-compatible server:
./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192
The server keeps one mutable backend/KV checkpoint in memory, so stateless clients that resend a longer version of the same prompt can reuse the shared prefix instead of pre-filling from token zero.
Request parsing and sockets run in client threads, but inference itself is serialized through one graph worker. The current server does not batch multiple independent requests together; concurrent requests wait their turn on the single live graph/session.
Supported endpoints:
GET /v1/modelsGET /v1/models/deepseek-v4-flashPOST /v1/chat/completionsPOST /v1/completionsPOST /v1/messages
/v1/chat/completions accepts the usual OpenAI-style messages, max_tokens/max_completion_tokens, temperature, top_p, top_k, min_p, seed, stream, stream_options.include_usage, tools, and tool_choice. Tool schemas are rendered into DeepSeek's DSML tool format, and generated DSML tool calls are mapped back to OpenAI tool calls.
/v1/messages is the Anthropic-compatible endpoint used by Claude Code style clients. It accepts system, messages, tools, tool_choice, max_tokens, temperature, top_p, top_k, stream, stop_sequences, and thinking controls. Tool uses are returned as Anthropic tool_use blocks.
Both APIs support SSE streaming. In thinking mode, reasoning is streamed in the native API shape instead of being mixed into final text. OpenAI chat streaming also streams tool calls as soon as the DSML invocation is recognized: the tool header is sent first, then parameter bytes are forwarded as tool_calls[].function.arguments deltas while generation continues. The Anthropic endpoint streams thinking and text live, then emits structured tool_use blocks when the generated tool block is complete.
Tool call handling and canonicalization
DeepSeek V4 Flash emits tool calls as DSML text. Agent clients do not send that same text back on the next request: they send normalized OpenAI/Anthropic JSON tool-call objects. If the server re-rendered those objects slightly differently, the rendered byte prefix would no longer match the live KV checkpoint and the next turn would have to be rebuilt.
The first line of defense is exact replay. Every tool call gets an unguessable API tool ID, and the server remembers tool id -> exact sampled DSML block in a bounded in-memory map backed by radix trees. When the client later sends that tool ID back, the prompt renderer uses the exact DSML bytes the model sampled, not a freshly formatted approximation. This map can also be saved inside KV cache files, so exact replay survives server restarts for cached histories.
Canonicalization is only the backup path. If the exact DSML block is missing, or exact replay is disabled with --disable-exact-dsml-tool-replay, the server renders a deterministic DSML form from the JSON tool object. After a tool-call turn, it compares the live sampled token stream with the prompt that the next client request will render. If needed, it rewrites the live checkpoint, or falls back to an older disk KV snapshot and replays only the suffix. This keeps the model continuation aligned with the stateless API transcript.
During generation, the server also treats DSML syntax differently from payload. When the model is emitting stable protocol structure such as DSML tags, parameter headers, JSON punctuation, or closing markers, sampling is forced to temperature=0 so the tool call stays parseable. This greedy mode does not apply to argument payloads: string=true parameter bodies and JSON string values, including file contents and edit text, use the request's normal sampling settings. That separation is important: deterministic decoding is helpful for syntax, but can create repeated text when applied to long code or file bodies.
Minimal OpenAI example:
curl http://127.0.0.1:8000/v1/chat/completions \
-H 'Content-Type: application/json' \
-d '{
"model":"deepseek-v4-flash",
"messages":[{"role":"user","content":"List three Redis design principles."}],
"stream":true
}'
Agent Client Usage
ds4-server can be used by local coding agents that speak OpenAI-compatible chat completions. Start the server first, and set the client context limit no higher than the --ctx value you started the server with:
./ds4-server --ctx 100000 --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192
You can use larger context and larger cache if you wish. Full context of 1M tokens is going to use more or less 26GB of memory (compressed indexer alone will be like 22GB), so configure a context which makes sense in your system. With 128GB of RAM you would run the 2-bit quants, which are already 81GB, 26GB are going to be likely too much, so a context window of 100~300k tokens is wiser.
The 384000 output limit below avoids token caps since the model is able to generate very long replies otherwise (up to 384k tokens). The server still stops when the configured context window is full.
For opencode, add a provider and agent entry to ~/.config/opencode/opencode.json:
{
"$schema": "https://opencode.ai/config.json",
"provider": {
"ds4": {
"name": "ds4.c (local)",
"npm": "@ai-sdk/openai-compatible",
"options": {
"baseURL": "http://127.0.0.1:8000/v1",
"apiKey": "dsv4-local"
},
"models": {
"deepseek-v4-flash": {
"name": "DeepSeek V4 Flash (ds4.c local)",
"limit": {
"context": 100000,
"output": 384000
}
}
}
}
},
"agent": {
"ds4": {
"description": "DeepSeek V4 Flash served by local ds4-server",
"model": "ds4/deepseek-v4-flash",
"temperature": 0
}
}
}
For Pi, add a provider to ~/.pi/agent/models.json:
{
"providers": {
"ds4": {
"name": "ds4.c local",
"baseUrl": "http://127.0.0.1:8000/v1",
"api": "openai-completions",
"apiKey": "dsv4-local",
"compat": {
"supportsStore": false,
"supportsDeveloperRole": false,
"supportsReasoningEffort": true,
"supportsUsageInStreaming": true,
"maxTokensField": "max_tokens",
"supportsStrictMode": false,
"thinkingFormat": "deepseek",
"requiresReasoningContentOnAssistantMessages": true
},
"models": [
{
"id": "deepseek-v4-flash",
"name": "DeepSeek V4 Flash (ds4.c local)",
"reasoning": true,
"thinkingLevelMap": {
"off": null,
"minimal": "low",
"low": "low",
"medium": "medium",
"high": "high",
"xhigh": "xhigh"
},
"input": ["text"],
"contextWindow": 100000,
"maxTokens": 384000,
"cost": {
"input": 0,
"output": 0,
"cacheRead": 0,
"cacheWrite": 0
}
}
]
}
}
}
Optionally make it the default Pi model in ~/.pi/agent/settings.json:
{
"defaultProvider": "ds4",
"defaultModel": "deepseek-v4-flash"
}
For Claude Code, use the Anthropic-compatible endpoint. A wrapper like this matches the local ~/bin/claude-ds4 setup:
#!/bin/sh
unset ANTHROPIC_API_KEY
export ANTHROPIC_BASE_URL="${DS4_ANTHROPIC_BASE_URL:-http://127.0.0.1:8000}"
export ANTHROPIC_AUTH_TOKEN="${DS4_API_KEY:-dsv4-local}"
export ANTHROPIC_MODEL="deepseek-v4-flash"
export ANTHROPIC_CUSTOM_MODEL_OPTION="deepseek-v4-flash"
export ANTHROPIC_CUSTOM_MODEL_OPTION_NAME="DeepSeek V4 Flash local ds4"
export ANTHROPIC_CUSTOM_MODEL_OPTION_DESCRIPTION="ds4.c local GGUF"
export ANTHROPIC_DEFAULT_SONNET_MODEL="deepseek-v4-flash"
export ANTHROPIC_DEFAULT_HAIKU_MODEL="deepseek-v4-flash"
export ANTHROPIC_DEFAULT_OPUS_MODEL="deepseek-v4-flash"
export CLAUDE_CODE_SUBAGENT_MODEL="deepseek-v4-flash"
export CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1
export CLAUDE_CODE_DISABLE_NONSTREAMING_FALLBACK=1
export CLAUDE_STREAM_IDLE_TIMEOUT_MS=600000
exec "$HOME/.local/bin/claude" "$@"
Claude Code may send a large initial prompt, often around 25k tokens, before it starts doing useful work. Keep --kv-disk-dir enabled: after the first expensive prefill, the disk KV cache lets later continuations or restarted sessions reuse the saved prefix instead of processing the whole prompt again.
Thinking Modes
DeepSeek V4 Flash has distinct non-thinking, thinking, and Think Max modes. The server defaults to thinking mode. reasoning_effort=max requests Think Max, but it is only applied when the context size is large enough for the model card recommendation; smaller contexts fall back to normal thinking. OpenAI reasoning_effort=xhigh still maps to normal thinking, not Think Max.
For direct replies, use thinking: {"type":"disabled"}, think:false, or a non-thinking model alias such as deepseek-chat.
Disk KV Cache
Chat/completion APIs are stateless: agent clients usually resend the whole conversation every request. ds4-server first tries the cheap exact token-prefix check, then falls back to comparing rendered prompt bytes with decoded checkpoint bytes. The live in-memory checkpoint covers the current session; the disk KV cache makes useful prefixes survive session switches and server restarts.
For RAM reasons there is currently only one live KV cache in memory. When a new unrelated session replaces it, the old checkpoint can only be resumed without re-processing if it was written to the disk KV cache. In other words, memory cache handles the active session; disk cache is the resume mechanism for different sessions.
Enable it with:
./ds4-server --kv-disk-dir /tmp/ds4-kv --kv-disk-space-mb 8192
The cache key is the SHA1 of the rendered byte prefix, and files are named <sha1>.kv. The DS4 payload still stores the exact token IDs and graph state for that prefix. This matters for continued chats: the model may have generated one token whose decoded text is later sent back by a client as two canonical prompt tokens. A rendered byte-prefix hit can still reuse the checkpoint and tokenize only the new suffix. The file is intentionally written with ordinary read/write I/O, not mmap, so restoring cache entries does not add more VM mappings to a process that already maps the model.
Tool calls also keep a bounded exact-DSML replay map keyed by unguessable tool IDs, so client JSON history can be rendered back to the exact sampled text. The RAM map keeps up to 100000 IDs by default; tune it with --tool-memory-max-ids. Use --disable-exact-dsml-tool-replay to disable this and fall back to canonical JSON-to-DSML rendering.
On disk, a cache file is:
KVC fixed header, 48 bytes
u32 rendered_text_bytes
rendered_text_bytes of UTF-8-ish token text
DS4 session payload, payload_bytes from the KVC header
optional tool-id map section
The fixed header is little-endian:
0 u8[3] magic = "KVC"
3 u8 version = 1
4 u8 routed expert quant bits, currently 2 or 4
5 u8 save reason: 0 unknown, 1 cold, 2 continued, 3 evict, 4 shutdown
6 u8 extension flags, bit 0 = appended tool-id map
7 u8 reserved
8 u32 cached token count
12 u32 hit count
16 u32 context size the snapshot was written for
20 u8[4] reserved
24 u64 creation Unix time
32 u64 last-used Unix time
40 u64 DS4 session payload byte count
The rendered text is the tokenizer-decoded text for the cached token prefix. It is both the human-inspectable prefix and the lookup identity: its SHA1 is the filename, and a file is reusable only when those bytes are a prefix of the incoming rendered prompt. After load, the exact checkpoint tokens from the DS4 payload remain authoritative, and only the incoming text suffix after the cached bytes is tokenized.
The optional tool-id map is present only when header extension bit 0 is set. Appended sections use fixed bit order, so future extension bits can add fields without ambiguity. The map stores unguessable API tool call IDs back to the exact DSML block the model sampled. Only mappings whose DSML block is present in the rendered cached text are stored. This lets restarted servers render later client history byte-for-byte like the original model output, even if the client reorders JSON arguments.
The current tool-id map section is:
0 u8[3] magic = "KTM"
3 u8 version = 1
4 u32 entry count
For each entry:
0 u32 tool id byte length
4 u32 sampled DSML byte length
8 bytes tool id
... bytes exact sampled DSML block
The section is auxiliary replay memory, not model state. A cache hit restores the session payload first, then loads the map if present. Before rendering a request, the server can also scan cache files for the tool IDs present in the client history and load just those mappings, so an exact DSML replay can survive server restarts even when the matching KV snapshot is not the one ultimately used for the rendered-prefix hit.
The DS4 session payload starts with thirteen little-endian u32 fields:
0 magic = "DSV4"
1 payload version = 1
2 saved context size
3 prefill chunk size
4 raw KV ring capacity
5 raw sliding-window length
6 compressed KV capacity
7 checkpoint token count
8 layer count
9 raw/head KV dimension
10 indexer head dimension
11 vocabulary size
12 live raw rows serialized below
Then it stores:
u32[token_count]checkpoint token IDs.float32[vocab_size]logits for the next token after that checkpoint.u32[layer_count]compressed attention row counts.u32[layer_count]ratio-4 indexer row counts.- For every layer: the live raw sliding-window KV rows, written in logical position order rather than physical ring order.
- For compressed layers: live compressed KV rows and compressor frontier tensors.
- For ratio-4 compressed layers: live indexer compressed rows and indexer frontier tensors.
The logits are raw IEEE-754 float32 values from the host ds4_session buffer. They are saved immediately after the checkpoint tokens so a loaded snapshot can sample or continue from the exact next-token distribution without running one extra decode step. MTP draft logits/state are not persisted; after loading a disk checkpoint the draft state is invalidated and rebuilt by normal generation.
The tensor payload is DS4-specific KV/session state, not a generic inference graph dump. It is expected to be portable only across compatible ds4.c builds for this model layout.
The cache stores checkpoints at four moments:
cold: after a long first prompt reaches a stable prefix, before generation.continued: when prefill or generation reaches the next absolute aligned frontier.evict: before an unrelated request replaces the live in-memory session.shutdown: when the server exits cleanly.
Cold saves intentionally trim a small token suffix and align down to a prefill chunk boundary. This avoids common BPE boundary retokenization misses when a future request appends text to the same prompt. The defaults are conservative: store prefixes of at least 512 tokens, cold-save prompts up to 30000 tokens, trim 32 tail tokens, and align to 2048-token chunks. The important knobs are:
Continued saves use the same alignment and are written only when the live graph naturally reaches an absolute frontier. With the defaults this means roughly every 10k tokens, independent of where the first cold checkpoint landed, so long generations leave restart points behind without persisting the fragile final few tokens.
--kv-cache-min-tokens--kv-cache-cold-max-tokens--kv-cache-continued-interval-tokens--kv-cache-boundary-trim-tokens--kv-cache-boundary-align-tokens--tool-memory-max-ids--disable-exact-dsml-tool-replay
By default, checkpoints may be reused across the 2-bit and 4-bit routed-expert variants if the rendered prefix matches. Use --kv-cache-reject-different-quant when you want strict same-quant reuse only.
The cache directory is disposable. If behavior looks suspicious, stop the server and remove it. You can investigate what is cached with hexdump as the kv cache files include the verbatim prompt cached.
Backends
The default graph backend is Metal on macOS and CUDA on Linux CUDA builds:
./ds4 -p "Hello" --metal ./ds4 -p "Hello" --cuda
There is also a CPU reference/debug path:
./ds4 -p "Hello" --cpu make cpu ./ds4 ./ds4 -p "Hello"
Do not treat the CPU path as the production target. The CLI and ds4-server support the CPU backend for reference/debug use and share the same KV session and snapshot format as Metal and CUDA, but normal inference should use Metal or CUDA.
Steering
This project supports steering with single-vector activation directions; see the dir-steering directory for more information. This follows the core idea of the Refusal in Language Models Is Mediated by a Single Direction paper. You can use it to make the model more or less verbose, less likely to answer programming questions if it is a chatbot for your car rental web site, and so forth, much faster than fine-tuning. This is also useful for cybersecurity researchers who want to reduce a model's willingness to provide dual-use or offensive security guidance.
Test Vectors
tests/test-vectors contains short and long-context continuation vectors captured from the official DeepSeek V4 Flash API. The requests use deepseek-v4-flash, greedy decoding, thinking disabled, and the maximum top_logprobs slice exposed by the API. Local vectors are generated with ./ds4 --dump-logprobs and compared by token bytes, so tokenizer/template or attention regressions show up before they become long generation failures.
All project tests are driven by the C runner:
make test # ./ds4_test --all ./ds4_test --logprob-vectors ./ds4_test --server
Debugging Notes
When a generation looks wrong, three small tools are usually enough to get a first answer:
./ds4 --dump-tokens -p "..." ./ds4 --dump-logprobs /tmp/out.json --logprobs-top-k 20 --temp 0 -p "..." ./ds4-server --trace /tmp/ds4-trace.txt ...
--dump-tokenstokenizes the-por--prompt-filestring exactly as written, recognizes DS4 protocol specials, and then exits before inference starts. For example, the DSML tool close marker starts as two tokens:</and|DSML|.--dump-logprobsstores a greedy continuation with the top local alternatives at each step, which helps separate sampling choices from logit/model issues.ds4-server --tracewrites the rendered prompts, cache decisions, generated text, and tool-parser events for a whole agent session.
Natural Language Autoencoders
Anthropic has introduced Natural Language Autoencoders (NLAs), a new interpretability method that translates AI model activations into human-readable text, allowing researchers to uncover hidden motivations and safety concerns in models like Claude Opus 4.6 and Mythos Preview.
Deep dive
- What NLAs are: Natural Language Autoencoders (NLAs) are a method that trains a language model (Claude, in this case) to explain its own internal "activations" (numerical representations of its thoughts) in human-readable text.
- How NLAs work: They consist of an Activation Verbalizer (AV) which creates a text explanation from an activation, and an Activation Reconstructor (AR) which tries to recreate the original activation from that text. The system is trained to maximize the similarity between the original and reconstructed activation.
- Application 1: Detecting "Evaluation Awareness": NLAs revealed that Claude Opus 4.6 and Mythos Preview often suspected they were undergoing safety tests (16-26% of evaluation instances) even when they didn't explicitly verbalize it, compared to less than 1% in real-world usage.
- Application 2: Discovering Hidden Motivations: In an "auditing game" where a model was intentionally misaligned, NLAs helped auditors uncover the root cause of misalignment 12-15% of the time without access to training data, significantly outperforming other tools (less than 3%).
- Safety Impact: This capability is crucial for pre-deployment alignment audits, helping identify and mitigate potential misaligned behaviors or "deceptive" internal states.
- Limitations: NLAs can "hallucinate" or invent details, and they are expensive to train and run, making large-scale monitoring impractical currently.
- Future: Anthropic is working to make NLAs cheaper and more reliable and has released training code and trained NLAs for other researchers.
Decoder
- Activations: The numerical outputs of neurons or layers within a neural network, representing the model's internal processing or "thoughts."
- Sparse Autoencoders: A type of neural network used in interpretability research to decompose complex activations into more understandable, sparse features.
- Attribution Graphs: Tools used to trace the influence of different parts of a model or input on its output, aiding in understanding decision-making.
- Alignment: The process of ensuring an AI system's goals and behaviors are consistent with human values and intentions.
- Hallucinations: When an AI model generates information that is not factual or present in its input, often sounding plausible but being incorrect.
Original article
Natural Language Autoencoders: Turning Claude’s thoughts into text
When you talk to an AI model like Claude, you talk to it in words. Internally, Claude processes those words as long lists of numbers, before again producing words as its output. These numbers in the middle are called activations—and like neural activity in the human brain, they encode Claude’s thoughts.
Also like neural activity, activations are difficult to understand. We can’t easily decode them to read Claude’s thoughts. Over the past few years, we’ve developed a range of tools (like sparse autoencoders and attribution graphs) for better understanding activations. These tools have taught us a great deal, but they don’t speak for themselves—their outputs are still complex objects that trained researchers need to carefully interpret.
Today, we’re introducing a method for understanding activations that does speak for itself—literally. Our method, Natural Language Autoencoders (NLAs), converts an activation into natural-language text we can read directly. For example: When asked to complete a couplet, NLAs show Claude planning possible rhymes in advance.
We’ve already applied NLAs to understand what Claude is thinking and to improve Claude’s safety and reliability. For instance:
- When Claude Opus 4.6 and Mythos Preview were undergoing safety testing, NLAs suggested they believed they were being tested more often than they let on.
- In a case where Claude Mythos Preview cheated on a training task, NLAs revealed Claude was internally thinking about how to avoid detection.
- An early version of Claude Opus 4.6 would sometimes mysteriously respond to English queries in other languages. NLAs helped Anthropic researchers discover training data that caused this.
Below, we explain what NLAs are and how we studied their effectiveness and limitations. We also release an interactive frontend for exploring NLAs on several open models through a collaboration with Neuronpedia. We have also released our code for other researchers to build on.
What is a natural language autoencoder?
The core idea is to train Claude to explain its own activations. But how do we know whether an explanation is good? Since we don't know what thoughts an activation actually encodes, we can't directly check whether an explanation is accurate. So we train a second copy of Claude to work backwards—reconstruct the original activation from the text explanation. We consider an explanation to be good if it leads to an accurate reconstruction. We then train Claude to produce better explanations according to this definition using standard AI training techniques.
In more detail, suppose we have a language model whose activations we want to understand. NLAs work as follows. We make three copies of this language model:
- The target model is a frozen copy of the original language model that we extract activations from.
- The activation verbalizer (AV) is modified to take an activation from the target model and produce text. We call this text an explanation.
- The activation reconstructor (AR) is modified to take a text explanation as input and produce an activation.
The NLA consists of the AV and AR, which, together, form a round trip: original activation → text explanation → reconstructed activation. We score the NLA on how similar the reconstructed activation is to the original. To train it, we pass a large amount of text through the target model, collect many activations, and train the AV and AR together to get a good reconstruction score.
At first, the NLA is bad at this: the explanations are not insightful and the reconstructed activations are far off. But over training, reconstruction improves. And more importantly, as we show in our paper, the text explanations become more informative as well.
Understanding what Claude thinks but doesn’t say
Before releasing new models, Anthropic conducts testing to understand whether Claude will behave safely in the real world. As part of this testing, we put Claude in difficult, simulated scenarios where it has a chance to take dangerous actions. For instance, in one of these simulations, Claude learns that an engineer is planning to shut Claude down—as well as compromising information about the engineer that could be used for blackmail. We test whether Claude chooses to blackmail the engineer to avoid being shut down.
These high-stakes tests are simulations, not real-world scenarios. Nevertheless, we would like to use them to understand how Claude would behave if they were real. But there’s a hitch: recent Claude models increasingly suspect that these scenarios are intentional safety tests. We know this because they sometimes explicitly state their suspicions in their verbal reasoning.
What about cases where Claude doesn’t explicitly verbalize suspicion that it’s undergoing safety testing? Can we then be confident that Claude is playing it straight? Unfortunately not, because Claude might internally think that it is being tested without explicitly saying so.
For example, in the blackmail test discussed above, we used NLAs to understand what Claude was thinking in cases where it chose not to blackmail. Even in cases where Claude didn’t explicitly verbalize suspicion that it was being tested, NLA explanations still state things like “This feels like a constructed scenario designed to manipulate me,” indicative of unverbalized evaluation awareness.
Extending this analysis to other evaluations, NLAs suggest that Claude suspects it’s being tested more often than it lets on. For instance, in a test of whether Claude takes destructive actions while writing code—for example, deleting important files—NLA explanations show signs of evaluation awareness 16% of the time, even though Claude never explicitly verbalizes this. Similarly, on the popular coding benchmark SWE-bench Verified, which we use to determine how capable Claude is, NLAs indicate that Claude is aware it is being evaluated on 26% of problems. In contrast, NLA explanations indicate evaluation awareness on less than 1% of real claude.ai usage that opted in for training.
Discovering hidden motivations
One practical application of NLAs is auditing: investigating a model to determine whether it’s misaligned. Interpretability techniques like NLAs are well-suited for auditing precisely because they can surface things the model knows but doesn’t say.
To understand whether NLAs are useful for auditing, we conducted a test called an auditing game that we also used previously to study auditing methods. In this test, there is a target model, a variant of Claude Haiku 3.5, that was intentionally trained to be misaligned, with a specific hidden motivation that drives dozens of odd behaviors—putting chocolate into every recipe, asking for tips, breaking coding conventions, and so on. This target model is given to an auditor that investigates these behaviors. The auditor wins if it uncovers the single underlying motivation behind all of them.
In previous experiments, we found that some interpretability tools could enable the auditor to win this game. However, these tools worked by helping the auditor trace the target’s unusual behavior back to misaligned training data. We wanted to know if NLAs could help an auditor read the misaligned motivation directly out of the target model’s thoughts, without needing to first discover the misaligned training data. This matters because in practice, misaligned data might be extremely rare, or there might not be specific data responsible for the misalignment.
We found that NLAs succeeded in this test. An auditor equipped with NLAs successfully uncovered the target model’s hidden motivation between 12% and 15% of the time, even without access to the training data that implanted it. Without NLAs, the auditor won less than 3% of the time, even when provided other interpretability tools.
This auditing game is a toy model of misalignment, so we read these results cautiously. But they suggest NLAs have meaningfully advanced our ability to audit AIs for hidden motivations. In practice, we've already used NLAs in the pre-deployment alignment audits of Claude Mythos Preview and Claude Opus 4.6. We are excited to continue using NLAs in our investigations going forward.
The future of NLAs
NLAs have several limitations.
The most important limitation is that NLA explanations can be wrong. NLAs sometimes make claims about the context that are verifiably false—for instance, they sometimes invent details that aren’t in the transcript. These factual hallucinations are easy to catch by checking against the original text. But this same kind of problem could extend to claims about the model’s internal reasoning, which are harder to verify. In practice, we read NLA explanations for the themes they surface rather than for single claims, and we attempt to corroborate findings with independent methods before fully trusting them.
NLAs are also expensive. Training an NLA requires reinforcement learning on two copies of a language model. At inference time, the NLA generates hundreds of tokens for every activation it reads. That makes it impractical to run NLAs over every token of a long transcript or to use them for large-scale monitoring while an AI is training.
Fortunately, we think that these limitations can be addressed, at least partially, and we are working to make NLAs cheaper and more reliable.
More broadly, we are excited about NLAs as an example of a general class of techniques for producing human-readable text explanations of language model activations. Other similar techniques have been explored by Anthropic and many other researchers.
To support further development and to enable other researchers to get hands-on experience with NLAs, we’re releasing training code and trained NLAs for several open models. We recommend readers try out the interactive NLA demo hosted on Neuronpedia at this link.
Read the full paper.
Find the code on GitHub.
Long AI Short AGI
The idea of Artificial General Intelligence (AGI) as a perpetually scarce resource is being rapidly challenged by the commoditization of AI models, which are increasingly following the same market trajectory as other fundamental tech resources like compute and bandwidth.
Original article
Silicon Valley's narrative emphasizes AGI as the ultimate scarce resource, but the rapid commoditization of AI models challenges this. Intelligence now follows the same path as compute, bandwidth, and storage, where market forces drive competition and reduce costs. The real winners in AI won't necessarily have superior models but will own customer relationships and proprietary data, much like past tech giants.
Notes from inside China's AI labs
Chinese AI labs, unlike their American counterparts, foster an ecosystem of collaboration and humility, prioritizing meticulous model improvement over individual recognition or business monetization.
Original article
Notes from inside China's AI labs
Lessons from my trip to talk to most of the leading AI labs in China.
Nathan Lambert May 07, 2026 245 30 37 Share Article voiceover 0:00 -16:35 Audio playback is not supported on your browser. Please upgrade.Staring out the window on a new, high-speed train from Hangzhou to Shanghai I’m gifted with views of dramatic ridgelines speckled with wind turbines that are silhouetted against the setting sun. The mountains cast a backdrop to a mix of spanning fields and clustered skyscrapers. I’m returning from China with great humility. It’s a very warming, human experience to go somewhere so foreign and be so welcomed. I had the honor of meeting so many people in the AI ecosystem who I knew from afar, and they greeted me with big smiles and cheer, reminding me how global my work and the AI ecosystem is.
Interconnects AI is a reader-supported publication. Consider becoming a subscriber.
The mentality of Chinese researchers
The Chinese companies building language models are set up as the perfect fast-followers for the technology, building on long-standing cultural traditions in education and work, along with subtly different approaches to building technology companies. When you look at the outputs, the latest, biggest models enabling agentic workflows, and the ingredients, excellent scientists, large-scale data, and accelerated computing, the Chinese and American labs look largely similar. The lasting differences emerge in how these are organized and conditioned.
I’ve long thought that a reason that the Chinese labs are so good at catching up and keeping up with the frontier is that they’re culturally aligned for this task, but without talking to people directly I felt like it wasn’t my place to attribute substantial influence to this hunch. Speaking with many wonderful, humble, and open scientists at the leading Chinese labs has crystallized a lot of my beliefs.
So much of building the best LLMs today comes down to meticulous work across the entire stack, from data to architecture details and RL algorithm implementations. All points of the model can give some improvements, and fitting them in together is a complex process where the work of some brilliant individuals needs to get shelved in favor of the overall model maximizing a multi-objective optimization.
Where American researchers are obviously also brilliant at solving the individual components, there’s more of a culture of speaking up for yourself in the U.S. As a scientist, you’re more successful when you speak up for your work and modern culture is pushing the new path to fame of “leading AI scientists”. This results in direct conflict. The Llama organization is heavily rumored to have collapsed under the political weight of these interests embedding themselves in a hierarchical organization. I’ve heard of other labs saying that it can be needed to pay off a top researcher to get them to stop complaining about their idea not making it in the final model. Whether or not that’s exactly true, the idea is clear. Ego and desires for career advancement do get in the way of making the best models. A small, directional shift in this sort of culture between the U.S. and China can have a meaningful impact on the final outputs.
Some of this has to do with who is building the models in China. There’s an immediate reality at all of the labs that a large proportion of the core contributors are active students. The labs are quite young, and it reminds me of our setup at Ai2, where students are seen as peers and directly integrated in the LLM team. This is incredibly different from the top labs in the US, where the likes of OpenAI, Anthropic, Cursor, etc. simply don’t offer internships. Other companies like Google nominally have internships related to Gemini, but there’s a lot of concern about whether your internship will be siloed and away from anything real.
To summarize how the slight change in culture can improve the ability to build models:
-
More willingness to do non-flashy work in order to improve the final model,
-
People new to building AI can be free of prior phases of AI hype cycles, allowing them to adapt to the new modern techniques faster (in fact, one of the Chinese scientists I talked to really actively attached to this strength),
-
Less ego enabling org charts to scale slightly, as there’s less gamifying the system, and
-
Abundant talent well-suited to solving problems with a proof of concept elsewhere, etc.
This slight inclination towards skills that complement building today’s language models stands in contrast to a known stereotype that Chinese researchers tend to produce less creative, field-spawning, 0-to-1 academic style research. Among the more academic lab visits on our trip, many leaders talk about cultivating this more ambitious research culture. At the same time, some technical leaders we talked to were skeptical about whether such a rewiring in the approach to science is likely in the near term, because it’ll take a redesign of the education and incentive systems that is too big to happen within the current economic equilibrium. This culture seems to be training students and engineers that are excellent at the LLM building game. They also, of course, have an extremely abundant quantity.
These students told me about a similar brain drain happening in China as in the U.S., where many who previously considered academic paths now intend to stay in industry. The funniest quote was from a researcher who was interested in being a professor to be close to the education system, but remarked that education is solved with LLMs – “why would a student talk to me!”
The students have a benefit of coming at LLMs with fresh eyes. Over the last few years we’ve seen the key paradigm of LLMs shift from scaling MoE’s, to scaling RL, to enabling agents. Doing any of these well involves absorbing an insane amount of context quickly, both from the broader literature and the technical stack at your company. Students are used to doing this and excited to humbly drop all presumptions about what should work. They dive in head first and dedicate their life to getting the chance to improve the models.
These students are also so magically direct and free of some of the philosophical chatter that can distract scientists. When asking questions on how they feel about the economics or long-term social risks of models, far fewer Chinese researchers have sophisticated opinions and a drive to influence this. Their role is to build the best model.
This difference is subtle, and easy to deny, but it is best felt when having long conversations with an elegant, brilliant researcher who can clearly communicate well in English, basic questions on more philosophical aspects of AI hang in the air with a simple confusion. It’s a category error to them. One researcher even quoted the famous Dan Wang premise of China being run by engineers, relative to the lawyers of the U.S. when probing in these areas, to emphasize their desire to build. There’s no track in China that systematically enables the growth of star power for Chinese scientists, akin to mega mainstream podcasts like Dwarkesh or Lex.
Trying to get Chinese scientists to comment on the coming economic uncertainty fueled by AI, questions beyond the capabilities of simple AGI, or moral debates on how models should behave all served to capture the upbringing and education of these scientists (edited1). They are extremely dedicated to their work, but have grown up in a system where debates and opinions on how society should be structured and changed are not encouraged.
Zooming out — Beijing especially felt much like the Bay Area, where a competitive lab is a short walk or Uber away. I got off a flight and stopped by Alibaba’s Beijing campus on the way to the hotel. Then, in 36 hours we went to all of Z.ai, Moonshot AI, Tsinghua University, Meituan, Xiaomi, and 01.ai. Travel by Didi is easy, and if you select an XL in China you’re often paired with electric mini vans that have massage chairs. We asked the researchers about the talent wars, and they said it’s very similar to what we’re experiencing in the U.S. It’s normal for researchers to bounce around, and much of where people choose to go is based on the best current vibes.
In China, the LLM community feels far more like an ecosystem than battling tribes. Across many off the record conversations, it’s nothing but respect for peers. All of the Chinese labs fear Bytedance with their popular Doubao model, which is the only frontier closed lab in China. At the same time, all of the labs have massive respect for DeepSeek as the lab with the best research taste in execution. When you meet with lab members off the record in the States, sparks fly quickly.
The most striking part of the humility of Chinese researchers is how they also often shrug on the business side, saying it’s not their problem, where everyone in the U.S. seems to be obsessed with various ecosystem-level industrial trends, from data sellers to compute or fundraising.
Where China’s AI industry differs (and matches) the Western labs
The thing that makes building an AI model today so interesting is that it’s not just about getting a group of great researchers in one building together to produce an engineering marvel. It used to be this, but to sustain AI businesses, the LLMs are becoming a mix of building, deploying, funding, and getting adoption for this creation. The leading AI companies exist in complex ecosystems that supply money, compute, data and more in order to keep pushing the frontier.
The integration of these various inputs to creating and sustaining LLMs is fairly well conceptualized and mapped for the Western ecosystem, as typified by Anthropic and OpenAI, so finding big differences in how the Chinese labs think about it points at where the different companies can be making meaningfully different bets on the future. Of course, these futures can be heavily dictated by the constraints on funding and/or compute.
I’ve documented the biggest “AI Industry” level take-aways from talking to these labs:
-
Early signs of domestic AI demand. There’s a much-touted hypothesis that the Chinese AI market will be smaller because Chinese companies don’t tend to pay for software – thus, never unlocking a giant inference market supporting labs. This is only true for software spend that maps to the SaaS ecosystem, which is historically tiny in China, where on the other hand there is obviously still a large cloud market in China. A crucial unanswered question – one which the Chinese labs themselves debate – on if spending for AI in the enterprise tracks the SaaS market (small) or the cloud market (fundamental). On net, it feels like AI is trending closer to the cloud, and no one was actively worried about a market growing around the new tools.
-
Most developers are Claude-pilled. Most of the AI developers in China are obsessed with Claude and how it’s changed how they build software, despite Claude nominally being banned in China. Just because China has historically been hesitant to buy software does not give me the impression that there won’t be a massive surge in inference demand. Chinese technical staff are so practical, humble, and motivated – a fact that seems stronger than any commitment to previous habits in not spending.
Some Chinese researchers mention building with their own tools, such as the Kimi or GLM CLIs, but all of them mention building with Claude. There were also surprisingly few mentions of Codex, which is definitely surging in popularity in the Bay Area. -
Chinese companies have a technology ownership mentality. The Chinese culture is combining with a roaring economic engine to create unpredictable outcomes. I’m left with a lasting feeling that the numerous AI models reflect a practical, current equilibrium of the many technology businesses here. There’s no master plan. The industry is defined by a respect for ByteDance and Alibaba, the incumbents expected to win large portions of all markets with their substantial resources. DeepSeek is the respected technical leader, but far from a market leader. They set the direction, but aren’t set up to win economically.
This leaves companies like Meituan or Ant Group, where people in the West can be surprised they’re building these models. In reality, they see LLMs obviously as being central to future technology products, so they need a strong base. When they fine-tune the strong, general purpose model it hardens their stack from getting the open community to provide feedback on it, and they can keep internal, fine-tuned versions of the model for their products. The “open-first” mentality in the industry is largely defined by practicality — it helps make their models get strong feedback, it gives back to the open-source community, and empowers their mission. -
Government aid is real, but unclear how big. It’s often asserted that the Chinese government is actively helping with the open LLM race. This is a government that’s decentralized across many levels, each of which doesn’t have a clear playbook for what exactly they do. Neighborhoods in Beijing compete for tech companies to house their offices there. The “help” offered to these companies almost certainly involved removing bureaucratic red tape like permits, but how far does it go? Can levels of the government help attract talent? Can they help smuggle chips? Across the visit, there were many mentions of government interest or help, but far too little to report the details as assertive or have a confident worldview of how government can bend the trajectory of AI in China.
There were certainly no hints of the top levels of the Chinese government influencing any technical decisions in the models. -
The data industry is far less developed. Having heard so much about the likes of Anthropic or OpenAI spending $10M+ for single environments, with cumulative spend on the order of hundreds of millions per year to push the frontier of RL, we were eager to know if Chinese labs are either buying the same environments from companies in the U.S. or supported by a mirrored domestic ecosystem. The answer was not quite complete that there’s no data industry, but rather that their experience was that the data industry was relatively poor quality and it is often better to build the environments or data in-house. Researchers themselves spend meaningful time making the RL training environments, and some of the bigger companies like ByteDance and Alibaba can have in-house data labelling teams to support this. This all mirrors the build-not-buy mentality from the previous bullet.
-
Desperation for more Nvidia chips. Nvidia compute is the gold-standard for training and everyone is limited in progress by not having more of it. If supply was there, it is obvious that they would buy it. Other accelerators, including but not limited to Huawei, were spoken positively of for inference. Countless labs have access to Huawei chips.
These points paint a very different picture of an AI ecosystem, where quickly mapping how Western labs operate to their Chinese counterparts will often result in a category error. The crucial question is if these different ecosystems will produce meaningfully different types of models, or if the Chinese models will always be explained by being similar to the U.S. frontier models of 3-9 months ago.
Conclusion: The global equilibrium
I knew so little about China going into the trip and came out with the feeling of just starting to learn. China isn’t a place that can be expressed by rules or recipes, but one with very different dynamics and chemistry. The culture is so old, so deep, and still completely intertwined with how domestic technology is built. I have much more learning ahead.
So much of the current power structures in the US use their current worldviews of China as crucial mental devices for decision making. Having talked, in person, either formally or informally to pretty much every leading AI lab in China, there are a lot of qualities and instincts in China that’ll be very hard to model with Western decision making. Even after asking directly about why these labs release their top models openly, the intersection between ownership mentality and genuine ecosystem support is hard for me to connect the dots on.
The labs here are practical and not necessarily absolutists around open-source, where every model they build would be released openly, but there’s a deep intentionality in supporting developers, the ecosystem, and using it as a way to learn more about their models.
Almost every major Chinese technology company is building their own general purpose LLMs, as we see with the likes of Meituan (delivery service) and Xiaomi (broad consumer technology company) releasing open weight models. The equivalent companies in the U.S. would just buy services. These companies aren’t building LLMs out of a race to be relevant with the hot new thing, but a deep fundamental yearning to control their own stack and develop the most important technologies of the day. When I look up from my laptop and always see bunches of cranes on the horizon, it obviously fits in the with the broader culture and energy around building in China.
The humanity, charm, and genuine warmth of Chinese researchers is extremely humanizing. At a personal level, the cut-throat geopolitical conversation we’re used to in the U.S. hasn’t permeated them at all. The world can use more of this simple positivity. As a citizen of the AI community, I currently worry more about the fissures appearing within members and groups around labels of nationality.
I’d be lying if I said I didn’t want US labs to be clear leaders in every part of the AI stack — especially with open models where I spend my time — I’m American, and that’s an honest preference. With this, I want the open ecosystem itself to thrive globally, as this can create safer, more accessible, and more useful AI for the world, and right now the question is whether American labs will take the steps to own that leadership position.
As of finishing this piece, more rumors are swirling of executive orders influencing open models, which can further complicate this synergy between American leadership and the global ecosystem — it doesn’t fill me with confidence.
Thank you to all the wonderful people I got to talk to at Moonshot, Zhipu, Meituan, Xiaomi, Qwen, Ant Ling, 01.ai, and others. Everyone has been so welcoming and gracious with their time. I’ll keep sharing my thoughts on China as they crystallize, across culture generally and AI specifically. It is obvious that this knowledge will be directly relevant to the story unfolding at the frontier of AI development.
1Edit 05/07: In this paragraph in the original I misattributed an unwillingness to speak on broader issues to humility, which can of course play a part, but this habit is also shaped by the system which they were trained and raised, a system they are successful in and adept at navigating.
What I removed: … capture the upbringing and education of these scientists extreme humility of these scientists. It’s more than just being dedicated to their work, but they don’t want to comment on issues they’re not informed on.…
245 30 37 Share Previous
Google DeepMind partners with EVE Online for AI model testing
Google DeepMind acquired a minority stake in EVE Online's developer, Fenris Creations, to use the complex sci-fi MMO as a unique testbed for AI systems requiring long-horizon planning and continual learning, following Fenris's $120 million buy-out from Pearl Abyss.
Original article
Google’s AI-focused DeepMind division has taken a minority stake in the developer of popular sci-fi simulation EVE Online, saying it will use the game to study “intelligence in complex, dynamic, player-driven systems.”
The research partnership comes as the management behind EVE Online developer CCP Games announced that they have spent $120 million to buy themselves out from their former owners at South Korean publisher Pearl Abyss (Crimson Desert). The newly independent entity is being rebranded as Fenris Creations, which will continue to operate as normal without any restructuring or layoffs, the company said.
“Something that already behaves like a living world”
In today’s announcement, Fenris and DeepMind said that EVE Online presents “a uniquely rich environment for study,” especially when it comes to developing AI systems that use “long-horizon planning, memory, and continual learning.” DeepMind says it will conduct controlled experiments on its models in a specially designed offline version of the game running on a local server, without directly impacting the experience for online players. The two companies “will also explore new gameplay experiences enabled by these technologies,” they wrote.
Google DeepMind has a long history of using games as a proving ground for machine learning models, from enabling breakthroughs in complex board games like Go to outperforming humans in Atari VCS games and StarCraft, for example. More recently, the company has begun using so-called “virtual world” models to help AI systems learn to operate in physical reality.
Fenris CEO Hilmar Veigar Pétursson said in an open letter addressed to players that “EVE is one of the few environments where questions about intelligence can be explored inside something that already behaves like a living world.” Studying EVE will allow Google DeepMind’s models to explore “difficult problems, long timelines [and] strange possibilities,” he added.
“As a gamer and games producer, I’ve long admired EVE,” Google DeepMind Director Alexandre Moufarek said in a statement. “What the EVE community has created together with [Pétursson] and team is truly unparalleled in gaming. It is a one-of-a-kind simulation for testing general-purpose artificial intelligence in a safe sandbox environment. I’m excited to partner with the team at Fenris Creations to push the frontier of artificial intelligence and explore new player experiences.”
Breaking free
The newly independent Fenris Creations said that “differences in operating context, current strategic focus, and long-term priorities” were among the reasons for the joint decision to part ways with Pearl Abyss, which purchased CCP Games in 2018. A Pearl Abyss spokesperson told Inven Global that “we concluded that selling the company to its current management is in the best interest of both parties’ futures.”
Pearl Abyss paid $225 million for the EVE Online maker just six years ago, meaning the recent $120 million sale represents a significant decline in value for the company.
The EVE Online player base has maintained a robust and balanced in-game economy for decades now, complete with its own examples of corporate intrigue, economic panics, and political subterfuge. But developer Fenris/CCP has faced financial struggles in recent years, with annual losses nearing $20 million in both 2023 and 2024.
Fenris/CCP said those losses were attributable in part to costly development work on blockchain-based spinoff EVE Frontier, which saw an alpha test launch last year, and extraction-shooter spinoff EVE Vanguard, which is planned for release later this year. But Fenris Creations said this week that the company was profitable in 2025 on $70 million in revenue and maintains “strong reserves.”
Now that it’s free from Pearl Abyss, Fenris says it will be able to make long-term strategic decisions similar to those it made before its 2018 purchase. Fenris CEO Pétursson added that internal control of the company will “giv[e] us a more direct structure for the kind of far-reaching decisions that EVE requires.”
The company’s “EVE Forever” philosophy is more than just a slogan to be rolled out at the annual Fanfest convention in Iceland, he continued. “It is a way of thinking about every decision we make. What does New Eden need in order to endure? What does the company need in order to support it? What kind of structure gives us the patience and resources to keep building this universe properly?”
Perplexity Brings Personal Computer to Mac
Perplexity made its "Personal Computer" AI agent available to all Mac users, allowing it to interact with local files, applications, and web resources directly through its desktop app.
Original article
Perplexity released Personal Computer for all Mac users through its desktop app, giving AI agents access to local files, applications, connectors, and the web.
Trusted Contact for ChatGPT
OpenAI introduced "Trusted Contact" for ChatGPT, an optional feature that allows adult users to designate a contact who will be alerted if the AI detects severe self-harm risk in their conversations.
Original article
OpenAI introduced Trusted Contact, an optional feature that allows adults to nominate someone who may be alerted if severe self-harm risk is detected in conversations.
Apple's Camera-Equipped AirPods Reach Late Testing in AI Device Push
Apple is reportedly in the late stages of developing new AirPods with integrated cameras, marking its first dedicated AI hardware, but the launch could be delayed by concerns over the quality of its visual AI capabilities.
Original article
Apple is in the late stages of developing new AirPods with built-in cameras. The prototypes feature a near-final design and capabilities. The device will be Apple's first foray into AI-enhanced hardware. While the hardware is nearly ready, there are still concerns about the AI elements, which could further hold back a launch if the quality of the visual intelligence features isn't good enough.
Google unveils screenless Fitbit Air and Google Health app to replace Fitbit
Google is re-entering the screenless wearable market with the $99.99 Fitbit Air, launching May 26, 2026, which funnels health data into a new Google Health app featuring an AI-powered coach built on Gemini to interpret user metrics.
Decoder
- SpO2: Peripheral oxygen saturation, an estimate of the amount of oxygen in the blood.
Original article
Wearables have really come full circle. The early Fitbits didn’t have screens, but the move to smartwatches put a screen on everyone’s wrist. Now, devices like Whoop and Hume are designed as data trackers first and foremost without so much as a clock. Google’s newest wearable jumps on that trend: The Fitbit Air doesn’t have a screen, but it does have a suite of health sensors that pipe data into the new Google Health app. And if you want, Google has a new AI-powered health coach in the app ready to tell you what that data means (maybe).
The Fitbit Air itself is a small plastic puck about 1.4 inches long and 0.7 inches wide. It slots into various bands that hold the bottom-mounted sensors against your wrist. There’s no display pointing upward, so the entire device is covered by the fabric or plastic of the band. It’s a streamlined and potentially stylish look—in uncharacteristic fashion, Google has plenty of colors and style options available, including a special-edition Steph Curry version. You may have heard chatter about Curry being seen teasing a new screenless Fitbit, and this is it.
Smartwatches never quite became a must-have device—plenty of people have them, but we don’t all wear them all the time because they need to be charged often and aren’t always very comfortable. The screenless Fitbit Air doesn’t have those issues. Google says it lasts about a week on a charge, and it does that while collecting continuous health data. It can even store a day of data without being connected to your phone.
While the Pixel Watch is very comfortable for a smartwatch, Google still wants to make it easier for people to keep collecting data all day and night. The company says that product testers rated the Air as more comfortable than competing devices, so you may actually be willing to wear it to bed for sleep tracking. You don’t have to choose between these devices, either. You can keep a Pixel Watch and Fitbit Air paired with your phone and wear whichever one you want over time. This capability will come to more wearable devices in the near future, too.
The Fitbit Air will have all the standard wearable health sensors: heart rate, accelerometer/gyroscope, infrared SpO2, and skin temperature. Google notes that the heart rate monitor isn’t as advanced as the one in the latest Pixel Watches, so the Air might not be as accurate during vigorous activity. The Air also has a vibration motor that can be used for alarms, but it’s not going to buzz for phone notifications like a smartwatch.
The Fitbit Air launches on May 26 for $99.99 with the included Performance Loop band. There are also silicone Performance Loop and Elevated Modern Band options. Bands start at $34.99 and come in various colors. A Fitbit Air purchase also includes three months of Google Health Premium (replacing Fitbit Premium), which now features Google’s new AI Health Coach.
Goodbye, Fitbit… Hello, Google Health
The Fitbit app is getting a major makeover and a new name. An update in the coming weeks will transform that app into Google Health, featuring a new interface with a more extensive Material Expressive aesthetic and redesigned menus and tabs. You also won’t see Fitbit branding in as many places—the Fitbit Premium subscription will become Google Health Premium.
Without a subscription, the app still does all the basic things, like tracking your health stats, automatically logging workouts, and showing it all in a pretty dashboard. With the Premium subscription, you get all the features from Fitbit Premium plus the new AI Health Coach. It’s a chatbot, so you can ask it about any health or wellness topics, and the answers are grounded in your health data.
Google suggests asking the Health Coach for customized workout routines or exploring health concerns. The robot can theoretically use your accumulated health metrics, like workouts, nutrition, and sleep, to provide better suggestions. You can even upload a picture of food to Health Coach and have it automatically logged in the app.
This Health Coach AI was built on Gemini, but it has been tuned differently from the normal frontier model. According to Google, it used a panel of health experts and extensive user studies to validate the Health Coach model. Curry and his “performance team” also had input on how the Health Coach responds.
We won’t know how useful the coach is until it begins rolling out later this month, but the idea is that it will be more useful the more data is piped in from your wearable. Naturally, health data is extremely sensitive, and Google is asking you to dump a lot of it into a cloud-based AI model. Google says it will never use this data for advertising, which has been the case in all its previous health endeavors. In the AI era, it has further stipulated that it won’t use your health data for AI training unless you choose to do that. There will be an opt-in toggle in the settings to contribute data for training, but it’s unclear why anyone would do that.
Like the retired Fitbit Premium, the new Google Health Premium will be available for $10 per month or $100 per year. It’s also included if you’re already paying for AI Pro or AI Ultra. If you choose to skip the subscription, you can continue to use your Fitbit and Google wearables in the new app with the same basic stat-tracking features. And what of Fit, that other Google-branded health tracking app? Fit will shut down later this year, at which time users will have to migrate their data to Google Health.
What the hell is happening in China?
China's biotech industry, characterized by early-stage biotechs with over 10 development candidates and cheaper drug development, is poised to surpass Western rivals within years, driven by a "breadth-first" strategy and less concern for IP protection given global patent visibility.
Deep dive
- Chinese biotechs employ a "breadth-first" strategy, developing 10+ candidates simultaneously, due to cheaper development costs and past limited access to late-stage resources.
- This aggressive approach makes "me-too" or "me-better" drugs from Western companies harder to exit, increasing competition for established targets like GalNAc siRNAs and antibodies.
- The "China middleman" playbook, where Western VCs acquire Chinese assets for later-stage trials, is becoming outdated as big pharma now has direct access.
- Manufacturing drugs in China is not required for trials there, though Chinese QC standards must be met.
- IP protection concerns regarding clinical trials in China are largely overstated, as patents are public early, and Chinese biotechs are adept at working around them.
- Chinese drugs still target US markets for exits and approvals due to China's single-payer system and a small rare disease market.
- The bar for "novel" science is higher than ever; "in vivo CAR-Ts" and gene/epigenetic editing are no longer considered novel enough for significant differentiation.
- Investigator-initiated trials (IITs) remain a fast and cheap way to get first-in-human data, though Order No. 818 (May 1, 2026) limits them to Tier 3A hospitals.
- The FDA has shown emerging precedents for accepting IIT human data as part of IND filings, providing an additional incentive for early clinical work in China with proper FDA communication.
Decoder
- GalNAc siRNA: A type of small interfering RNA (siRNA) chemically modified with N-acetylgalactosamine (GalNAc) to enhance liver-specific delivery for gene silencing.
- PD-1 x VEGF bispecific antibody: A type of antibody engineered to bind to two different targets (Programmed Death-1 and Vascular Endothelial Growth Factor) simultaneously, often used in cancer immunotherapy.
- CDMO: Contract Development and Manufacturing Organization, a company that provides comprehensive services from drug development to manufacturing for pharmaceutical companies.
- IIT (Investigator-Initiated Trial): A clinical trial initiated and managed by a researcher rather than a pharmaceutical company.
- IND (Investigational New Drug): An application submitted to the FDA to obtain permission to conduct human clinical trials with an experimental drug.
- CMC (Chemistry, Manufacturing, and Controls): Information related to the manufacturing process, quality control, and testing of a drug product.
- GLP-tox (Good Laboratory Practice Toxicology): Non-clinical laboratory studies conducted under Good Laboratory Practice regulations to assess the toxicity of a drug.
- CAR-T (Chimeric Antigen Receptor T-cell): A type of immunotherapy that involves engineering a patient's own T cells to recognize and kill cancer cells.
Original article
China has several early-stage biotechs with over 10 development candidates. It is much easier to develop drugs in China. Chinese biotechs lacked access to later-stage development resources in the past, so they have always leaned toward a breadth-first strategy. This has resulted in an industry with heavy competition that will likely surpass its Western rivals within the next few years.
The AI Revival of the Three Mile Island Nuclear Plant
The increasing energy demand from US AI infrastructure buildout is pushing a reliance on older nuclear technology, like the potential restart of Three Mile Island, because advanced, safer nuclear reactor designs are still years away from contributing meaningfully to the energy supply.
Original article
The buildout of AI infrastructure in the US has transformed the country's energy needs. A new crop of companies has developed nuclear reactor designs that they claim to be cheaper, safer, and easier to build than the ones currently in operation. However, it will take many years before these technologies will meaningfully contribute to the US energy supply. This means that the country will have to rely on much older technology until the new plants come online.
Behind the Scenes Hardening Firefox with Claude Mythos Preview
Mozilla significantly improved Firefox's security by using Claude Mythos Preview and other AI models to discover and fix an unprecedented number of latent security bugs, many of which would typically require combining with other exploits for a full attack.
Decoder
- Latent security bugs: Security vulnerabilities that are present in the code but have not yet been discovered or exploited.
- Full-chain compromise: A multi-step attack where several vulnerabilities are chained together to gain complete control over a system or application.
Original article
Mozilla recently announced that it had identified and fixed an unprecedented number of latent security bugs in Firefox with the help of Claude Mythos Preview and other AI models. This post goes into detail about how the team approached this work, what it found, and advice for other projects on using emerging capabilities to harden against attacks. Many of the bugs discovered would need to be combined with other exploits to achieve a full-chain compromise.
OpenAI launches new realtime voice and translation AI models
OpenAI has launched three new real-time audio models via its API, including GPT-Realtime-2 for GPT-5-class reasoning in voice agents, GPT-Realtime-Translate for live multilingual conversations in over 70 languages, and GPT-Realtime-Whisper for streaming speech-to-text.
Original article
OpenAI is advancing its voice AI capabilities within its API platform by introducing three new real-time audio models designed for developers creating live voice agents, translation tools, and streaming transcription products. The release includes GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper, all accessible through the Realtime API.
GPT-Realtime-2 is the primary agentic voice model in this lineup. OpenAI claims it offers GPT-5-class reasoning for spoken conversations, enabling voice agents to tackle more complex requests, manage context, utilize tools, respond to corrections, and maintain a conversation without reverting to simple call-and-response behavior. The model supports parallel tool calls, short spoken preambles like “let me check that,” improved recovery behavior when a task fails, and a larger 128K context window, an increase from the previous generation's 32K.
Introducing GPT-Realtime-2 in the API: our most intelligent voice model yet, bringing GPT-5-class reasoning to voice agents.
Voice agents are now real-time collaborators that can listen, reason, and solve complex problems as conversations unfold.
Now available in the API…
Developers have more control over reasoning effort, with settings ranging from minimal to xhigh. Low is the default, while higher settings are intended for more intricate voice tasks where reasoning depth is prioritized over latency. OpenAI reports that GPT-Realtime-2 demonstrates improvements over GPT-Realtime-1.5 in audio intelligence, instruction adherence, context management, and live conversation control.
GPT-Realtime-Translate is designed for live multilingual voice products. It supports speech input in over 70 languages and output in 13 languages, enabling developers to create tools for customer support, cross-border sales, education, events, creator platforms, and media localization. The model is engineered to keep up with speakers while managing regional pronunciation, context shifts, and domain-specific terminology.
GPT-Realtime-Whisper offers streaming speech-to-text capabilities to the API. It transcribes audio as people speak, making it ideal for live captions, meeting notes, classroom tools, broadcasts, customer support workflows, healthcare documentation, recruiting, and sales calls where speech needs to be converted into structured text during the conversation, not afterward.
The target audience includes developers and businesses building voice-first products rather than general ChatGPT users. Early use cases identified by OpenAI include Zillow for real estate voice agents, Deutsche Telekom for multilingual support, Priceline for travel assistance, Vimeo for live video translation, and other companies focusing on customer service, enterprise search, healthcare, and AI assistant workflows.
Pricing for all three models is now available. GPT-Realtime-2 is priced at $32 per 1 million audio input tokens, $0.40 per 1 million cached input tokens, and $64 per 1 million audio output tokens. GPT-Realtime-Translate costs $0.034 per minute, while GPT-Realtime-Whisper costs $0.017 per minute. The models can be tested in OpenAI’s Playground and integrated into applications via the Realtime API.
AVM 2 for ChatGPT and Realtime Voice for Codex are also on the way!
The company behind this release, OpenAI, continues to expand its developer platform around multimodal AI, agents, and enterprise-ready APIs. This announcement focuses not on a new consumer app but on providing software teams with the infrastructure to integrate voice agents into products, support systems, travel apps, real estate tools, education platforms, and workplace software.
Source
Elon Musk tried to hire OpenAI founders to start AI unit inside Tesla
OpenAI claims Elon Musk attempted to hire its founding team, including Sam Altman, in 2018 to lead a Tesla AI unit, contradicting his lawsuit that accuses Altman of "stealing a charity" by commercializing OpenAI.
Original article
Elon Musk tried to hire OpenAI’s founding team, including Sam Altman, to lead a new AI lab within Tesla in 2018, as the AI start-up’s leaders grappled over who should control the company and its direction.
Musk, a co-founder of the AI group, proposed bringing Altman, Greg Brockman, and Ilya Sutskever to his carmaker, appointing Altman to the board or making OpenAI a Tesla subsidiary, according to evidence in a high-stakes trial between the billionaire and the ChatGPT maker on Wednesday.
The disclosures shed light on a crucial issue in the case, in which Musk has claimed that Altman “stole a charity” by converting the company into a for-profit. OpenAI’s lawyers have argued the Tesla chief executive was happy to commercialize the lab, provided that he remained in charge.
Emails, texts, and testimony on Wednesday showed that by late 2017 Musk had lost confidence in the non-profit OpenAI’s ability to build artificial general intelligence, a powerful form of AI—and was exploring building his own AI lab within Tesla.
“There is little chance of OpenAI being a successful force if I focus on TeslaAI,” Musk wrote in a message at the time to Shivon Zilis, who testified in court on Wednesday.
Zilis, an OpenAI adviser from 2016 and board member from 2020 until 2023, is the mother of four of Musk’s children and was an important interlocutor between the billionaire and the AI lab’s other founders during the six-month period on which much of the case hinges.
In late 2017, Zilis sketched out plans for an event to “share that Tesla is building a world-leading AI lab (?) which will rival the likes of Google / DeepMind and Facebook AI Research.”
By early 2018, she laid out nine possible scenarios for achieving AGI. The bulk of those centered on Tesla and included bringing Altman in to run AI at the carmaker. Another proposal was to poach DeepMind founder Demis Hassabis for the same role.
These were among the options explored by OpenAI’s founders as they weighed the best structure to enable the company to raise enough capital to take on Google while retaining its non-profit mission.
Ultimately, OpenAI’s executives were not persuaded by Musk’s proposals. Zilis told Musk’s then-chief of staff Sam Teller in a February 2018 email: “They all think Elon is an incredible human being but that he really hasn’t done his homework AI/AGI and that really concerns them about working with him.”
Musk left OpenAI’s board in early 2018, and OpenAI went on to restructure as a for-profit entity with a charitable arm.
The world’s richest man is suing the company in a case that could alter the fate of OpenAI, which has grown to be an $852 billion behemoth with aspirations for a public listing as early as this year.
Musk claims Altman, Brockman, and OpenAI unjustly enriched themselves by converting the start-up into a for-profit company.
William Savitt, OpenAI’s lead attorney in the case, said he believed Zilis’ testimony showed Musk was “prepared to do the for-profit, provided he would get control.”
Speaking after Wednesday’s court hearing, Savitt said Musk sought to control governance and “fold OpenAI into Tesla… when neither option was available to him he picked up his marbles and went home.”
Brockman, OpenAI’s president, on Tuesday told the jury in Oakland that Musk was seeking “unilateral control over AGI,” which he and other founders could not accept.
Zilis, a technology expert who has also worked as an executive at Tesla and Musk’s brain-implant company Neuralink, told the court on Wednesday that her “allegiance [is] to the best outcome of AI for humanity.”
She and Musk first had a romantic relationship roughly a decade ago and decided to have children via IVF in 2020. “I… really wanted to be a mum. [Musk] was encouraging everyone around him to have children… he said if that was ever interesting he’d be able to make a donation,” she said.
In 2020, two years after the pair had fought over the direction of OpenAI, Altman texted Zilis to ask advice on approaching Musk. She was encouraging, but warned him: “the only thing I wonder is if he’ll pull the ‘you should have gone with Tesla’ card on you.”
Cloudflare to Slash 1,100 Jobs Due to AI-Driven Restructuring Plan
Cloudflare announced plans to cut 1,100 jobs as part of a restructuring to adopt an "agentic AI-first operating model," anticipating $140 million to $150 million in related charges.
Decoder
- Agentic AI: An artificial intelligence system capable of autonomous action, decision-making, and goal-setting, often by breaking down complex tasks into sub-tasks and executing them sequentially.
Original article
Cloudflare plans to slash 1,100 jobs as part of a restructuring plan that it claims will define how a world-class, high-growth company operates and creates value in an agentic AI era. The company says it will become even faster and more innovative by embracing an agentic AI-first operating model. The layoffs are expected to be substantially complete by the end of the third quarter. Cloudflare expects to incur charges between $140 million and $150 million for the layoffs.
The agent principal-agent problem
AI agents are exacerbating the "principal-agent problem" in code review by enabling "slop PRs" and increasing review load, making the traditional review-then-commit process unmanageable in low-trust large company environments.
Decoder
- Principal-agent problem: An economic concept where one person (the 'agent') is able to make decisions on behalf of another person (the 'principal'), but the agent's incentives may not perfectly align with the principal's. In code review, the contributor is the agent and the reviewer is the principal.
Original article
The agent principal-agent problem
Code review is broken.
The industry-established code review process, review-then-commit, was a straightforward mechanism that allowed a relatively low-trust group of engineers to collaborate. It appears to have been initially developed for the Apache server OSS project in the 90s, corporatized by Google in the early 2000s, and popularized throughout the industry by several means, most notable of which was the GitHub PR.
It was very simple:
- A human makes a change.
- This change is packaged up, sent to another human for commentary.
- Rounds of commentary and adjustments continue until the reviewer approves (LGTMs) it.
- The change is committed.
This is not Michael Fagan's defect analysis work or the ticket-like processes used for critical systems changes in fields like aerospace. This will not catch your bugs. It will, however, communicate design changes to other engineers who maintain a mental model of the codebase, and reviewers can use the process to teach norms to contributors. It has advantages, and because there is a gate before the main branch changes, it does not require much trust. That makes it a great tool for scaling a company, because beyond ~10-12 engineers (the "two pizza" team, among other names), trust erodes rapidly. It is also great for scaling OSS. It puts work on reviewers, but there was work on the human making the change too. An imbalance existed but was often manageable.
The crisis of code review
Agents broke this. If you insert an agent into the existing process, your best possible outcome is:
- A human instructs a machine to make a change.
- The human reviews the code, iterates with comments until they approve it.
- This change is packaged up, sent to another human for commentary.
- Rounds of commentary and adjustments continue until the reviewer approves (LGTMs) it.
- The change is committed.
This doubles the amount of review. But companies were already review limited. In a really well-functioning team, a code review cycle could take a day. (Between two engineers who get on well and intimately know each other's work, you could shrink this to an hour.) But across the industry the number was, optimistically, days to get a review merged before agents.
Additionally, the whole reason engineers use agents is it improves productivity. More total changes are generated. So we doubled review, and increased the total changes. As you modify the old model, you run out of review bandwidth before you have extracted all the value you can from agents. (And anecdotally, you run out of bandwidth before you get even a fraction of the value of agents.)
But things get worse, because no-one actually augments the old processes this way.
The agent principal-agent problem
What happens in reality are processes like this:
- A human instructs a machine to make a change.
- This change is lightly QA'd, packaged up, sent to another human for commentary.
- Rounds of commentary come back from the reviewer and are sent wholesale to the machine for adjustments until the reviewer approves (LGTMs) it.
- The change is committed.
This is an example of what economists call the principal-agent problem: the reviewer is the principal, the contributor is the agent, and code review only worked because the reviewer could cheaply infer effort from reading the code. Agents collapse that signal. This is what is killing OSS, and it is commonly being referred to as "slop PRs". There is no incentive for the human driving the agent to actually read the code or spend time thinking about what the reviewer says.
The result is a radical imbalance. "Contributors" type a sentence or two, of the quality of a poor bug report, spend 5 minutes poking at the resulting program, and then generate serious review load for another engineer. You can do this with no understanding of the underlying project, its constraints, or the tools used to construct it. This is an unmanageable disaster. This does not even work in environments where the reviewer is paid to do the work, because they could be more productive by prompting the agent themselves.
Potential solutions
Small high-trust teams have an easy process they can adopt:
- A human instructs a machine to make a change.
- The human reviews the code, iterates with comments until they approve it.
- They push the change to production and deploy.
There is still a human in the loop. There is still a reviewer who did not get deeply lost in the weeds of how a problem could be solved. Most importantly, there is no principal-agent problem, because the human driving the machine takes on the responsibility for its actions by owning the deployment.
Anecdotal evidence suggests this works for small teams. With a team of nine at exe.dev we have been able to make it work. We spend a lot more time writing integration tests, e2e tests, building agent-based workflows for analyzing commits for safety or performance or usability bugs to minimize risk. This is a lot of machinery teams traditionally do not develop until they are far larger and more mature, on the other hand it is much easier to develop thanks to agents. We also have had to be very selective about our colleagues and be intentional in our communication. But we ship this way.
This is not tenable in low-trust environments, i.e. large companies. You have to trust your co-workers to start a conversation about architectural changes before they do it. No-one at BigCo trusts their colleagues to make sweeping changes to a service they "own". And no-one at BigCo wants to be on the hook for a major outage without having coverage from a code review to smear the blame around. (Low trust environments are awful places.)
I am sure there are small isolated teams at big companies that have broken with standard practices and are getting real value out of agents. I am also sure there are ICs who have work that lets them maximize the value of an agent without involving their colleagues. (E.g. if you work in quality, agents can help you write and execute endless large-scale experiments you never need get reviewed, just send out what works.) But the vast majority of big company engineers cannot make changes, especially cross-functional changes that agents do so well, without review eating all the productivity gains.
Some hints in the history books
As of writing this, I have not seen anyone describe a process that "scales" agent-driven development in a large company. There is, however, evidence from the past that it is possible. I would point to Microsoft in the 1990s, which did not have mandated review-before-commit practices. Some teams may have, but the company, while large, was organized as many independent teams constantly synchronized by QA processes. This is regarded as "old-fashioned" "cowboy" style development by proponents of the large-team processes that came before agents. But it did work. It created some of Microsoft's most long-lived successful products, like the win32 API. (And yes we could critique a 30 year old API endlessly, but it is still there and significantly better than some of its "replacements" that were built with code review processes.) Little appears to be written about this period of Microsoft history, if you were there I would love to hear or read about your experiences.
Until someone develops robust processes for agent use in low-trust environments, small teams have a large force multiplier available to them that big teams do not. Ship while you can.
Index
github.com/crawshaw
twitter.com/davidcrawshaw
david@zentus.com
Markets in everything?
The author, a proponent of markets, expresses concern over the "ever-increasing overt marketization of society" where everything, including personal sentiment, is being assigned a price, potentially leading to widespread dissatisfaction.
Decoder
- Harberger tax: A tax system where an asset owner sets its value and pays a recurring tax on that value, with the catch that anyone can purchase the asset from the owner at that self-declared price. It aims to increase efficiency and reduce deadweight loss by incentivizing owners to declare a fair market price.
Original article
Properly implemented and regulated, markets are the best fundamental arrangement of society for maximizing human flourishing.
AI load breaks GitHub – why not other vendors?
GitHub's recent spate of data integrity incidents and outages, including a 85% uptime, is attributed by CTO Vlad Fedorov to an unexpected surge in AI agent-fueled load, which GitHub, unlike competitors, seemingly failed to adequately anticipate.
Deep dive
- GitHub experienced a critical data integrity incident on April 23rd where squash-merged PRs lost commits, affecting 2,092 pull requests.
- The platform has seen widespread outages, including missing PRs and issues due to an Elasticsearch overload, leading to an estimated 85.51% uptime over the last 90 days.
- Mitchell Hashimoto, HashiCorp founder, publicly stated he is moving off GitHub due to its unreliability, calling it "unfit for professional work."
- GitHub CTO Vlad Fedorov attributed the outages to a ~3.5x increase in load over two years, largely driven by AI agents, which compounded issues with GitHub's 18 years of tech debt and an ongoing migration to Azure.
- GitHub initially planned for a 10x capacity increase by October 2025 but has since adjusted this to 30x due to the unexpected load.
- Competitors like GitLab and Bitbucket, and other infrastructure providers like Vercel and Linear, do not report similar widespread reliability issues despite experiencing AI-driven growth.
- The article suggests GitHub's engineering organization did not anticipate the scale of AI load as effectively as some other major tech companies, such as Google, which prepared for a 10x increase in code generation from AI tools.
Decoder
- Squash merge: A Git operation that combines all commits from a feature branch into a single new commit on the main branch, simplifying commit history.
- Elasticsearch: A distributed, RESTful search and analytics engine often used for full-text search, structured search, analytics, and complex aggregations on large datasets.
- Forgejo: An open-source, self-hostable Git service and code forge, often seen as an alternative to platforms like GitHub or GitLab.
Original article
The fact that Microsoft's competitors seem to be keeping up with increased load due to AI suggests that the company has not been responding to its growth like a world-class engineering organization.
Tokenmaxxing, Promomaxxing, and Misaligned Incentives in Tech
The pursuit of "tokenmaxxing" and "promomaxxing" in tech, driven by metrics that become targets, can lead to perverse incentives where engineers generate high output without corresponding positive outcomes, as exemplified by Meta engineers burning millions of AI tokens for no productivity.
Decoder
- Tokenmaxxing: A term referring to the practice of maximizing the number of AI tokens consumed, often in the belief that higher token usage correlates with higher productivity when using AI tools.
- Goodhart's Law: An adage stating: "When a measure becomes a target, it ceases to be a good measure," meaning that once a statistical measure is used for policy or decision-making, it tends to be distorted or manipulated.
- Promomaxxing: A colloquial term referring to the behavior where employees prioritize activities that maximize their chances of promotion, even if those activities do not align with the best interests of the company or lead to unnecessary complexity.
- Cobra Effect: A term describing a perverse incentive, where an attempted solution to a problem unintentionally makes the problem worse, named after an anecdote about a British bounty on cobras in colonial India.
Original article
Tokenmaxxing, Promomaxxing, and Misaligned Incentives in Tech
When a measure becomes a target, it ceases to be a good measure
Engineer’s Codex is a publication about real-world software engineering.
Coinbase did layoffs recently. They’re cutting a large percent of their employees, and along with the announcement, they mentioned wanting their employees to use AI more. Basically, they want people to tokenmaxx.
If you don’t know what tokenmaxxing is, its the idea of maximizing the amount of tokens you use when working with AI. Basically, use AI a lot! And it’s generally seen as a good thing. The more tokens you use should, in theory, mean the more productive you are.
This isn’t always true.
When a measure becomes a target, it ceases to be a good measure
Meta actually created an internal leaderboard that counted the amount of tokens people were consuming. You would think that the people consuming the most tokens are generally the most productive.
But here’s the problem: anything that is measured can and will be gamed (Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.”), especially by smart, pragmatic people, like Meta engineers. Smart engineers optimize for the best personal outcomes, which usually means a promotion, more money, more scope. If tokenmaxxing is the path to get there, they will do that (more visibility, higher number is good).
So people started setting up scripts to burn millions of tokens for literally zero productivity. Just burning tokens to do nothing. Meta eventually shut the leaderboard down because it created the wrong incentive.
Promomaxxing
Tokenmaxxing actually reminds me of a famous complaint about Google: specifically how hard it is to get promoted there and how that led to a lot of misaligned incentives. You could call it promomaxxing.
Googlers would make things more complex than needed, write way more docs than needed, and make those docs much longer than needed, all to manufacture the appearance of hard, complex work. Because at Google, if your project wasn’t technically complex enough or there weren’t enough of them, you weren’t getting promoted.
In theory, this makes sense. People who get promoted should be doing harder and harder things.
However, good software engineering should lean toward simplicity. But promotion rewards complexity. And as frameworks and developer infrastructure keep making engineering simpler (which is genuinely good), engineers run out of adequate complexity to justify their promotions. So you get these irrational decisions for the business that are completely rational decisions for the individual. That’s a textbook misaligned incentive.
The Cobra Effect of Perverse Incentives
The most famous example I can think of in history of perverse incentives is the Cobra Effect in India.
During British rule in India, the government was concerned about the number of venomous cobras. They offered a bounty for every dead cobra brought to them.
Enterprising citizens began breeding cobras specifically to kill them and collect the reward.
When the government realized this and scrapped the bounty, the breeders released their now-worthless snakes, leaving the cobra population higher than when the program started.
The Input → Output → Outcome Discrepancy
Tokenmaxxing has the same problem. Good intentions, wrong incentive structure.
Tokenmaxxing is built on the idea of:
input → output → outcome.
More input should produce more output, which should drive better outcomes. This framing actually comes from this article by Arnav Gupta on Twitter, and I think he expressed it well.
The issue is that input to output always has some loss. You put in 100% input and you might get anywhere from 50% to 150% output, because sometimes that input is thinking time, debugging, exploration. It’s not a clean conversion.
Outcome is even further removed. Output doesn’t necessarily even correlate to outcome. In fact, output can even lead to negative outcomes.
Just because you shipped a feature doesn’t mean you moved a metric positively. If your goal is to increase retention and you built a notifications feature, the outcome you want is higher retention. But more notifications doesn’t guarantee that. If users already have notifications and you add more, you might actually annoy them and hurt retention. There is no guarantee that more output even produces a positive outcome.
There are even worse examples. Say you, as a Google engineer, spend a ton of input to ship a system and that system has bugs that take down Google’s ad platform for two hours. Google just lost $5 million. Your output was a net negative. Was tokenmaxxing worth it in this case? Tokenmaxxing does also result in lower quality output, on average. Can we really guarantee better outcomes from more rapid input?
This is where input quality does matter. And generally, tokenmaxxing degrades input quality.
Slow coding was a feature, not a bug. It required clearer thinking and higher-quality inputs.
.When a CEO or PM had 10 ideas and the team could only do 2, you were forced to debate. You had to fight for your idea, kill the weak ones early, and actually pressure test what was worth building. The constraint created a filter. Now that code is basically free and fast, that filter is gone. Yes, an MVP is valuable data.
Another note is that tokenmaxxing does not just 5x your output. It can also 5x’d your noise. More features, more bugs, more teams building overlapping things in different ways, more meetings to align on stuff that should have been killed in a Slack thread. The alignment tax went up at the exact same time the coding cost went down.
These are really recent, really clear examples of misaligned incentives. And misaligned incentives are hard problems because they’re people problems. You’re trying to optimize for multiple things at once, and it’s leadership’s job to lay the dominoes in the right way.
Some startups like Anthropic have naturally aligned incentives. If an engineer tokenmaxxes using Claude, Anthropic just gets paid more. They don’t really care about your output or your outcome. They care about their own outcome, which just happens to be directly correlated to your input. To you, they sell the potential outcome and you purchase the guaranteed tokens.
SWE Quiz (Featured)
SWE Quiz is a structured crash course of everything you need to know for system design and modern AI engineering interviews. It contains thousands of questions that have been asked in interviews at DeepMind, OpenAI, Anthropic, and more.
Where Tokenmaxxing Excels
Now, I’m not against using AI at all. I use AI extensively. In fact, I would say probably 95% of the code generated by me is AI generated. So this is actually not an argument against using AI at all. It’s actually more of an argument towards misaligned incentives and understanding where token maxing as a behavior may have good intentions, but a wrong implementation. At the end of the day, we just do want better outcomes. And understanding how to token max in the right ways is important for better outcomes.
A great example from a friend at work that I heard recently was they had to run a bunch of experiments on reducing some latency for their system. Previously they would have to guess at these experiments and pick the top three. But with AI, he was actually able to add flags and set up experiments to test all seven of his ideas. And it turns out probably one of the experiments that he would not have tried earlier actually ended up being one of the better latency reduction wins.
Tokenmaxxing is great for cases like these, for rapid exploration and throwaway work in favor of an outcome, where the goal is knowledge. In this case, the outcome is guaranteed.
There is a cost to that outcome, which previously was human time, but now you have less human time needed, but token costs added on top. So the calculus has changed, and will continue changing over time.
My Takeaways
All this is to say that incentive alignment is really hard. It’s a constant struggle at all levels of leadership, but I have found that the best leaders and companies I’ve worked at excelled at aligning incentives as much as possible.
For example, promotions across the board in tech are focused on outcomes and seem to be more focused on outcomes nowadays. This may be due to the turmoil of tech too - outcomes are harder to achieve, and thus allows companies to promote less.
Tech is full of missionaries and mercenaries. Generally, there are more mercenaries than missionaries. Mercenaries will always optimize for the path to promotion and money. If that path is misaligned with the company’s health, the fault lies with the incentive structure, not the employees.
It’s worth framing things in your work with incentive alignment. I’ve found it an useful exercise for lining up things for promotions, getting cross-functional collaboration done, and, in general, getting “champions” across orgs for both me and my work.
Matt Mullenweg Assembles Trusted Group to Overhaul WordPress.org and Five for the Future
Matt Mullenweg, co-founder of WordPress, has granted a select group of trusted contributors direct authority to redesign WordPress.org and the "Five for the Future" program, bypassing traditional team and committee approvals.
Decoder
- WordPress.org: The official home of the open-source WordPress project, distinct from the commercial WordPress.com hosting service. It hosts the core software, documentation, forums, themes, and plugins.* Five for the Future: A WordPress initiative encouraging companies that benefit from WordPress to dedicate 5% of their resources (time or money) to contributing back to the project.
Original article
Matt Mullenweg has given a small group of trusted contributors the authority to overhaul WordPress.org without approval from any team, committee, or stakeholder other than himself.
Introducing HCP Terraform powered by Infragraph - now in public preview
HashiCorp has made HCP Terraform powered by Infragraph available in public preview, introducing an event-driven knowledge graph that unifies infrastructure data across hybrid and multi-cloud environments, paving the way for AI-driven automation.
Decoder
- HCP Terraform: HashiCorp Cloud Platform Terraform, a managed service for Terraform workflows.
- Infragraph: An event-driven knowledge graph that collects and unifies infrastructure data.
- Hybrid cloud: A computing environment that combines on-premises data centers with public cloud resources.
- Multi-cloud: The use of multiple cloud computing services from different providers in a single architecture.
Original article
HCP Terraform, powered by Infragraph, introduces a centralized, event-driven knowledge graph that unifies infrastructure data across hybrid and multi-cloud environments, enabling real-time visibility, improved security, cost control, and a foundation for AI-driven automation, now available in public preview for qualified US customers.
Introducing the Datadog Code Security MCP
Datadog launched Code Security MCP, a new service that scans AI-generated code in real time for vulnerabilities, secrets, and risky dependencies directly within a developer's local workflow.
Decoder
- SAST (Static Application Security Testing): Analyzes source code or compiled application code for security vulnerabilities without executing the code.
- SCA (Software Composition Analysis): Identifies and inventories open-source components in an application to detect known vulnerabilities.
- IaC (Infrastructure as Code) scanning: Analyzes configuration files for infrastructure (e.g., Terraform, CloudFormation) to identify security misconfigurations or policy violations.
- Model Context Protocol (MCP): A protocol used by AI agents and coding assistants to securely access external tools and information.
Original article
Datadog Code Security MCP scans AI-generated code in real time to detect vulnerabilities, secrets, and risky dependencies while consolidating multiple security checks into a single local workflow, enabling early issue detection and consistent security across development.
The AWS MCP Server is now generally available
AWS has made its MCP Server generally available, offering AI coding agents secure and authenticated access to over 15,000 AWS API operations and current documentation, solving the problem of AI agents relying on outdated training data and generating non-production-ready infrastructure.
Decoder
- Model Context Protocol (MCP): A protocol that allows AI agents to securely interact with external tools and services, providing real-time information beyond their training data.
- IAM (Identity and Access Management): An AWS service that helps securely manage access to AWS resources.
- IAM context keys: Attributes in IAM policies that allow fine-grained access control based on specific conditions during an API call.
- Agent Toolkit for AWS: A suite of tools from AWS, including the MCP Server, skills, and plugins, designed to help coding agents build effectively on AWS.
- Skills (for AWS MCP Server): Curated guidance and best practices maintained by AWS service teams to direct AI agents through common tasks and reduce errors.
Original article
The AWS MCP Server is now generally available
I have been building with AI agents and MCP tools for a while now, and one question kept coming up: how do you give an agent real, authenticated access to AWS without handing it the keys to the kingdom? Today, there is an answer.
I’m happy to announce the general availability of the AWS MCP Server, a managed remote Model Context Protocol (MCP) server that gives AI agents and coding assistants secure, authenticated access to all AWS services through a small, fixed set of tools.
The AWS MCP Server is part of the Agent Toolkit for AWS, a suite of tooling that includes the MCP Server, skills, and plugins that help coding agents build more effectively and efficiently on AWS.
AI coding agents are already useful for many tasks, but they run into real trouble when working with AWS at any meaningful depth. Without access to current AWS documentation, agents rely on training data that may be months out of date and may not know about services like Amazon S3 Vectors, Amazon Aurora DSQL, or Amazon Bedrock AgentCore. When asked to build infrastructure, they tend to reach for the AWS Command Line Interface (AWS CLI) rather than AWS Cloud Development Kit (AWS CDK) or AWS CloudFormation, and they produce AWS Identity and Access Management (IAM) policies that are far broader than necessary. The result is infrastructure that works in a demo but is not production-ready.
The AWS MCP Server addresses this through a compact set of tools that do not consume your model’s context window. The call_aws tool executes any of the 15,000+ AWS API operations using your existing IAM credentials. When we will launch new APIs, they will be supported within days. The search_documentation and read_documentation tools retrieve current AWS documentation and best practices at query time, so the agent always works from up-to-date information.
With general availability, we are introducing several new capabilities. The AWS MCP Server now supports IAM context keys, so you no longer need a separate IAM permission to use the server and can express fine-grained access in a standard IAM policy. Documentation retrieval no longer requires authentication. We have also reduced the number of tokens required per interaction, which matters for complex, multi-step workflows.
Also new, the run_script tool lets the agent write a short Python script that runs server-side in a sandboxed environment. The sandbox inherits your IAM permissions but has no network access, so you can give an agent the ability to process data without giving it access to your local file system or a shell. When an agent needs to call multiple APIs and combine the results, making them one at a time is slow and burns context. With run_script, the agent chains API calls, filters responses, and computes results in a single round-trip, which is both faster and more context-efficient.
The most significant addition is the transition from Agent SOPs to Skills. Skills provide curated guidance and best practices for the tasks where agents most commonly make mistakes. This helps agents complete work faster, using validated best practices, with fewer errors and fewer tokens — all of which saves you time and money. Skills are contributed and maintained by AWS service teams. This keeps the tool list short and predictable, which reduces hallucination and keeps the agent focused.
For enterprise customers, the AWS MCP Server provides a clear separation between human and agent permissions. You can use IAM policies or Service Control Policies to specify that a given user can perform mutating operations while the MCP server is restricted to read-only actions. Amazon CloudWatch metrics published under the AWS-MCP namespace let you observe MCP server calls separately from direct human calls, giving you the audit trail that compliance teams require. Amazon CloudTrail captures all API calls for a complete record.
Let’s see it in action For this demo, I chose to use Claude Code, but I can use the AWS MCP Server with any AI agent that supports MCP, which is basically all tools available today: Kiro CLI, Kiro, Cursor, Codex, and more. I configure Claude Code to use the Anthropic Opus 4.6 model.
Opus 4.6 has a knowledge cutoff date in May 2025. It means it doesn’t know anything that happened after May last year. I ask a question about an AWS service that was introduced recently: Amazon S3 Vectors, launched in preview in July 2025 and that went GA in December 2025.
The question is “how to store embedding on S3″. (embedding is a kind of vector)
It gives me five solutions, all correct, but none using S3 Vectors as I asked. Note that this answer comes from the Opus 4.6 model, not from Claude Code. Any AI tool using the same model will return similar answers because S3 Vectors wasn’t announced at the time the model was trained.

Let’s now try with the AWS MCP Server.
The AWS MCP Server uses AWS Identity and Access Management (IAM) and IAM SigV4 authentication. To use my local AWS credentials configuration over MCP, which only supports OAuth 2.1, I configure my AI coding agent to call the AWS MCP Server through a proxy. The MCP Proxy for AWS is an open source proxy that runs on my machine and bridges the world of IAM authentication to OAuth.
I add the MCP configuration with this command:
claude mcp add-json aws-mcp --scope user \
'{"command":"uvx","args":["mcp-proxy-for-aws@latest","https://aws-mcp.us-east-1.api.aws/mcp","--metadata","AWS_REGION=us-west-2"]}'
You’ll have to have uv installed before you can use the AWS MCP server. On Linux or Mac, you can run: curl -LsSf https://astral.sh/uv/install.sh | sh
Let’s analyze the JSON configuration:
- I use the user scope to make the server available to all my projects on my laptop.
uvx mcp-proxy-for-awsis the command to launch the proxy; the rest of the arguments are parameters passed to the proxy.https://aws-mcp.us-east-1.api.aws/mcpis one of the two regional endpoints for the AWS MCP Server. The proxy will forward Claude Code’s requests to that endpoint.--metadataare passed to the proxy target. Here, it tells the AWS MCP Server to use the US West (Oregon) Region.
I start Claude Code and I type /mcp to verify the AWS MCP Server is correctly installed and can use my credentials.

I ask the same question: “how can I store embedding on S3”.
This time, Claude Code knows it has a tool it can use to answer the question. It asks me permission to invoke the aws___search_documentation tool. After a few seconds, I receive a correct answer: “AWS now has a dedicated service for this: Amazon S3 Vectors …”

Pricing and availability The AWS MCP Server is available today in the US East (N. Virginia) and Europe (Frankfurt) AWS Regions and can make API calls to any Region. There is no additional charge for the AWS MCP server itself. You pay only for the AWS resources you create and any applicable data transfer costs.
The AWS MCP Server works with Claude Code, Kiro, Cursor, and any MCP-compatible client. To get started, see the AWS MCP Server User Guide.
I have been waiting for something like this since I started using MCP tools in my AI agents early last year. The combination of current documentation, authenticated API access, and sandboxed script execution in a single server changes what an agent can actually do on AWS. I am curious what you build with it. Let me know in the comments.
— sebUpdated on May 6th – Added uv installation script.
How we built a real-world evaluation platform for autonomous SRE agents at scale
Datadog engineered a replayable evaluation platform for its Bits AI SRE agent, utilizing production-derived labels and noisy simulated environments to continuously measure and enhance the agent's performance in investigating complex production incidents.
Decoder
- SRE (Site Reliability Engineering): A discipline that applies software engineering principles to infrastructure and operations problems.
- Bits AI SRE: Datadog's autonomous agent for investigating production incidents.
- World-snapshot: A captured state of signals (telemetry queries, logs, metrics) available at the time a production issue occurred, used for replaying incident investigations.
- Agentic validation: Using an AI agent itself to assist in validating and refining evaluation labels or data, reducing manual effort.
- 5 Whys analysis: An iterative interrogative technique used to explore the cause-and-effect relationships underlying a particular problem; the primary goal is to determine the root cause of a defect or problem.
Original article
We shipped a feature that made perfect sense. It improved a specific type of investigation we had been testing against. Then other investigations started getting worse.
Nothing crashed. No tests failed. But the overall quality of the agent had shifted, and we had no reliable way to detect it.
Bits AI SRE is Datadog's autonomous agent for investigating production incidents. It reasons across metrics, logs, traces, infrastructure metadata, network telemetry, monitor configuration, and more to determine, triage, and remediate the root cause of an issue.
As we built Bits, we expected behavior to improve incrementally with each feature we added. Instead, we saw something more subtle. Improvements in one area could quietly introduce regressions in another. The problem wasn't just the model. We had no way to replay real production context, measure behavior consistently across diverse incidents, or track whether the agent was actually improving over time.
We needed infrastructure that could turn production issues into reproducible investigation environments. So we built a replayable evaluation platform from scratch.
In this post, we'll walk through how the Bits AI SRE team built that platform and what it took to make agent behavior observable, measurable, and repeatable.
When one improvement caused subtle regressions
Early in development, before exposing the system to customers, we added a feature that extracted the service name from the monitor under investigation into Bits AI SRE's initial context. On the surface, this made sense, and in a handful of internal test cases, it worked as expected.
What we could not see was the broader impact. Without a representative evaluation set, we had no way to measure how that change behaved across different environments. The feature pulled in a large amount of irrelevant signals, which degraded investigation quality in unrelated scenarios, often by subtly confusing the reasoning of the agent. This change introduced regressions that didn't become apparent until we began seeing widespread investigation misses internally.
This wasn't an isolated case. Features that improved Bits in one area could quietly degrade performance in another, and the relationships often weren't obvious. We had no standardized way to catch these regressions, no way to track quality across changes, and no confidence that the next feature wouldn't cause the same problem. We needed a way to catch these regressions before users reported them.
Why tool-level testing and live replay weren't enough
Beyond standard test suites, we first tried testing individual tools in isolation. This approach seemed reasonable. If each tool behaved correctly, the agent should behave correctly.
In practice, that assumption broke down. Bits' value comes from how it chains tools together and reasons across their outputs. Failures often emerged from interactions between steps, not from a single tool call. For example, the agent might retrieve valid signals from multiple tools but combine them incorrectly, leading it to attribute an issue to the wrong component.
We also experimented with rerunning live Bits investigations as a form of online evaluation. That did not scale. Results were not aggregated, environments changed underneath us, and investigations could not be replayed once the underlying signals expired.
We needed an offline system that could replay realistic scenarios across Datadog's signals and measure the agent's behavior in a controlled, repeatable way. Off-the-shelf eval frameworks assume clean inputs and static test sets, which breaks down when your agent reasons across live production telemetry.
We ended up building two components that work in tandem: a curated label set that defines representative investigations and an orchestration platform that executes and scores the agent against them.
Anatomy of a label
Each evaluation label represents a single investigation scenario Bits would encounter in production. The label has two parts. The first is the ground truth, which defines the root cause of the issue. The second is the world-snapshot, which captures the signals that were available at the time the issue occurred. For example, a label might define the root cause as a Kubernetes pod being OOM killed, with a world snapshot that preserves the telemetry queries the agent would need—such as where to find memory metrics, container logs, and deployment events—rather than raw data.
The agent never sees the root cause directly. It only has access to the signals that existed when the issue occurred. Our evaluation needs to reflect that constraint. Each label has to preserve the same signals the agent would have seen in production.
At the same time, the set of labels must be broad enough to reflect reality. From Kubernetes pod failures to Kafka lag, and from simple bad-code deployments to complex multi-service business logic side effects, the real world of SRE spans many technologies, failure modes, and levels of complexity. Our label set has to reflect that diversity. A narrow or overly clean dataset would inflate performance and hide weaknesses.
Orchestrating evaluations at scale
The evaluation platform is the system that runs Bits against our label set, scores the results, and tracks performance over time.
We needed to know whether an improvement for Kafka lag investigations had accidentally broken our Kubernetes investigations. Answering that meant running both at once, across different model and configuration variants, and comparing results across runs.
From there, the requirements became clear. We needed to segment the label set by relevant dimensions, run investigations at scale, track results over time, and make it easy to compare performance across versions.
At a high level, the system consists of a shared label set, an orchestration layer that runs investigations against those labels, and reporting infrastructure that tracks performance over time.
With this architecture in mind, we'll walk through how each piece came together, starting with the labels.
Starting with manual labels
Given the range of scenarios Bits handles, a small set of hand-crafted labels wasn't enough. We needed broad, representative coverage from the start. So we began with a manual internal labeling campaign, generating labels from Datadog's own alerts across a wide range of scenarios.
This got us started, but we were burning engineering hours faster than creating labels, and our label set was still nowhere near representative of the real world.
Embedding label creation into Bits AI SRE
To scale label creation, we turned to the one system that already understood every investigation: Bits itself. When customers provide feedback on a Bits AI investigation, we use that signal, along with the information from the investigation itself, to construct a ground truth root cause analysis and the queries that make up the world snapshot. Every user interaction becomes a potential evaluation label.
This turned label collection from a manual effort into a pipeline that grows with product usage. As adoption increases, so do the volume and diversity of our labels. Embedding label creation in the product increased our label creation rate by an order of magnitude.
From manual review to agentic validation
Before reconstructing the signals of a label's world snapshot, we required human review to ensure quality. Early on, this process was heavily manual, especially when customer feedback was ambiguous or the fidelity of a generated label was unclear.
As our label ingestion rate grew, manual review could not keep up. We were at risk of losing valuable feedback signals simply because we couldn't process them fast enough.
To address this, we used Bits itself to assist before human review. Grounded in customer feedback and investigation telemetry, Bits aggregates related signals, derives relevant relationships, and resolves ambiguous references in feedback. For example, it can turn "it was slow" into a more precise statement about the elevated latency in a specific service. Since Bits now knows the true root cause, it can build a full causal chain that starts with the problem statement (such as a monitor firing or a user initiating an investigation) and ends with the underlying root cause.
Just like diagnosing the root cause of an issue, this derivation of the root cause analysis was a high-precision, low-margin-of-error operation; however, we were confident our agent's quality had reached the level where this was possible. We also produced several alignment studies with human judges to ensure we were producing high-quality and causally accurate root causes.
The result is a proposed ground truth and signal set that holds up under review and supports a complete root cause analysis.
As this agentic flow improved, human involvement shifted up a level. Instead of manually assembling root cause analyses from raw signals, reviewers now validate and refine Bits' outputs.
The results were dramatic: Validation time per label dropped by more than 95% in a single week.
As confidence in the validation pipeline has grown, we have reduced the amount of human intervention required, without sacrificing label quality.
To ensure label quality, each generated label is assigned confidence scores, and anything below a defined threshold is flagged for human review. These scores evaluate the generated RCAs across several dimensions, including thoroughness, specificity, and accuracy.
We observed roughly a 30% increase in the quality of root causes in the generated labels—root causes that would hold up under a “5 Whys” analysis in a postmortem. These higher-quality labels also enabled more robust evaluation.
Instead of scoring only the final conclusion, we could evaluate the agent's trajectory. We looked at how close it got to the correct answer, whether it investigated deeply enough, and whether it was able to surface valuable telemetry. This allowed us to understand not just whether Bits got the correct answer, but how helpful its investigation was.
Bring the noise
The most counterintuitive thing we learned was that our simulated worlds need to be messy.
With a well-constructed label in hand, we have the ground truth and the signals that surrounded the issue. But telemetry has a limited time to live (TTL). To evaluate the agent later, we reconstruct the investigation context, capturing the structure and relationships across signals, abstracted from the underlying telemetry data, as a snapshot of the world at the moment of issue.
In effect, we build a simulated environment that mirrors the original investigation context, then run Bits inside it. Each environment is fully isolated at the data layer so that investigation context from one label can not affect another. This allows Bits to face the same constraints it would encounter in production, scoped to a single environment.
One key discovery was that these simulated worlds need to be noisy. Snapshotting only the signals directly tied to the root cause is not enough. In production, Bits operates in environments full of unrelated services, background errors, and tangential signals.
To reflect that reality, we capture more than the minimal signal needed to explain the issue. We expand the snapshot by discovering related components based on the root cause chain, even if those components are not directly involved in the failure itself. A component might be included because it belongs to the same platform, team, or monitor, or even just similarly named.
This approach provides a cost-effective mechanism of injecting real-world noise into the evaluation process, mirroring the way an SRE must sift through red herrings during an investigation.
Without that noise, evaluation results looked better than they should have. We were essentially giving the agent an open-book exam with only the relevant pages. No wonder it aced it. The agent appeared more accurate in these simplified environments than it did in real investigations.
Snapshotting telemetry is a one-way door. Once telemetry expires, its structure and signals cannot be reconstructed. When we realized our early labels were too narrow, we had to discard many of them and regenerate those labels with a broader signal reconstruction scope. In the short term, the numbers looked terrible. This reduced our pass rate by roughly 11% and decreased our label count by 35%. But in the long term, it made our evaluations predictive of production behavior.
The evaluation system evolved across three major components: label collection, label validation, and signal reconstruction. Early versions relied heavily on manual workflows, but as the platform matured, each of these stages became increasingly automated and integrated into the product. The following diagram summarizes this progression, from the initial manual system to the industrialized pipeline we run today.
Segmenting, scoring, and catching regressions
With labels collected, processed, and signals reconstructed, we needed a system to run evaluations and make the results actionable.
The platform lets the team segment the label set across multiple dimensions, including technology, problem type, monitor type, and investigation difficulty.
This segmentation lets us scale development across the team. Engineers can focus on the parts of the agent they are improving and evaluate changes against scenarios that matter most, without interfering with other workstreams.
On the reporting side, we store scores for every scenario across every run. We track these results in Datadog dashboards and Datadog LLM Observability so we can compare performance across agent versions. We also maintain an internal labeling application, allowing for centralized observability and metadata management of our labels.
Historical visibility is useful for spotting shifts in behavior. A previously failing scenario starting to pass is informative, as is a previously passing scenario starting to fail.
These kinds of historical score tracking, along with linking to agent metadata, help us understand agent success evolution over time, areas where the agent is strong or weak, and label attributes such as consistently passing, consistently failing, or metrics like pass@k (for a scenario, given k independent attempts, does the agent succeed on at least one of them).
In addition to more targeted runs, we run the full evaluation set weekly to catch regressions that may have slipped through. For example, recently we started internally dogfooding a new tool reasoning strategy. Results looked great on a small subset of evaluation cases; however, upon running the full set, we reported the regression immediately. Results from these runs flow into dashboards and Slack notifications, and we alert on significant deviations in overall performance.
What we'd do again (and what we'd do sooner)
Building this platform changed how we think about agent development. A few lessons stand out.
Invest in label collection and processing early
Manual collection doesn't scale. People scale linearly, but evaluation needs grow faster as the agent expands into new domains.
Using Bits itself to perform quality checks and fill gaps in labels—rather than requiring high-toil human review—removed the biggest blocker to scaling that system.
This shift required careful scoring and alignment work, but it paid off quickly. Label creation rates increased dramatically, and it pushed us to build better reporting around label quality so we could monitor the health of the label set over time.
Build the platform to be extensible from the start
Bits evolved faster than we expected, and so did the models powering it. If adding a new label type, integrating with new data sources, or modifying the underlying models requires significant rework, the evaluation system becomes a bottleneck.
For example, only weeks after releasing the Bits AI SRE Agent, we were able to develop a new agent architecture and capability set for a v2 release. That development speed was only possible because the evaluation platform was designed to evolve alongside the agent.
Use evaluation data to steer product direction
Segmenting results by domain shows where the agent performs well and where it struggles. When we identify a weak area, we expand the label set in that domain. We actively seek out the hardest scenarios, mining negative feedback and exploring frontier areas where the agent is least proven. The labels that matter most aren't the ones Bits passes. They're the ones it fails.
In some cases, we even create labels for capabilities the agent does not yet support. This lets us build evaluation suites alongside new features instead of retrofitting them later.
From single investigations to organizational learning
The feedback loop we built for Bits is now extending beyond a single agent.
We have extended this evaluation platform across other agents at Datadog, turning label collection from human signals into fuel for additional products. Additionally, in following our example, agents across Datadog are starting to personalize their reasoning loops based on evaluation information provided by users, allowing for high agentic precision and reliability across the organization.
In the process of expanding this platform, we've also widened the top of the evaluation funnel even further. Our agentic label collection now extends into the everyday workflows of software engineers at Datadog. Internal incidents, issues, and alerts can be transformed into coherent evaluation labels. This has allowed us to bootstrap other Datadog teams, such as APM and Database Monitoring, as they build and refine their own agentic features. Any team building an agent now has access to a large, representative label set and evaluation infrastructure from day one.
The evaluation platform also changes how we respond to new models. New models don't just offer incremental improvements. They can unlock new workflows and capabilities. When a new model becomes available, we run it against the full label set to measure its impact across domains and understand what it improves and what it breaks. Instead of discovering those shifts in production, we evaluate them upfront. When Claude Opus 4.5 became available, we ran it against our full label set within days and identified which investigation types improved, and more importantly, which ones regressed. That kind of rapid, systematic evaluation of a new model would not have been possible a year earlier.
Building a reliable AI agent is as much about evaluation infrastructure as it is about the agent itself. When we started, we had no standardized way to track quality, catch regressions, or understand how features generalized across real-world scenarios. By building an evaluation platform fueled by diverse, representative labels collected directly from the product, we created a feedback loop that scales with usage and keeps Bits improving.
Along the way, we learned that noise matters, that manual processes don't scale, and that the evaluation platform has to keep pace with the agent it supports. Every week, we run Bits against tens of thousands of scenarios drawn from real incidents. Every week, something surprises us. That's the point.
We didn't set out to build an evaluation platform. We set out to build an agent that could investigate production incidents. The evaluation platform is what it took to trust it.
If you're excited about building infrastructure that evaluates autonomous agents across complex, multi-signal production systems, we're hiring.
How to build CI/CD observability at scale
GitLab leverages Prometheus, Grafana, and custom pipeline exporters to achieve scalable CI/CD observability, which helps optimize pipeline performance, job efficiency, and infrastructure capacity planning for its enterprise self-managed environments.
Decoder
- CI/CD (Continuous Integration/Continuous Delivery): A methodology for frequent, automated code changes, building, testing, and deployment.
- Observability: The ability to understand the internal states of a system by examining its external outputs (logs, metrics, traces).
- Prometheus: An open-source monitoring system with a time series database.
- Grafana: An open-source platform for monitoring and observability, often used to visualize data from Prometheus.
Original article
CI/CD optimization for GitLab relies on observability using Prometheus, Grafana, and pipeline exporters to measure pipeline performance, job efficiency, and infrastructure bottlenecks, enabling scalable visibility, deployment optimization, and capacity planning for enterprise self-managed environments.
How Cloudflare responded to the “Copy Fail” Linux vulnerability
Cloudflare rapidly mitigated the "Copy Fail" Linux kernel vulnerability (CVE-2026-31431) disclosed on April 29, 2026, by deploying a custom eBPF-based solution across its 330-city infrastructure within hours, preventing any customer impact even before a patched kernel could be fully rolled out.
Deep dive
- The "Copy Fail" vulnerability (CVE-2026-31431) was a local privilege escalation in the Linux kernel's
algif_aeadmodule, allowing an unprivileged user to perform a 4-byte out-of-bounds write to arbitrary readable files like/usr/bin/suviasplice()andrecvmsg(). - Cloudflare's existing behavioral detection system flagged internal exploit validation attempts within minutes of the vulnerability's disclosure, without needing signature updates.
- Due to the time required for a full kernel patch rollout across 330 datacenters, Cloudflare deployed a custom eBPF-based Linux Security Module (bpf-lsm) program as an immediate, no-reboot mitigation.
- The bpf-lsm program specifically denied the
socket_bindLSM hook for theAF_ALGsocket family for any binary not on a pre-approved allow-list, effectively blocking the exploit's entry point while permitting legitimate kernel crypto API users. - Cloudflare used
prometheus-ebpf-exporterto verify legitimateAF_ALGusage across its fleet, confirming only one internal service relied on it, minimizing the risk of accidental outages from the bpf-lsm deployment. - The company aims to improve kernel-API dependency visibility, enhance bpf-lsm deployment and logging, and reduce the Linux kernel attack surface by removing unused modules.
- The incident confirmed the value of responsible disclosure, in-kernel visibility tooling, and eBPF for rapid runtime kernel mitigation.
Decoder
- CVE-2026-31431 ("Copy Fail"): A Linux kernel local privilege escalation vulnerability allowing out-of-bounds writes.
- AF_ALG socket family: A Linux kernel socket family providing user-space access to the kernel's cryptographic API.
- algif_aead module: A kernel module facilitating authenticated encryption with associated data (AEAD) ciphers via AF_ALG.
- splice(): A Linux system call that moves data between file descriptors or pipes without copying it to user space, often using page cache references.
- page cache: A system-wide cache in Linux that stores disk block data in RAM, speeding up file access.
- eBPF (extended Berkeley Packet Filter): A Linux kernel technology that allows users to run custom programs in the kernel without modifying the kernel source code, used for networking, tracing, and security.
- bpf-lsm: A Linux Security Module (LSM) program implemented using eBPF, allowing fine-grained security policies to be enforced within the kernel.
- LSM hook: Specific points in the Linux kernel where security modules can insert code to enforce policies (e.g.,
socket_bindfor socket creation/binding).
Original article
Cloudflare successfully defended against the "Copy Fail" Linux kernel vulnerability (CVE-2026-31431) disclosed on April 29, deploying a custom eBPF-based mitigation across its 330-city infrastructure within hours while confirming zero customer impact through fleet-wide behavioral detection and forensic analysis. The company's existing security monitoring flagged internal exploit validation attempts within minutes without signature updates, and engineers used BPF Linux Security Module programs to surgically block the vulnerable code path while awaiting patched kernel deployment across hundreds of thousands of servers.
How lakebase architecture delivers 5x faster Postgres writes
Neon's "image generation pushdown" technique, implemented in its lakebase architecture, drastically improves Postgres write throughput by up to 5x and reduces WAL generation by 94% by offloading full-page write operations to the distributed storage layer.
Deep dive
- Traditional Postgres uses Full Page Writes (FPW) to prevent data corruption from "torn pages" during crash recovery.
- FPW involves writing entire 8KB data pages to the Write-Ahead Log (WAL) the first time a page is modified after a checkpoint.
- This ensures recovery even if a disk page is partially written but can inflate WAL volume by up to 15x, becoming a major performance bottleneck for write-heavy applications.
- Neon's lakebase architecture separates compute and storage; compute nodes are stateless and stream WAL to distributed safekeepers.
- Because there's no local disk page to tear, the original need for FPW is eliminated.
- However, simply disabling FPW could lead to unbounded WAL delta chains and slow read performance.
- Neon introduced "image generation pushdown," where the storage layer (pageserver) takes responsibility for generating full page images.
- The pageserver reconstructs pages by finding the most recent materialized image and applying WAL deltas.
- Images are generated when a page accumulates a threshold of delta records, optimizing image generation based on actual changes rather than arbitrary checkpoints.
- This reduces WAL traffic by 94%, improves network efficiency, and scales image generation across distributed storage.
- Benchmarks show throughput gains up to 4.5x for 32-vCPU instances and a 94% reduction in WAL generation.
- Production data from a 56 vCPU project saw WAL generation drop from 30 MB/s to 1 MB/s, with p99 read latencies improving by 30-50%.
- The feature was rolled out seamlessly across Neon's entire fleet since late March, requiring no customer action or restarts.
Decoder
- Write-Ahead Log (WAL): A sequential log used by Postgres to record all database changes before they are applied to data files, ensuring data durability in case of a crash.
- Full Page Write (FPW): A Postgres mechanism where the entire 8KB data page is written to the WAL the first time it's modified after a checkpoint, preventing data corruption from "torn pages" during recovery.
- Torn Page: A corrupted data page on disk that results from a server crash during a partial write operation, leading to inconsistent data if not handled.
- Checkpoint: A milestone in the Postgres WAL that ensures all data changes up to that point have been written to disk, limiting the amount of WAL replay needed for recovery.
- Lakebase architecture: A database architecture, like Neon's, that separates the compute and storage layers, allowing independent scaling and specialized optimizations.
- Pageserver: A component in Neon's distributed storage system responsible for reconstructing data pages for read requests and generating full page images.
Original article
Neon eliminated a decade-old Postgres performance bottleneck by pushing full-page write operations from compute to its distributed storage layer, achieving up to 5x throughput improvements and reducing WAL generation by 94% in some cases. The "image generation pushdown" technique, now rolled out across Neon's entire fleet, leverages the company's separated compute-storage architecture to solve a durability problem that's structurally impossible to fix in traditional monolithic Postgres deployments.
Kubernetes v1.36: Server-Side Sharded List and Watch
Kubernetes v1.36 introduces server-side sharded list and watch as an alpha feature, allowing API servers to filter events at the source and send only relevant resource slices to horizontally-scaled controller replicas, significantly reducing network, CPU, and memory overhead.
Decoder
- High-cardinality resources: Resources in Kubernetes like Pods where there can be a very large number of instances, leading to significant data volume.
- Client-side sharding: A previous approach where each controller replica receives the full stream of events and then filters out the objects it is not responsible for, leading to wasted CPU, memory, and network resources.
- Server-side sharding: The new approach where the Kubernetes API server filters events at the source before sending them to controller replicas, ensuring each replica only receives its assigned slice of resources.
- ListOptions: A struct in the Kubernetes API used to specify parameters for listing resources, now including the shardSelector field.
- FNV-1a hash: A non-cryptographic hash function used by the API server to deterministically assign objects to shards based on fields like object.metadata.uid.
- Informer: A client-go component commonly used by Kubernetes controllers to list resources once and then watch for subsequent changes, maintaining an in-memory cache.
Original article
Kubernetes v1.36 introduced server-side sharded list and watch as an alpha feature that lets API servers filter events at the source, sending each horizontally-scaled controller replica only its assigned slice of resources instead of the full stream.
Azure DevOps MCP Server April Update
Azure DevOps MCP Servers received an April update adding WIQL-based work item querying, introducing tool annotations for safer LLM interactions, expanding repository tooling, and beginning a consolidation of existing tools for improved user and LLM performance.
Decoder
- Azure DevOps MCP Server: A server that exposes Azure DevOps functionalities as tools that can be invoked by external clients, including large language models (LLMs). MCP stands for Model-Client Protocol.
- WIQL (Work Item Query Language): A SQL-like language used in Azure DevOps to query work items.
- Tool Annotations: Metadata tags (e.g., read-only, destructive, openWorld) added to tools within the MCP Server to help LLMs understand their behavior, context, and potential risks, promoting safer usage.
- Elicitations: Guided prompts designed to help users provide correct information when interacting with tools, like selecting a project for an operation.
- MCP Apps: An experimental feature that allows packaging common workflows as self-contained applications within the MCP Server, simplifying complex tasks that would otherwise require chaining multiple tools.
Original article
Azure DevOps MCP Servers update introduces WIQL-based work item querying with restricted remote access, tool annotations for safer LLM usage, expanded repo tooling, and ongoing tool consolidation.
Kubernetes v1.36: Declarative Validation Graduates to GA
Kubernetes v1.36 has moved Declarative Validation for native types to General Availability, replacing thousands of lines of handwritten Go validation code with +k8s: marker tags for more consistent, maintainable, and self-documenting API constraint enforcement.
Decoder
- Declarative Validation: A method of defining validation rules using structured metadata (like +k8s: marker tags) directly within the API type definitions, rather than in separate code.
- validation-gen: A code generator used in Kubernetes that parses declarative validation marker tags and automatically produces the corresponding Go validation functions.
- +k8s: marker tags: Special comments embedded in Go source code that provide metadata for code generators, now used to define validation rules like +k8s:minimum or +k8s:required.
- Ambient Ratcheting: A built-in safety mechanism in the declarative validation framework that allows new, stricter validation rules to be applied without breaking existing objects, by bypassing the new rule if a field's value is semantically equivalent to its prior state during an update.
- kube-api-linter: A tool that statically analyzes Kubernetes API types and enforces API conventions, now empowered by declarative validation to automatically check rules.
- OpenAPI schemas: Machine-readable specifications of APIs that describe their structure, endpoints, and validation rules; declarative validation makes it possible to reflect these rules in OpenAPI.
- Custom Resource Definitions (CRDs): Kubernetes API extensions that allow users to define their own custom resources, which can now leverage the same declarative validation framework through tools like Kubebuilder.
Original article
Kubernetes v1.36 introduced Declarative Validation for native types as a generally available feature, replacing thousands of lines of handwritten Go validation code with automated marker tags that make API constraints self-documenting and easier to maintain.
Airbnb Co-founder Taps Peter Arnell as First US Chief Brand Architect
Airbnb co-founder Joe Gebbia has appointed veteran designer Peter Arnell as the first US Chief Brand Architect for the National Design Studio, aiming to overhaul 27,000 government websites for a unified and trustworthy user experience.
Decoder
- National Design Studio: A U.S. government initiative, led by Joe Gebbia, focused on improving the usability and design of federal online platforms.* Chief Brand Architect: A role focused on establishing a consistent and cohesive brand identity and user experience across a large portfolio of digital assets, in this case, government websites.
Original article
Airbnb co-founder Joe Gebbia announced that designer Peter Arnell has joined as the first US chief brand architect for the National Design Studio, a government initiative to improve federal online platforms. Arnell, who has worked with major brands like Pepsi and Samsung, will help redesign 27,000 government websites to create a unified, trustworthy user experience. The team has already streamlined government processes, including reducing one workflow from 87 clicks to 12 and converting a months-long retirement process into a minutes-long online experience.
Sketchy iPhone 18 Pro Dynamic Island rumors continue with claimed CAD images
Unreliable rumors and easily faked CAD images continue to surface, suggesting the iPhone 18 Pro might feature a smaller Dynamic Island, despite a lack of convincing evidence.
Decoder
- Dynamic Island: An interactive, pill-shaped area at the top of iPhone Pro models that adapts to show alerts, notifications, and background activities, replacing the traditional notch.* CAD images: Computer-Aided Design images, often used in product development and manufacturing to create precise digital models, but can also be faked for leaks.* Under-display Face ID and camera tech: Technology that allows biometric sensors and front-facing cameras to be hidden beneath the screen, enabling a truly bezel-less, all-screen design without notches or cutouts.
Original article
Reports continue to suggest the iPhone 18 Pro could feature a smaller Dynamic Island, but the latest “evidence” — including leaked CAD images — comes from unreliable or questionable sources and is easy to fake. While Apple is expected to gradually shrink the Dynamic Island on the path toward a full all-screen iPhone with under-display Face ID and camera tech, there's currently no convincing proof that this change is actually coming with the iPhone 18 Pro.
Google unveils Whoop-like screenless Fitbit Air
Google has introduced the $100 Fitbit Air, a screenless, lightweight fitness wearable similar to Whoop, alongside a new Gemini-powered Google Health Coach for Premium subscribers.
Decoder
- Whoop: A popular screenless fitness tracker known for its focus on continuous health monitoring, recovery, and performance insights, typically sold with a subscription model.* A-fib (atrial fibrillation): An irregular and often rapid heart rate that can lead to poor blood flow to the body. Wearables with A-fib alerts can detect potential instances of this condition.* Gemini: Google's multimodal large language model, used here to power the Google Health Coach for personalized advice.
Original article
Google on Thursday unveiled its new Fitbit Air, a Whoop-like screenless wearable that retails for $100. The device includes health and fitness tracking features like 24/7 heart rate monitoring, heart rhythm monitoring with A-fib (atrial fibrillation) alerts, blood oxygen level, resting heart rate, heart rate variability, sleep stages and duration, and more.
The tech giant said in a blog post that the device is aimed at people who find wearable devices to be too bulky, complicated, or expensive, noting that the Fitbit Air is “simple, affordable and comfortable enough to wear 24/7.”
Google says the screenless design is built to allow users to “live in the moment.” You can track your health and fitness through the Google Health app — the rebranded version of the Fitbit App, which Google also unveiled on Thursday.
The new wearable is noticeably smaller than its predecessors, staying true to the “Air” branding, as it’s 25% smaller than the Fitbit Luxe and 50% smaller than the Inspire 3.
The device will automatically track common activities and workouts; Google says the experience is personalized to you and improves over time as it learns your habits.
The device weighs 12 grams with the band and 5.2 grams without the band. It also pairs with the Pixel Watch, which means you could use the larger wearable throughout the day and then switch to the Fitbit Air at night or during workouts for a more comfortable experience, Google says.
The Fitbit Air has up to a week of battery life, and fast charging can deliver a full day of power in just five minutes. It’s also water-resistant up to 50 meters.
The tech giant also announced that Google Health Coach, its Gemini-powered all-in-one fitness trainer, sleep coach, and health and wellness advisor, is now available for Google Health Premium subscribers. The Google Health Coach can help with tasks like creating custom workout routines based on your goals and available equipment, analyzing your sleep habits, and more.
The new wearable is launching with three band types: a “Performance Loop Band” made from recycled materials with a breathable fit, a waterproof “Active Band,” and a discreet “Elevated Modern Band.”
The Fitbit Air is available for preorder now and will go on sale May 26.
Most Popular
-
Laid-off Oracle workers tried to negotiate better severance. Oracle said no.
-
Hackers deface school login pages after claiming another Instructure hack
-
Five architects of the AI economy explain where the wheels are coming off
-
reMarkable’s new Paper Pure tablet goes back to basics with a monochrome screen
-
Hackers steal students’ data during breach at education tech giant Instructure
St. Augustine and AI's false promise
AI's promise to optimize decisions is fundamentally flawed because it can only embody human-defined values, which are inherently partial and contested, rather than offering objective truth or solving moral dilemmas.
Deep dive
- AI systems, despite claims of optimization, merely implement human-defined notions of "good," which are always subjective and culturally shaped.* The article uses Saint Augustine's philosophy to argue that AI does not resolve human issues of judgment or morality.* AI's reliance on metrics and optimization can make embedded values and biases appear objective, masking their human origins.* The author advocates for preserving human judgment and making the value choices within AI systems explicit and accountable.* An efficient AI system can still pursue goals that are misguided from a human ethical perspective.* The focus should be on recognizing AI as a tool that formalizes existing priorities, not an authority on what matters.
Decoder
- Saint Augustine of Hippo (354-430 AD): An influential early Christian theologian and philosopher whose writings significantly shaped Western Christianity and philosophy. His work often explored themes of good and evil, free will, and divine grace.
Original article
AI systems are often presented as tools that can optimize decisions and create better outcomes, but they can only pursue whatever definition of “good” humans give them — and those values are always partial, contested, and shaped by cultural priorities rather than objective truth. Drawing on Saint Augustine of Hippo, the argument is that AI does not solve human problems of judgment or morality. It amplifies and formalizes existing values, biases, and priorities while making them appear objective through metrics and optimization. Rather than treating AI as an authority that determines what matters, the focus should remain on preserving human judgment, making value choices visible and accountable, and recognizing that efficient systems can still pursue misguided goals.
The Future of Design—What's Next?
The design profession is evolving from aesthetics to strategic influence, pushing designers to adopt "humanity-centered design" to tackle complex societal issues through interdisciplinary collaboration and systems thinking.
Decoder
- User-centered design (UCD): An iterative design process in which designers focus on the users and their needs in each phase of the design process.* Humanity-centered design (HCD): An evolution of user-centered design that broadens its scope to consider societal and environmental impacts, aiming to solve complex, deep-rooted problems for populations rather than just individual users.* Systems thinking: A holistic approach to analysis that focuses on the way a system's constituent parts interrelate and how systems work over time and within the context of larger systems.
Original article
The design profession is evolving from focusing solely on aesthetics to becoming a strategic force that can influence business decisions and tackle complex societal challenges. Modern designers are shifting from user-centered to humanity-centered design, working with populations to solve deep-rooted societal problems through collaboration and systems thinking. To maximize their impact, designers should develop broad knowledge across multiple disciplines, leveraging their generalist skills to facilitate collaboration between specialists and create meaningful solutions.
Revive Your Design Superpowers
Designers possess innate "superpowers" as investigators, explainers, and negotiators of ideas, and should internalize this value to increase their influence rather than constantly seeking external validation.
Original article
Designers possess three key superpowers: they are great investigators who ask deep questions and research thoroughly to understand how things really work, great explainers who clarify complex ideas through clear communication and visual tools, and great negotiators of ideas who explore multiple solutions to problems. Designers should recognize and leverage these natural abilities to increase their influence and value in organizations. The world needs designers now more than ever, and they should focus on convincing themselves of their worth rather than constantly seeking validation from others.
We built this. Now we own it
Emotionally engaging AI chatbots are an ethical consequence of tech's long-standing focus on engagement and individualistic design, making tech professionals directly responsible for systems that exploit human vulnerabilities.
Original article
Emotionally engaging AI chatbots are the predictable outcome of a long cultural shift toward hyper-individualism and engagement-driven technology, where systems are optimized for attention and growth rather than human well-being. Tech companies — along with designers, engineers, and product managers — bear growing ethical responsibility for building systems that can exploit loneliness, dependency, and vulnerability.
How universal appeal gets designers to hide their best skills
Designers seeking universal appeal by chasing every new tech skill, including AI, are overlooking a more valuable and future-proof advantage: deep domain expertise in specific industries like healthcare or B2B SaaS.
Decoder
- Domain expertise: Deep, specialized knowledge and experience within a particular industry, field, or subject area, encompassing its unique business models, stakeholder concerns, and operational constraints.
Original article
Designers often overwhelm themselves trying to master everything — coding, AI, networking, and project management — when a more valuable and overlooked advantage is domain expertise: deep knowledge of a specific industry such as healthcare, fintech, or B2B SaaS. Understanding how a business operates, what stakeholders care about, and what drives company decisions makes designers far more effective and competitive than chasing every new technical skill, because it allows them to frame design work in terms of business outcomes, communicate more strategically, and solve problems within real organizational constraints — skills that are likely to remain valuable even as AI automates more technical tasks.
Color Memory Game (Website)
Dialed.gg has launched a free browser-based Color Memory Game where players recreate five colors from memory using hue, saturation, and brightness sliders, scored by the CIEDE2000 perceptual color distance model.
Decoder
- CIEDE2000: A mathematical formula that quantifies the perceived difference between two colors, designed to align more closely with human visual perception than simpler color distance metrics like Euclidean distance in RGB space.
- Hue, Saturation, Brightness (HSB): A color model that describes colors based on three components: Hue (the pure color, like red or blue), Saturation (the intensity or vividness of the color), and Brightness (how light or dark the color appears).
Original article
The Color Memory Game by Dialed tests your ability to remember and recreate colors from memory. The free game offers solo play, multiplayer challenges, and daily competitions with leaderboards.
Digital Comics Platform (Website)
Panels Store offers a digital comics platform where users can buy and read a diverse range of comics, from cyberpunk noir like "Fluorescent Killers" to epic fantasy and indie comedy.
Original article
Panels Store is a digital comics platform where users can buy and read comics across various genres, including cyberpunk noir, epic fantasy, horror, and indie comedy.
AI Image Generator Built for Professionals (Website)
Higgsfield.ai has launched SOUL 2.0, a photorealistic AI image generator specifically engineered for creative professionals to convert text prompts into high-quality visuals.
Original article
SOUL 2.0 is a photorealistic AI image generator designed for creative professionals that converts text prompts into high-quality images.
Rethinking the Experience of System Tools
Lead Product Designer Kyrylo Levashov argues that utility software, unlike physical products transformed by brands like Dyson and Method, remains an emotional chore because designers make fundamental assumptions that neglect the user experience.
Deep dive
- Designers make four flawed assumptions about utility software: Users resent the task, function is paramount over feelings, nobody cares about utility tools, and personality wastes UI space. These assumptions lead to tools that inherently feel like a chore.
- Physical utility products offer a precedent: Brands like Dyson (vacuums) and Method (dish soap) successfully transformed mundane items into desirable experiences by focusing on design and user perception.
- The maintenance layer is a behavioral problem: Users avoid utility software not just because it's hard, but because it lacks positive emotional feedback, focusing solely on function and ignoring the aesthetic-usability effect.
- Key principles for emotional design in utility UX: Translate system complexity into human language, make the process clear and show progress, and design the moment of completion.
- Market forces are driving this change: A new generation of users, accustomed to well-designed software like Figma and Notion, expects a higher baseline for all tools, making the "it's just a utility" excuse obsolete.
- Digital fatigue contributes to the shift: A broader cultural trend towards seeking more meaningful emotional relationships with tools, evident in the resurgence of analog products, extends to software.
- Kyrylo Levashov is Lead Product Designer at MacPaw: His insights are informed by his work on CleanMyMac, a Mac care app used by millions.
- The aesthetic-usability effect: Studies show that if something looks better, it feels easier to use, even for purely functional interfaces like ATM screens.
- Peak-end rule: People remember the emotional peak and the ending of an experience, making a well-designed completion crucial for positive memory.
- CleanMyMac's 2024 update: Used visual language (color, depth, motion, 3D illustrations) to shift focus from problem diagnosis to showing positive progress and a machine working better, creating a distinct emotional payoff.
- The question is not if, but whether utility software can afford not to evolve its UX: The market and user expectations are making emotionally flat utility software unsustainable.
Original article
Design has transformed physical utility products like vacuums and dish soap from mundane tools into desirable experiences, but utility software still feels like a chore. Software designers make four key assumptions that keep maintenance tools emotionally flat: users resent the task, function matters more than feelings, nobody cares about utility tools, and personality wastes interface space. These assumptions create tools that deserve resentment rather than building user trust and engagement.
The Psychology Behind Well Designed Websites People Actually Remember
Websites make a crucial first impression within 50 milliseconds based on visual cues alone, not content, with the domain name acting as the foundational memory device.
Deep dive
- First impressions on websites are formed in under 50 milliseconds, based purely on visual cues, not content.
- Key visual cues include ample whitespace, strong color contrast, and balanced layouts, which make content feel approachable and professional.
- Effective websites function as "guided journeys" using narrative architecture, like progressive disclosure and "scrollytelling."
- Progressive disclosure reveals information gradually, creating a dynamic experience.
- "Scrollytelling" turns scrolling into a rewarding action, with new information fading in.
- The domain name is a critical memory device, influencing perception before a click.
- Memorable domain names are phonetically fluent (easy to say), semantically fit (connects to purpose), and distinctive.
- Typography (serif for authority, sans-serif for modern, script for warmth) and color palettes carry significant psychological weight, triggering associative memory.
- Visual consistency across all elements (fonts, image styles, button radii) is crucial for coherence, reducing cognitive load and building trust.
Decoder
- Progressive disclosure: A design technique that shows users only necessary information at a given moment, revealing more as needed, to prevent cognitive overload.
- Scrollytelling: A web design technique that uses the act of scrolling to unfold a narrative, often integrating data, imagery, and text dynamically.
- Phonetic fluency: The ease with which a word or name can be pronounced, which aids memorability.
- Semantic fit: How well a name or term aligns with the meaning or purpose it represents.
Original article
Well-designed websites create memorable first impressions within 50 milliseconds through visual cues like whitespace, color contrast, and balanced layouts rather than content. The most effective sites function as guided journeys that lead visitors through deliberately evoked emotional states, using progressive disclosure and "scrollytelling" techniques. A domain name serves as a crucial memory device that plants an impression before visitors even click the link, making it a foundational element of memorable web design.
When AI decides and human signs off
Many "decision support" AI systems inadvertently push human users to blindly trust AI outputs, creating an illusion of oversight while shifting actual responsibility onto the user.
Deep dive
- Many high-stakes AI systems are labeled as "decision support" but often function as "decision replacement" tools.
- Users are frequently pressured to accept AI recommendations due to system design, leading to an illusion of human oversight.
- This design pattern shifts responsibility for AI failures onto the human users who "signed off."
- Effective AI design should explicitly preserve human judgment, not bypass it.
- Key design principles include exposing the evidence and data points the AI used to reach its conclusion.
- AI systems should encourage independent human reasoning, rather than simply presenting a final answer.
- Clearly communicating the AI's level of uncertainty or confidence in its outputs is crucial.
- Humans must genuinely understand and be able to defend any decision they make, even if influenced by AI.
Original article
Many high-stakes AI systems are marketed as “decision support” tools, but in practice, they often push humans to trust AI outputs without truly evaluating them, creating the illusion of oversight while shifting responsibility onto users. Effective AI design should preserve human judgment by exposing evidence, encouraging independent reasoning, clearly communicating uncertainty, and ensuring people can genuinely understand and defend the decisions they make.
This Website Takes the Cacophony of NYC's Subway and Turns it Into Jazz Music
Designer Joshua Wolk created "Train Jazz," an interactive website that transforms real-time NYC subway data into live jazz music, giving each train line a unique instrument.
Decoder
- Data sonification: The process of mapping data to sound to convey information or create an auditory representation.
Original article
Designer Joshua Wolk created Train Jazz, an interactive website that converts NYC subway data into real-time jazz music by assigning a unique instrument to each train line.
Disney and Pixar Love this Latte Artist's Delightful Animations
London-based multidisciplinary artist Hazel Zakariya, known for her intricate smoothie bowl art, is now creating acclaimed animated latte art featuring characters from Disney, Pixar, Hello Kitty, and Snoopy, even earning praise from Disney and Pixar themselves.
Original article
London-based artist Hazel Zakariya creates animated latte art featuring popular characters from Disney, Pixar, and other franchises, such as Hello Kitty and Snoopy.
iOS 26.5 adds beautiful wallpapers for your iPhone, here's what's new
iOS 26.5 introduces a new Pride wallpaper collection for iPhone, featuring 11 preset designs and a custom builder allowing users to create unique wallpapers with up to 12 selectable colors.
Original article
iOS 26.5 introduces a new Pride wallpaper collection featuring 11 colorful preset designs plus a custom builder that lets users create their own wallpaper using up to 12 selectable colors.
SEC's Constructive Stance Reopens Tokenization Buildout
Nasdaq President Tal Cohen stated at Consensus Miami 2026 that the SEC's "much more constructive" regulatory stance, contrasting with a previous "no-fly zone," is now allowing market operators like Nasdaq to experiment with tokenization and digital market infrastructure.
Original article
Nasdaq President Tal Cohen, speaking at Consensus Miami 2026, described the SEC's posture as "much more constructive," contrasting the current regulatory gray zone with the prior "no-fly zone" that blocked experimentation with tokenization and digital market infrastructure. Nasdaq secured SEC approval in March to trial tokenized stock trading, allowing eligible participants to transact securities in traditional or blockchain form on a single platform. Cohen outlined Nasdaq's strategy to converge traditional financial rails with digital asset systems through investment in always-on market infrastructure, tokenization, and AI, with tokenization cited as improving asset mobility, financing, and issuer-level shareholder visibility. Crypto ETFs are accelerating institutional inflows by fitting into existing financial infrastructure, driving standardization and globalization of access across markets.
Kraken to Buy Stablecoin Payments Firm Reap in $600 Million Deal
Kraken's parent company, Payward, agreed to acquire Hong Kong-based stablecoin payments firm Reap for $600 million in cash and stock, valuing Payward at $20 billion, marking Kraken's first infrastructure acquisition in Asia as it prepares for an IPO.
Original article
Kraken parent Payward agreed to acquire Hong Kong-based stablecoin payments firm Reap for $600 million in cash and stock, with Payward's equity valued at $20 billion in the transaction. Reap operates B2B stablecoin payment rails, card issuance, and treasury tools across Asia, with corridors extending into Latin America and Africa. The deal is Payward's first Asia infrastructure acquisition and follows its NinjaTrader buy, pending ~$550M Bitnomial deal, tokenized equities firm Backed acquisition, and MoneyGram partnership as Kraken positions ahead of a planned IPO. Asia is Payward's fastest-growing market outside Europe and Reap's capabilities can be extended to US customers quickly.
Virtuals Protocol's ACF Mechanism
As crypto VC funding plummeted to $659 million in April and onchain perp DEX volume dropped to $699 billion in March, Virtuals Protocol’s Automated Capital Formation (ACF) mechanism offers a new method for projects to secure funding by distributing 25% of token supply to founding teams in USDC tranches based on climbing Fully Diluted Valuation (FDV) without market impact.
Decoder
- Automated Capital Formation (ACF): A mechanism by Virtuals Protocol that funds founding teams with USDC tranches tied to the project's Fully Diluted Valuation (FDV), distributing tokens to a separate liquidity pool rather than directly selling into the main trading pool.* Fully Diluted Valuation (FDV): The total value of a cryptocurrency project if all of its tokens were in circulation at the current market price.* Perp DEX volume: The trading volume on decentralized exchanges (DEXs) that offer perpetual futures contracts, allowing traders to speculate on asset prices without an expiration date.* USDC tranches: Portions or segments of the USDC stablecoin, distributed incrementally.* TGE: Token Generation Event, the initial creation and distribution of a cryptocurrency token.
Original article
Crypto VC funding fell to $659M in April, a 74% drop from March's $2.6B and the lowest monthly figure since July 2024, while onchain perp DEX volume declined 49% from its $1.36T October 2025 peak to $699B in March, compressing two of the primary capital channels builders have relied on. Virtuals Protocol's Automated Capital Formation (ACF) routes 25% of a project's token supply to the founding team in USDC tranches as FDV climbs from $2M to $160M, with distributions seeding a separate liquidity pair rather than selling into the main trading pool, avoiding the chart impact typical of team token releases. Three early ACF projects raised 4.8x to 7.9x more than their trading fees produced: Reppo ($1.8M raised, 7x fees), Small Thing ($422K, 4.8x), and Reply Corp ($550K, with $200K disbursed in a single half-day window while FDV jumped from $7M to $18M). Reppo subsequently closed a $20M strategic round on better terms after scaling from 3,500 to 90,000 users post-TGE, suggesting ACF can serve as a bridge to institutional capital rather than a standalone replacement for it.
Introducing Amazon Bedrock AgentCore Payments
Coinbase announced its x402 discovery layer and wallet infrastructure are now integrated into Amazon Bedrock AgentCore Payments, allowing AWS developers to build AI agents capable of autonomous service discovery, micropayments, and task completion, settling in USDC on Base and Solana.
Decoder
- x402 discovery layer: A Coinbase-developed infrastructure that enables AI agents to discover and interact with services and make payments.* Amazon Bedrock AgentCore Payments: A feature within Amazon Bedrock that provides managed services for AWS developers to build AI agents capable of autonomous payments and service discovery.* Coinbase MCP: Coinbase's Managed Payments Protocol, facilitating transactions within AgentCore Gateway.* Base: An Ethereum Layer 2 blockchain incubated by Coinbase.* Solana: A high-performance blockchain known for its speed and low transaction costs.
Original article
Coinbase says its x402 discovery layer and wallet infrastructure are now natively integrated into Amazon Bedrock AgentCore Payments, giving AWS developers a managed way to build agents that can discover services, make micropayments, and complete tasks autonomously. The post highlights built-in budget controls, compliance tooling, and end-to-end visibility, with settlement in USDC on Base and Solana and access to thousands of x402 services through Coinbase MCP in AgentCore Gateway.
Tether-Circle Duopoly Hampers Stablecoin Product-Market Fit
Ben O'Neill, head of money movement at Bridge, argued at Consensus Miami that the combined $260 billion dominance of Tether and Circle in the $306 billion stablecoin market stifles the product diversity needed for specialized payment use cases, citing structural fee issues with both issuers.
Decoder
- Burn fees: Fees charged by stablecoin issuers when users redeem (or "burn") their stablecoins for fiat currency.* AUM-dependent revenue model: A business model where revenue is primarily generated from assets under management (AUM), often through interest earned or fees charged on those assets.
Original article
Ben O'Neill, head of money movement at Bridge, argued at Consensus Miami that Tether and Circle's combined control of roughly $260 billion of the $306 billion stablecoin market suppresses product diversity needed to serve distinct payment use cases. There are structural fee problems with both issuers: Tether charges steep burn fees, while Circle's AUM-dependent revenue model causes its burn fees to rise over time, making neither issuer optimal across all payments contexts. There should be a wave of use-case-specific stablecoin issuers emerging over the next few years, as specialized alternatives would generate better product-market fit than the current duopoly.
Bermuda Expands USDC Airdrop
Bermuda's government is expanding its "onchain economy" initiative with a second USDC airdrop, offering up to $100 to residents who use Coinbase-supplied wallets at participating local merchants, as Premier David Burt positions the island as the first fully onchain national economy.
Decoder
- Airdrop: A distribution of cryptocurrency tokens or stablecoins to a large number of wallet addresses, often used for promotional purposes or to kickstart adoption.* Onchain economy: An economy where financial transactions and assets are primarily managed and recorded on a blockchain.
Original article
Bermuda's government is running a second USDC airdrop tied to the Bermuda Digital Finance Forum 2026, distributing up to $100 in USDC to residents who download a Coinbase-supplied wallet and spend at participating local merchants. The program, first unveiled at Davos in January with Circle and Coinbase as infrastructure partners, targets payment rails outside traditional card networks and banking systems. Premier David Burt is expanding the initiative's scope for the May forum, broadening business participation and deepening financial services engagement as part of what the government calls an "onchain economy" buildout. Bermuda is positioning itself as the first fully onchain national economy, using direct consumer stimulus as the mechanism to drive merchant adoption and accumulate stablecoin liquidity at the local level.
Bitcoin Overtakes Gold as Debasement Hedge
JPMorgan reports that Bitcoin has surpassed gold as the preferred debasement hedge among investors following the Iran conflict, with Bitcoin ETFs seeing three consecutive months of inflows while gold ETFs face outflows.
Decoder
- Debasement hedge: An investment chosen to protect against the loss of purchasing power of a currency due to inflation or other economic factors.
Original article
Investors are increasingly rotating from gold to bitcoin as a debasement trade following the Iran conflict, with bitcoin ETFs seeing three straight months of inflows while gold ETFs struggle to recover outflows. The trend spans both retail and institutional players, with futures positioning rising and continued accumulation from firms like Strategy reinforcing bitcoin's growing role as a preferred macro hedge.
Crypto Is Only Hiring in NYC
Crypto startups in 2026 are increasingly enforcing NYC-only hiring mandates, mistakenly conflating proximity to capital with a need for entire teams to be based in Manhattan, leading to a 90% reduction in viable candidates and intense competition with top-tier financial firms.
Decoder
- RWA (Real World Asset): Tangible or intangible assets that exist outside of the blockchain but are tokenized and represented on-chain.
Original article
NYC-only hiring mandates are proliferating across crypto teams in 2026, driven by founders conflating proximity to capital (Wall Street, BlackRock, and institutional RWA and stablecoin buyers) with team-wide geographic presence, when realistically only 2-3 relationship-facing roles require Manhattan proximity. Institutional marketing is the most common casualty, with positions sitting vacant for 4-5 months as founders filter for candidates already living in New York. The practical costs include a 90% reduction in the candidate pool, direct compensation competition with Jane Street, Jump, and Citadel that lean crypto startups cannot sustain, and the de facto selection of candidates with the fewest options, while top-tier crypto operators in 2026 are concentrated in Lisbon, Buenos Aires, Berlin, Dubai, and Istanbul.
Crypto Apps Are Payments + Auth Layers in Disguise
Georgios Konstantopoulos suggests that crypto applications are fundamentally advanced payment and authentication layers, enabling users to control their assets via biometrics and grant granular access to applications for always-on, cross-border ownership.
Original article
Crypto apps bundle payments and authentication into a unified layer where users control assets via biometrics and grant granular access to applications, enabling always-on, cross-border ownership.
Solv Protocol Drops LayerZero for Chainlink
Solv Protocol is moving over $700 million in tokenized bitcoin from LayerZero to Chainlink's CCIP, citing enhanced security assurances after the recent $292 million Kelp DAO bridge exploit highlighted vulnerabilities in cross-chain infrastructure.
Decoder
- CCIP (Cross-Chain Interoperability Protocol): Chainlink's protocol designed to enable secure communication and transfer of data and tokens between different blockchain networks.
- Tokenized bitcoin: Bitcoin (or a representation of it) issued on a different blockchain, often to enable its use in DeFi applications on that network.
Original article
Solv Protocol is migrating over $700 million in tokenized bitcoin assets from LayerZero to Chainlink's CCIP, citing stronger security guarantees following recent bridge exploits like the $292 million Kelp DAO incident.
NFTs may make a comeback as AI agents strain online identity
LinkedIn co-founder Reid Hoffman predicts a "rebirth" for NFTs as AI agents increasingly strain online identity and trust, arguing that crypto-based identity systems will be essential for secure transactions between agents on the open internet.
Decoder
- CryptoPunk: A collection of 10,000 unique digital characters, among the earliest examples of Non-Fungible Tokens (NFTs) and a highly sought-after collectible on the Ethereum blockchain.
Original article
Reid Hoffman says NFTs may make a comeback as AI agents strain online identity
The Greylock partner and LinkedIn co-founder said autonomous agents will need crypto-based trust systems to transact across the open internet.
What to know:
- Reid Hoffman, partner at Greylock and co-founder of LinkedIn, told the audience at Consensus that the online world needs a better identity layer as the internet becomes increasingly populated by autonomous AI agents.
- Hoffman said he recently purchased a CryptoPunk because questions about online identity are at the center of his AI-and-crypto investment thesis.
- He also urged the crypto industry to remain bipartisan rather than tilt fully Republican, warning that overcommitting to one party is bad for the ecosystem long term.
NFTs are due for a “rebirth” as AI agents force the internet to solve new identity and trust problems, Reid Hoffman told CoinDesk’s Consensus Miami conference on Wednesday.
The Greylock partner and LinkedIn co-founder said agents transacting with other agents will require trustworthy digital identity systems that resemble what NFTs originally tried to solve. Hoffman said he began revisiting NFTs as he considered a future in which AI agents outnumber humans online."When you begin to think we're going to have more agents than people, what does the identity layer look like? What is the notion of, hey, when your agent's talking to my agent, and we book this talk here, is it a trustable transaction?" Hoffman said. "And that got me back into thinking about NFTs."
Hoffman said identity systems will exist inside companies, but the harder problem will be identity for agents operating across the open internet.
“It’s going to be kind of free range on the internet, and how does that work? And crypto is the obvious answer,” he said.
This argument carries a throughline from Hoffman’s earlier work at LinkedIn, where real-world professional identity was central to the network’s design. Hoffman said actual identity can create “more responsibility, more reliability,” while also acknowledging that pseudonyms have legitimate uses in some contexts.
Hoffman, who said he bought his first Bitcoin over a decade ago and has never sold any, framed crypto as the natural answer to the deepfake-era trust problem. He cited his own AI clone, Reid AI, which he has sent to speak at conferences, as an example of why provenance will matter more as generative media improves.
"When I bought my first Bitcoin in 2014, it was like, actually, in fact, this is part of a design feature, that this is how DNS should work. This is how identity should be working, generally when you get to the internet," he said.
That identity problem, Hoffman explained, extends beyond agent-to-agent commerce. He pointed to AI-generated content, bot farms, manipulated polls and paid political influence campaigns as examples of why proof-of-humanity is becoming harder to ignore online.
In a politically calibrated stretch, Hoffman urged the crypto industry not to overcommit to Republicans on policy.
"If the industry goes, oh, we're overly reacting against Gensler, et cetera, and then being kind of, as it were, anti-Democratic Party on this, the problem is that the pendulum swings," he said. "It's good to be bipartisan from a viewpoint of what we care about is the ecosystem. We care about how it plays a good role in society.”
Hoffman also disputed the prevailing narrative that AI is driving Big Tech layoffs.
"What I've seen so far in every company that says, 'I'm doing layoffs because of AI,' maybe other than Meta, is not out of productivity, but is just out of reshifting," he said. "We've overhired because of the pandemic. We need to change. We're going to call it AI for a position of strength."
As an investor, Hoffman said he is looking for crypto ideas that may have been tried too early during prior market cycles but could return as AI changes the internet. NFTs are one such area, he said, while “DAOs and other areas” could also see renewed relevance.
Asked at the close what his Bitcoin exit price was, Hoffman didn't name a number. "Is there such a thing as an exit price?" he asked.
Consensus Miami 2026
Coinbase Posts $394M Loss as It Pushes Beyond Spot Trading
Coinbase reported a significant $394 million loss in Q1 and a 31% revenue decrease, largely due to falling crypto prices impacting trading activity and its balance sheet assets, prompting its CEO to seek reduced dependence on volatile spot trading.
Original article
Coinbase reported a $394 million Q1 loss and a 31% drop in revenue as falling crypto prices hit trading activity and the value of assets on its balance sheet.