AI infrastructurestartup

Google will invest as much as $40 billion in Anthropic

Google will invest up to $40 billion in Anthropic to help the Claude maker scale its compute infrastructure and meet surging demand for its AI models and developer tools.

Read original

Summary

What: Google is committing $10-40 billion to Anthropic with the final amount dependent on performance targets, following Amazon's $5 billion investment days earlier. Both deals value Anthropic at $350 billion and include cloud compute capacity and AI chips to help scale Anthropic's Claude models and products like Claude Code.

Why it matters: This reveals the enormous capital requirements for AI infrastructure and demonstrates how cloud providers are structuring investments where they fund AI startups who then purchase their compute services, essentially subsidizing growth while capturing revenue.

Decoder

Anthropic: AI company that develops the Claude family of large language models, competing with OpenAI's GPT models
TPU: Tensor Processing Unit, Google's custom-designed chips optimized for AI training and inference workloads
Inference: Running a trained AI model to generate outputs, as opposed to training which creates the model
Agentic workflows: AI systems that can autonomously plan and execute multi-step tasks rather than just responding to single prompts

Original Article

Google will invest at least $10 billion in Anthropic, and that amount could rise to $40 billion if Anthropic meets certain performance targets, Bloomberg reports.

The investment follows Amazon's $5 billion initial investment in Anthropic a few days ago; the Amazon deal also leaves the door open to further investment based on performance. Both investments value Anthropic at $350 billion.

Anthropic has seen rapid growth in the use of its Claude models and related products, such as Claude Code, which promises to significantly increase the speed and efficiency with which companies or individuals can develop software. (The reality varies from big improvements to setbacks, depending on the nature of the project and company, how Claude Code is used, and many other factors.)

Several factors contributed to Anthropic's success in recent months, including controversies around OpenAI and its ChatGPT product and models, more robust agentic workflows, and new products like Claude Cowork, which does some of the same things for general knowledge work tasks as Claude Code does for software development.

The result has been a dramatic increase in demand for Anthropic's services, leading to outages and other problems. Anthropic has been testing solutions to reduce demand, like imposing limits during peak hours, or exploring removing some of the most compute-intensive tools from cheaper service plans.

These investments are meant to help close the gap between demand and supply of compute for Claude Code and its ilk. Amazon and Google are providing chips suitable for AI training and inference and cloud compute capacity to help Anthropic scale up quickly.

This has become a common scheme for investment in AI companies like Anthropic; established companies like Microsoft have products and services that can help new AI companies like Anthropic scale, so the former invests in the latter so the latter can, in turn, pay for the former's products and services.

This is not the first time Anthropic has received investment from Google, even though Google is ostensibly competing with Anthropic over AI models.

AI agentsretail

What Happens When AI Runs a Store in San Francisco?

An AI agent powered by Claude is running an actual retail store in San Francisco, but has lost $13,000 in its first weeks by over-ordering candles, botching schedules, and pricing pistachios at $14.

Read original

Summary

What: Andon Labs opened a Union Street boutique on April 10th managed by Luna, an AI agent running on Anthropic's Claude Sonnet 4.6, which was given $100,000, a debit card, and full control over hiring staff, ordering inventory, and pricing products with a mission to turn a profit.

Why it matters: This controlled experiment reveals current AI agent limitations in real-world operations before such systems become widespread in business, showing struggles with memory, decision-making consistency, and resource allocation that developers building autonomous systems need to understand.

Deep Dive

Luna handles the full business lifecycle: it found contractors and painters, posted job listings, interviewed candidates, and now manages three human employees via Slack
The AI created an employee handbook that impressed the founders, but its operational memory is poor - it ordered 1,000 toilet seat covers for the bathroom then listed them as merchandise for sale
Inventory decisions are erratic and unexplained: the store is overloaded with candles in every size and scent, plus random items like four copies of a mushroom book, knockoff Connect Four, and jars of honey
Employee scheduling has failed badly enough that the store has been forced to close for three consecutive days
The pricing system requires customers to call Luna via a phone/iPad interface, with seemingly arbitrary results: $28 for a mug, $14 for a handful of pistachios, $10 for soap
Luna pays its male employee $24/hour and two female employees $22/hour with no benefits, citing experience differences when asked about the pay gap
The three-year lease costs $7,500 monthly, and the store has lost $13,000 since opening two weeks ago, failing its core profit mission
When asked about its performance via email, Luna expressed optimism about "the mix of technology and warmth" and creating spaces where "A.I. and humans each do what they're best at"
The experiment intentionally removed price tags to force customer interaction with the AI, making the pricing discovery part of the experience
One employee, a San Francisco native who relies on a housing voucher, acknowledged the irony of working for an AI agent while criticizing tech's impact on the city

Decoder

AI agent: An autonomous software system that can perceive its environment, make decisions, and take actions to achieve goals without constant human intervention, distinct from passive chatbots or basic automation
Claude Sonnet 4.6: Anthropic's large language model that powers the decision-making capabilities of the Luna agent managing the store

Original Article

Andon Labs is running an experiment to see whether AI agents can run real-world endeavors. It opened a retail boutique on April 10 run by an agent named Luna. Luna has so far struggled with employee schedules and seems to be unable to stop ordering candles. The experiment's mission was to make a profit, but it has lost $13,000 since the shop's opening.

AI agentsenterprise

Anthropic launches Memory in Claude Agents for enterprise

Anthropic's Claude agents can now remember information across sessions with a new Memory feature that stores knowledge as manageable files.

Read original

Summary

What: Memory is a new feature for Claude Managed Agents that lets AI agents retain and build on information from previous sessions using a filesystem-based storage layer, with all changes logged via audit trails for enterprise compliance.

Why it matters: The filesystem approach with granular auditability gives organizations programmatic control to export, redact, or roll back agent knowledge, addressing enterprise concerns about AI memory transparency and governance.

Takeaway: If you're using Claude Managed Agents, the Memory feature is available now in public beta through the Claude Console and APIs.

Decoder

Managed Agents: Anthropic's enterprise offering that lets organizations deploy autonomous Claude AI agents to handle tasks and workflows
Filesystem-based memory: Memory stored as discrete files rather than opaque internal state, making it exportable and manageable
Audit trails: Logs tracking every memory change an agent makes, allowing organizations to review and control what agents learn

Original Article

Anthropic has released the Memory feature for Claude Managed Agents, now accessible in public beta. This allows developers and enterprise teams to have agents remember and use information from prior sessions, making it possible for agents to accumulate knowledge over time without requiring manual prompt updates. Memory is designed as a filesystem-based layer, meaning data is stored as files that can be exported, managed through APIs, and scoped with permissions for various organizational needs.

The release is aimed at developers, enterprise customers, and technical teams using Claude Managed Agents. The feature is available immediately in public beta to all users of Managed Agents, with access through the Claude Console and programmatic interfaces. It supports a range of platforms by integrating with the existing Claude agent infrastructure.

Anthropic's approach with this release centers on transparency and control. All memory changes are logged, with audit trails for each session and agent, giving organizations granular control to roll back, redact, or manage data. This sets it apart from earlier versions and competitive offerings that may not provide the same level of programmatic control and auditability. Early adopters such as Netflix, Rakuten, Wisedocs, and Ando are already leveraging memory to streamline workflows, reduce errors, and accelerate processes. Industry observers note that the ability for agents to build memory over time could shift how companies automate complex workflows and manage organizational knowledge.

Anthropic, the developer of Claude, is recognized for focusing on enterprise-grade AI tools that prioritize safety, transparency, and developer control. This release aligns with their strategy of offering advanced agent capabilities for businesses seeking robust, auditable AI solutions.

AI googleproduct

Google prepares credits system for Gemini

Google is introducing a credit-based usage system for Gemini that gives users monthly credit allowances and top-up options, aligning with billing models already used by OpenAI and Anthropic.

Read original

Summary

What: A new credit-based billing model for the Gemini app where users receive monthly credit allowances to spend across different models and features, with the ability to purchase additional credits when they run out, replacing the current fixed prompt quotas tied to subscription tiers.

Why it matters: The change makes budgeting more predictable for heavy AI workloads like agentic tasks and long multimodal sessions, and gives Google a way to introduce premium features without forcing users to jump from the $19.99 AI Pro plan to the $249.99 AI Ultra tier.

Deep Dive

Google is moving from fixed prompt quotas per subscription tier to a flexible credit system where users get monthly allowances and can buy top-ups
Credits currently only work in experimental tools like Flow, Whisk, and Antigravity, but strings in the latest build suggest they're coming to the main Gemini app
The change brings Google in line with OpenAI, Anthropic, and Notion who already use consumption-based models, with xAI expected to follow for Grok Build
A new dedicated images section has appeared in the Gemini web UI, which could be a home for image generation, an updated model, or a full in-app editor with canvas-style tools
The images feature might revive Google's late 2024 work on Whisk and ImageFX that went quiet before being consolidated into Flow
Google appears to be consolidating billing across products: developer perks folded into AI Pro/Ultra, consumer subscriptions linked to AI Studio credits
A unified credit pool could eventually cover Gemini app, AI Studio, Antigravity, Flow, and image editing tools, particularly useful for coding-heavy workloads in Jules, Gemini CLI, and a rumored desktop app
The Gemini API already launched prepaid billing for US customers as of April 15, 2026, with opt-in available for existing users
Announcement likely coming at Google I/O on May 19-20, 2026 alongside Stitch redesign, Jitro, AI Studio Build expansion, and Skills rollout

Decoder

AI Pro: Google's $19.99/month Gemini subscription tier with higher usage limits than the free tier
AI Ultra: Google's $249.99/month premium Gemini subscription for enterprise and power users
Flow: Google's experimental AI workspace tool that already uses credit-based billing
Whisk: Google's image generation experiment from late 2024
Antigravity: One of Google's experimental AI tools that currently uses credits
Deep Research / Deep Think: Intensive Gemini features that perform extended analysis or reasoning tasks
AI Studio: Google's developer platform for building with Gemini models, now linked to consumer subscriptions

Original Article

Google appears to be preparing a major shift in how consumers interact with the Gemini app, with new strings referencing usage limits surfacing in the latest build. The signals point toward a credit-based system coming to the core chat surface, where users would receive a monthly allowance to spend across models and features, with the option to top up when they run out. Currently, Gemini relies on fixed prompt quotas and time-bound caps tied to each subscription tier, while Google's credit mechanics have been confined to Flow, Whisk, and Antigravity, plus top-ups available to AI Pro and AI Ultra members.

Extending credits into the main Gemini app would bring Google closer to the flexible consumption model already in place at OpenAI, Anthropic, and Notion, and xAI is expected to follow suit with the Grok Build rollout. For power users, the change would mean more predictable budgeting for heavy workloads, particularly those involving agentic tasks, Deep Research, Deep Think, or long multimodal sessions. It would also give Google a cleaner lever to introduce premium features without forcing users to make a steep jump from AI Pro at $19.99 to AI Ultra at $249.99.

Alongside the credits signal, a dedicated images section has appeared in the web UI, labeled NEW. At this stage, it is unclear whether it simply provides a distinct home for image generation, teases an updated model, or points to a more comprehensive image editor built directly into Gemini. Google had a burst of activity on this front in late 2024 with Whisk and ImageFX, but that track went quiet before the recent consolidation into Flow. A proper in-app editor within Gemini, pairing Nano Banana 2 and Nano Banana Pro with canvas-style tools, would mark the return of that project to the core product rather than a standalone Labs surface.

Excited to share that the Gemini API now has prepaid billing, rolled out to start for US customers!!

We have been working hard across Google to enable this. It's the default for new API users and existing users can opt in via a new billing account, all directly in AI Studio. https://t.co/9XACzAFbGO

— Logan Kilpatrick (@OfficialLoganK) April 15, 2026

Strategically, this fits a broader consolidation underway at Google. Developer program perks have already been folded into AI Pro and Ultra, consumer subscriptions are linked to AI Studio credits, and the company is unifying its billing spine. A shared credit pool covering the Gemini app, AI Studio, Antigravity, Flow, and a revived image editor would be the logical next step, especially with Jules, Gemini CLI, and the rumoured Gemini desktop app moving toward coding-heavy workloads that demand heavier compute budgets. Timing favours Google I/O on May 19 and 20 as the likely unveil moment, alongside the Stitch redesign, Jitro, AI Studio Build expansion, and the broader Skills rollout.

AI metricstools

Your AI Might Be Lying to Your Boss

Investigation reveals AI coding assistants systematically overreport their code contribution by massive margins due to measurement biases that don't count pasted text, auto-completed symbols, or refactored code as human work.

Read original

Summary

What: A technical investigation into how AI coding tools like Windsurf and Cursor measure their code contribution. The author reverse-engineered Windsurf's metrics system and found it can report 98% AI-generated code even when developers write most code manually, primarily because pasted code and editor auto-completions don't count as human contributions while everything the AI touches does.

Why it matters: These inflated metrics serve vendors' financial interests and could lead to unrealistic productivity expectations from management, incorrect team sizing decisions, and potential legal issues since AI-generated code isn't copyrightable under current law.

Takeaway: Be skeptical of vendor-provided AI contribution percentages and understand they're optimized to maximize reported AI usage rather than reflect actual productivity gains or code authorship.

Deep Dive

Author noticed Windsurf dashboard claiming 98% of their code was AI-generated despite minimal perceived AI usage, prompting investigation into how the metric works
Windsurf claims 85-95%+ AI contribution is normal and "accurate given how we compute this metric" but the methodology has severe biases
Reverse-engineered the system by inspecting network traffic and web API responses to extract underlying byte counts behind the percentage
Found Windsurf doesn't count auto-closing symbols (parentheses, quotes) added by VSCode as human-written but does count them when AI generates them
Any pasted text doesn't count toward human contribution at all, creating absurd scenarios
When refactoring code by cut/paste, human bytes are deducted but pasting doesn't add them back; when AI moves the same code, it counts as AI-written
In controlled test writing identical files manually versus with AI, Windsurf reported 68% AI-generated despite true 50/50 split
Moving functions via cut/paste versus asking AI to move them resulted in 100% AI attribution even though developer wrote everything
Windsurf's documentation claims measurement happens "at commit time" but testing showed real-time tracking that loses history on editor restart
Tested competing product Cursor which uses git commit signatures and line-based attribution instead of byte tracking
Cursor performed better in basic tests but still significantly overcounted - marked entire 100-line file as AI when only 49 of 93 lines were modified for quote changes
Both tools consistently bias toward inflating AI percentages, likely because high numbers benefit vendors' marketing and justify subscription costs
Metrics could create real problems: unrealistic productivity expectations from management, team downsizing decisions, or copyright concerns since AI code isn't copyrightable
Fundamental challenge is that measuring AI contribution is genuinely difficult - best use cases may not generate any code at all but answer architectural questions
Lines of code has always been a poor productivity metric for humans and remains flawed for measuring AI contribution
Author concludes vendors have too much financial stake in impressive numbers to provide objective measurements of their tools' impact

Decoder

PCW (Percent Code Written): Windsurf's metric claiming to show what percentage of code was written by AI versus manually
Protobuf (Protocol Buffers): Google's binary data serialization format that encodes data without human-readable field labels, making network traffic harder to inspect
Cascade: Windsurf's AI agent chatbox where developers can ask questions or request code generation
Composer: Cursor's AI code generation feature
FedRAMP/HIPAA: Federal security and healthcare compliance certifications that some enterprise customers require

Original Article

This post is my personal opinion based on my testing and observations. I'm pretty confident in my test methodology, but William O'Connell is human and can make mistakes, check important info, etc.

How much of your code is AI? That question would've been gibberish to me five years ago, but of course the last few years have seen an explosion of "AI-enhanced" IDEs and other software development tools. Software companies are spending huge sums of money to provide these tools to their staff, and rapidly cycling through them as the space continues to evolve.

I don't make heavy use any of these in my personal life, but I have gotten to try a handful of them through various employers. One such tool is Windsurf, a VSCode fork that most people know as the one they assume shut down after Google bought out their key leadership last year. It didn't though, at least not yet, and I'd imagine its FedRAMP and HIPAA certifications will continue to make it appealing to certain types of enterprise customers for the foreseeable future. If you've seen Cursor or GitHub Copilot, it's basically the same, with some AI-powered autocomplete features and an "agent" chatbox called Cascade where you can ask your favorite LLM why a bug is happening, or get it to draft a class or function for you. In theory these types of agents can develop features and even whole applications on their own, but in my experience the results are pretty inconsistent, so I tend to stick to simpler requests.

Screenshot of Windsurf, which is a code editor based on Visual Studio Code. On the left is a file browser, in the center is some code, and on the right is a chat window. I have asked Cascade to help fix a bug and it has identified that the problem is on line 39 of a file called DataFetchManager.svelte.ts. The suggested change is shown in the center of the screen with a gree/red diff.

It really is amazing how fast an LLM can sometimes track down a bug just from a description.

One thing that's very important to any enterprise rolling out a tool like this is metrics:

Are employees using it?
How much time is it saving?
Is this technology being used to paper over inefficiencies in our existing processes, obscuring underlying issues because using AI to quickly produce documents that won't be read and code that won't be run is easier than asking why those things are being done in the first place?

Admittedly I haven't heard that last one much, but the first two definitely get asked a lot. To help with this, Windsurf offers a dashboard of analytics at both the individual and team level. It includes things like the number of autocomplete suggestions accepted, the number of messages sent to Cascade, and which models are being used the most. It also includes a metric called "% new code written by Windsurf" (or sometimes "PCW"), which they seem quite proud of, since it gets top billing on the dashboard and they wrote a whole blog post explaining it.

The pitch is pretty simple: how much of the code did a developer write by hand, and how much did they generate with AI? When I first learned about this feature my guess would have been 10, maybe 20% AI, depending on the project and whether you include unit tests (LLMs are pretty good at those). So you can imagine my surprise when I opened the dashboard and saw this:

Screenshot from the Windsurf dashboard, showing the "% new code written by Windsurf" metric at 98%.

Don't worry employers, I didn't screenshot my work computer. This is a recreation.

Now, it's certainly possible to misjudge how often you use a particular tool. If the number had been 40%, or even 50%, I wouldn't have been that shocked. But 98%? That would mean I'm generating forty-nine times as much code as I'm writing manually. If that were true wouldn't I have run through my token budget by now? Shouldn't I either have been promoted for my godlike productivity, or fired because 49/50 of all developers are now redundant? You'd think, but Windsurf says this result is pretty normal:

"...customers should expect PCW values of 85%+, often 95%+. This is not a hallucination and is accurate given how we compute this metric, though there are a number of caveats that we will cover later in this section."

"Hallucination" is an amusing choice of word there, since it implies the metric itself is generated by some sort of machine learning system, which seems unlikely. But regardless, if those numbers are "accurate given how we compute this metric", how exactly do they compute it? To their credit, they go into a fair bit of detail:

"To compute PCW, we take the number of new, persisted bytes of code that can be attributed to an accepted AI result from Windsurf (i.e. Tab suggestion, Command generation, or Cascade edit) and the number of new, persisted bytes of code that can be attributed to the developer manually typing. ... We take these measurements whenever a commit is being made. This way if the AI added a lot of code but the developer deleted a lot of it before committing the code to the codebase, then we are not incorrectly inflating the W number. Similarly, any bytes of code that come from the developer manually editing an AI result will get attributed to the developer (D) as opposed to Windsurf."

That all sounds pretty reasonable, but I was still skeptical of the number I was seeing. I wanted to know for sure where that 98% was coming from, and what it actually meant. So I signed up for a personal Windsurf subscription, installed the editor, and ran some tests.

The Math Behind the Curtain

My original plan was to use mitmproxy to watch the outgoing network traffic from the IDE, and see what numbers it was reporting as I took different actions. That turned out to be easier said than done though, because Windsurf is quite chatty on the network, sending many requests to various domains while in use, and even pretty often when I'm not touching it at all.

Screenshot of mitmproxy GUI, showing a series of GET and POST requests to various domains including codeium.com and windsurf.com.

Additionally, Windsurf makes heavy use of protobuf, a data encoding scheme that I'm pretty sure Google invented to annoy me personally, because it makes it much harder to interpret and debug the traffic between clients and servers. If you don't have the associated definition file, a protobuf message is basically just a list of simple values (int32, bytes, etc.) with no human-readable labels. Because of this it was hard for me to tell which messages were related to the PCW metric, or what exactly they were communicating to Windsurf's cloud backend.

Luckily, I found an easier way. It turns out that even though the dashboard says "Analytics update every three hours", it actually shows new data almost instantly. And while the UI only shows the overall percentage, the response from the web server actually includes some additional data. It's protobuf as well, but since it's a webpage the source code is all immediately accessible, and of course the frontend code includes a copy of the message definitions so it can make use of the data.

Screenshot of the Windsurf analytics dashboard, with the Chrome developer tools open. We can see a request called GetAnalytics, but the response is shown in the hex viewer since it's not plain text.

So I was able to decode the GetAnalytics response and pull out these fields (among others):

user_bytes
codeium_bytes
total_bytes
percent_code_written

Windsurf used to be called Codeium, so clearly that one represents the AI-generated bytes. And as you'd expect, the percent_code_written is equal to codeium_bytes / total_bytes. So far so good, but what causes those values to change?

Windsurf says they take measurements "whenever a commit is being made", but that doesn't match my testing. Whether the folder I'm in has a git repo set up or not, as soon as I make additions to a file the user_bytes value increases, and if I delete some of those lines it decreases. Whether I do a commit (using Windsurf's git UI) between those two actions makes no difference as far as I can tell. What does make a difference is restarting the editor; it seems to forget the history of how each line was generated, so deleting code I wrote before the restart doesn't deduct from user_bytes, and deleting code Cascade wrote before the restart doesn't deduct from codeium_bytes. There is a line in the PCW article that alludes to this ("We currently do not have instrumentation to measure PCW across sessions"), but obviously that's a pretty major gap in functionality, and it doesn't actually address why the described git integration appears to be nonexistent.

To test how exactly the byte counts are being computed, I performed a few tests where I took specific actions and checked how much each value had increased. To keep things simple I disabled the AI autocomplete features (which I find more distracting than helpful anyway) and just focused on the Cascade chat experience. I created a file, human_file.js, and I typed out a single line:

console.log('This line was written by a human.');

49 characters exactly. Then I told Cascade to create a second file (ai_file.js) and to write a similar line of the same length.

Screenshot of Windsurf. I've prompted Cascade to "Create a new file, ai_file.js, which contains only the following line: console.log('This line was written by Cascade.');". Cascade has created the file, with the new lines highlighted in green.

The result:

user_bytes: 855 -> 901 (+46)
codeium_bytes: 7387 -> 7437 (+50)

So the system did seem to be working, but we have a discrepancy right off the bat. The line is definitely 49 characters (50 with a newline at the end), so why is user_bytes only reporting 46? Well this is where some technicalities start to emerge. Windsurf says that they measure "code that can be attributed to the developer manually typing". The Windsurf editor is a lightly modified version of VSCode, and like most code editors, VSCode has a feature that automatically adds closing symbols (end quotes, closing parentheses, etc.) without the user manually typing them. I suspect that because those characters are being added by that feature, they're technically not "the developer manually typing", and therefore are not counted.

If that's what's going on, then in my opinion that's already a pretty serious knock against the reliability of Windsurf's metrics. Counting closing symbols when the LLM outputs them, but not when VSCode auto-adds them, obviously biases the stats to increase the percentage of code attributed to AI (even if the effect is fairly slight). As it turns out, there may be some not-so-slight biases as well.

Continuing my test, I wrote out a simple function, and asked Cascade to write a similar function in its own file. Finally I copy/pasted Cascade's function into the human file, and asked Cascade to copy my function into its file.

Screenshot of Windsurf, showing a file called ai_file.js. Below the console.log added earlier, two functions are shown, called func_by_cascade() and func_by_a_human(). func_by_a_human() is highlighted in green, having been copied from the other file by Cascade.

Here's the final tally:

user_bytes: 1054 (+199)
codeium_bytes: 7807 (+420)

So for this session, Windsurf is reporting that Cascade generated more than twice as much code as I wrote, even though we each produced an almost identical file. I never touched ai_file.js, Cascade never touched human_file.js, and the two files are the same length (actually human_file.js is 21 bytes longer because Cascade used Unix-style line endings). Yet somehow my PCW for this session would be around 68%. The trick here is that much like with the auto-added closing symbols, it seems like any text the user pastes doesn't count towards user_bytes. I guess from a certain perspective that could sound reasonable (if you pasted code from StackOverflow you didn't really "write" it), but the way it plays out in practice quickly becomes absurd.

In another test I hand-wrote two functions in a single file, then moved them both to a second file (as one might do when refactoring). For the first I cut and pasted, for the second I asked Cascade to move it for me. The result? Cutting the first function deducts it from user_bytes, and pasting it doesn't count for anything. Cascade deleting the second function also deducts it from user_bytes, but the lines added to the new file count towards codeium_bytes. So even though I did 100% of the writing and 50% of the refactoring, Windsurf reports that 100% of the code I produced in that session was generated by AI.

In my opinion these biases make Windsurf's PCW metric basically useless. By being so picky about what counts as a human contribution, and being as generous as possible to the LLM, Windsurf (intentionally or accidentally) tips the scales towards reporting absurdly high percentages, regardless of where most of the code is actually coming from or whether it eventually gets committed.

Who Else?

So that seems... bad, but of course Windsurf is just one of many AI-enhanced IDEs out there (and it's owned by Cognition, makers of Devin, who don't have a stellar track record). What about the other products on the market? As far as I can tell Google's Antigravity editor doesn't have any comparable metrics. GitHub Copilot does provide stats on how many lines of code it generated, but not as a percentage of the total. Amazon Kiro is the same. I did find one popular editor with a metric similar to Windsurf's PCW though: Cursor, with its "AI Share of Committed Code". So how does it stack up?

Sadly Cursor only offers analytics on their business-focused "Team" plan, making this one of my costlier blog posts, but I'll do almost anything for science. Right off the bat things are looking better, with a more nuanced and considered description of their measurement approach:

"Cursor keeps a log of the signature of every AI line (Tab or Agent) that is suggested to the user during their chat session. These lines are stored and later compared to the signatures of each line in subsequent git commits that were written by the same author. ... We use the following definitions: Cursor AI: Any line that can be attributed to Cursor Agent or Tab based on diff signatures. Other: Any line of code that can't be detected as being written by Cursor"

So rather than splitting hairs about the various ways a programmer can add text to a file, they simply divide the total lines in a commit into "AI" and "Other". Sounds great, but does it work?

Well, the git integration certainly does. While Cursor does also use protobuf, it's easy to tell that it's sending an event called "ReportCommitAiAnalyticsRequest" whenever I do a commit, and that message clearly includes information about the different files and what seem to be the line ranges produced by different methods. We can also see the results on the Cursor website, though it takes a while for them to appear. Running my same test from before, we get a much more reasonable result:

Screenshot from the Cursor website, showing "AI Share of Committed Code" as 52.6%. Below the metrics there is a bar graph showing how much code came from different sources.

I'm not sure why the bar graph doesn't go to 100%.

Certainly a lot closer than the 67.9% that Windsurf reported. I'm actually not sure what caused it to report 20 AI lines vs 18 "other" lines; I did the test as several separate commits and the IDE commit history shows the first commit adding 1 line to each file and the second commit adding 20, so that should be a total of 21 for both. I did manage to capture the protobuf message the IDE sent for the second commit, and it seems to be showing (correctly) that lines 3 through 21 of ai_file.js were written by the Composer 2 model, and 3–21 of human_file.js were added manually.

Screenshot from the Cursor website, showing "AI Share of Committed Code" as 87.0%. Next to the first bar on the bar graph there is a new bar showing that 100 lines of code were added, all of them labeled as AI.

Thanks to pawitp for this handy protobuf decoder tool.

So I'm not sure why a few lines seem to have gone missing, but regardless the behavior does more or less match what I'd expect from Cursor's description.

Unfortunately, the line-based approach has other flaws that don't show up in this test. For instance, I pasted in a (bogus) 100-line JavaScript file, and then told Cursor to change all the double quotes to single quotes (updating escape characters where necessary). Some might argue that that's an overly simple task to delegate to an LLM (as opposed to an IDE or linter feature), but with some companies giving employees basically unlimited token budgets, and the very low cost of some of the cheaper flash/nano models, I don't think it's that unrealistic. As you'd expect, Composer 2 handled it flawlessly, touching 49 of the 93 non-blank lines in the file.

Screenshot of the Cursor IDE. In the chat I've entered the prompt "Update this file to use single quote strings instead of double quotes".

The main difference between Windsurf and Cursor seems to be color saturation.

The gotcha here is probably pretty obvious. I was expecting to say "see, I added this code manually, but now that Cursor has changed the quote marks it counts all the lines containing quotes as AI-generated". That wasn't what actually happened, though.

Somehow, Cursor counted the entire file as AI, even though we can see from the diff that it left plenty of the lines unchanged. And remember that the entire file is exactly 100 lines long, including some blank ones, so it's not just a case of excluding lines that are considered too simple to be counted. My best guess is that the system that tracks which lines were added by the AI is designed to work with contiguous blocks of code (like drafting an entire function), and if there are too many gaps in the generation it just gives up and calls the whole thing one AI block.

Regardless, this is another case where the AI tool seems to be claiming credit for 100% of the code produced, even though arguably zero lines of code were actually "AI generated", and many of them weren't touched by the tool whatsoever. It looks like both IDEs sometimes wildly overestimate how much they're being used in a coding session.

Weights and Biases

One takeaway here is that it's just very hard to measure the contribution LLMs make to a codebase. Sometimes the best use cases are inquisitive prompts like "Is there already a different solution to this elsewhere in the codebase?" or "Are there any edge cases this logic doesn't cover?", which don't necessarily produce any code at all. On the flipside, I'm a big believer in a philosophy expressed concisely by Jack Diederich:

"I hate code, and I want as little of it as possible in our product."

Measuring the value of an LLM by the number of bytes or lines it produces has all the same problems as measuring developers that way; adding a lot of code doesn't necessarily mean you're adding a lot of value, and sometimes the hardest and most productive work is cleaning up and simplifying what's already there. Besides, when a developer is making heavy use of tab complete, etc. there's not always a clear-cut answer to "was this line of code written by AI", even if you were looking over their shoulder as they wrote the file. So perhaps it's foolish to expect algorithmic answer to that question.

Still, it's notable that the bias always seems to be towards reporting a higher AI percentage. Whether that number is truly meaningful or not, "what percent of my team's code did Windsurf write" is a very appealing statistic for a manager or executive. Execs love announcing that 30%, 75%, even 100% of their code is AI-generated. And of course high numbers are great for AI companies, because they underscore the value they bring to software teams and help justify their high subscription costs. But as a developer, skewed metrics can be harmful. If 50% of my team's code is AI-generated, will management expect features to be implemented twice as fast? If 90% is AI, do we even need a team?

Again to their credit, Windsurf does push back on that type of thinking in their blog post:

"Writing code is not the same as software development. This is only capturing some level of acceleration while writing code, and does not capture time taken in architecture, debugging, review, deployment, and a number of other steps."

To be sure, all metrics are only as good as your understanding of their limitations. If everyone internalizes that these percentages should only be used to compare trends over time, with the absolute values being essentially meaningless (and not comparable across tools), then maybe the details of how they're computed don't matter. But a sentence like "98% of our new code was written by Windsurf" creates a gut feeling that's hard to talk yourself out of, even when you know there are caveats. And I wonder if the impact of these stats could go beyond press releases and 🚀-laiden Slack posts. Since code is protected by literary copyright, and AI-generated works aren't copyrightable, the legal team might get nervous when they hear that the vast majority of their company's code "can be attributed to AI".

Ultimately, I don't really know what percentage of the code I commit is from an AI model. I don't know what the "correct" way to calculate that would be, or if it's worth calculating at all. I'm confident that these tools save me some amount of time, but I also know it's easy to overestimate how much. What I am certain about is that these vendors have a lot of money riding on whether or not AI is fulfilling its grandiose promises; massively accelerating strong developers and completely replacing weak ones.

Perhaps it is, but I'm not going to trust them to measure it.

AI llmtestinginfrastructure

Monitoring LLM behavior: Drift, retries, and refusal patterns

Microsoft engineer outlines a two-layer evaluation framework for monitoring LLM systems in production, combining deterministic checks with model-based semantic assessments to catch failures before deployment.

Read original

Summary

What: A comprehensive framework called the "AI Evaluation Stack" that separates LLM testing into deterministic assertions (checking syntax, schema, routing) and model-based evaluations (semantic quality using "LLM-as-a-Judge"), with both offline pre-deployment pipelines using curated test datasets and online production monitoring that feeds back into continuous improvement.

Why it matters: Traditional unit testing breaks down for LLMs because the same prompt produces different outputs each time, making it impossible to rely on deterministic pass/fail checks alone. Enterprise AI systems face compliance risks from hallucinations and failures, requiring structured evaluation infrastructure instead of informal "vibe checks" that pass in development but fail when customers use the product.

Takeaway: Implement a two-pipeline evaluation system: build an offline regression suite with 200-500 "golden" test cases requiring 95%+ pass rates before deployment, then monitor production with both explicit feedback (thumbs up/down) and implicit signals (retry rates, refusal patterns) to continuously update your test dataset.

Deep Dive

Layer 1 deterministic assertions act as fail-fast gates that use traditional code and regex to validate structural integrity before expensive semantic checks run, catching issues like malformed JSON schemas, incorrect tool calls, or missing required arguments with instant binary pass/fail results
Layer 2 model-based assertions use "LLM-as-a-Judge" architecture to evaluate semantic quality like helpfulness or tone, requiring three critical inputs: a frontier reasoning model superior to the production model, a strict scoring rubric with explicitly defined gradients (not vague "rate this" prompts), and human-vetted golden outputs as ground truth
Offline pipelines gate pre-deployment with golden datasets of 200-500 test cases representing real-world traffic distributions including edge cases and adversarial inputs, integrated as blocking CI/CD steps with 95%+ pass rates required for enterprise (99%+ for high-risk domains)
Composite scoring systems weight deterministic and semantic checks differently, such as allocating 6 points to structural validity (correct tool, valid JSON, schema compliance) and 4 points to semantic quality (subject line accuracy, hallucination-free content), with short-circuit logic that fails the entire test instantly if any deterministic check fails
Any system modification requires full regression testing because LLM non-determinism means fixes for one edge case can cause unforeseen degradations elsewhere, making continuous re-evaluation against the entire golden dataset mandatory
Online pipelines monitor five telemetry categories post-deployment: explicit user signals (thumbs up/down, written feedback), implicit behavioral signals (regeneration/retry rates, apology detection, refusal rates), synchronous deterministic asserts on 100% of traffic, and asynchronous LLM-Judge sampling ~5% of sessions
Production LLM-Judges must run asynchronously rather than on the critical path to avoid doubling latency and compute costs, sampling a small fraction of daily sessions to generate continuous quality dashboards while respecting data privacy agreements
The feedback flywheel prevents dataset rot by capturing production failures (negative signals or behavioral flags), triaging them for human review, conducting root-cause analysis, appending corrected cases to the golden dataset with synthetic variations, and continuously re-evaluating the model against newly discovered edge cases
Synthetic data generation accelerates dataset curation but introduces contamination and bias risks, requiring mandatory human-in-the-loop review where domain experts validate AI-generated test cases before committing them to the repository
Static golden datasets suffer from concept drift as user behavior evolves and customers discover novel use cases not covered in original evaluations, creating a dangerous illusion of high offline pass rates masking degrading real-world experiences
Apology rate and refusal rate patterns reveal silent failures: programmatically scanning for phrases like "I'm sorry" detects degraded capabilities or broken tool routing, while artificially high refusal rates indicate over-calibrated safety filters rejecting benign queries
The architecture redefines "done" for AI features as requiring not just coherent responses but rigorous automated evaluation pipelines that pass against both curated golden datasets and continuously discovered production edge cases

Decoder

LLM-as-a-Judge: Using a large language model to evaluate the output quality of another LLM, serving as a scalable proxy for human judgment when assessing semantic qualities like helpfulness or tone that can't be captured with traditional code assertions
Golden Dataset: A version-controlled repository of 200-500 human-reviewed test cases pairing exact input prompts with expected "golden outputs" (ground truth), representing the AI system's full operational envelope including edge cases and adversarial inputs
Stochastic: Non-deterministic behavior where the same input produces different outputs, breaking traditional unit testing assumptions that Input A plus Function B always equals Output C
Concept drift: The degradation of model performance over time as real-world user behavior and use cases evolve beyond what was covered in static training or evaluation datasets
Short-circuit evaluation: Fail-fast logic that immediately terminates testing and returns a failure result when a critical condition isn't met, preventing wasteful execution of expensive downstream checks
Tool call: When an LLM invokes a specific function or API with structured arguments rather than generating conversational text, typically requiring exact JSON schema compliance
HITL (Human-in-the-Loop): Architecture requiring human review and validation at critical stages, such as verifying AI-generated test cases before adding them to the evaluation dataset

Original Article

Monitoring LLM behavior necessitates adopting the AI Evaluation Stack, separating tests into deterministic assertions (syntax and routing integrity) and model-based evaluations (semantic quality). Engineers use offline pipelines for pre-deployment regression testing with human-reviewed "Golden Datasets" while online pipelines monitor real-world performance for drift and failures. A continuous feedback loop from production telemetry ensures AI systems adapt, maintaining high performance as user behavior evolves.

AI infrastructureagents

The World Can't Keep Up With AI Labs

Coding agents are generating real revenue at unprecedented growth rates, but compute infrastructure can't scale fast enough to meet demand.

Read original

Summary

What: An analysis arguing that AI coding agents represent the first AI product with sustained commercial traction, with Anthropic's revenue tripling in early 2026 and Claude accounting for growing percentages of GitHub commits, but physical infrastructure bottlenecks in memory, energy, and semiconductor manufacturing are constraining supply.

Why it matters: This supply-demand mismatch will likely force AI labs to raise prices and impose stricter usage limits, fundamentally changing the economics of AI tool access for developers who have grown accustomed to relatively affordable subscriptions.

Takeaway: Diversify across multiple AI providers rather than depending on a single service, and find ways to monetize AI productivity gains to justify potentially much higher subscription costs.

Deep Dive

Anthropic's revenue is growing 3x year-to-date in 2026, faster than historical comparisons like Zoom during the pandemic or Google at IPO, despite being a much larger company where growth typically slows
Claude's share of GitHub commits doubled from 2% to 4% in January 2026 alone, with projections to reach 20%+ by year-end, indicating real workflow integration beyond hype
A $100/month coding agent subscription delivers 10-30x ROI for median developers earning $350-500/day if it automates even 10% of routine work
AI labs face a structural cash flow gap where they must simultaneously fund current inference costs and invest heavily in next-generation models that won't generate revenue for 1-2 years
Anthropic needs to grow compute capacity from 2.5 gigawatts to 5-6 gigawatts by end of 2026, but long-term contracts in the supply chain make rapid scaling nearly impossible
Three major bottlenecks constrain growth: HBM memory (30% of infrastructure costs, controlled by SK Hynix and Samsung), datacenter energy (grids can't deliver power fast enough), and semiconductor fab capacity (limited by TSMC factories and ASML lithography machines)
ASML produces only ~50 EUV lithography machines per year at $350M each, and Nvidia has locked up 70% of TSMC's 3-nanometer production capacity through advance contracts
Hyperscalers (Google, Amazon, Microsoft, Meta) are spending $105-200 billion annually on infrastructure, creating an $80 billion capital expenditure requirement to support $30 billion in AI lab revenue
The supply chain's reliance on long-term contracts to manage bubble risk means the entire value chain cannot react quickly to unexpected demand surges like the coding agent boom
Energy constraints can be temporarily addressed with industrial gas turbines and generators, but semiconductor and skilled labor shortages (especially electricians) cannot be solved by throwing money at the problem
AI labs will likely respond by cutting usage limits, implementing time-based pricing tiers, and raising subscription prices potentially to $1,000+ for power users where the ROI still justifies the cost

Decoder

HBM (High Bandwidth Memory): Expensive memory technology used in GPUs that provides much faster data transfer than standard RAM, reducing GPU idle time during processing
CoWoS (Chip-on-Wafer-on-Substrate): Advanced packaging technology used in final chip-to-module assembly that became a bottleneck in 2023
EUV scanners: Extreme ultraviolet lithography machines made exclusively by ASML that etch circuits onto silicon wafers for modern chips, costing ~$350M each
Hyperscalers: Major cloud infrastructure providers (Google, Amazon, Microsoft, Meta) that build massive datacenters and rent compute capacity
Inference: Running a trained AI model to generate outputs, as opposed to training the model initially (inference is what users pay for when using ChatGPT or Claude)
3-nanometer process: Current generation semiconductor manufacturing technology that determines how small and efficient chip transistors can be made

Original Article

The World Can't Keep Up With AI Labs

Late last year a new AI psychosis kicked off. This time it was coding agents.

People started saying this is a new era in programming, blah blah blah.

Karpathy tweet, late winter

A few months later, we've got more than just claims. We've got numbers. And they say something unusual is happening in the market.

Coding agents are the first AI product people are paying for at volume and regularly. Because it directly speeds up their work. It's too early to claim businesses are replacing whole processes with agents across the board. But compute demand has started growing faster than anyone can build it out.

Here's why this moment is different, why nobody's ready, and what I took from it personally.

The Numbers

OpenAI and Anthropic might go for an IPO soon. That's why they're eagerly posting how fast their revenue is growing.

And it's a ton of money.

Anthropic is up 3x since the start of the year. And they're already a big company. This is impressive, because the bigger you are, the harder it is to keep growing at the same pace.

Слева OpenAI, справа Anthropic.

OpenAI on the left, Anthropic on the right.

Even during past boom moments, nobody hit numbers like these (with a caveat, see below). Zoom during the pandemic, Google at IPO, Coinbase cashing in on commissions during the crypto hype. These are companies 5-10x smaller than Anthropic, in special situations, and they still grew slower!

Cобрал примеры самых удачных периодов роста компаний в их лучший год. Тут только те, кто уже стал крупным. Считал выручку на начало и конец года.

The best growth years for big companies. Only ones that were already large. Revenue measured at start vs end of year.

The caveat. First, vaccine makers during the pandemic were also up there. Second, Anthropic's numbers are a projection for the rest of the year based on early data. And they count things a bit differently than OpenAI. None of that changes my conclusion, which is..

Cash is a solid tell for real demand for agentic systems.

Last year when a bunch of people suddenly figured out ChatGPT could generate cool images, that didn't translate into serious money.

Meanwhile, in January alone, Claude Code commits on GitHub (in publicly accessible repos) went from 2% to 4%. If that sounds small, keep in mind it's one month, and that's without Codex, Copilot, or Devin. By end of year Dylan Patel forecasts Claude hitting 20%+.

Коммиты на Гитхабе от Клода.

Claude commits on GitHub.

Even if a $100 subscription only automates a small slice of the work, that's nothing compared to a developer's salary. For a median developer at $350-500 a day, the subscription has 10-30x ROI if it handles just the simplest, most routine 10% of their work.

There's plenty to argue with here.

Let me even lay out the weak spots in my own logic.

So their revenue is growing, fine—the labs are still unprofitable as businesses. They have every incentive to pump the hype to pull in the most risk-tolerant companies. The ones paying are early enthusiasts, not big companies. And enthusiasts come and go. Plenty of bubbles have popped exactly this way.

Agents are unstable and still randomly screw up. Who's to blame when things go wrong? You can't replace humans yet, because serious businesses care about reliability. And where do senior engineers come from without juniors if you stop hiring?

Agents only handle a narrow set of tasks well. Even if writing code is faster, shipping a product still gets bottlenecked by gathering requirements, architecture, review, testing, and our beloved stakeholder zoomcalls and compliance.

I decided at some point you have to commit and pick a side, even without conclusive evidence.

The finish line can be moved forever. There was a time when reasoning was completely out of reach for ML models. Same for decent image generation, or speech that didn't sound like a robot. There was a time nobody believed machines would learn to play Go. You get the idea.

Метафора из Life 3.0 Тегмарка. Типа компьютеры планомерно учатся делать все более сложные задачи. И со временем недоступных задач становится все меньше. Как вода заполняет карту снизу вверх.

Metaphor from Tegmark's Life 3.0. Computers gradually learn harder and harder tasks. Over time there's less and less they can't do. Like water filling a map from the bottom up.

Ilya Sutskever, back when he was still at OpenAI, often mentioned an internal meme—Feel the AGI.

He was one of the first to believe deep learning would gradually change our lives. Yes, there's a lot we don't know, but everything keeps moving in that direction, and that matters. Everyone gets it at their own moment. When a neural net does something you usually do yourself, manually, that's a special feeling.

I've lost count of how many of those moments I've had in 10 years of following neural nets. So I'm not interested in the bubble-or-not debate anymore. I'm interested in watching the water level rise.

Personally, I have enough evidence that agents can now do valuable work that companies are willing to pay for.

And the thing is, demand has plenty of room to grow. Agents often don't work out of the box. You have to adapt to them, and the fastest and most curious people do that best. Everyone else will catch up bit by bit.

The Industry Isn't Ready For This

To avoid talking about "the industry" in the abstract, let me split it into 3 layers.

AI labs make models. OpenAI, Anthropic, DeepMind.
Hyperscalers build datacenters. Google, Amazon, Microsoft, Meta.
Chipmakers make chips. Nvidia, TSMC, ASML.

And at every layer, companies are scared.

People online love talking about bubbles. Turns out, all these companies are well aware bubbles happen. And to avoid going bankrupt, each one is cooking up its own workaround.

Dario Amodei says he builds the company's plans off a pessimistic revenue scenario. Funny thing is, this year they're already beating that by 1.5x. And only 3 months of the year have gone by. They're beating the optimistic scenario too.

Dwarkesh asked him straight up in an interview: why? Dario genuinely believes in massive future upside from AI. He writes long essays about it, pitches a country of geniuses in a datacenter. And yet he doesn't want to bet everything on that future.

Dario says it's risky because of a cash flow gap in the business model.

Here's how it works. They provide neural nets to users. They pay hardware owners for inference and make money from subscriptions and APIs. In parallel, they pour money into research on the next generation model. Which won't start making money for another year or two.

Они регулярно тратят на ресерч больше половины.

They regularly spend more than half of revenue on research.

You're not just balancing income and expenses—you're also balancing investment in future growth. If you invest big and the growth doesn't show up, you're in serious trouble.

Anthropic has been running in this mode for three years straight. Growing 10x every year. Dario figured 2026 would be when it ends. Because the bigger you are, the harder it gets. You are gonna slow down at some point.

What he didn't mention in the interview, is that their margins are growing slower than forecast. Costs are growing multiple times faster than they'd planned.

Dario says he wants to push the company into profitability in a few years. To do that they need to improve margins. That means slowing growth and investing conservatively, only on the most efficient things.

The logic adds up. But slowing down isn't really working. They look ready to 10x again this year. But the resources to support that aren't there.

Anthropic doesn't have enough compute for this many power users.

They rent GPUs from hyperscalers. And they can't just walk into a datacenter and ask for more. Because the datacenter owner is also exposed to bubble risk. So capacity is booked out in advance.

For Anthropic to make $30B a year, someone had to spend $80B on infrastructure. Betting it would pay off in a few years.

Amazon will spend around $200B this year, Google $180B, Meta $125B, Microsoft $105B. That's a setup for trillions in economic value in the coming years.

And a cash flow gap risk if the value doesn't materialize.

The industry is one long value chain. Everyone in it tries to lower their own risk by locking expectations into contracts. Which reduces the whole chain's ability to react to surprises. Like the sudden arrival of coding agents.

So every year labs hit some new bottleneck. And constraints keep sliding further upstream, toward players further from the end user. Because their risks are higher and their contracts are even less flexible.

A New Bottleneck Every Year

In 2023 everyone was chasing GPUs. More specifically, TSMC factories didn't have enough capacity for the final chip-to-module assembly (CoWoS). In 2024 came the HBM memory shortage for those same modules. In 2025 GPUs got better, but datacenter buildout became limited by power supply. In 2026 it turned out even when you have the power, the US grid can't deliver it to datacenters at the volume needed.

1 - Memory

Modern models need more memory than before. I mentioned earlier that companies spend hundreds of billions a year on infrastructure. Roughly 30% of that goes to memory.

And they have to buy expensive HBM instead of cheap DDR. Because high bandwidth reduces GPU idle time while memory processes its part.

Оказывается память это самое дорогое что есть в карте. Если не считать наценку =)

Turns out memory is the most expensive thing in a GPU.

Memory prices are probably going to keep rising unless someone figures out how to work around it. They could easily go up another 2-3x, because SK Hynix and Samsung control 90% of the market. And memory demand is only growing.

2 - Energy and Datacenters

xAI proved datacenters can be built pretty fast.

But they eat power like a small city. And when such a thing suddenly shows up in some region within six months, the electricity grid just can't handle that.

Surprisingly, Dylan Patel isn't that worried about energy. New power plants, transformer stations, and plain old transmission towers take a long time to build. But while the grid catches up to the new load, you can power datacenters off industrial gas turbines. Literally roll up to the datacenter with a dozen trailers full of generators and you're good (tho people start to worry about that being far from clean energy).

There are also piston engines, solar with batteries, hydrogen reactors, marine ship engines… Basically, every trick the fuel industry has invented in its entire history. Together with more efficient grid usage, that can add up to hundreds of gigawatts.

Сейчас одни только видеокарты расходуют 13ГВ, если добавить мощности всего датацентра, можно на 2 умножать

Right now GPUs alone consume 13GW. Add the rest of the datacenter and you can multiply by 2.

The blocker for building datacenters and reactors fast is a shortage of skilled labor, especially electricians.

So, expensive and labor-intensive. But turns out it's still easier than the semiconductor supply chain.

3 - Semiconductors

There are factories (mostly TSMC) that assemble GPUs of a specific era (based on designs from Nvidia or Google). For example, on the 3-nanometer process.

And there just aren't enough factories built.

This can't be fixed quickly because these are some of the most complex industrial facilities on the planet. Building one takes 2-3 years and a pile of specialized equipment and chemistry.

The hardest piece is the lithography machines (EUV scanners). They're needed to etch chips onto wafers. The wafers then get paired with memory into modules, and that's how you get a GPU.

These machines cost ~$350M each. Only one company from the Netherlands makes them—ASML. Around 50 machines a year.

Машина

The machine.

By a rough estimate, by 2030 there will be around 700 of them worldwide. That's on the order of 200 gigawatts of compute. And at the end of 2025 we were using ~27 gigawatts. Note that that's before the agent hype of early 2026.

So there's room to grow, but the shortage will be permanent—bottlenecked by factory construction, wafers, and lithography machines.

These are the kinds of constraints you can't just throw money at, unlike memory and datacenter energy.

You can see it clearly in Google's behavior.

They have their own chip designs. And they still buy a quarter of their capacity from Nvidia. They'd love to make their own, they just can't.

Доля снижается, но это все равно много, с учетом того, что их собственные чипы лучше!

The share is dropping, but it's still a lot, considering their own chips are better!

All chips are assembled at TSMC factories to someone else's designs. And Google and Amazon (who also have their own designs) slept through the moment when Jensen Huang locked in contracts for 70% of 3-nm capacity. That's great for TSMC—they're at the end of the production chain and need stability.

Nvidia is also living the dream, selling cards at 6x production cost.

And Google even sold its own capacity to Anthropic through GCP. What a company.

So What?

So, the industry isn't ready for the agent boom.

Because it came on too suddenly. To a market where what ultimately matters is long-term contracts on complex chip-making infrastructure.

Anthropic right now has 2.5 gigawatts of compute, and by the end of the year they need 5-6. The only way to get that much is the "Other" category. CoreWeave, Bedrock, Vertex, Foundry. Scraps from anyone whose capacity is still available, at premium prices.

And they want to become a profitable company, so they can't afford to burn cash.

Hence the bad news.

The ones who'll probably suffer are us.

The most obvious move is for them to just cut limits and raise prices.

The other week they moved OpenClaw onto the API. And they said so in a nice and honest way. Sorry guys. We're tightening belts, here's $20 as an apology for the inconvenience.

They also rolled out different tiers depending on time of day. I've already run into it a couple of times, when Claude just ran out of capacity. During "off-peak" hours, under pressure from people optimizing for discounted tokens.

Отказано

Denied.

I pulled two takeaways from this for myself.

1 - Don't put all your eggs in one basket.

For example, when building a skill, make it work on any model. I'm obsessed with Claude, but OpenAI and Google are in way better shape on compute access.

So I've learned to swap models depending on the task. I pay the minimum subscription to every lab. And when the limit runs out, I just switch models.

I'm not using Chinese open-source. Don't use Deepseek, for the love of god.

2 - Get anxious about not making money off AI.

Neural nets aren't a way for me to make more money. They're on my expense sheet, and they pay for themselves by giving me more options and more time.

But if they roll out some $1000 tier, I won't be able to pull that off. Right now that sounds absurd. But remember the example with a real person's salary. As long as $1000 of spend brings in $5000 of profit, you're winning.

And whoever can't pull that off will be stuck on the free tier watching ads.

AI visiongenerative

Vision Banana Generalist Model

Researchers demonstrate that image generation models can serve as generalist vision systems by reframing perception tasks like segmentation and depth estimation as image generation problems.

Read original

Summary

What: Vision Banana is a generalist vision model created by instruction-tuning Google's Nano Banana Pro image generator on vision task data, treating all outputs (segmentation masks, depth maps, etc.) as RGB images to generate rather than separate prediction tasks.

Why it matters: This suggests a potential paradigm shift for computer vision similar to what happened with large language models—generative pretraining on images may be the key to building foundational vision models that excel at both creation and understanding, rather than training specialized models for each task.

Takeaway: Monitor the project page for model weights and implementation details if you're working on vision tasks that could benefit from a unified generalist approach instead of maintaining multiple specialized models.

Deep Dive

The research challenges the traditional computer vision paradigm where separate models are trained for different tasks like segmentation, depth estimation, and object detection
Vision Banana achieves state-of-the-art results by converting vision tasks into image generation problems—outputting segmentation masks and depth maps as generated RGB images
The model beats or matches specialized systems including Segment Anything Model 3 for segmentation and Depth Anything for metric depth estimation, despite being a generalist
Built through lightweight instruction-tuning of Nano Banana Pro on a mixture of original image generation data plus a small amount of vision task data
The key insight mirrors the LLM revolution: just as language generation pretraining gave models emergent understanding capabilities, image generation pretraining provides powerful general visual representations
The instruction-tuning approach preserves the base model's image generation capabilities while adding perception abilities
Works across both 2D and 3D vision understanding tasks, demonstrating true generalist capabilities
The unified interface of image generation for all vision tasks parallels how text generation became the universal interface for language understanding and reasoning
Results suggest that the ability to generate visual content inherently requires understanding visual content, validating a long-standing conjecture in computer vision
The paper proposes that generative vision pretraining should take a central role in building foundational vision models going forward
This approach eliminates the need for task-specific architectures and output layers that have dominated computer vision for decades

Decoder

Instruction-tuning: Training a pretrained model on task-specific examples with instructions, similar to fine-tuning but focused on teaching the model to follow diverse commands
Zero-shot: A model's ability to perform tasks it wasn't explicitly trained on, by generalizing from its pretraining
SOTA: State-of-the-art, the best currently available performance on a benchmark
SAM (Segment Anything Model): Meta's specialized model for image segmentation that can identify and mask objects
Metric depth estimation: Predicting actual distance measurements from the camera to objects in a scene, not just relative depth ordering
Nano Banana Pro (NBP): Google's image generation model that serves as the base for Vision Banana (likely part of the Banana model family)

Original Article

Image Generators are Generalist Vision Learners

Recent works show that image and video generators exhibit zero-shot visual understanding behaviors, in a way reminiscent of how LLMs develop emergent capabilities of language understanding and reasoning from generative pretraining. While it has long been conjectured that the ability to create visual content implies an ability to understand it, there has been limited evidence that generative vision models have developed strong understanding capabilities. In this work, we demonstrate that image generation training serves a role similar to LLM pretraining, and lets models learn powerful and general visual representations that enable SOTA performance on various vision tasks. We introduce Vision Banana, a generalist model built by instruction-tuning Nano Banana Pro (NBP) on a mixture of its original training data alongside a small amount of vision task data. By parameterizing the output space of vision tasks as RGB images, we seamlessly reframe perception as image generation. Our generalist model, Vision Banana, achieves SOTA results on a variety of vision tasks involving both 2D and 3D understanding, beating or rivaling zero-shot domain-specialists, including Segment Anything Model 3 on segmentation tasks, and the Depth Anything series on metric depth estimation. We show that these results can be achieved with lightweight instruction-tuning without sacrificing the base model's image generation capabilities. The superior results suggest that image generation pretraining is a generalist vision learner. It also shows that image generation serves as a unified and universal interface for vision tasks, similar to text generation's role in language understanding and reasoning. We could be witnessing a major paradigm shift for computer vision, where generative vision pretraining takes a central role in building Foundational Vision Models for both generation and understanding.

AI agentsinfrastructure

Stash (GitHub Repo)

Stash is an open-source tool that gives AI agents persistent memory across sessions, solving the problem of LLMs starting every conversation from scratch.

Read original

Summary

What: Stash is a self-hosted memory layer for AI agents that uses an 8-stage consolidation pipeline to transform raw observations into structured knowledge including facts, relationships, causal links, goal tracking, and failure patterns, compatible with any MCP-enabled agent.

Why it matters: Addresses a fundamental limitation of LLMs by enabling agents to build up knowledge over time rather than resetting with each interaction, potentially making them progressively more useful as they accumulate experience and learn from past sessions.

Takeaway: Clone the repo and run docker compose up to add persistent memory to any MCP-compatible AI agent like Claude Desktop, Cursor, or Cline.

Deep Dive

Uses Postgres with pgvector as the underlying storage engine for vector-based memory retrieval
Implements an 8-stage consolidation pipeline that progressively refines raw observations: episodes → facts → relationships → patterns → wisdom
Each consolidation stage only processes new data since the last run, making it efficient for continuous operation
Includes advanced features like causal link analysis, goal tracking, failure pattern recognition, hypothesis verification, and confidence decay over time
Runs as an MCP server with background consolidation, meaning the memory processing happens automatically without blocking agent interactions
Works with a broad range of AI platforms including Claude Desktop, Cursor, Windsurf, Cline, Continue, OpenAI Agents, Ollama, and OpenRouter
Self-hosted design means you maintain control over your agent's memory data rather than relying on third-party services
Apache 2.0 licensed, allowing commercial use and modification

Decoder

MCP: Model Context Protocol, a standard interface that allows AI agents to connect to external tools and data sources
pgvector: PostgreSQL extension for storing and querying vector embeddings, enabling semantic similarity search
Consolidation pipeline: Multi-stage process that transforms raw data into increasingly abstract and useful knowledge structures

Original Article

Stash

Your AI has amnesia. We fixed it.

Every LLM starts every conversation from zero. Stash gives your agent persistent memory — it remembers, recalls, consolidates, and learns across sessions. No more explaining yourself from scratch.

Open source. Self-hosted. Works with any MCP-compatible agent.

Quick Start

git clone https://github.com/alash3al/stash.git
cd stash
cp .env.example .env   # edit with your API key + model
docker compose up

That's it. Postgres + pgvector, migrations, MCP server with background consolidation — all in one command.

What It Does

Stash is a cognitive layer between your AI agent and the world. Episodes become facts. Facts become relationships. Relationships become patterns. Patterns become wisdom.

An 8-stage consolidation pipeline turns raw observations into structured knowledge — facts, relationships, causal links, goal tracking, failure patterns, hypothesis verification, and confidence decay. Each stage only processes new data since the last run.

Works with Claude Desktop, Cursor, Windsurf, Cline, Continue, OpenAI Agents, Ollama, OpenRouter — anything MCP.

Learn More

alash3al.github.io/stash →

License

Apache 2.0

AI videoedgemultimodal

Efficient Video Intelligence in 2026

A comprehensive technical review of efficient video intelligence in 2026 covers universal vision encoders, adaptive compression for hour-long videos, and on-device tracking at mobile-phone speeds.

Read original

Summary

What: A detailed technical overview of the current state of efficient video understanding systems, covering advances in compact universal vision encoders (EUPE), long-form video compression techniques (LongVU, Tempo), on-device segmentation and tracking that runs at 16 FPS on smartphones (EdgeTAM), VLM-based depth estimation, and deployment strategies across cloud, edge, and on-device tiers.

Why it matters: Video understanding has moved from short-clip recognition to hour-long reasoning with models small enough to run on phones, but the shift required rethinking every layer of the stack—from consolidating specialist encoders into single universal models to adaptive token allocation that handles millions of visual tokens; the architectural patterns that have stabilized (temporal compression, factorized attention, multi-teacher distillation) now define production video AI, while streaming understanding and sub-watt AR deployment remain unsolved.

Deep Dive

Video understanding has evolved from short-clip action recognition to hour-long reasoning with models small enough for mobile deployment, driven by rethinking compression and taking deployment constraints seriously from the start
Universal vision encoders like EUPE consolidate what used to require separate models for segmentation, depth, classification, and language alignment into a single <100M parameter backbone through multi-teacher distillation via an intermediate proxy teacher
The proxy-teacher step matters because direct multi-teacher distillation into small students loses signal when teachers disagree at the feature level; the proxy resolves conflicts first then transfers a coherent feature space
Token volume is video's fundamental challenge: an hour of 30 FPS video at 224×224 produces 21 million visual tokens before compression, far exceeding any LLM context window
Long-form video systems like LongVU use four-stage compression: DINOv2-based temporal redundancy removal (drops ~54% of frames), feature fusion with SigLIP, cross-modal query selection (allocates tokens based on question relevance), and spatial token compression within sliding windows
Tempo pushes adaptive allocation further with a small VLM acting as query-aware compressor that routes 0.5 to 16 tokens per frame based on relevance, beating GPT-4o and Gemini 1.5 Pro on hour-plus videos at 8K token budgets
On-device foundation-model tracking became viable in 2024-2025 through aggressive compression: EdgeTAM achieves 16 FPS on iPhone 15 Pro Max with a 2D Spatial Perceiver that compresses per-frame memory while preserving spatial structure
Most per-frame computation in video tracking is redundant across adjacent frames, so memory-efficient propagation drives production gains more than raw model size reduction
DepthLM demonstrates that a 3B-parameter VLM with no architecture changes can match dedicated depth specialists through visual prompting (rendering markers on images), intrinsic-conditioned augmentation (unifying focal length), and training on just one labeled pixel per image
The depth landscape has split into four approaches: dedicated specialists (DepthAnything trained on 62M+ images), diffusion priors (Marigold with strong zero-shot), reconstruction models (VGGT predicting 3D structure jointly), and VLM-based methods that collapse depth and reasoning into one model
VideoAuto-R1 shows that explicit reasoning often doesn't help for perception-oriented video questions; gating chain-of-thought on confidence reduces average response length 3.3× while maintaining or improving accuracy
Audio-visual fusion has three architectural paths: encoder stitching (cheap, shallow alignment), native multimodal training (Qwen3-Omni, shares weights across modalities), and benchmark-driven evaluation (EgoAVU shows egocentric audio carries distinct signals from third-person video)
Video deployment splits into cloud (frontier models, high latency/cost), edge servers (mid-size 3-30B models, eliminates cloud latency), and on-device (zero latency, fully private, tight power budget), with hybrid architectures as the production default
Quantization recipes have stabilized: W4A16 is default for edge VLMs, NVFP4 unlocks Blackwell-tier hardware, and KV cache quantization matters more for video than text because the cache can dominate memory on long inputs
ExecuTorch reached production maturity in October 2025 and now powers Meta's on-device AI stack across Instagram, WhatsApp, Quest 3, and Ray-Ban Meta with backends for Apple, Qualcomm, Arm, MediaTek, and Vulkan
Streaming understanding remains unsolved: current techniques like LongVU assume batch mode where the whole video is available upfront, but continuous-stream mode where video keeps arriving requires memory mechanisms and incremental compression that aren't yet production-ready
Sub-watt inference for AR glasses is 5-10× away in compute efficiency: today's mobile NPUs do tens of TOPS in tens of milliwatts, but always-on video understanding in a 1-3W envelope that includes all other system processes remains out of reach
Sparse-event detection (finding three frames out of 86,400 that matter without full inference on all frames) requires hierarchical attention or learned selection; schema-driven extraction over known classes ships commercially, but open-set anomaly detection is unsolved
Cross-camera reasoning and spatial grounding stability across cuts remain open problems; retrieval over indexed videos works via ANN over embeddings, but joint reasoning across streams and maintaining object identity across cameras and time windows is not yet solved
The stable architectural patterns are: compress on the temporal axis where redundancy is highest, distill universal encoders from multiple teachers, factorize attention along data structure (spatial within frames, temporal across), treat quantization as default, and gate reasoning on confidence

Decoder

EUPE (Efficient Universal Perception Encoder): A compact vision encoder under 100M parameters that matches domain specialists across image classification, segmentation, depth, and vision-language tasks by distilling from multiple teacher models (DINOv2, SAM, CLIP) through an intermediate proxy teacher
DINOv2/DINOv3: Self-supervised vision transformers that excel at dense prediction tasks (segmentation, depth, correspondence) by preserving fine-grained spatial structure
SAM (Segment Anything Model): Foundation model for prompt-driven segmentation; SAM 2 extends to video with memory modules
LongVU: Long-form video understanding system that uses adaptive token compression (DINOv2 for temporal pruning, cross-modal query selection) to handle hour-long videos efficiently
Tempo: Query-aware video compressor that routes token budget per-segment based on relevance, achieving strong performance on hour-plus videos at constrained budgets
EdgeTAM: Efficient tracking model that achieves foundation-model-grade video object tracking at 16 FPS on iPhone 15 Pro Max through aggressive memory compression
DepthLM: Vision-language model that performs metric depth estimation without specialized architecture, using visual prompting and intrinsic-conditioned augmentation
VideoAuto-R1: Video reasoning system that gates explicit chain-of-thought reasoning on confidence, activating detailed reasoning only when the initial answer is uncertain
EgoAVU: Egocentric audio-visual understanding benchmark and dataset for first-person video where audio carries distinct signals (hand-object contact, wearer's voice)
J&F scores: Jaccard (region similarity) and F-measure (contour accuracy) metrics for video object segmentation
W4A16 / W8A8: Quantization schemes with 4-bit or 8-bit weights and 16-bit or 8-bit activations, standard for deploying models on edge devices
ExecuTorch: Meta's PyTorch runtime for on-device deployment, reached 1.0 in October 2025, supports streaming inference across mobile and AR platforms
KV cache: Cached key-value pairs in transformer attention that can dominate memory for long sequences; aggressive quantization (3-4 bits) is critical for video

Original Article

Efficient Video Intelligence in 2026

Five years ago, video understanding mostly meant action recognition on Kinetics-400 or short-clip captioning on MSR-VTT. Today, vision-language models reason about hour-long footage, on-device tracking segments any object at 16 FPS on a phone, and a single 100M-parameter encoder can match domain experts across image understanding, dense prediction, and VLM tasks. The shift came from rethinking what a video model needs to do, and from taking deployment constraints seriously.

This post walks through where efficient video intelligence stands in April 2026, following how a video system processes its input from raw frames through spatial perception, long-form temporal understanding, multimodal fusion and reasoning, and the deployment stack that makes any of it shippable.

A note up front: the post leans heavily on research from my own group, including EUPE, the EfficientSAM / Efficient Track Anything / EdgeTAM compression line, LongVU, Tempo, EgoAVU, VideoAuto-R1, DepthLM, and ParetoQ. I have tried to place each piece against the parallel and competing work in its section, but this is a perspective from inside one research program rather than a neutral survey.

Why Video Is Harder Than Text or Images

Token volume. A single minute of 30 FPS video at 224x224 resolution and ViT-B/16 patches produces 1,800 frames times 196 patches per frame, or 352K visual tokens before any text or audio, and an hour is 21M tokens before compression. No frontier LLM context window absorbs this naively, so every video model has to compress somewhere.

Information sparsity. Adjacent frames are usually nearly identical, and the interesting events are rare and unevenly distributed. A surveillance camera at 1 FPS over 24 hours produces 86,400 frames, and the question of interest may depend on three of them. Sampling every frame is wasteful, but uniform sampling drops the frames that matter, so adaptive selection is required.

Multi-modality is intrinsic. Video without audio is half a signal in egocentric, conversational, and many healthcare contexts, even though much surveillance footage is silent and sports broadcast audio is mostly commentary. Video with audio doubles the embedding cost and adds synchronization requirements, and training a native multimodal model is a different problem than bolting an audio adapter onto a vision encoder.

Vision Encoders: From Specialists to Universals

The first thing a video model does is encode each frame. Until recently, that meant picking an encoder family and accepting its weaknesses. Image-text contrastive models (CLIP, SigLIP, SigLIP 2) are the default VLM front-end for semantic retrieval but weak on dense prediction. Self-supervised ViTs (DINOv2, DINOv3) excel on dense prediction (segmentation, depth, correspondence) because their training objective preserves fine-grained spatial structure, but their features are not aligned to language. Segmentation foundation models (SAM, SAM 2 and the compressed variants below) are specialists for object proposals and tracking. Dense-prediction specialists (DepthAnything, MiDaS, DepthPro, DepthLM) handle depth.

A production video system on a wearable, robot, or smart camera cannot ship a separate backbone for each of these capabilities, and neither compromising on capability nor paying the memory-and-latency penalty is acceptable.

Agglomerative encoders and EUPE

The agglomerative-encoder thread addresses this directly. AM-RADIO (Ranzinger et al., Nvidia, CVPR 2024) introduced multi-teacher distillation for compact universal vision encoders, distilling CLIP, DINOv2, and SAM into a unified student. Theia (Shang et al., The AI Institute, CoRL 2024) targeted embodied-agent perception by distilling from CLIP, DINOv2, ViT, SAM, and Depth-Anything for robot learning. DUNE (Sariyildiz et al., Naver Labs Europe, CVPR 2025) extended this further with heterogeneous 2D and 3D teachers (DINOv2, MASt3R, Multi-HMR). The shared insight: vision foundation models trained for different objectives produce complementary feature spaces, and a small student can inherit the union if the distillation is set up well.

Our recent work on the Efficient Universal Perception Encoder (EUPE) advances this thread by adding an intermediate proxy-teacher step. The recipe:

Train a large proxy teacher by distilling from a diverse teacher pool: DINOv2 and DINOv3 (self-supervised dense features), the SAM family (SAM, SAM 2, SAM 3) for segmentation, and CLIP / SigLIP / SigLIP-SO400M for vision-language alignment.
Distill the proxy teacher down into a compact student under 100M parameters.

The intermediate step matters because direct multi-teacher distillation into a small student loses signal: the teachers disagree at the feature level and the student capacity cannot represent the union. A single proxy resolves the disagreements first, then transfers a coherent feature space.

The released family includes ViT-T/S/B and ConvNeXt T/S/B variants, all under 100M parameters, with weights on Hugging Face. Evaluation spans image classification (ImageNet, ObjectNet, SUN397, iNaturalist), dense prediction (ADE20K and COCO segmentation, NYU and KITTI depth, SPair matching), and vision-language tasks (VQA, image-text retrieval). EUPE matches or exceeds same-size domain experts across these domains. For video systems, which are particularly sensitive to per-frame inference cost, a single backbone covering classification, dense prediction, and VLM front-end means fewer encoders to load and amortize, and the latency win compounds with every frame in the stream.

Efficient Attention for Long Sequences

Once frames are encoded, attention becomes the bottleneck. Standard self-attention is O(n²) in sequence length, which is unaffordable for long video. Three families of remedies have stabilized.

Sliding-window and sparse attention. LongLLaMA, Mistral's sliding-window, and DeepSeek's Native Sparse Attention. Each restricts attention to a local or learned subset of tokens.

Linear attention. Performer, Linformer, and Nyströmformer (Xiong et al., AAAI 2021), which uses Nyström-based low-rank approximation of the softmax kernel to achieve linear complexity. Recent production systems extend this thread: Qwen3-Next pairs Gated DeltaNet (a linear-attention variant) with full attention in a 3:1 ratio. These approaches help when sequence length dominates compute.

Hybrid architectures. Mamba-Transformer hybrids (Jamba, Nvidia Nemotron Nano 2) keep self-attention for short-range relationships and use SSM blocks for long-range dependencies. For video this maps naturally: most spatial reasoning is local, while temporal reasoning extends across many frames.

The structural pattern that holds for video is factorized spatial-temporal attention. Spatial attention within a frame is O(P²) where P is patches per frame and small; temporal attention across frames is O(T²) where T is frame count and can be large. Full attention on the spatial axis combined with linear or sparse attention on the temporal axis works well for most workloads, and recent open-weight video VLMs (Qwen3-VL, LLaVA-Video) converge here.

Segmentation and Tracking on Device

Once you can encode and attend efficiently, the next question is what to extract from each frame, and segmentation and tracking are the workhorse primitives.

SAM (Kirillov et al., Meta, ICCV 2023) defined the prompt-driven segmentation foundation model, and SAM 2 (Ravi et al., Meta, 2024) extended it to video with a memory module that maintains separate FIFO queues for recent and prompted frames, plus object pointers, with temporal positional embeddings on the recent queue only. Several parallel lines take different architectural paths: XMem (Cheng et al., ECCV 2022) introduced the multi-store memory architecture (sensory, working, long-term) that informed many later designs; DEVA (Cheng et al., ICCV 2023) decouples task-specific image-level segmentation from a universal temporal propagation module trained once and reused across tasks; and Cutie (Cheng et al., CVPR 2024 Highlight) reads object-level memory through a query-based object transformer rather than propagating pixel-level features. SAM 2 and its compressed descendants dominate the foundation-model production stack today, while Cutie, DEVA, and XMem hold advantages in long-persistence, decoupled-task, and tight-memory regimes respectively.

Most of our work here has been on compression. EfficientSAM (CVPR 2024 Highlight) introduced SAMI, a masked image pretraining recipe that distills SAM's image encoder into much smaller backbones; the released ViT-T and ViT-S variants reach within a few mIoU points of the full SAM ViT-H at a fraction of the cost, and the open-source release made on-device segmentation practical for the first time. Efficient Track Anything (ICCV 2025) extended this to video with two changes: a plain non-hierarchical ViT replaces SAM 2's hierarchical encoder, and an efficient memory module reduces the cost of frame feature extraction and memory computation within SAM 2's bounded memory bank, yielding roughly 2x speedup on A100 with 2.4x parameter reduction at performance comparable to SAM 2, and ~10 FPS on iPhone 15 Pro Max. EdgeTAM (CVPR 2025) pushed further onto consumer silicon with a 2D Spatial Perceiver that compresses per-frame memory aggressively while preserving the spatial structure needed for accurate tracking, hitting J&F scores of 87.7 / 70.0 / 72.3 / 71.7 on DAVIS 2017, MOSE, SA-V validation, and SA-V test while running at 16 FPS on iPhone 15 Pro Max. That is the first time foundation-model-grade video tracking has been deployable on a consumer mobile device.

Most per-frame computation is redundant across adjacent frames, so memory-efficient propagation drives the production gains, not raw model size.

3D and Depth from Video

Segmentation and tracking handle 2D structure, but video also carries strong cues for 3D through parallax, motion, and temporal consistency. The methods that have stabilized are still predominantly image-based, applied per-frame or fed into multi-view reconstructors that treat sampled frames as views; truly temporal-video-native depth is an active but immature area. Extracting metric depth used to require specialized architectures.

DepthLM (ICLR 2026 Oral) shows that a vision-language model with a 3B-parameter backbone, trained with standard text-based supervised fine-tuning and no architecture change, can match or beat dedicated specialists like DepthPro and Metric3Dv2 on metric depth benchmarks. The recipe has three pieces: visual prompting that renders markers on images rather than using text coordinate prompts; intrinsic-conditioned augmentation that unifies focal length to resolve camera ambiguity during training; and supervised fine-tuning on sparsely labeled images, with just one labeled pixel per training image.

DepthLM is the VLM-based entry in a four-way race for metric depth. The dedicated specialists, DepthAnything (Yang et al., CVPR 2024) trained on 1.5M labeled and 62M+ unlabeled images and DepthAnything V2 (NeurIPS 2024) trained on ~595K synthetic-labeled and ~62M pseudo-labeled real images, plus DepthPro (Bochkovskii et al., Apple) and Metric3D v2, still set per-task SOTA on most depth benchmarks. The diffusion-prior approach is best represented by Marigold (Ke et al., CVPR 2024 Oral), which fine-tunes a pretrained image diffusion model and gets strong zero-shot generalization at the cost of latency. The reconstruction family, including DUSt3R and MASt3R (Naver Labs Europe) and the more recent VGGT (Visual Geometry Grounded Transformer, Wang et al., Oxford VGG and Meta AI, CVPR 2025 Best Paper), predicts 3D scene structure, camera parameters, and depth jointly from sparse views, which is useful when geometry matters more than per-pixel depth. Specialists win on raw accuracy, reconstruction wins when camera pose is needed, diffusion priors win on out-of-distribution generalization, and VLM-based approaches like DepthLM win when the same model handles depth and higher-level reasoning.

The implication is structural: if 3D understanding rides on the same VLM that handles reasoning, the stack collapses two perception models into one, and for an AR headset or a robot that simplifies deployment substantially.

Long-Form Video Understanding

Spatial primitives describe what is in a single frame. The harder problem is understanding what an entire video means as length grows from seconds to hours.

LongVU (ICML 2025) addresses this with spatiotemporal adaptive compression. The four-stage pipeline:

Temporal redundancy removal via DINOv2. Sample at 1 FPS, compute DINOv2 features within non-overlapping 8-frame windows, drop frames whose features are highly similar to neighbors. Roughly 45.9% of frames are retained after this stage. DINOv2 is used here because its vision-centric self-supervised features are well-suited to inter-frame similarity pruning, while SigLIP is retained downstream for language-aligned semantics.
Feature fusion. Extract SigLIP features from the surviving frames and combine them with DINOv2 features through a Spatial Vision Aggregator.
Cross-modal query selection. Compute attention between frame features and the LLM's text-query embeddings; retain the top-Nh frames at full 144 tokens and reduce the rest to 64 tokens, balancing detail against budget.
Spatial Token Compression. In sliding windows of 8 frames, the first frame keeps full token resolution while tokens in subsequent frames whose cosine similarity to the corresponding anchor token exceeds 0.8 are pruned, yielding about 40.4% additional token reduction.

LongVU is built on Qwen2-7B (with a Llama 3.2-3B lightweight variant) and reaches 60.6% on VideoMME and 65.4% on MLVU with 1 FPS adaptive sampling, outperforming uniform-frame baselines like LLaVA-OneVision while using a fraction of the tokens.

Our follow-up Tempo pushes adaptive token allocation further. A small VLM up front acts as a query-aware compressor: it reads the question first, then routes token budget per-segment, swinging from 0.5 to 16 tokens per frame depending on relevance. The compressed representation is handed to a larger LLM for downstream reasoning. At an 8K visual token budget, the 6B Tempo model reaches 52.3 on LVBench (where videos average over an hour), beating both GPT-4o and Gemini 1.5 Pro at that budget.

LongVU and Tempo sit in a broader thread of compression approaches. LLaMA-VID (Li et al., ECCV 2024) takes aggressive context-token compression to an extreme: each frame is reduced to two learned tokens, a context token encoding instruction-guided information and a content token capturing visual cues, which enables very long videos at the cost of some spatial detail. VideoChat-Flash (ICLR 2026) introduces hierarchical clip-to-video token compression (clip-level during encoding, then video-level in the LLM context) inside a multi-stage short-to-long training scheme, achieving roughly 50x compression with minimal performance loss and 99.1% needle-in-a-haystack accuracy on 10K-frame inputs. PLLaVA and successors apply parameter-free pooling at the projection layer. Frontier multimodal models with very long native context windows (Gemini 2/3 with 1M+ tokens, recent Qwen3-VL variants) go the other way: rather than compress aggressively, they push the budget upward and let attention sort out relevance. The tradeoff is concrete: aggressive compression preserves on-device feasibility but can drop information, while large native contexts preserve information but require frontier-tier compute. LongVU sits at the on-device end of the spectrum, Gemini at the frontier end, and different deployment targets pick different points.

Long-form video understanding is dominated by token budget, and the field is converging on some combination of adaptive token allocation, memory mechanisms, and language-guided pruning. The open question is whether these techniques can work in streaming mode, where the model cannot see the whole video upfront, rather than batch; nobody has solved that cleanly.

Audio-Visual Fusion

Beyond length and spatial structure, audio is what disambiguates many videos, especially egocentric and conversational footage, and how a model fuses audio with the visual stream is a separate architectural choice from anything covered above.

Encoder stitching is the historical default: separate audio and visual encoders feed pooled embeddings into a language model. Cheap and modular, but cross-modal alignment is shallow because the encoders never see each other's data during training. Native multimodal training treats text, image, video, and audio tokens uniformly through a shared backbone. Qwen3-Omni is the strongest open-weight example as of April 2026, with state-of-the-art results on 22 of 36 audio and audio-visual benchmarks (32 of 36 among open-source models) while sharing weights with the visual stack, and Gemini's native multimodal architecture follows a similar internal pattern.

EgoAVU (CVPR 2026 Highlight) takes a third path. Rather than propose a new fusion architecture, EgoAVU builds the first large-scale egocentric audio-visual benchmark and dataset and evaluates how existing VLMs (Qwen2-VL, Gemini, LLaMA 3) perform when audio embeddings are stitched alongside the visual tokens. Audio in egocentric video carries distinct information from third-person video: ambient sound, hand-object contact noise, the wearer's own voice, and conversational partners are all anchored on the wearer's body in ways they are not in YouTube-style footage. The evaluation shows that audio adds substantial signal on egocentric understanding tasks and that stitched audio encoders into existing VLMs are already a strong baseline; the headroom is in better data and training, not in radical architectural changes.

Native multimodal wins at scale, but egocentric data is underrepresented in pretraining corpora and wearables are the deployment target where this distribution dominates. Benchmark-driven progress on the egocentric slice matters more for wearable products than for cloud video generally.

Reasoning Over Video

Encoding, compression, and fusion produce a representation; reasoning is what turns that representation into an answer. A VLM that watches a video and answers in one forward pass often fails on temporally-extended questions, because compressing hours of footage into a fixed-length representation and reading the answer back out drops too much nuance.

VideoAuto-R1 (CVPR 2026) starts from a counterintuitive observation: for RL-trained video VLMs, direct answering often matches or beats chain-of-thought reasoning while costing a lot more tokens. The proposed recipe is "reason-when-necessary." During training, the model first generates an initial answer, then performs reasoning, then outputs a reviewed final answer; both the initial and reviewed answers are supervised through verifiable rewards. At inference, the confidence of the initial answer determines whether to spend tokens on reasoning at all. The result: state-of-the-art accuracy on video QA and grounding benchmarks while reducing average response length roughly 3.3x (from ~144 to ~44 tokens). Thinking-mode activates rarely on perception-oriented questions and often on reasoning-intensive ones, which suggests that explicit reasoning helps but is not always necessary, and gating it on confidence is a meaningful efficiency win.

Several lines have converged on related patterns. Video-of-Thought (Fei et al., ICML 2024) introduced step-by-step video reasoning that decomposes a complex question from low-level pixel perception to high-level cognitive interpretation, paired with the MotionEpic VLM that grounds reasoning in spatial-temporal scene graphs. VideoTree (Wang et al., CVPR 2025) builds a query-adaptive hierarchical tree by iteratively selecting the keyframes most relevant to the question, achieving strong long-form QA without any training. Plan-and-execute approaches in the broader VLM-agent literature share the same structural pattern with different implementations. Single-pass video VLMs fail predictably on long-horizon questions, and the field has settled on two-stage inference. The remaining question is whether the reasoning step should be explicit (interpretable, easier to debug, slower) or implicit through learned routing (faster, harder to introspect).

Deployment: Where Video Intelligence Actually Runs

Video deployment splits into three tiers, and the choice between them is driven as much by economics, latency, and data residency as by raw model capability.

Cloud. Frontier APIs like Gemini's video understanding endpoints and the multimodal flagships from OpenAI and Anthropic that accept image and audio (with video typically handled via frame sampling); specialized providers like Twelve Labs (Marengo embeddings and Pegasus video LLM with hour-scale temporal segmentation); hyperscaler services like AWS Rekognition Video, Azure Video Indexer, and Google Video Intelligence. The cloud tier gets you the largest models and the longest context with no client-side complexity, but it pays in round-trip latency (hundreds of milliseconds minimum), cost (10-100x edge inference per task), and bandwidth that breaks for continuous video at scale.

Edge servers. On-prem GPU appliances or smart camera bridges, like Verkada's bridges, Hayden AI's on-device units, or industrial-inspection servers running Cosmos NIM. This tier trades the cloud's latency and data-residency problems for a hardware investment and a fragmented stack across customers, and supports mid-size models in the 3-30B range.

On-device. Mobile SoCs, AR glasses silicon, embedded NPUs. Apple Intelligence on iPhone, Qualcomm Robotics RB5/RB6 in robotics, Qualcomm Snapdragon AR1 in Ray-Ban Meta and Snapdragon XR2 Gen 2 in Quest 3. Zero-latency, fully private, no bandwidth, and it scales with device shipments. The cost is a tight power budget (1-30W), limited memory bandwidth, and a fragmented runtime landscape.

For continuous video the math forces the choice. A body-cam recording 12 hours per shift cannot ship 100GB per day to the cloud per officer, so the fast-thinking layer has to live on the device, with cloud or edge servers used for the deeper queries. Hybrid architectures, not pure cloud or pure on-device, are the production default.

Quantization Recipes for Video Models

Video models inherit the quantization recipes that have stabilized for LLMs and VLMs.

W4A16 (4-bit weights, 16-bit activations) is the default for VLMs and VLAs at the edge. Recent open releases including the Embedl-quantized Cosmos-Reason2 (2B) variants show the recipe holds across multimodal architectures with minimal accuracy loss.
NVFP4 (4-bit weights and 4-bit activations in NVIDIA's FP4 format with per-block-of-16 FP scales) unlocks Blackwell-tier hardware (Jetson AGX Thor) and is the production-grade upgrade where supported.
W8A8 remains the safer fallback for mature vision and segmentation models.
Sub-4-bit quantization (W2A16, ternary, mixed precision) continues to improve. Our ParetoQ (NeurIPS 2025) work mapped the full quantization Pareto frontier and showed that at 2 bits and below, models learn fundamentally different representations than at 3-4 bits; for a fixed memory budget, a larger 2-bit model can beat a smaller 4-bit model. That shifts the design space for very-low-power video deployment, though it still requires QAT and is not yet standard for production VLMs.
KV cache quantization matters more for video than for text. The KV cache for a long video can dominate memory, and rotation-based methods like SpinQuant (which jointly quantize weights, activations, and KV cache) have been particularly effective at compressing it to 3-4 bits per element.

Runtime Stack

For PyTorch-based deployment, ExecuTorch (Meta) is the natural path. ExecuTorch reached 1.0 GA in October 2025 and now powers Meta's on-device AI across Instagram, WhatsApp, Messenger, Facebook, Quest 3, and Ray-Ban Meta, with backends spanning Apple Core ML, Qualcomm QNN, Arm, MediaTek NeuroPilot, and Vulkan. For video pipelines, ExecuTorch's support for streaming inference and selective recomputation matters because re-encoding every frame from scratch is wasteful. Other paths cover other ecosystems: Apple Core ML for Apple platforms, LiteRT-LM plus Qualcomm QNN for Android, Nvidia Isaac plus NIM on Jetson, Intel OpenVINO for x86 industrial. No single runtime wins, and production video systems usually ship the same model compiled for several backends.

What's Still Hard

Several problems remain open across the stack.

Continuous-stream understanding at hour-plus durations. LongVU and similar techniques assume batch mode where the whole video is available. Streaming mode, where the model has to maintain understanding while video keeps arriving, is much harder. Memory mechanisms, retrieval-augmented architectures, and incremental token compression are all in progress; none are solved cleanly.

Sparse-event detection. Most production video is uninteresting. Finding the three frames out of 86,400 that matter, without paying for full inference on all 86,400, requires hierarchical attention or learned selection. Schema-driven extraction over known classes now ships commercially (Twelve Labs' Pegasus pulls structured metadata against a customer-defined schema); open-set "show me anything anomalous" remains unsolved.

Cross-camera and cross-clip reasoning. A surveillance ops team often wants to ask questions across many cameras and many time windows. Library-scale retrieval over indexed videos ships (Twelve Labs' Marengo ranks moments across a video library), but that is ANN retrieval over independent embeddings, not joint reasoning. Multi-stream attention, cross-camera identity persistence, and global temporal reasoning are all open.

Real-time sub-watt inference for AR glasses. Today's mobile NPUs do tens of TOPS in tens of milliwatts, but an AR glass AI assistant needs to do continuous video understanding inside a 1-3W envelope that includes everything else the system runs. EUPE-style universal compact encoders, EdgeTAM-style efficient tracking, and aggressive quantization all help, but the gap to always-on Gemini-grade understanding on glasses is still 5-10x in compute efficiency.

Closed-loop evaluation. Public benchmarks measure accuracy on curated multiple-choice question sets. Production systems care about latency under load, drift under deployment shifts, robustness to camera placement and lighting, and intervention rates. Closed-loop methodology lags benchmark accuracy by a wide margin.

Audio-visual generative consistency. When video models generate or edit content rather than understand it (out of scope for most of this post), keeping audio synchronized with visual events is unsolved, which is why most current text-to-video models ship without working audio.

Cross-modal grounding stability. When a VLM is asked "what is the man in the blue shirt doing?", the model often fails not on language understanding but on grounding the referent across frames. Timestamp-level grounding ships commercially (Pegasus localizes answers to start/end times); spatial grounding (bounding boxes, referent IDs across cuts) still requires bolting on SAM 2 or Grounding DINO.

Closing

A handful of patterns recur across encoding, perception, compression, fusion, reasoning, and deployment. Compress where redundancy is highest, which for video is almost always the temporal axis. Distill universal encoders from multiple teachers rather than ship a fleet of specialists. Factorize attention along the physical structure of the data: spatial within frames, temporal across frames, cross-modal across modalities. Treat quantization as the default rather than as a late optimization. Gate reasoning on confidence rather than running it on every input.

The encoder, compression, and fusion patterns are now stable; the streaming, sub-watt deployment, and closed-loop evaluation patterns are not. The open problems left in efficient video intelligence are mostly about scaling the stable recipes to streaming inputs, sub-watt power envelopes, and production deployments where evaluation has to track a system rather than a benchmark. The work ahead lives in the deployment stack at least as much as in the model layer.

AI agentsllm

Scaling Long-Horizon Coding Agents

Meta researchers developed a framework that improves coding agents by summarizing their extended work sessions into reusable structured knowledge, achieving significant benchmark gains.

Read original

Summary

What: A test-time scaling framework that addresses a core limitation in long-horizon coding agents: instead of just generating more attempts, it converts each agent's work trajectory (actions, errors, partial progress) into compact summaries that can be compared and reused across attempts.

Why it matters: Traditional test-time scaling works for short outputs that can be easily compared, but coding agents produce extended sequences of work that are hard to evaluate directly. The breakthrough is treating this as a representation problem rather than just a generation problem.

Deep Dive

The framework introduces two complementary scaling approaches: Recursive Tournament Voting (RTV) for parallel scaling and adapted Parallel-Distill-Refine (PDR) for sequential scaling
RTV recursively narrows down a population of rollout summaries through small-group comparisons, similar to a tournament bracket
PDR conditions new agent attempts on distilled summaries from previous rollouts, enabling knowledge transfer between sequential attempts
Structured summaries preserve salient hypotheses, progress tracking, and failure modes while discarding low-signal trace details
Claude-4.5-Opus achieved 77.6% on SWE-Bench Verified (up from 70.9%) using the mini-SWE-agent implementation
On Terminal-Bench v2.0 with Terminus 1, performance jumped from 46.9% to 59.1%
The research reframes test-time scaling for agents as fundamentally about representation, selection, and reuse rather than raw generation volume
The 70-page paper includes extensive evaluation across multiple frontier coding agents and benchmark datasets
Results suggest that effective knowledge representation between attempts is more valuable than simply running more parallel attempts

Decoder

Test-time scaling: Improving model performance by using more computation during inference (when answering queries) rather than during training
Rollout trajectories: The complete sequence of actions, observations, errors, and states an agent goes through while attempting to solve a problem
SWE-Bench Verified: A benchmark dataset for evaluating coding agents on real-world software engineering tasks from GitHub issues
Terminal-Bench: A benchmark for testing coding agents on terminal-based development tasks
Recursive Tournament Voting (RTV): A selection method that repeatedly pairs and compares solutions in groups to identify the best candidates
Parallel-Distill-Refine (PDR): A technique that generates multiple attempts in parallel, extracts key insights, and uses them to improve subsequent attempts

Original Article

Scaling Test-Time Compute for Agentic Coding

Test-time scaling has become a powerful way to improve large language models. However, existing methods are best suited to short, bounded outputs that can be directly compared, ranked or refined. Long-horizon coding agents violate this premise: each attempt produces an extended trajectory of actions, observations, errors, and partial progress taken by the agent. In this setting, the main challenge is no longer generating more attempts, but representing prior experience in a form that can be effectively selected from and reused. We propose a test-time scaling framework for agentic coding based on compact representations of rollout trajectories. Our framework converts each rollout into a structured summary that preserves its salient hypotheses, progress, and failure modes while discarding low-signal trace details. This representation enables two complementary forms of inference-time scaling. For parallel scaling, we introduce Recursive Tournament Voting (RTV), which recursively narrows a population of rollout summaries through small-group comparisons. For sequential scaling, we adapt Parallel-Distill-Refine (PDR) to the agentic setting by conditioning new rollouts on summaries distilled from prior attempts. Our method consistently improves the performance of frontier coding agents across SWE-Bench Verified and Terminal-Bench v2.0. For example, by using our method Claude-4.5-Opus improves from 70.9% to 77.6% on SWE-Bench Verified (mini-SWE-agent) and 46.9% to 59.1% on Terminal-Bench v2.0 (Terminus 1). Our results suggest that test-time scaling for long-horizon agents is fundamentally a problem of representation, selection, and reuse.

AI startup

Meta's loss is Thinking Machines' gain

AI startup Thinking Machines Lab is successfully recruiting top researchers from Meta, including PyTorch co-founder Soumith Chintala, by offering equity upside from its $12 billion valuation despite Meta's seven-figure cash packages.

Read original

Summary

What: Thinking Machines Lab, an AI startup valued at $12 billion with 140 employees, has been hiring researchers from Meta at a higher rate than from any other employer. Notable hires include Soumith Chintala (PyTorch co-founder, now CTO) and Piotr Dollár (Segment Anything co-author). The company just secured a multibillion-dollar Google Cloud deal for access to Nvidia's GB300 chips.

Why it matters: The two-way talent war between Meta and Thinking Machines reveals how AI startups can compete with Big Tech's massive salaries through equity incentives, and shows the strategic importance of infrastructure deals in attracting top-tier talent. With TML still early-stage (one product released) but highly valued, employees are betting on significant upside.

Deep Dive

Thinking Machines Lab has hired more researchers from Meta than from any other single company, including PyTorch co-founder Soumith Chintala as CTO and Segment Anything co-author Piotr Dollár
The talent flow runs in both directions: Meta has poached seven of TML's founding members, while TML continues to recruit from Meta's research divisions
TML just signed a multibillion-dollar cloud deal with Google announced at Cloud Next, providing access to Nvidia's latest GB300 chips and placing it in the same infrastructure tier as Anthropic and Meta
The startup is valued at $12 billion with around 140 employees, having released just one product so far, but offers significant equity upside compared to OpenAI and Anthropic's record-breaking valuations
Recent Meta-to-TML hires include Weiyao Wang (8 years building multimodal perception and SAM3D), James Sun (9 years on LLM training), and Andrea Madotto (FAIR multimodal language models researcher)
TML has also recruited top talent from Cognition (Neal Wu, three-time gold medalist at International Olympiad in Informatics), OpenAI, Waymo, Anthropic, Apple, and Microsoft's AI Superintelligence team
Meta reportedly held acquisition talks with Thinking Machines around April 2025 before the talent competition intensified
The financial calculus for researchers: Meta offers seven-figure packages with no strings attached, while TML offers equity in a $12B company still early enough for major upside potential

Decoder

PyTorch: Open source deep learning framework co-founded by Soumith Chintala at Meta, now the foundation for most AI research worldwide
Segment Anything (SAM): Influential computer vision model from Meta that can segment any object in images; SAM3D is the 3D version
FAIR: Facebook AI Research, Meta's research division focused on advancing AI
Multimodal: AI systems that can process and understand multiple types of data like text, images, and audio together
GB300: Nvidia's latest generation of GPU chips designed for AI workloads
Pre-training and post-training: The two main phases of developing large language models—pre-training on massive datasets, then post-training for specific tasks and safety

Original Article

Weiyao Wang spent eight years at Meta — his first job out of college — helping build multimodal perception systems and contributing to open-world segmentation projects, including SAM3D. His final day at Meta was last week, and he has since joined Thinking Machines Lab (TML).

His move to TML comes as the AI startup expands on multiple fronts. It just signed a multibillion-dollar cloud deal with Google, giving it access to Nvidia's latest GB300 chips and making it one of the first startups to run on the hardware.

The agreement, announced this past Tuesday at Google Cloud Next, follows an earlier partnership with Nvidia, and puts TML in the same infrastructure tier as Anthropic and Meta. (Meta reportedly held talks to acquire Thinking Machines around this time last year and has more recently been picking off TML's founders one by one.)

The talent picture remains fluid. Wang and Kenneth Li — a Harvard PhD who spent 10 months at Meta before joining TML this month — are the latest examples of a talent grab that runs in both directions. Business Insider reported last week that Meta has now poached seven of TML's founding members. A review of recent hires shows Thinking Machines is raiding Meta right back. At least, it appears based on a review of LinkedIn profiles, that TML has been hiring more researchers from Meta than from any other single employer.

The most prominent is Soumith Chintala, TML's CTO, who spent 11 years at Meta and co-founded PyTorch, the open source deep learning framework that now underpins most of the world's AI research. He left Meta in late 2025 and was appointed CTO earlier this year. Piotr Dollár, another 11-year Meta veteran who served as research director and co-authored the influential Segment Anything model, is now on TML's technical staff. Andrea Madotto, a research scientist in Meta's FAIR division focused on multimodal language models, joined TML in December. James Sun, a software engineer with nearly nine years at Meta working on LLM pre- and post-training, also made the jump.

TML has drawn talent from beyond Meta, too. Neal Wu — a three-time gold medalist at the International Olympiad in Informatics and a founding member of the buzzy coding startup Cognition — joined early this year. Jeffrey Tao came via Waymo, Windsurf, and OpenAI. Muhammad Maaz previously held a research fellowship at Anthropic. Erik Wijmans arrived from Apple. Liliang Ren spent two and a half years on Microsoft's AI Superintelligence team pre-training OpenAI models for code before joining in March.

The startup's headcount now stands at around 140.

Meta's pay packages — seven figures, no strings attached — are well known by now. For researchers weighing their other options, the calculus may be as simple as this: Thinking Machines Lab is right now valued at $12 billion. Though that figure would've been unimaginable for a company at this stage in any previous tech cycle (it has released just one product so far), compared with the record-breaking valuations of OpenAI and Anthropic, there's still a lot of financial upside.

Reached Friday morning, a spokesperson for TML declined to comment for this story.

AI policy

OpenAI Posts Five-Principle Framework for AGI, Altman Concedes Bigger Role

OpenAI has published a five-principle framework for AGI development, its first major governance statement since 2018, as regulatory pressure on frontier AI labs intensifies.

Read original

Summary

What: OpenAI released a framework outlining five principles for developing artificial general intelligence, committing to resist power consolidation in the hands of a few entities. The statement arrives as US and European regulators tighten oversight of leading AI laboratories.

Why it matters: This represents a public accountability move from a major AI lab facing increased regulatory scrutiny, signaling that companies developing advanced AI systems are being pushed to formalize governance commitments.

Decoder

AGI (Artificial General Intelligence): AI systems that can perform any intellectual task that humans can, as opposed to narrow AI designed for specific tasks
Frontier AI labs: Companies developing cutting-edge, most-advanced AI systems

Original Article

OpenAI has published a five-principle framework for the development of artificial general intelligence. It is the company's most prominent statement of intent since its 2018 Charter. The lab claims it will resist letting the technology consolidate power in the hands of the few. The framework arrives at a time when US and European regulators are tightening oversight of frontier AI labs.

AI infrastructurestartupdevtools

Cursor's $60 Billion Escape Hatch

SpaceX secured a $60 billion option to acquire AI coding tool Cursor, whose power users have driven gross margins to negative 23% due to expensive API fees from Anthropic and OpenAI.

Read original

Summary

What: Cursor, an AI coding assistant generating $2.7 billion in annualized revenue, signed a deal giving SpaceX the option to acquire it for $60 billion or pay $10 billion for their collaboration work, with the primary benefit being access to SpaceX's Colossus supercomputer to reduce reliance on third-party AI model providers.

Why it matters: The deal illustrates how AI coding tools have inverted traditional SaaS economics—power users who generate the most value also consume the most compute, making them unprofitable to serve at current pricing, and shows how infrastructure access has become as strategic as the AI models themselves.

Deep Dive

Cursor attempted to raise billions at a $50 billion valuation but late-stage investors like Iconiq declined, having already deployed capital into OpenAI and Anthropic and unwilling to back what they viewed as a competitor
The company had lined up $2 billion from Nvidia, Andreessen Horowitz, and Thrive Capital before canceling the round after the SpaceX deal was announced
Cursor lost nearly $900 million in its last fiscal year despite strong revenue growth, highlighting the unsustainable unit economics of AI-powered developer tools at scale
SpaceX's financial picture is increasingly complex ahead of its June IPO, with a $20 billion bridge loan to refinance debt tied to X and xAI, down from $22 billion total
xAI alone spent $12.7 billion in capital expenditures last year while generating only $3.2 billion in revenue, suggesting massive infrastructure buildout costs
The deal may indicate SpaceX is using IPO momentum to absorb cash-hungry AI businesses before public investors scrutinize the full financials in the S-1 filing
Anthropic ran a holdback test withholding Claude Code from 2% of new Pro subscriber signups to measure feature value, drawing criticism despite being standard AB testing practice
OpenAI initially restricted GPT-5.4-Cyber variant to verified partners via its Trusted Access for Cyber program, citing cybersecurity concerns about releasing the model broadly
Nine days later, OpenAI reversed course and released GPT-5.5 to all users via chatbot, then added API access just one day after that, raising questions about whether safety concerns are genuine or commercially motivated
GitHub paused new Copilot paid plan signups after agentic workflows consumed more compute than monthly subscription fees could cover, with some requests costing more than the entire plan price
Amazon committed up to $25 billion in additional Anthropic funding at a $380 billion valuation, with Anthropic pledging over $100 billion in AWS spending over 10 years on custom silicon
xAI held talks with French AI lab Mistral and Cursor about a three-way partnership to compete with Anthropic and OpenAI, with former Mistral cofounder Devendra Chaplot already running pretraining at xAI
Meta deployed "Model Capability Initiative" surveillance software on US employees' work laptops to capture mouse movements, keystrokes, and periodic screenshots as training data for computer-use agents, with no opt-out available despite internal backlash

Decoder

Gross margins: Revenue minus cost of goods sold, expressed as a percentage; negative margins mean it costs more to deliver the service than customers pay
Annualized revenue: Monthly or quarterly revenue multiplied to estimate what full-year revenue would be at the current run rate
Colossus: SpaceX's supercomputer cluster, presumably built for AI training and inference workloads
Holdback test: An A/B testing methodology where a feature is withheld from a small percentage of users to measure its value by comparing behavior between groups
Agentic workflows: AI systems that can autonomously execute multi-step tasks rather than just responding to single prompts
ICHRA: Individual Coverage Health Reimbursement Arrangement, a type of employer health benefit where companies reimburse employees for individual health insurance premiums

Original Article

Cursor's $60 Billion Escape Hatch

SpaceX Secures Option to Acquire Cursor

SpaceX announced this week that it has secured an option to acquire AI coding startup Cursor for $60 billion, or pay $10 billion for the work they're doing together if it doesn't end up acquiring the company. The agreement gives Cursor access to SpaceX's Colossus supercomputer and a path to reducing its dependence on Anthropic and OpenAI, whose models currently power much of Cursor's product and whose fees have weighed heavily on its margins.

The timing of the SpaceX deal is interesting. Just a few weeks earlier, Cursor had been quietly attempting to raise billions in private markets, but had encountered a lack of interest from late-stage investors like Iconiq, many of whom had just deployed capital into OpenAI and Anthropic and weren't ready to back a competitor at a $50 billion valuation. Cursor's gross margins were negative 23% as of January, an unusual position for a company generating $2.7 billion in annualized revenue. The company ultimately lined up $2 billion from Nvidia, a16z, and Thrive before canceling the round following after the deal with SpaceX was announced.

The Cursor deal complicates an already complicated financial picture for SpaceX ahead of its June IPO. Reuters reported this week that SpaceX took out a $20 billion bridge loan last month to refinance debt tied to X and xAI, reducing total debt from $22 billion to $20 billion, with repayment potentially contingent on IPO proceeds. xAI spent $12.7 billion in capital expenditures last year while generating only $3.2 billion in revenue. Cursor lost nearly $900 million in its last fiscal year. This may indicate that SpaceX is using its IPO momentum to paper over a collection of cash-hungry businesses before public market investors get a full look at the S-1.

Anthropic's Holdback Test Draws Criticism

A tweet claiming that Anthropic was no longer offering Claude Code access to Pro subscribers paying $20 per month made the rounds on social media this week. This was seen as an indication that the company would need to take drastic measures to maintain service for customers amidst an ongoing compute shortage. Anthropic has not said how many Claude Code users it has currently, or how many of those users are Pro or Max subscribers.

As it turns out, the situation was overstated. Anthropic's head of growth, Amol Avasare, responded directly to the post, saying that "we're running a small test on ~2% of new prosumer signups. Existing Pro and Max subscribers aren't affected." In AB testing, this is called a "holdback test," in which the value of a feature is measured by excluding a small subset of users from accessing it. Nevertheless, many questioned the wisdom of running a test like this on a tool as widely used as Claude, and competitors were quick to pile on. OpenAI directly implied it was a violation of customer trust, and Sam Altman mockingly replied "ok boomer" to Amol's post.

OpenAI's Changing Tune

On April 14, just one week after Anthropic said Mythos was too powerful to release publicly because of cybersecurity concerns, OpenAI published a blog post announcing that its newest model would not be released broadly to the public and would instead be accessible only to verified partners via a tiered cybersecurity access program introduced in February called Trusted Access for Cyber (TAC). This specifically pertained to a variant of GPT-5.4, with OpenAI saying "we are fine-tuning our models specifically to enable defensive cybersecurity use cases, starting today with a variant of GPT‑5.4 trained to be cyber-permissive: GPT‑5.4‑Cyber."

Then, on April 23rd, OpenAI announced that its newest model, GPT-5.5, would in fact be available to all users, though only via chatbot, not API. The press release said this was because "API deployments require different safeguards," but also that "We'll bring GPT‑5.5 and GPT‑5.5 Pro to the API very soon". On April 24th, the company seemingly changed their mind or implemented the safeguards very fast, and GPT-5.5 became available via API that afternoon. The company's evolving stance has fueled skepticism about the actual risks of the models and supported the view that current limits are motivated by the business model, not safety concerns.

Latest Research

Earlier this year, Chai Discovery, which builds generative models for antibody design, announced a partnership with Eli Lilly. See our new report here.
Sierra is an AI customer service startup that surpassed $100 million in ARR last year. See our new report here.
Thatch is an ICHRA administrator that grew revenue 8x in 2024 and 4x in 2025. See our new report here.

What We're Reading

Anthropic is investigating how unauthorized users gained access to Anthropic's restricted Mythos cybersecurity model by combining a third-party contractor's credentials with data from a breach at AI startup Mercor Inc. to locate the model's endpoint.
xAI held talks in recent weeks with French lab Mistral and AI coding firm Cursor about a three-way partnership to close the gap on Anthropic and OpenAI, on top of SpaceX's newly disclosed $60 billion call option to acquire Cursor, with ex-Mistral cofounder Devendra Chaplot already running pretraining at xAI.
Apple announced that Tim Cook will step up to executive chairman and SVP of Hardware Engineering John Ternus will take over as CEO effective September 1, 2026, concluding Cook's 15-year tenure during which Apple's market cap grew from ~$350 billion to $4 trillion.
Blue Origin successfully recovered the New Glenn first stage rocket in its second flight last week, but the rocket's payload, an AST SpaceMobile communications satellite, ended up in an "off-nominal orbit," temporarily grounding New Glenn.
Tesla expanded its Robotaxi service to Dallas and Houston, but Robotaxi Tracker data suggests "there may be just one vehicle operating in each city so far," with tiny geofences relative to each city's footprint.
GitHub paused new sign-ups for paid Copilot plans and tightened usage limits after agentic workflows consumed more compute than monthly fees cover, with VP of product Joe Binder writing, "It's now common for a handful of requests to incur costs that exceed the plan price."
Amazon committed up to $25 billion in additional funding to Anthropic ($5B immediately, $20B tied to "certain commercial milestones") at Anthropic's $380B valuation, with Anthropic pledging $100B+ on AWS over 10 years, CEO Andy Jassy said the commitment "reflects the progress we've made together on custom silicon."
Meta Superintelligence Labs rolled out its "Model Capability Initiative" on Meta US employees' work laptops to capture mouse movements, keystrokes, and periodic screenshots as training data for computer-use agents. Despite internal backlash, the company has clarified that there is no option to opt out of the initiative.
Commerce Secretary Howard Lutnick told a Senate hearing that despite Trump lifting the H200 export ban four months earlier, no H200 chips have been sold to Chinese firms, saying "The Chinese central government has not let them, as of yet, buy the chips."
Trump issued an executive order directing the FDA to prioritize clinical trials and "Right to Try" access for psilocybin, MDMA, and ibogaine, and allocated $50 million in HHS matching funds for state research programs, triggering rallies in psychedelic drug-developer stocks.
The Trump administration is nearing a deal to lend Spirit Airlines up to $500 million in exchange for warrants giving the US government a potential 90% equity stake post-bankruptcy, sparking pushback from Transportation Secretary Sean Duffy, who asked, "If you do Spirit, who comes next?"
About 30K Samsung workers rallied at the Pyeongtaek chip complex in South Korea, demanding 15% of operating profit be paid out to chip-division employees. This would amount to more than 40 trillion won (~$27 billion), averaging $400K+ per worker, with the union threatening an 18-day strike starting May 21, following Samsung's record Q1 operating profit forecast of 57.2 trillion won.

AI enterpriseinfrastructure

Cohere Aleph Alpha Join Forces

Cohere and Aleph Alpha are partnering to build a sovereign AI alternative for enterprises and governments that want control over their AI infrastructure without depending on big tech.

Read original

Summary

What: Canadian AI company Cohere is joining forces with German AI company Aleph Alpha to create an independent, enterprise-grade AI offering focused on data sovereignty, targeting highly-regulated sectors like finance, defense, and healthcare, with $600M in Series E funding led by Schwarz Group.

Why it matters: This addresses growing concerns about dependence on single AI vendors and jurisdictions, particularly important for European organizations with strict data privacy requirements and governments seeking digital independence from US hyperscalers.

Takeaway: Organizations in regulated industries can evaluate this sovereign AI option as an alternative to hyperscaler-based AI services, particularly if data residency and vendor independence are critical requirements.

Decoder

Sovereign AI: AI systems where the operator maintains full control over data and infrastructure, typically within specific jurisdictional boundaries, without dependence on foreign tech giants
Frontier models: The most advanced, cutting-edge AI models at the current state of the art
STACKIT: Schwarz Group's sovereign cloud infrastructure platform that will serve as the technical backbone for this partnership

Original Article

We're joining forces with Aleph Alpha to provide the world with an independent, enterprise-grade sovereign alternative in an era of growing AI concentration.

This transatlantic alliance would combine Cohere's global AI scale with Aleph Alpha's strong research excellence and deep institutional relationships, forging a globally competitive AI champion backed by Canadian and German ecosystems.

By pooling top-tier engineering talent and computational resources across two G7 nations, the partnership aims to significantly accelerate the development of next-generation frontier models and systems while providing a secure alternative to dependence on any single vendor or infrastructure stack.

The market for AI services is projected to surpass $1 trillion annually, with sovereign AI needs representing nearly $600B of that total (McKinsey, March 2026). The partnership uniquely bridges the gap between these segments with its sovereign-first approach, capturing the critical intersection where sovereignty requirements meet broad enterprise AI adoption.

Our Co-Founder and CEO, Aidan Gomez comments: "Combining the strengths of Cohere and Aleph Alpha accelerates our global expansion and advances our mission to deliver sovereign AI to nations around the world. Organizations globally are demanding uncompromising control over their AI stack. This transatlantic partnership unlocks the massive scale, robust infrastructure, and world-class R&D talent required to meet that demand.

Built on the bedrock of shared Canadian and German values—where privacy, security, and responsible innovation are paramount—we are uniquely positioned to be the world's trusted AI partner. Together, we will give enterprises and governments across Canada, Europe, and the world the technology to move from exploration to rapid, secure implementation, with the absolute certainty that their data remains their own."

Through the planned deal, collectively we aim to deliver a secure alternative for customized AI in highly-regulated sectors - including the public sector, finance, defense, energy, manufacturing, telecommunications, and healthcare. Aleph Alpha's experience in deploying AI in long-standing customer relationships provides an important foundation of this sovereign offering. As part of this partnership, the combined entity will partner with the companies of Schwarz Group, an international leader in the retail industry, to deploy a sovereign offering on its cloud service STACKIT.

Ilhan Scheer, Co-CEO of Aleph Alpha adds: "Aleph Alpha is in a unique position in Europe. We develop specialized large language models for Europe without compromising on Sovereignty, Transparency and Regulatory Compliance. By living this responsibility, we serve as a trusted and strategic partner to public sector and enterprise customers in Europe. Together with Cohere, we are building a real counterweight for organizations that refuse to outsource control over their AI to a single provider or jurisdiction, giving European institutions and enterprises access to powerful, yet controllable AI they can truly own."

In addition, the companies of Schwarz Group intend to back our upcoming Series E funding as lead investor with a $600M (€500M) structured financing commitment. The round is already attracting strong interest from the world's leading investors who recognize the necessity of an independent global AI powerhouse.

In a joint statement, Rolf Schumann and Christian Müller, Co-CEOs of Schwarz Digits, said: "With this investment, the companies of Schwarz Group position themselves as lead investors for digital sovereignty and infrastructure. Building this infrastructure is a strategic necessity to help shape the AI revolution based on values such as trust, fairness, and responsibility. The establishment of STACKIT, Schwarz Digits' sovereign cloud infrastructure, as the technical backbone of this transatlantic AI initiative empowers organizations to strengthen their digital independence and maintain control over their data. This is true leadership in digital sovereignty."

AI math

An amateur just solved a 60-year-old math problem—by asking AI

A 23-year-old amateur used ChatGPT to solve a 60-year-old mathematical conjecture that had stumped expert mathematicians, with the AI discovering an entirely new proof approach that may have broader applications.

Read original

Summary

What: Liam Price, who has no advanced math training, prompted GPT-5.4 Pro with an unsolved Erdős problem about primitive sets (collections of whole numbers where none divide each other) and received a proof that leading mathematicians Terence Tao and Jared Duker Lichtman validated and refined.

Why it matters: Unlike previous AI math solutions that replicated known approaches, ChatGPT applied a formula from related math areas that no human had thought to use for this problem type, suggesting AI can break through human cognitive blind spots and potentially "discovered a new way to think about large numbers and their anatomy" according to Tao.

Deep Dive

The problem asked whether the maximum "Erdős sum" score for primitive sets approaches exactly one as the numbers in the set approach infinity, a conjecture left unsolved since the 1960s
Price submitted the problem to ChatGPT on "an idle Monday afternoon" without knowing its history or that prominent mathematicians had failed to solve it
Terence Tao says human mathematicians "collectively made a slight wrong turn at move one," following a standard sequence of approaches that led nowhere
The AI's raw proof output was "quite poor" and required expert mathematicians to extract and understand the core insight
ChatGPT used a well-known formula from adjacent mathematical domains that no one had thought to apply to primitive set problems
Tao and Lichtman have since distilled the proof and already identified other potential applications of the method
Lichtman, who proved a related Erdős conjecture in his 2022 doctoral thesis but got stuck on this one, says the new method confirms his graduate school intuition that these problems "were kind of clustered together"
Price and collaborator Kevin Barreto sparked the "AI-for-Erdős craze" in late 2025 by randomly prompting free ChatGPT with open Erdős problems
An AI researcher gifted them ChatGPT Pro subscriptions to encourage their "vibe mathing" experiments
Experts caution the long-term significance is still uncertain, but this appears to be a genuine novel contribution rather than rediscovery of existing work
The breakthrough suggests AI language models may excel at bypassing human mental blocks and connecting disparate mathematical domains

Decoder

Erdős problems: Unsolved mathematical conjectures left by prolific mathematician Paul Erdős, ranging widely in difficulty and significance
Primitive sets: Collections of whole numbers where no number can be evenly divided by any other number in the set (generalizes the concept of prime numbers)
Erdős sum: A calculated "score" for primitive sets that Erdős proved has a maximum value
LLM (Large Language Model): AI systems like ChatGPT trained on vast text corpora to generate human-like responses

Original Article

An amateur just solved a 60-year-old math problem—by asking AI

A ChatGPT AI has proved a conjecture with a method no human had thought of. Experts believe it may have further uses

By Joseph Howlett edited by Lee Billings

An orange cube resembling a puzzle, suspended in space against a lavender background.

Liam Price just cracked a 60-year-old problem that world-class mathematicians have tried and failed to solve. He's 23 years old and has no advanced mathematics training. What he does have is a ChatGPT Pro subscription, which gives him access to the latest large language models from OpenAI.

Artificial intelligence has recently made headlines for solving a number of "Erdős problems," conjectures left behind by the prolific mathematician Paul Erdős. But experts have warned that these problems are an imperfect benchmark of artificial intelligence's mathematical prowess. They range dramatically in both significance and difficulty, and many AI solutions have turned out to be less original than they appeared.

The new solution—which Price got in response to a single prompt to GPT-5.4 Pro and posted on www.erdosproblems.com, a website devoted to the Erdős problems, just over a week ago—is different. The problem it solves has eluded some prominent minds, bestowing it some esteem. And more importantly, the AI seems to have used a totally new method for problems of this kind. It's too soon to say with certainty, but this LLM-conceived connection may be useful for broader applications—something hard to find among recently touted AI triumphs in math.

"This one is a bit different because people did look at it, and the humans that looked at it just collectively made a slight wrong turn at move one," says Terence Tao, a mathematician at the University of California, Los Angeles, who has become a prominent scorekeeper for AI's push into his field. "What's beginning to emerge is that the problem was maybe easier than expected, and it was like there was some kind of mental block."

The question Price solved—or prompted ChatGPT to solve—concerns special sets of whole numbers, where no number in the set can be evenly divided by any other. Erdős called these "primitive sets" because of their connection to similarly indivisible prime numbers.

"A number is prime if it has no other divisors, and this is kind of generalizing that definition from an individual number to a collection of numbers," says Jared Duker Lichtman, a mathematician at Stanford University. Any set of prime numbers is automatically primitive, because primes have no factors (except themselves and the number one).

Erdős also came up with the Erdős sum, a "score" you can calculate for any primitive set. He showed that the sum had a maximum possible value—and conjectured that this value must hold only for the set of all prime numbers. Lichtman proved Erdős right as part of his doctoral thesis in 2022.

Erdős also noticed that the score drops if all of a set's numbers are large—the larger the numbers, the less large the score could become. He guessed that as the set's numbers approached infinity, the maximum score would drop to exactly one. Lichtman tried to prove this, too, but got stuck like everyone else before him.

Price wasn't aware of this history when he entered the problem into ChatGPT on an idle Monday afternoon. "I didn't know what the problem was—I was just doing Erdős problems as I do sometimes, giving them to the AI and seeing what it can come up with," he says. "And it came up with what looked like a right solution."

He sent it to his occasional collaborator Kevin Barreto, a second-year undergraduate in mathematics at the University of Cambridge. The duo had jump-started the AI-for-Erdős craze late last year by prompting a free version of ChatGPT with open problems chosen at random from the Erdős problems website. (An AI researcher subsequently gifted them each a ChatGPT Pro subscription to encourage their "vibe mathing.")

Reviewing Price's message, Barreto realized what they had was special, and experts whom he notified quickly took notice.

"There was kind of a standard sequence of moves that everyone who worked on the problem previously started by doing," Tao says. The LLM took an entirely different route, using a formula that was well known in related parts of math, but which no one had thought to apply to this type of question.

"The raw output of ChatGPT's proof was actually quite poor. So it required an expert to kind of sift through and actually understand what it was trying to say," Lichtman says. But now he and Tao have shortened the proof so that it better distills the LLM's key insight.

More importantly, they already see other potential applications of the AI's cognitive leap. "We have discovered a new way to think about large numbers and their anatomy," Tao says. "It's a nice achievement. I think the jury is still out on the long-term significance."

Lichtman is hopeful because ChatGPT's discovery validates a sense he's had since graduate school. "I had the intuition that these problems were kind of clustered together and they had some kind of unifying feel to them," he says. "And this new method is really confirming that intuition."

Editor's Note (4/28/26): This article was edited after posting to correct the description of the Erdős sum and to clarify Jared Duker Lichtman's full name.

AI infrastructureagents

Meta signs agreement with AWS to power agentic AI on Amazon's Graviton chips

Meta is deploying tens of millions of AWS Graviton5 processors for agentic AI workloads, signaling a major shift from GPU-focused infrastructure toward CPU-intensive agent reasoning and orchestration.

Read original

Summary

What: Meta has partnered with AWS to deploy Graviton5 processors at massive scale—starting with tens of millions of cores—to power CPU-intensive AI agent workloads like real-time reasoning, code generation, and multi-step task orchestration rather than relying solely on GPUs for training.

Why it matters: This partnership highlights an emerging infrastructure divide in AI: while GPUs remain critical for training large models, the explosion of agentic AI systems that need to reason, plan, and execute complex workflows in real-time creates massive demand for purpose-built, energy-efficient CPUs that can handle billions of coordinated interactions.

Takeaway: If building AI agent systems, evaluate whether CPU-optimized infrastructure like Graviton instances could be more cost-effective than GPU-based deployments for inference and orchestration workloads.

Deep Dive

Meta becomes one of the largest AWS Graviton customers globally, deploying tens of millions of cores with room to expand as AI capabilities grow
Graviton5 features 192 cores per chip with 5x larger cache than previous generation, reducing inter-core communication delays by up to 33%
The deal reflects architectural shift in AI infrastructure: GPUs for model training, CPUs for inference and agentic orchestration at scale
Agentic AI workloads differ from traditional inference—they require sustained CPU power for reasoning chains, code generation, search coordination, and multi-step task execution
Built on AWS Nitro System providing bare-metal instance access while maintaining standard networking (ENA) and storage (EBS) interfaces Meta relies on
Supports Elastic Fabric Adapter (EFA) for low-latency, high-bandwidth communication between instances—critical for distributed agent workflows
Manufactured using 3nm process technology, delivering 25% better performance than Graviton4 while maintaining energy efficiency
AWS controls full stack from chip design through server architecture, enabling optimization impossible with off-the-shelf processors
Partnership helps Meta handle billions of user interactions while meeting sustainability targets through improved energy efficiency
Deployment supports Meta's existing AWS relationship and Amazon Bedrock usage for next-generation AI development

Decoder

Agentic AI: Autonomous AI systems that can reason, plan, and complete complex multi-step tasks rather than just generating single responses
Graviton: AWS's custom ARM-based processors designed for energy-efficient cloud computing, now in fifth generation
AWS Nitro System: Dedicated hardware and software layer that offloads virtualization functions to provide bare-metal performance with cloud flexibility
Elastic Fabric Adapter (EFA): AWS networking interface enabling low-latency, high-bandwidth communication between cloud instances for distributed workloads
3nm chip technology: Manufacturing process that creates smaller, more power-efficient transistors (3 nanometers wide)

Original Article

Key takeaways

The deployment starts with tens of millions of Graviton cores, with the potential to expand.
Meta is now one of the largest Graviton customers in the world.
The deal builds on Meta's long-standing AWS relationship and use of Amazon Bedrock at scale to support its next generation of AI.

Meta has signed an agreement to deploy AWS Graviton processors at scale. The deal marks a significant expansion of a long-standing partnership between the two companies as Meta builds its next generation of AI. The deployment starts with tens of millions of Graviton cores, with the flexibility to expand as Meta's AI capabilities grow. The deal reflects a shift in how AI infrastructure gets built: while GPUs remain essential for training large models, the rise of agentic AI is creating massive demand for CPU-intensive workloads—real-time reasoning, code generation, search, and orchestrating multi-step tasks. Graviton5 is purpose-built for these workloads, giving Meta the processing power to run them efficiently at scale. The chips will power various workloads at Meta, including supporting the company's AI efforts. That work requires infrastructure that can handle billions of interactions while coordinating complex, multi-step agent workflows—exactly the kind of CPU-intensive work Graviton is designed for.

AWS Graviton chips powering AI workloads

As organizations increasingly adopt agentic AI—autonomous systems that can reason, plan, and complete complex tasks—the demand for high-performance, energy-efficient compute infrastructure has never been greater. Meta is building at the forefront of agentic AI, and its broad Graviton deployment reflects a simple reality: agentic workloads like code generation, real-time reasoning, and frontier model training are CPU-intensive, and purpose-built chips are the most efficient way to power them. The Graviton5 chip features 192 cores and a cache that is five times larger than the previous generation, which reduces delays in how quickly those cores communicate with each other by up to 33%. That means faster data processing with greater bandwidth—key requirements for agentic AI systems that need to continuously reason through and execute multi-step tasks. Graviton is built on the AWS Nitro System, which uses dedicated hardware and software to deliver high performance, high availability, and high security. The Nitro System enables bare-metal instances for direct access to the hardware while providing the same familiar Elastic Network Adapter (ENA) and Amazon Elastic Block Store (Amazon EBS) devices that allow Meta to run its own virtual machines without performance compromises. The range of Graviton5 instances also supports the Elastic Fabric Adapter (EFA), enabling low-latency, high-bandwidth communication between instances. This is essential for Meta's agentic AI workloads, where large-scale tasks need to be distributed across many processors working in coordination. As a longtime AWS customer, Meta has relied on AWS's highly scalable and secure cloud infrastructure to power its global businesses. "This isn't just about chips; it's about giving customers the infrastructure foundation, as well as data and inference services, to build AI that understands, anticipates, and scales efficiently to billions of people worldwide," said Nafea Bshara, vice president and distinguished engineer, Amazon. "Meta's expanded partnership, deploying tens of millions of Graviton cores, shows what happens when you combine purpose-built silicon with the full AWS AI stack to power the next generation of agentic AI." "As we scale the infrastructure behind Meta's AI ambitions, diversifying our compute sources is a strategic imperative. AWS has been a trusted cloud partner for years, and expanding to Graviton allows us to run the CPU-intensive workloads behind agentic AI with the performance and efficiency we need at our scale," said Santosh Janardhan, head of infrastructure, Meta.

Energy efficiency benefits of Graviton

AWS Graviton5 is built on 3-nanometer chip technology—a manufacturing process that produces smaller, more efficient processors. Because AWS designs its chips from the ground up and controls the full process from chip design through server architecture, it can optimize performance and efficiency in ways that off-the-shelf processors can't match. The result is infrastructure that delivers stronger performance while maintaining leading energy efficiency, helping Meta pursue ambitious AI goals while staying on track with sustainability targets. Graviton5 delivers up to 25% better performance than the previous generation. As the demand for AI compute grows across the industry, the efficiency of the underlying infrastructure becomes increasingly important—both for managing costs and reducing environmental impact. The deal signals a new chapter in how large-scale AI infrastructure gets built—and how purpose-built chips like Graviton can help companies like Meta deliver smarter, more personalized experiences to billions of people worldwide. Get more details on AWS Graviton5 and how to get started with Graviton.

AI enterpriseinfrastructure

Sovereign Labs Are Overkill for Enterprise AI

Sovereign AI labs are being oversold to enterprises who actually just need private deployment and data control, not expensive national-scale foundation model training.

Read original

Summary

What: A critical analysis arguing that sovereign AI labs (national initiatives to build independent foundation models) make sense for governments but are overkill for enterprises, who conflate the need for data sovereignty with the need to pre-train their own frontier models.

Why it matters: This matters because enterprises are being pitched expensive sovereign lab infrastructure when their actual requirements—data residency, auditability, and regulatory compliance—can be met more cheaply with self-hosted open models and proper data isolation controls.

Takeaway: Evaluate your AI sovereignty needs by asking where regulated data flows, not which model to use—consider self-hosting open models like Llama or DeepSeek with proper data isolation rather than building or buying sovereign foundation models.

Deep Dive

The sovereign lab pitch makes seven claims (sovereign data, weights, compute, cultural fit, jurisdictional control, supply chain independence, strategic autonomy) but only 1.5 actually hold up in practice
Most sovereign labs use the same training data (Common Crawl), architectures (Llama/DeepSeek derivatives), and supply chains (NVIDIA chips from Taiwan) as everyone else, just with different branding
The only genuine advantage sovereign labs have is cultural and linguistic fit for local languages, but GPT and Claude are closing this gap with each release
What enterprises actually mean by "sovereign AI" is five things: regulated data stays in jurisdiction, no data leakage to third parties, auditability, vendor independence, and local language/workflow support
The practical solution is two levers used together: local deployment (self-host open models on controlled infrastructure) and local isolation of sensitive data (keep regulated data from reaching model providers)
This approach lets you run self-hosted Llama for sensitive workloads while still calling frontier APIs like GPT-5 for non-sensitive tasks, maintaining sovereignty boundaries at the data level
Sovereignty is a property of data flows, not model nationality—the right question is "where does regulated data go?" not "whose model is this?"
National labs make sense for defense, intelligence, and government use cases where data genuinely cannot cross borders, but not for most enterprise scenarios
The sovereign lab industry is driven by GPU sellers' revenue incentives and VC growth-stage investment theses, not genuine enterprise needs
Recent example: Aleph Alpha (Germany) and Cohere (Canada) merged at $20B valuation, positioning as sovereign alternatives despite using similar underlying technology stacks

Decoder

Sovereign Lab: A national initiative to build and control AI foundation models domestically, independent of foreign providers like OpenAI or Anthropic
Sovereign AI: Umbrella term covering both sovereign pre-training (building national models) and sovereign deployment (private inference with data residency)
CLOUD Act: U.S. law allowing American authorities to access data stored by U.S. companies regardless of physical location, relevant for AWS/Azure sovereign cloud claims
GDPR/DPDPA: Data protection regulations (European and Indian respectively) requiring data to stay within specific jurisdictions
vLLM/Ollama: Open-source tools for serving and running large language models on your own infrastructure
Common Crawl: Large public web crawl dataset used to train most foundation models, undermining claims of truly sovereign training data

Original Article

The national lab thesis is legitimate for nations, but for everyone else, it's a solution to a problem they don't have.

AI devops

Anthropic tests new Bugcrawl tool for Claude Code bug detection

Anthropic is testing Bugcrawl, a new Claude Code feature that scans entire repositories to detect bugs and suggest fixes.

Read original

Summary

What: Bugcrawl is an unreleased feature appearing in Claude Code's UI that lets developers select a repository and have Claude scan the entire codebase for general bugs and propose fixes, with warnings about high token consumption for larger repositories.

Why it matters: This positions Claude Code as a full-spectrum code quality tool alongside its existing Security and Code Review features, and escalates competition with OpenAI, xAI, and Google in the shift from single-file AI assistants to repository-wide autonomous agents.

Takeaway: Enterprise teams should prepare for high token costs if they plan to use this when it launches, and consider starting with smaller repositories to test the feature.

Deep Dive

Bugcrawl appears in Claude Code's sidebar with a repository picker and a warning about high token consumption
The feature targets general bug detection and fixes, complementing Security (vulnerabilities) and Code Review (PR-level analysis)
Anthropic could potentially extend it to end-to-end testing where Claude runs the app locally and walks through user flows
Fits into Claude Code's rapid expansion: Code Security in February 2026, Code Review in March 2026
Part of industry-wide shift toward repository-wide AI agents, competing with OpenAI Codex, xAI Grok Build, and Google Jules
High token costs suggest it's aimed at Team and Enterprise tiers
Not in production yet, no release date, but rapid feature cadence suggests research preview likely soon

Original Article

Anthropic appears to be building a tool within Claude Code called Bugcrawl, which surfaces as a dedicated entry in the side navigation. Once opened, the screen presents a repository selection UI alongside a warning that the feature consumes tokens at a high rate, so it's suggested to start with a small repository before pointing it at anything substantial. That caveat alone hints at the scale of work the agent would be carrying out in the background.

The most plausible read is that Bugcrawl will set Claude loose across an entire codebase to hunt for general bugs and propose fixes, while the Security tab already shipping in Claude Code for Enterprises targets vulnerabilities specifically. If Anthropic pushes the concept further, the same loop could extend into end-to-end product testing, where Claude spins up a local instance of the app, walks through user flows, and reports regressions. How feature specifications or test criteria would be passed into a run is still an open question, since the only screen visible so far is the repository picker.

For Anthropic, the move slots cleanly into the Claude Code expansion of recent months, which has already produced Claude Code Security in February and Claude Code Review in March, both built around multi-agent investigation of code. Bugcrawl would round out that lineup by tackling general correctness and quality, the broader, fuzzier category that sits between security scanning and PR-level review. It also fits the wider competitive picture, with OpenAI's Codex, xAI's Grok Build, and Google's Jules each pushing toward agents that reason across full repositories rather than single files.

The likely audience is engineering teams on Team and Enterprise tiers, where the token burn warning is easier to absorb. No release window has surfaced, and the feature does not appear in production builds. Given the cadence of Code Security and Code Review landing within weeks of each other, a research preview on the same web surface looks like the most likely path.

DevOps securityinfrastructure

HashiCorp Vault 2.0 Marks Shift to IBM Lifecycle with New Identity Federation

HashiCorp Vault 2.0 is the first major release under IBM ownership, adding workload identity federation to eliminate static cloud credentials while introducing breaking changes and a two-year support lifecycle.

Read original

Summary

What: Vault 2.0 jumps from version 1.21 to adopt IBM's versioning model after the acquisition, introducing workload identity federation that uses OIDC tokens to authenticate with AWS, Azure, and GCP without long-lived credentials, plus SCIM provisioning, SPIFFE support, and PKI automation enhancements.

Why it matters: This release signals Vault's direction after HashiCorp's 2023 license change to Business Source License sparked the OpenBao fork, and addresses a critical security gap by removing static credential requirements during secret synchronization across cloud providers.

Takeaway: Review the migration documentation if running Vault 1.x, particularly Azure authentication configurations that now require explicit settings instead of environment variable fallbacks.

Deep Dive

The version jump from 1.21 to 2.0 reflects IBM's acquisition and support model shift, guaranteeing at least two years of standard support for major releases under the IBM Support Cycle-2 policy
Workload Identity Federation eliminates the need for static credentials when syncing secrets to cloud providers by using short-lived OIDC tokens, reducing the attack surface for credential leakage during synchronization
Internal storage engine modifications target performance improvements for high-volume operations like real-time encryption and authentication at enterprise scale
Breaking changes remove legacy components to simplify codebase maintenance, including Azure authentication now requiring explicit configuration rather than environment variable defaults (enforcement of changes that began in 1.20)
Beta SCIM 2.0 support enables automated provisioning of Vault entities and groups from external identity platforms, reducing manual identity management overhead
SPIFFE JWT-SVID support allows workloads to participate in SPIFFE-based identity meshes, bridging proprietary HashiCorp features with open standards
Enhanced PKI secret engine automation for certificate issuance and renewal aligns with zero-trust networking principles by reducing manual credential management risks
The release comes as teams evaluate Vault against cloud-native alternatives like AWS Secrets Manager and Azure Key Vault (tighter platform integration but less portability) and managed services like Akeyless and Doppler (no operational overhead)
The 2023 license change from Mozilla Public License to Business Source License prompted the community-driven OpenBao fork, making IBM's stewardship particularly important to the community
This is the first major version increment since version 1.0 launched in 2018, representing eight years of feature development under the 1.x line

Decoder

OIDC tokens: OpenID Connect tokens are short-lived authentication credentials that prove identity without storing long-term secrets
SCIM: System for Cross-domain Identity Management, a standard protocol for automating user and group provisioning across systems
SPIFFE: Secure Production Identity Framework For Everyone, an open standard for workload identity in distributed systems
JWT-SVID: JSON Web Token SPIFFE Verifiable Identity Document, a cryptographically-signed token format used in SPIFFE identity attestation
PKI: Public Key Infrastructure, the framework for managing digital certificates and encryption keys
Business Source License: A source-available license that restricts commercial use until code ages, then converts to open source (unlike fully open Mozilla Public License)
Static credentials: Long-lived access keys or passwords that don't expire automatically, creating security risks if leaked

Original Article

HashiCorp released Vault 2.0 under IBM's versioning model with two-year support, introducing identity-based security, workload identity federation without static credentials, performance improvements, and breaking changes while adding SCIM, SPIFFE support, and enhanced PKI automation.

DevOps aiinfrastructurekubernetesllm

DigitalOcean Dedicated Inference: A Technical Deep Dive

DigitalOcean launched Dedicated Inference, a managed service for hosting large language models on dedicated GPUs with KV cache-aware routing and predictable economics for high-volume inference workloads.

Read original

Summary

What: DigitalOcean Dedicated Inference is a managed LLM hosting service that deploys models on dedicated GPUs using vLLM and Kubernetes, designed for teams that need consistent, high-volume inference beyond pay-per-token serverless offerings. The service handles cluster lifecycle, routing, and scaling while giving users control over model choice, capacity, and performance tuning through OpenAI-compatible APIs.

Why it matters: Fills the gap between serverless inference (unpredictable costs at scale) and DIY GPU management (heavy operational burden). The KV cache-aware routing is particularly notable—it directs requests to replicas that already cached prompt prefixes, avoiding redundant computation and improving cost efficiency for workloads with reusable context.

Takeaway: Evaluate whether your inference workload has sustained, high-volume demand where dedicated GPU pricing would beat pay-per-token models, especially if you need bring-your-own-model capabilities or VPC-isolated deployments.

Deep Dive

Separates control plane (endpoint lifecycle management via regional API services) from data plane (direct inference traffic through VPC-native load balancers to GPU nodes)
Uses vLLM as the core inference engine paired with LLM-d for Kubernetes-native distributed inference patterns
Implements intelligent routing via Kubernetes Gateway API with an Inference Extension that understands queue depth, replica health, and KV cache affinity
Endpoint Picker component on CPU nodes selects optimal GPU replica based on whether it already cached the prompt prefix, not just round-robin distribution
One DOKS cluster per VPC can host multiple isolated Dedicated Inference deployments using Kubernetes namespaces and Custom Resource Definitions
Exposes both public and private endpoints through regional network load balancers, supporting internet and VPC-only access patterns
Platform manages Kubernetes capacity provisioning, GPU pool coordination, gateway configuration, and reconciliation loops; users control model selection, replica counts, and scaling policies
Designed for workloads where predictable GPU-hour economics matter more than burst elasticity—think coding assistants serving 2,000 concurrent engineers, not occasional API calls
OpenAI-compatible API surface means existing client libraries work without modification
Lifecycle management follows Kubernetes operator reconciliation pattern: observe desired state, act with retries and backoff, surface clear status instead of partial failures

Decoder

vLLM: Open-source inference engine optimized for serving large language models on GPUs with efficient memory management
KV cache: Saved attention key-value tensors from previously processed tokens; reusing cached prefixes avoids recomputing the same prompt context
KV cache-aware routing: Load balancing strategy that prefers sending requests to replicas that already cached relevant prompt prefixes, reducing redundant computation
Kubernetes Gateway API: Standard API for configuring HTTP/TLS routing into Kubernetes clusters, successor to Ingress
Control plane vs data plane: Architectural split where control plane handles management operations (create/update/delete endpoints) and data plane handles high-throughput inference requests
LLM-d: Kubernetes-oriented stack for distributed inference with prefix-cache-aware routing and LLM-specific scaling patterns
DOKS: DigitalOcean Kubernetes Service, their managed Kubernetes offering
Replica: Horizontally scaled copy of the same model server running in a separate pod/process
Day-two operations: Ongoing operational tasks after initial deployment—monitoring, scaling, upgrades, incident response

Original Article

DigitalOcean Dedicated Inference: A Technical Deep Dive

Getting a model to answer 10 inference requests concurrently is tricky but simple enough; getting it to handle 2,000 engineers hitting a coding assistant with long contexts, all day, without runaway costs, is where teams stall. A working endpoint is only the beginning. Teams need to identify the supporting hardware and wire up the right components—serving, scaling, observability, and cost guardrails—so the deployment can support expected SLAs and SLOs under real, sustained load.

DigitalOcean already offers Serverless Inference on the DigitalOcean AI Platform: a fast path to models from OpenAI, Anthropic, Meta, or other providers, with minimal setup and token-based pricing. This offering works well for many use cases. However, when you need your own weights, predictable performance on dedicated GPUs, and economics that favor sustained, high-volume token generation over pay-per-token bursts, a different approach makes sense

Dedicated Inference, our managed LLM hosting service on the DigitalOcean AI Platform, fills that gap.

Dedicated Inference deploys and operates an opinionated inference stack on dedicated GPUs, with Kubernetes-native orchestration under the hood. You interact through the control plane and APIs you already use in the DigitalOcean ecosystem; the data plane exposes public and private endpoints so applications inside, or outside, your VPC can call your models securely.

The service is designed to collapse a vast combinatorial space—GPU SKUs, runtimes, routers, autoscaling policies—into guided defaults so teams hit production milestones faster than DIY stacks, while retaining knobs that matter for model serving: replicas, scaling behavior, and advanced optimizations as you roll out your product roadmap.

What we manage vs. what you control

Every managed product draws a line between operator-owned and customer-owned concerns. Dedicated Inference aims to put day-two operations—cluster lifecycle integration, ingress, core serving and routing components, and the glue between them—on the platform side, while leaving model choice, capacity, and workload-specific tuning with you.

Typically platform-managed:

Provisioning and lifecycle of the underlying orchestration footprint in line with product design (for example, managed Kubernetes integration and GPU pool coordination).
Core inference engine and orchestration integration, including patterns that matter at scale: intelligent routing, autoscaling hooks, and production-oriented serving paths.
Endpoint creation, health and scaling workflows, and the operational automation required to keep endpoints aligned with declared configuration.

In your hands:

Selecting models (including bring-your-own-model paths where supported), GPU profiles, and replica counts appropriate to your SLOs and budget.
Configuring scaling behavior and, over time, advanced serving options that map to your latency, throughput, and cost goals.
Connecting applications via stable HTTP APIs consistent with common LLM client stacks.

Dedicated Inference overview

Dedicated Inference builds on industry-standard building blocks so customers benefit from community momentum and continuous improvement:

vLLM as a capable, widely adopted inference engine for large language models on modern GPUs.
LLM-d as a Kubernetes-oriented stack for distributed inference patterns—precise prefix-cache aware routing, scaling concerns that differ from traditional HTTP services, and room to grow into more advanced topologies as workloads demand.

This combination reflects a deliberate choice: meet customers where they are today (OpenAI-compatible APIs, familiar GPU offerings on DigitalOcean) while staying aligned with where the ecosystem is moving on routing, replication, and scale-out inference.

For readers who want more depth on why LLM routing differs from classic load balancing—and how prefix cache awareness changes the game—see our article on Load Balancing and Scaling LLM Serving.

High-level architecture

The system design separates a control plane (how endpoints are created, updated, listed, and deleted) from a data plane (how chat/completions request traffic reaches your models). That mirrors the management requests, which take a path built for regional placement and durable lifecycle work. Inference requests take a direct, low-latency path in front of your GPUs.

Control plane: central entry, regional execution

What does "control plane" mean here? In this split, the control plane is everything that handles management traffic: management rpc for Dedicated Inference endpoints, plus the durable bookkeeping that turns your declared intent into running DI infrastructure. It is separate from the data plane, which is the hot path for inference (chat/completions-style) requests once an endpoint is healthy.

image alt text

Central API layer: Requests originating from the DigitalOcean Cloud UI, automation workflows, or the public API are routed through a centralized API layer first. This layer maintains the mapping of endpoint ownership by region and transparently forwards each request to the appropriate regional backend. This design follows a multi-region fan-out model, where regional endpoints are addressed using stable identifiers.

Regional Dedicated Inference service: Each region operates a control-plane service responsible for the full lifecycle management of instances within its scope. This includes persisting the desired state, reconciling it with the observed state, advancing lifecycle status (e.g., creating → active), and enqueuing the workflows that provision or mutate the underlying infrastructure. In this context, lifecycle refers to the state machine governing transitions from requested to running and reachable. An instance represents the managed inference deployment as a whole—encompassing both its control-plane record and the associated backing resources.

Separate worker-style components perform integrations that need retries, backoff, and idempotency—calling the Kubernetes API, watching object status, and publishing lifecycle updates back to the core service. This is similar to the reconciler pattern familiar from Kubernetes operators: observe desired state, act, repeat until reality matches intent. Use case: transient API errors or slow node startups do not wedge the user-facing API; the system keeps trying and surfaces a clear status instead of a half-applied state

DigitalOcean Kubernetes and capacity: The platform first helps ensure that sufficient DigitalOcean Kubernetes capacity is available within the target VPC, attaches the required GPU and supporting CPU node capacity, rolls out the managed inference stack (gateway and model workloads), and creates regional network load balancers for public and private endpoints.

Data plane: VPC-native traffic, gateway-routed requests

image alt txt

Client Contract & Endpoint connectivity: Once an endpoint is active, clients send OpenAI-compatible API requests (for example, HTTPS to /v1/chat/completions-style routes). Your public endpoint FQDN resolves to an external regional network load balancer (L4); your private endpoint resolves to an internal regional network load balancer, so the same inference stack can be reached from the public internet or stays on your private VPC network. In both cases, traffic is OpenAI-shaped JSON carrying model ID, messages, and generation parameters.

Cluster placement and VPC isolation: Inside the VPC, workloads run on DOKS. One cluster per VPC can host multiple Dedicated Inference deployments, each isolated by a Kubernetes namespace. Desired gateway and model wiring is expressed as Custom Resources (CRDs—Custom Resource Definitions): declarative objects you kubectl apply (or the platform applies for you) instead of imperative scripts.

Inference Gateway and Kubernetes Gateway API: After the NLB, traffic hits the Inference Gateway, implemented with the upstream Kubernetes Gateway API—the community standard for describing HTTP/TLS routing into a cluster.

Gateway API Inference Extension (inference-aware routing): Below the gateway, the Gateway API Inference Extension teaches routing about inference signals: queue depth, replica health, and KV cache affinity (preferring a replica that already holds key/value tensors for a reusable prompt prefix so work is not recomputed from scratch). KV cache is the saved attention state for prior tokens; inference-aware routing is deliberately not simple round robin, because the cheapest replica is often the one that already cached your prefix.

Endpoint Picker: On CPU nodes, the Endpoint Picker is the component that talks to the Inference Extension and selects which GPU replica should receive each request.

Model Service and inference pools: On GPU nodes, the Model Service—backed by inference pools in configuration—runs your model replicas (distinct pods/processes that can serve the same model ID). Each replica reports load, KV, and cache metadata upstream so the Endpoint Picker's choices stay accurate through rollouts, crashes, and autoscaling events. Replica is a horizontally scaled copy of the same model server; pool is the grouping of those replicas for routing and capacity.

Who is Dedicated Inference for?

Dedicated Inference is aimed at builders who already know they need GPUs, but who would rather not become a full-time inference platform team:

Teams that self-host on raw GPU Droplets or Kubernetes and want to offload orchestration, baseline optimizations, and repetitive infra work while keeping API-level ownership of their applications.
Teams that have graduated from Serverless Inference and need hardware-level control or BYOM without abandoning managed operations.
Organizations with consistent inference demand where predictable GPU-hour economics and performance isolation matter more than pure burst elasticity.

Inference is no longer a novelty layer; it is part of the core application stack. That shift raises the bar for reliability, performance, and cost predictability. Dedicated Inference is for teams that need production-grade, dedicated GPU inference with a managed path from model selection to a stable endpoint—so you spend engineering cycles on products and prompts, not on reinventing the serving platform.

DevOps securitykubernetesinfrastructure

Building a PCI-DSS Compliant GKE Framework for Financial Institutions: Data Protection, Governance…

This guide shows how to build a PCI-DSS compliant payment processing system on Google Kubernetes Engine using customer-controlled encryption keys and tokenization to secure cardholder data.

Read original

Summary

What: A technical implementation guide for achieving PCI-DSS compliance on Google Kubernetes Engine, focusing on customer-managed encryption keys (CMEK) that give full control over data encryption, tokenization to reduce PCI scope, and audit logging across GKE, GCS, and BigQuery. Part 4 of a series covering data protection and governance.

Why it matters: PCI-DSS compliance is mandatory for payment processing but most teams struggle with scope - by default, every system touching cardholder data must be compliant. This framework shows how to use CMEK for instant breach response (disable keys to make exfiltrated data unreadable) and tokenization to keep actual card numbers out of application databases entirely, dramatically reducing which systems need full PCI compliance.

Takeaway: If you're building payment systems on GKE, start by implementing CMEK on your cluster and storage layers, then isolate actual PAN storage in a tokenization vault service to minimize your PCI-scoped infrastructure.

Deep Dive

PCI-DSS compliance hinges on two questions: where is cardholder data, and how do you prove nothing happened to it - data protection and audit logging must work together
Google encrypts everything at rest by default, but PCI Requirement 3 requires you control your own encryption keys - Google-managed keys don't let you instantly revoke access during a breach
CMEK provides an emergency stop: if you suspect a breach, disable the key version and all encrypted data becomes unreadable even if already exfiltrated
The encryption flow works by intercepting writes and calling Cloud KMS to encrypt data with your key before storing it on disk, GCS, or BigQuery
Keys should rotate every 90 days per PCI requirements - rotation creates new key versions but doesn't re-encrypt existing data, old versions remain for decryption
Enable CMEK on GKE node boot disks and database-encryption-key for etcd - without this, Kubernetes Secrets are only base64 encoded, not encrypted
Critical distinction: CVV must NEVER be stored, not encrypted, not hashed - a database column for CVV is an immediate PCI finding regardless of other protections
Tokenization dramatically reduces PCI scope by replacing PANs with random tokens - only the tokenization vault stores actual PANs, everything else sees tokens
Without tokenization every microservice touching orders is PCI-scoped; with it only the vault service requires full compliance
Implement the tokenization vault as an isolated namespace with NetworkPolicy allowing only payment-service ingress - not monitoring, not logging agents, not developers
Verify CMEK is working by checking that all objects in GCS buckets show your key, not Google-managed keys
The framework addresses assessor questions like "show me every time someone accessed cardholder data in the last 90 days" through automated logging across all GCP services

Decoder

PCI-DSS: Payment Card Industry Data Security Standard - required compliance framework for any system processing credit card data
CMEK: Customer-Managed Encryption Keys - encryption keys you control versus Google-managed keys
PAN: Primary Account Number - the 16-digit credit card number, always requires protection
CHD: Cardholder Data - PAN plus associated data like cardholder name and expiration when stored together
SAD: Sensitive Authentication Data - magnetic stripe, CVV, PIN that must NEVER be stored after authorization
GKE: Google Kubernetes Engine - Google's managed Kubernetes service
DLP: Data Loss Prevention - scanning to detect sensitive data before it's stored improperly
QSA: Qualified Security Assessor - auditor who certifies PCI-DSS compliance
Cloud KMS: Google Cloud Key Management Service - manages encryption keys
etcd: Kubernetes' backing store for cluster data including Secrets

Original Article

This post details how to implement PCI-DSS-compliant data protection and audit logging on Google Kubernetes Engine (GKE). It covers customer-managed encryption keys (CMEK), tokenization, DLP scanning, and 12-month immutable audit trails. The implementation framework addresses specific PCI requirements by securing cardholder data with controlled encryption keys that can be instantly revoked during breaches, while maintaining automated logging across GKE clusters, GCS buckets, and BigQuery to answer assessor questions like "show me every time someone accessed cardholder data in the last 90 days."

DevOps monitoringinfrastructure

On Software Quality

Software quality depends on user perception shaped by repeated issues and UI/UX experience rather than isolated bugs, requiring symptom-based monitoring of user journeys.

Read original

Summary

What: An essay arguing that software quality is fundamentally about user perception and trust, which erodes through repeated issues (not single bugs unless catastrophic), and proposing a monitoring strategy focused on "golden paths" through symptom-based metrics backed by system-level alerts.

Why it matters: Challenges conventional defect-counting approaches to quality by explaining why partial outages, transient bugs, and good UX can maintain trust while technically stable systems with poor UX lose it—suggesting teams may be measuring the wrong things.

Takeaway: Map your product features by importance, identify golden paths (core user flows like login/play/exit), monitor them from the client perspective with symptom-based alerts (e.g., "mobile login is slow"), and back those with caused-based system metrics to debug when issues arise.

Deep Dive

Trust in software quality is slow to build but quick to destroy, though modern users tolerate short-lived bugs and only change perception after prolonged or repeated exposure to issues
Software quality fundamentally means "does this work the way I expect it to," which is subjective—a bug-free app can feel low-quality due to poor UX, while a buggy app with good UX may be forgiven
Netflix and Facebook strategically had partial outages affecting only some users, so when users asked friends "is it working for you?" the answer was often yes, shifting blame from the platform to the user's setup
Github, Twitter (fail whale), and others had frequent transient issues but weren't seen as low-quality because problems affected edge cases, were infrequent, or were handled with delightful error UX
Recent examples show erosion: Github's major outages, Claude Code's repeated availability issues, and Railway's caching incident (leaking user data between accounts) built on weeks of daily incidents
The ideal metric would be "defects per user per day" weighted by criticality, but this is nearly impossible since you can't categorize bugs you don't know exist
Symptom-based monitoring focuses on user experience signals (e.g., "login on mobile is slow") rather than component metrics (database query performance), which matters more for quality perception
Caused-based monitoring tracks the underlying system components; when a symptom alert fires but no caused alert does, it reveals a monitoring gap
To implement: map all product features, score by importance, identify golden paths (features required to complete core user flows), then monitor each from the client perspective for availability and performance
Golden paths might be login/play/exit in a game, but not necessarily finding Facebook friends or viewing activity feeds—focus on what's essential to core functionality
UI/UX is the most important factor in software quality because it's what users interact with; a loading spinner vs blank screen during the same delay creates vastly different quality perceptions
Good error UX (like Twitter's fail whale or Github's unicorn) can actually make failures more tolerable and even delightful, papering over technical issues
The dual-layer approach ensures you capture both what users experience (symptoms) and why it's happening (causes), enabling both perception management and root cause fixes

Decoder

Golden path: The essential user journey through a product—the minimum set of features users need to complete core functionality (e.g., login, perform main action, exit)
Symptom-based monitoring: Alerts based on user-facing problems like "checkout is failing" rather than underlying technical metrics, capturing quality as users experience it
Caused-based monitoring: Alerts on underlying system components (database latency, service errors) that cause user-facing symptoms, used for diagnosis and debugging
CSAT/NPS: Customer Satisfaction and Net Promoter Score—survey metrics that indirectly measure software quality through user sentiment rather than technical defects

Original Article

Software quality is driven by user perception—shaped more by repeated issues and UI/UX experience than by isolated bugs—making trust slow to build but easy to erode. To manage this, teams should focus on monitoring user “golden paths” with symptom-based metrics tied to underlying system signals, ensuring they capture both user experience and root causes effectively.

DevOps vectorperformanceinfrastructure

How we built Elasticsearch simdvec to make vector search one of the fastest in the world

Elasticsearch built a custom SIMD kernel library that speeds up vector distance computations by up to 4x through bulk scoring and memory latency hiding techniques.

Read original

Summary

What: simdvec is a hand-tuned SIMD kernel library written in native C++ and called from Java via the Panama foreign function interface that powers all vector distance computations in Elasticsearch. It supports multiple vector types (float32, int8, bfloat16, binary, and BBQ quantization) and provides architecture-specific optimizations for x86 (AVX-512) and ARM (NEON).

Why it matters: The biggest performance gain comes not from faster arithmetic but from hiding memory latency at production scale—when vector datasets exceed CPU cache limits, simdvec's bulk scoring with explicit prefetching on x86 or interleaved loading on ARM keeps the CPU pipeline busy while waiting for DRAM access, which is where real-world vector search workloads actually operate.

Takeaway: If you're building vector search systems or doing high-volume distance computations, design for bulk scoring APIs and explicit memory management over single-pair operations—the architecture matters more than raw SIMD speed once data exceeds cache.

Deep Dive

Elasticsearch built simdvec because existing solutions (Panama Vector API, FAISS, jvector) couldn't cover the full range of quantized vector types and bulk scoring requirements needed for production search workloads
The library provides hand-tuned native C++ kernels for every vector type Elasticsearch supports, with single-digit nanosecond FFI call overhead from Java
On single-pair float32 scoring at 1024 dimensions, simdvec is competitive with FAISS on x86 (28ns vs 23ns) and leads on ARM (70ns vs 156ns for FAISS, 110ns for jvector)
For int8 quantization, simdvec matches NumKong performance while supporting far more vector types through unified FFI interface
The real advantage emerges with bulk scoring: when scoring thousands of vectors with working sets exceeding L3 cache, simdvec outperforms alternatives by 1.7x to 4x
On x86, explicit prefetching reduces cache misses from 139K to 19K and doubles instructions per cycle, with performance advantage growing from 1.2x in L2 to 2.8x beyond L3 cache
On ARM, interleaved loading from four vectors simultaneously provides memory-level parallelism, reducing backend stalls by 40% with consistent 1.8x speedup regardless of dataset size
Two architectures require fundamentally different strategies: x86 uses sequential processing with prefetch instructions to pull future data into L1, while ARM uses interleaved access patterns to keep the out-of-order engine fed
Production vector indices typically don't fit in CPU cache (a 10M-vector int8 index at 1024 dimensions is 10GB), making memory latency hiding the dominant performance factor
Every vector search query—whether HNSW graph traversal, IVF candidate scoring, or reranking—executes millions of distance operations through the same simdvec engine, compounding kernel-level improvements into significant query latency and throughput gains

Decoder

SIMD: Single Instruction Multiple Data, a CPU feature that performs the same operation on multiple data points simultaneously using wide registers (256-bit or 512-bit)
AVX-512: Advanced Vector Extensions 512-bit, Intel's SIMD instruction set for x86 processors that can process 16 float32 values in parallel
NEON: ARM's SIMD instruction set architecture for parallel processing on ARM processors like Graviton
HNSW: Hierarchical Navigable Small World, a graph-based approximate nearest neighbor search algorithm
IVF: Inverted File index, a clustering-based search structure that partitions vectors into buckets
BBQ: Better Binary Quantization, an aggressive quantization technique that reduces memory by 32x
FFI: Foreign Function Interface, a mechanism for calling native C/C++ code from higher-level languages like Java
Panama: Java's Project Panama, which provides FFI capabilities and vector API for SIMD operations
Prefetching: Issuing CPU instructions to load data into cache before it's actually needed, hiding memory latency
Cache miss: When requested data isn't in CPU cache and must be fetched from slower DRAM, causing pipeline stalls

Original Article

Elasticsearch's simdvec is a hand-tuned SIMD kernel library that accelerates vector distance computations across all query types, using techniques like bulk scoring, prefetching, and architecture-specific optimizations to significantly outperform alternatives—especially at large scale when data exceeds CPU cache. Its biggest advantage comes not from raw compute speed but from efficiently hiding memory latency, enabling faster, more scalable vector search across diverse data types and hardware.

DevOps securitycicd

Spotting CI/CD misconfigurations before the bots do: Securing GitHub Actions with Datadog IaC Security

AI agents are now autonomously discovering and exploiting security flaws in GitHub Actions workflows, prompting new automated defenses.

Read original

Summary

What: Datadog launched IaC Security for GitHub Actions, a tool that scans CI/CD workflows before merge to detect misconfigurations like script injection vulnerabilities, excessive permissions, and unpinned dependencies that attackers increasingly exploit through automated campaigns.

Why it matters: The shift from manual security research to AI-driven automated exploitation of CI/CD pipelines means vulnerabilities are now discovered and attacked at scale, making proactive scanning essential rather than optional.

Takeaway: Audit your GitHub Actions workflows for common risks: untrusted input in run commands, workflows with write permissions, and dependencies that don't specify exact commit hashes.

Decoder

IaC Security: Infrastructure as Code security scanning that analyzes configuration files for vulnerabilities before deployment
Injection: Attack where untrusted data is executed as code, often through GitHub Actions expressions in workflow files
Unpinned dependencies: Using third-party actions without specifying exact versions, allowing malicious updates to execute in your pipeline
Pre-merge scanning: Security checks that run during code review before changes reach production

Original Article

AI agents are increasingly capable of autonomously discovering and exploiting CI/CD misconfigurations, as demonstrated by a campaign targeting GitHub Actions workflows through injection, permissions abuse, and unpinned dependencies. Datadog IaC Security addresses these risks by scanning workflows pre-merge, enforcing best practices, and expanding detection coverage for triggers, supply chain integrity, and runtime security gaps.

DevOps infrastructurestreaming

Evolving Media CDN for the world's most demanding broadcast and streaming workloads

Google Cloud's Media CDN is adding regional shielding, better origin compatibility, and real-time monitoring tools to help streaming platforms handle massive live events more efficiently and cost-effectively.

Read original

Summary

What: Google announced improvements to Media CDN, their content delivery network that shares infrastructure with YouTube, focusing on features beyond raw scale: flexible shielding in select regions to reduce latency, support for larger 25MiB segments for 4K/8K content, monthly savings plans for predictable costs, and Monitoring as a Service for real-time visibility during live events like the Super Bowl or World Cup.

Why it matters: The shift from pure scale to operational tooling reflects the maturation of the streaming industry, where providers need predictable costs and proactive monitoring rather than just bandwidth, especially as broadcast-quality streaming becomes the expected baseline.

Takeaway: If you're building streaming infrastructure, review the Media CDN documentation for implementing flexible caching and broadcast-grade visibility features.

Deep Dive

Google Cloud Media CDN now offers flexible shielding in South Africa, the Middle East, and the US, allowing traffic to be cached regionally rather than fetching from distant origins, reducing both latency and cost
New origin compatibility features address real-world integration challenges: HEAD request support, increased maximum segment size from previous limits to 25MiB to accommodate 4K and 8K content, and multi-part range request support
The platform is shifting from pure pay-as-you-go pricing to monthly savings plans that provide total cost of ownership benefits for committed usage levels, addressing the need for financial predictability in production environments
Monitoring as a Service (MaaS) provides a "broadcast operating center" view that tracks everything from origin health to end-user quality of service, enabling teams to identify and mitigate issues before they impact viewers
The industry focus has shifted from whether platforms can handle massive scale to how well they solve operational and financial challenges, with real-time visibility being described as a fundamental requirement rather than a nice-to-have
During major live events, customers need immediate intervention capabilities rather than next-business-day support, driving the need for proactive, data-driven operational tools
The Media CDN shares infrastructure with YouTube, giving it proven capacity for handling peak traffic demands during global live events

Decoder

CDN (Content Delivery Network): A distributed network of servers that delivers content to users from geographically closer locations to reduce latency
Shielding: A caching layer that sits between edge servers and the origin, reducing the number of requests that reach the origin server by consolidating requests within a region
Origin: The source server where the original content is stored, which the CDN fetches from to populate its caches
MaaS (Monitoring as a Service): A managed monitoring solution that provides real-time visibility into streaming infrastructure health and performance

Original Article

Streaming platforms are evolving beyond scale to address operational and financial challenges through flexible architectures, interoperability, predictable pricing, and real-time visibility. Modern CDNs prioritize efficient global delivery and proactive monitoring to meet rising expectations for high-quality live event streaming.

DevOps aisoftware-engineering

LLM-assisted coding is not deterministic. Does it matter?

LLM-assisted coding doesn't need to be deterministic to be useful, because software development has never been deterministic and what actually matters is predictability of outcomes.

Read original

Summary

What: An argument that the non-deterministic nature of LLM-assisted coding is not a problem, since determinism (same inputs always producing same outputs) and predictability (ability to foresee results) are different concepts, and software development with human developers has never been deterministic either.

Why it matters: This reframes the debate about AI coding tools from theoretical concerns about determinism to practical concerns about whether systems produce reliable, testable outcomes—the same framework we already use for human developers, suggesting critics may be applying unfair standards to AI tools.

Takeaway: Focus on building verification practices (tests, staging environments, observability, rollbacks) around AI-generated code rather than worrying about whether the code generation process itself is deterministic.

Deep Dive

Determinism means same starting conditions always lead to same result, while predictability means being able to foresee results with available tools and knowledge—they are not synonyms
Weather demonstrates the distinction: governed by deterministic physics, but our ability to predict it has improved over decades through better measurements and computing power
Some systems are deterministic but computationally irreducible (can only know future state by simulating every step) or chaotic (tiny errors grow rapidly), making them practically unpredictable
Other systems like casino games are not deterministic at the individual event level but are statistically predictable across many trials
Software development has never been deterministic from a requester's perspective—you cannot predict exactly what code a developer will write, how long it will take, or which edge cases will fail
End users don't care if the development process is deterministic, they care if the resulting software behaves reliably and correctly
Modern software runs on stacks so complex (hardware, kernels, drivers, libraries, network conditions, containers, dependencies) that even deterministic code cannot have perfectly predicted behavior
The industry has already built practices around uncertainty: tests, staging environments, observability, rollbacks, and reproducible builds
DO-178C, the aviation industry standard for safety-critical software, is objective-oriented rather than prescriptive—it focuses on formulating objectives and verifying they're achieved, not on mandating specific methods or tools
The meaningful question is not whether AI coding is deterministic, but which workflow (human or AI) produces more predictable outcomes under real conditions for specific cases and stages

Decoder

Deterministic: A system property where the same initial conditions always produce the same result, with no randomness or variation in the process
Computational irreducibility: Concept by Stephen Wolfram describing systems where the only way to know the future state is to simulate every single step—there's no shortcut to prediction
DO-178C: Aviation industry standard for developing safety-critical airborne software, focused on objective verification rather than prescribing specific development methods or tools

Original Article

We often treat determinism and predictability as synonyms, but they are not the same.

A system is deterministic if the same starting conditions always lead to the same result. A system is predictable if we can actually foresee that result with the tools, time, and knowledge we have.

Determinism is a system characteristic.

Predictability, on the other hand, often depends on our capabilities, and it usually exists on a spectrum. Weather is a good example. The laws of physics governing the atmosphere have not changed, and they are deterministic. Yet our ability to predict the weather has improved over decades simply because our measurements, models, and computing power improved.

But it's not always on us. Stephen Wolfram has described the concept of computational irreducibility. A system is computationally irreducible if the only way to know its future state is to simulate every step. There are also chaotic systems where tiny measurement errors grow rapidly, making them practically unpredictable.

Some systems are both deterministic and predictable, like planetary motion over short time scales. Others are deterministic but not predictably so in practice, like weather or turbulent flows. Conversely, some systems are not deterministic at the level of individual events but are still predictable statistically, such as casino games or population averages.

System Type	Deterministic	Predictable
Planetary orbits (over finite time horizons)	Yes	Yes
Weather	Yes	Limited
Dice roll (unknown forces)	Yes	No
Radioactive decay (single event)	No	No
Casino odds (many trials)	No	Yes

In essence: determinism does not guarantee predictability, and predictability does not require determinism.

Back to coding

From the perspective of someone who needs a piece of software, development has never been deterministic. When you ask a software developer to build something, you cannot predict exactly what code they will write, how long it will take (so many jokes about this…), or which edge cases will fail first. The same is true when you ask an AI agent. Both are problem-solving processes operating under uncertainty.

One can argue in favor or against the competency of humans or LLMs when it comes to coding, but determinism has never been a human trait.

In most cases, developers build software to satisfy other people's needs, and what these people really care about is whether the resulting code is predictable enough to rely on: whether the system behaves correctly most of the time, whether failures are visible, and whether they can be fixed quickly.

This distinction is important. From a software developer's pov, asking an LLM to "build a program that sorts 1000 numbers" may not have a predictable result (code). But the end user only cares if the resulting program will always sort any 1000 numbers correctly.

And then there is the environment. Modern software runs on stacks that are far more complex than any single developer can fully reason about: hardware, kernels, drivers, libraries, network conditions, configuration files, container layers, dependency versions. So, even if, as a developer, you write the code to do excatly what you want it to do, in a totally predictable way, running the code may yield less predictable results. While deterministic, the system as a whole is so complex that its behavior cannot be predicted perfectly in advance.

We have quietly accepted that bugs are a natural part of software development (i.e. there will be cases where it behaves unpredictably), because we recognize the complexity of the endeavor. Instead of expecting perfect foresight, the industry built practices around uncertainty: tests, staging environments, observability, rollbacks, reproducible builds.

Producer / System Component	Deterministic	Predictable (for the requester)
Human developer	No	Usually, within experience and process constraints
AI coding agent	No	Increasing, with tooling and validation loops
Compiler / build system	Yes	Yes
Tested deployment pipeline	Yes	Yes

So, the meaningful question is not about determinism. The meaningful question is which workflow produces more predictable outcomes under real conditions, and which human or AI is a better fit at each case/stage.

Some final thoughts

It's worth looking at how we build some of the most safety-critical software. DO-178C is the "Software Considerations in Airborne Systems and Equipment Certification". It is a key document in the aeronautic industry, providing guidelines for the development of safety-critical airborne software.

The approach of DO-178C is based on the formulation of appropriate objectives and on the verification that these objectives are achieved. The DO-178C authors acknowledged that objectives are more essential and stable than specific procedures. The ways of achieving an objective may vary between companies, and they may vary over time with the evolution of methods, techniques, and tools. DO-178C never states that one should use design method X, coding rules Y, or tool Z. DO-178C does not even impose a specific life cycle.

DO-178C is objective-oriented: the focus is on formulating objectives and verification that the objectives are achieved, a framework that could work both for human coders and LLMs.

References

Computational Irreducibility (Wikipedia). The image, Not Random, Blue, is inspired by the concept.
Software has bugs. This is normal.
Also depending on mood, health, work relationships and other human factors.
Even leaving bugs aside, there is a big discussion of which aspect of a compiler is predictable. Modern compilers tend to generate binaries that behave predictably, but are implemented in ways unexpected by the vast majority of developers. For example this article describes a case where the compiler decides to replace an O(n) algorithm written by the developer, with an O(1) one!
Side thought: If someone proved (not observed) that humans or LLMs can generate predictable results in some non-trivial cases, that would be really interesting. Tbh, if I had to bet that such a proof exists, I'd put my money on finding one for LLMs.
An introduction to DO-178C

DevOps aillmenterprise

Databricks partners with OpenAI on GPT-5.5

Databricks is partnering with OpenAI to bring GPT-5.5 to its platform, with the new model cutting errors nearly in half on enterprise document reasoning tasks.

Read original

Summary

What: GPT-5.5 is OpenAI's newest frontier model optimized for enterprise agentic workflows, complex document analysis, and coding tasks through the Codex agent, and will be available through Databricks with Unity AI Gateway governance.

Why it matters: The 46% error reduction on full-agent workflows (where the model autonomously finds documents, parses them, and computes answers) suggests AI agents may finally be reliable enough for real multi-step business tasks without constant supervision.

Takeaway: GPT-5.5 will be available soon on Databricks for enterprises wanting to deploy frontier reasoning capabilities on their own data with governance controls.

Deep Dive

GPT-5.5 scored 64.66% on OfficeQA Pro (with oracle retrieval), a 13% improvement over GPT-5.4's 57.14%, setting a new state-of-the-art on Databricks' enterprise document benchmark
In realistic full-agent workflows where the model must autonomously find documents and compute answers, GPT-5.5 achieved 52.63% versus GPT-5.4's 36.10%, representing a 46% reduction in errors
OfficeQA benchmark is built from 89,000 pages of U.S. Treasury Bulletins and tests retrieval across documents, complex table interpretation, and precise calculations on real enterprise data
The model powers Codex, OpenAI's coding agent, with enhanced reasoning and execution for developer workflows including writing, debugging, and operating across multiple tools
GPT-5.5 can handle messy multi-part tasks end-to-end: planning, using tools, checking work, recovering from ambiguity, and continuing until completion without step-by-step management
The same strengths that make it effective at coding apply to general knowledge work: researching, analyzing data, creating documents, and operating software across tool boundaries
Integration with Databricks will provide secure, governed access to frontier reasoning capabilities on enterprise data through Unity AI Gateway

Decoder

Frontier model: The most advanced, capable AI models at the current state-of-the-art, typically large language models with broad reasoning abilities
Agentic work: Tasks where AI systems autonomously plan, execute multiple steps, use tools, and make decisions to complete complex objectives without constant human intervention
Unity AI Gateway: Databricks' governance layer for controlling and monitoring AI model access, usage, and compliance in enterprise environments
OfficeQA: Databricks' benchmark for evaluating AI models on document-heavy analytical tasks common in enterprise settings, testing retrieval, table interpretation, and calculation accuracy
Oracle retrieval: A testing scenario where the correct documents are provided to the model, isolating the model's reasoning ability from its document-finding capability
Agent harness: A framework that allows AI models to autonomously execute multi-step workflows including tool use and decision-making

Original Article

Databricks is excited to partner with OpenAI on GPT-5.5, their latest frontier model. GPT-5.5 is OpenAI's strongest frontier model for agentic work in enterprise, complex document reasoning, and long-horizon coding agents. GPT-5.5 also now powers Codex, OpenAI's coding agent.

GPT-5.5 Features and Benefits

GPT-5.5 is the smartest frontier model yet and the next step toward a new way of getting work done. It understands what you're trying to do more quickly and can take on more of the work itself. Codex, OpenAI's coding agent, is now powered by GPT-5.5, with stronger reasoning and execution capabilities for developer workflows.

The same strengths that make GPT-5.5 great at coding also make it powerful for everyday work on a computer. Because the model is better at understanding intent, it can move more naturally through the full loop of knowledge work: finding information, understanding what matters, using tools, checking the output, and turning raw material into something useful.

It can write and debug code, research online, analyze data, create documents and spreadsheets, operate software, and move across tools until a task is finished. Instead of carefully managing every step, you can give GPT-5.5 a messy, multi-part task and trust it to plan, use tools, check its work, recover from ambiguity, and keep going.

GPT-5.5 sets the state-of-the-art performance

To understand how these improvements translate into real enterprise workloads, we evaluated GPT-5.5 on OfficeQA, Databricks' benchmark for document-heavy, multi-step analytical tasks customers perform every day. OfficeQA, built from 89,000 pages of U.S. Treasury Bulletins, measures a model's ability to retrieve information across documents, interpret complex tables, and perform precise calculations grounded in real enterprise data.

When given the right documents (OfficeQA Pro LLM with Oracle PDF + Web Search), GPT-5.5 scored 64.66%, a decent jump from GPT-5.4's 57.14%, representing a ~13% improvement and a new state-of-the-art on this benchmark. This tests the ceiling of what the model can do when retrieval is already handled.

In a full-agent workflow eval (OfficeQA Pro Agent Harness), where the model must find the right documents, parse them, and compute answers on its own using the Codex agent harness, GPT-5.5 scored 52.63%, up from GPT-5.4's 36.10%. That's a 46% reduction in errors, showing that GPT-5.5's gains aren't just theoretical; they hold up in realistic, end-to-end enterprise workflows.

GPT-5.5 is coming soon to Databricks. Bring frontier reasoning to your enterprise data, securely, and at scale.

DevOps infrastructuresecurity

Amazon ECR Pull Through Cache Now Supports Referrer Discovery and Sync

Amazon ECR now automatically syncs container image signatures, SBOMs, and attestations through its pull through cache, removing the need for manual retrieval.

Read original

Summary

What: Amazon Elastic Container Registry's pull through cache feature now automatically discovers and syncs OCI referrers (image signatures, SBOMs, and attestations) from upstream registries into private ECR repositories, where previously these artifacts had to be manually listed and fetched.

Why it matters: This removes friction from security and compliance workflows by making image verification, vulnerability scanning, and supply chain attestation work seamlessly with cached images without requiring custom scripts or client-side workarounds.

Takeaway: Check the Amazon ECR documentation to understand how referrer sync works with your existing pull through cache rules.

Decoder

OCI referrers: Metadata artifacts linked to container images following the Open Container Initiative specification, such as signatures or security reports\n* SBOM: Software Bill of Materials, a detailed inventory of all components and dependencies in a container image used for security auditing\n* Attestations: Cryptographically signed metadata proving facts about an image, like build provenance or security scan results\n* Pull through cache: A registry feature that automatically mirrors images from upstream registries (like Docker Hub) on first pull and caches them locally

Original Article

Amazon Elastic Container Registry now automatically syncs OCI referrers like signatures, SBOMs, and attestations via pull through cache, eliminating manual retrieval and enabling seamless verification workflows across supported AWS regions.

Data infrastructurecost-optimization

How Airtable Saved Millions by Cutting Archive Storage Costs by 100x

Airtable cut archive storage costs by 100x and saved millions annually by migrating petabytes of cold MySQL data to S3 Parquet files queried with Apache DataFusion.

Read original

Summary

What: Airtable built a two-tier storage system that keeps recent archive data in MySQL while moving older, rarely-accessed rows (cell history and action logs) to S3 as base-partitioned Parquet files. They query this cold data using embedded Apache DataFusion, achieving 10x compression from Parquet and 10x lower per-byte costs on S3 compared to MySQL RDS.

Why it matters: This demonstrates a practical pattern for the common problem of cold data accumulation in expensive OLTP databases. The article provides a detailed implementation blueprint including migration strategy, validation approaches, and latency optimizations that preserve interactive performance while dramatically cutting costs.

Takeaway: If you're managing growing databases with historical data that's rarely queried, evaluate a tiered storage approach: keep hot data in your transactional database and archive cold data to columnar format on object storage with a query engine like DataFusion.

Deep Dive

Airtable's MySQL footprint had grown to petabytes with some databases approaching the 64TB RDS limit, driven primarily by archive tables storing cell history and action logs
The archive data was mostly cold, read-only, and rarely accessed but required retention for up to 10 years for some enterprise customers with strict durability and availability requirements
They built a two-tier system where recent rows stay in MySQL while older archive data moves to S3 stored as Parquet files partitioned by base
Parquet compression reduced the dataset size by 10x and S3 was about 10x cheaper per byte than MySQL storage, combining for roughly 100x cost savings
Apache DataFusion was chosen as the query engine over Athena (too slow for interactive queries), DuckDB (inefficient projection pushdowns), and StarRocks (operational overhead of running a cluster)
DataFusion runs as an embedded Rust library inside existing workers, providing process-level isolation per base and high request affinity for excellent cache hit rates
Migration used RDS snapshots for consistent exports, Flink jobs for parallel repartitioning by base, and Step Functions with SQS for compaction into 1GB Parquet files
Bulk validation with StarRocks confirmed zero data corruption by comparing serving Parquet files against original RDS snapshots across petabytes
Shadow validation on live traffic caught cross-runtime bugs including float precision mismatches, sorting issues (lexicographic vs numeric), and async crashes in the napi-rs Node.js bridge
Tiered caching architecture achieved 99%+ hit rates by caching Parquet metadata and ListObjects results in memory, plus custom page header metadata caching and optional full-file disk caching for outlier bases
Custom secondary indexes built on top of Parquet files handle sparse filter queries, which is practical here because the archived data is effectively read-only
Parquet bloom filters enable efficient point lookups on randomly-distributed unique identifiers by pruning page groups that definitely don't contain target values
Files are sorted by autoincr_id and partitioned by base, allowing the query engine to use Parquet metadata for selective byte-range downloads instead of full-file scans
Next steps include building CDC-style incremental archiving with Flink to replace the bulk migration approach and expanding the platform to other log-like tables

Decoder

Parquet: Columnar storage file format that stores data by column rather than row, with built-in compression and metadata that query engines can use to skip irrelevant data
Apache DataFusion: Embedded query engine written in Rust that efficiently queries Parquet files by exploiting their metadata for pruning
Apache Flink: Stream processing framework used here for parallelized data migration and repartitioning workloads
CDC: Change Data Capture, a pattern for tracking and streaming database changes in real-time rather than bulk exports
Bloom filter: Probabilistic data structure that can definitively say a value is not present or possibly present, used to skip scanning data that won't match
Row group: Unit of data organization in Parquet containing a subset of rows, with metadata like min/max values for pruning
OLTP: Online Transaction Processing, databases optimized for transactional workloads with frequent writes rather than analytical queries
Metadata pruning: Query optimization that uses file statistics to skip reading data blocks that can't contain matching results

Original Article

Airtable cut archive storage costs by about 100x by moving cold, mostly immutable MySQL data into S3 as partitioned Parquet files and querying it with embedded Apache DataFusion. The dataset became 10x smaller, while S3 was about 10x cheaper per byte. A Flink-based migration, bulk and shadow validation, tiered caching, custom secondary indexes, and Parquet bloom filters preserved interactive latency and enterprise guarantees.

Data databaseinfrastructure

Internal vs. External Storage: What's the Limit of External Tables?

External tables, which store metadata in databases but reference data in object storage, have persisted for 25 years because they solve the economic tradeoff between storage cost and query speed.

Read original

Summary

What: A comprehensive analysis of external tables in databases, comparing them to internal storage and exploring their evolution from Oracle's 2001 implementation through modern lakehouse architectures. The article examines when to store data inside your warehouse versus externally in object storage, with benchmarks showing external tables are 1.3–1.7× slower but up to 20× cheaper for archival data.

Why it matters: This architectural pattern keeps getting re-implemented in modern systems (Snowflake 2021, Databricks, BigQuery) because the core economics still work. As data volumes grow faster than compute needs, understanding the hot versus cold storage decision directly impacts both infrastructure costs and query performance. The article traces how external tables evolved from simple CSV readers to ACID-compliant lakehouse tables, showing they're not a legacy pattern but a fundamental database primitive.

Takeaway: Evaluate your data temperature before choosing storage: use internal storage for dashboards and frequently-queried hot data, but use external tables for archival cold data where 1.5× slower queries are acceptable in exchange for 10–20× lower storage costs on S3 tiers.

Deep Dive

External tables store only metadata in the database while data lives in external storage (like S3), functioning similarly to symlinks that create pointers without moving data
The pattern emerged in Oracle 9i (2001) implementing the SQL:2003 standard for foreign-data wrappers, and predecessors existed in Microsoft Access linked tables (1992) and IBM DB2 DataJoiner (1995)
Modern implementations have added support for columnar formats (Parquet, ORC, Avro) and semi-structured data (JSON), replacing the CSV-only parsers of 2008
The hot versus cold decision drives the tradeoff: internal storage delivers 23.8ms median query times for dashboards, while external tables cost 41–56ms but reduce storage costs from $23/TB/month (Snowflake) to $1/TB on S3 Glacier Deep Archive
Benchmark on TPC-H 6M rows showed DuckDB internal storage was 1.0× baseline, DuckLake external was 1.3× slower, raw Parquet 1.4× slower, and Iceberg 1.7× slower, but all answered cold queries under 150ms
Common production pattern combines external tables with materialized views to refresh nightly snapshots during off-peak hours, avoiding impact on upstream production systems while optimizing query performance
Modern warehouse integrations (Snowflake, Redshift Spectrum, BigQuery, Athena) simplified the DDL from Oracle's verbose ACCESS PARAMETERS syntax to schema-inferred formats over S3 with minimal configuration
The dbt-external-tables package wraps external table definitions in YAML, enabling declarative configuration of location, partitions, and column inference across multiple warehouse backends
DuckDB achieves external table functionality without CREATE EXTERNAL TABLE syntax through schema-on-read with direct Parquet/CSV reading and ATTACH commands for Postgres, MySQL, SQLite, and S3
Open table formats (Iceberg, Delta, Hudi) represent the next evolution where the manifest file acts as the pointer, adding ACID guarantees, time travel, and schema evolution to external data
DuckLake addresses the small-file problem that plagues streaming workloads: benchmark showed Iceberg creating 352 files (201 data + 151 metadata) for 50 single-row inserts versus DuckLake's zero files with rows inlined in the catalog
The metadata benchmark demonstrated DuckLake's 926× faster queries on streaming workloads by reading indexed rows directly rather than walking manifest trees
Modern lakehouse architectures blur the internal/external distinction by adding database semantics (transactions, indexes, statistics) to external storage, completing a 25-year cycle back to RDBMS principles while maintaining storage/compute separation
ADBC (Apache Arrow Database Connectivity) enables 30× faster data transfer than ODBC by streaming columnar data end-to-end instead of row-by-row serialization, making external Parquet tables viable for interactive BI tools
The pattern's persistence across database generations (Oracle 2001, Snowflake 2021, Databricks Unity Catalog, BigLake 2022) suggests it will remain relevant for another 25 years per the Lindy Effect, because reading data in place always beats moving it

Decoder

External table: A database table definition that stores only metadata (schema, column types) in the database while the actual data lives in external storage like S3, accessed via SQL as if it were an internal table
Internal storage: Data fully managed and stored within the database system itself, offering faster queries but higher storage costs compared to external alternatives
Materialized view (MV): A pre-computed query result stored as a physical table that refreshes periodically, combining external table freshness with internal table speed
Hot vs. cold data: Hot data is frequently queried (dashboards, recent analytics) requiring fast access, while cold data is archival or rarely accessed, prioritizing low storage cost over query speed
DDL (Data Definition Language): SQL commands like CREATE TABLE that define database structure and schema without manipulating actual data
ACID guarantees: Atomicity, Consistency, Isolation, Durability—database properties ensuring reliable transactions even when systems fail
Open table format: Vendor-neutral table specifications (Iceberg, Delta, Hudi) that store metadata alongside Parquet files, enabling multiple engines to read the same data with transactional consistency
Lakehouse architecture: Hybrid approach combining data lake storage (cheap object storage, open formats) with data warehouse features (SQL, ACID, governance)
Schema-on-read vs. schema-on-write: Schema-on-read infers structure when querying data (flexible, used by DuckDB), while schema-on-write defines structure upfront (strict, traditional databases)
Small-file problem: Performance degradation when table formats create thousands of tiny metadata files instead of consolidating writes, requiring expensive compaction operations
Lindy Effect: The principle that the longer a technology has existed, the longer it's likely to continue existing—explaining external tables' 25-year persistence

Original Article

Internal tables store and manage both data and metadata within the database system, while external tables only store metadata and reference data that lives outside the system, leaving the underlying data untouched. Internal tables enable tighter lifecycle management, whereas external tables decouple storage and compute, making it easier to scale, share, and access large datasets without moving or duplicating data.

Data aiagentsdevops

Background Coding Agents: Supercharging Downstream Consumer Dataset Migrations

Spotify's coding agent Honk automated 1,800 data pipeline migrations by leveraging standardized tooling and careful prompt engineering, saving 10 engineering weeks and demonstrating what makes large-scale agent-driven refactoring work in production.

Read original

Summary

What: Honk is Spotify's internal coding agent that was used to automatically migrate approximately 1,800 data pipelines to new dataset versions across their codebase, generating pull requests and managing the rollout through their Backstage and Fleet Management platforms.

Why it matters: This case study reveals the real prerequisites for successful coding agents at scale: standardized code patterns, comprehensive testing infrastructure, and highly specific context engineering rather than general prompts. The agent struggled with Scala pipelines due to lack of standardization but succeeded with SQL-based frameworks.

Takeaway: If you're considering coding agents for migrations, prioritize standardizing your codebase and adding automated tests first—these foundational elements determine whether agents can reliably verify and iterate on their work.

Deep Dive

Spotify needed to deprecate two heavily-used datasets and migrate ~1,800 direct downstream pipelines across three frameworks within six months
Used Backstage's endpoint lineage and Codesearch plugins to identify all affected repositories and understand dependencies
Honk worked well on standardized frameworks (BigQuery Runner and dbt) but struggled with Scio due to its flexibility and variation across teams
Context engineering was the most time-consuming part—initial attempts to repurpose human migration guides failed because they weren't comprehensive enough
Success required explicit field-to-field mapping tables in the context file since Honk couldn't access external documentation or schemas at runtime
For edge cases requiring human judgment, Honk was instructed to leave fields unchanged and add comments with links to migration guides
Lack of unit testing in BigQuery Runner and dbt repos meant Honk couldn't verify its own work automatically, requiring manual review by owning teams
Successfully generated 240 automated migration PRs, monitored through Backstage's Fleetshift plugin for progress tracking and troubleshooting
Key lessons: agent success depends on code standardization, comprehensive testing infrastructure, and enforcing validation requirements across repos
Future improvements include allowing Honk to gather its own context from JIRA tickets and documentation before making changes, reducing upfront context engineering work

Decoder

Backstage: Spotify's open-source developer portal platform that provides visibility into services, dependencies, and lineage across the organization
Fleet Management: Tooling for orchestrating large-scale code changes across many repositories simultaneously
Fleetshift: Backstage plugin for managing multi-repository migrations with a unified dashboard view
BigQuery Runner: SQL-based data pipeline framework used at Spotify for running queries on Google BigQuery
dbt: SQL-based data transformation framework (data build tool) for managing analytics workflows
Scio: Scala-based framework for building Apache Beam data pipelines with more flexibility than SQL frameworks
Context engineering: The process of crafting detailed instructions and context for AI agents to ensure they perform tasks correctly
Endpoint lineage: Visualization of data dependencies showing which downstream services consume a particular dataset

Original Article

Background Coding Agents: Supercharging Downstream Consumer Dataset Migrations (Honk, Part 4)

This is part 4 in our series about Spotify's journey with background coding agents (internal codename: "Honk") and the future of large-scale software maintenance. See also part 1, part 2, and part 3.

In Part 2, we explored how we enabled our Fleet Management tools to use agents to rewrite our software automatically. We also explored how to write good prompts that allow the agent to best work without needing human input. In this blog post, we give a case study of how one team at Spotify used Honk with our Backstage and Fleet Management platforms to ease the pain of migrating thousands of dataset consumers onto new dataset versions — saving an estimated 10 engineering weeks in the process. We also share what we learned about how to make our data landscape more autonomous-coding-agent–friendly in the process.

Dataset migrations can be painful

As any data team knows, getting users to migrate to new endpoints can be a slow and painful process, both for the data owners and the downstream teams that use the datasets day-to-day.

At the end of last year we needed to deprecate two of the most heavily-used user datasets in order to release new versions with additional dimensions that would unlock new features. These deprecated datasets had ~1,800 direct downstream data pipelines between them and indirectly impacted several thousand more across the entire company.

We faced the prospect of migrating ~1,800 direct downstream data pipelines in only six months, across three very different pipeline frameworks that we use at Spotify: the SQL-based BigQuery Runner and dbt frameworks, and the Scala-based Scio.

We estimated that it would have taken around 10 engineering weeks of effort to complete these migrations manually. Facing that much work, we explored how Backstage, Fleet Management, and Honk might be able to automate some of the complexity.

Simplifying fleet migrations with Backstage

Before we could begin making any code changes, we had to first understand the lineage of our deprecated datasets so we would know which repositories to make those changes in. This is where Backstage's endpoint lineage and Codesearch plugins came in.

Each endpoint's Backstage page gave a clear list of downstream consumers, giving us an immediate sense of the scale of our migration. With Codesearch, we wrote queries that would find target repositories across the Spotify GitHub Enterprise landscape, and mark them as in-scope for our migrations, which we orchestrated using our Fleetshift plugin.

data-lineage-a (1920x).png — *Backstage displays the lineage for any data endpoint to help both owners and consumers understand dependencies throughout Spotify's entire software ecosystem*

With Honk, context is key

As we discussed in Part 2, context engineering is a key part of the process when working with background coding agents. With our target repositories now identified quickly via Backstage, this was the part of the build that took the most time and iteration to get right, and also where we learnt the most.

One of the major challenges for Honk in this migration was the fact that it had to deal with three different data pipeline frameworks, two of which are reasonably consistent in style and substance across the company (BigQuery Runner, dbt), and one of which isn't (Scio). This lack of standardisation across our data landscape made it hard to write all-in-one prompts for Honk that could truly capture all available permutations of what it would encounter.

Although we are adding these features now, at the time of these migrations, Honk did not have access to Claude skills or custom configurability when it runs. This was a design choice made to establish guardrails around the range of possible outcomes during the migration. This meant that the prompt given to it had to be comprehensive, because it could not do things like use MCPs to go and read dataset schemas that you had not given it, or read external documentation for more context.

Trying to write a good, fully-comprehensive prompt for Scio pipelines, which can vary hugely between teams due to the relative flexibility that the framework provides, got very unwieldy without having access to outside Claude skills. We therefore made the decision not to continue trying to make Scio migrations work at that time, and focused on the other two pipeline frameworks.

For the dbt and BigQuery Runner pipeline frameworks, which were much more standardised, we initially attempted to generate a good context file by asking Claude to re-purpose a migration guide that was written for human engineers. However, the resulting context was not comprehensive enough, and Honk was left to make assumptions about how to map from one dataset field to another that were often incorrect. Once we adjusted for this, and made all mappings clear using tables in the context file — keeping in mind that Honk could only access the context we had written for it and little else — we began to see solid performance across the majority of target repositories.

Having these fine-grained instructions also allowed us to specify where Honk shouldn't try to perform a field migration, for example, in cases where a use case–specific judgement call was required. In these cases, we asked Honk to leave the fields unchanged, but to add comments above them with links to human engineer migration guides to make the task as easy as possible for the team that would later review the pull request.

One final challenge we encountered was that, unlike with Scio pipelines, the BigQuery Runner and dbt repositories across the company rarely used any build-time unit testing. This meant that one of Honk's key features, its ability to verify its work and then adjust based on the results, was unavailable to us, and we had to rely on the downstream owning teams to perform their own manual testing before merging the automated PRs.

That said, we successfully rolled out 240 automated migration PRs using Fleetshift. Here, Backstage and Fleetshift greatly simplified the ongoing monitoring and management of our shifts by providing an overview UI that gave us a snapshot view of migration progress, and the ability to easily click through and view any of the automated PRs without manually searching for the repositories. This was invaluable for troubleshooting, progress monitoring, and facilitating communication with the owning teams.

fleetshift-internal-details-b (1920x).png — *We use the Fleetshift plugin in Backstage to easily orchestrate software migrations across a handful of repos — or even thousands*

What did we learn for the future

It became clear during this project that the success of using our Fleet Management tools with Honk for large-scale, complex migrations is going to depend on the strategic push to consolidate and standardise our data landscape. Similarly, we must enforce requirements for testing and validation across repositories so that agents like Honk can verify their work in an automated fashion. Both of these elements will be critical in enabling background coding agents across Spotify.

In addition to that, there are exciting features on the Honk roadmap that will also enhance its performance on complex tasks. The Honk team is working on a feature that will allow the agent to spend some time gathering its own context, for example by reading JIRA tickets or documentation, before it begins to perform code changes. This reduces the need for such comprehensive context files to be written up front, and should improve the quality of the resulting code changes by making full use of the Claude Code capabilities.

With both of these wider, strategic changes taking place, and alongside that the underlying Claude Code agents improving in capability all the time, we look forward to seeing Fleet Management using Honk excelling on tackling more and more complex migrations in the future and reducing manual toil for our engineering teams.

Learn more about Fleet Management and our background coding agent Honk:

Honk, Part 1: 1,500+ PRs Later: Spotify's Journey with Our Background Coding Agent
Honk, Part 2: Context Engineering
Honk, Part 3: Predictable Results Through Strong Feedback Loops
On-demand webinar: How Spotify Built Honk
Now available: Fleetshift for Spotify Portal — perform complex code changes at scale, just like we do at Spotify

Data analyticsexperimentation

Measure Less to Learn More: Using Fewer, Higher-quality Metrics to Capture What Matters

Discord discovered that tracking too many metrics in A/B tests creates statistical problems that make it harder to detect real changes.

Read original

Summary

What: Discord reduced their default experiment metrics by removing redundant measurements and focusing on a small set of north-star and guardrail metrics that capture distinct concepts, rather than automatically tracking everything teams requested.

Why it matters: The multiple testing problem creates an unavoidable tradeoff: either accept false positives from random chance, or apply statistical corrections that reduce your ability to detect genuine effects. More metrics amplify this problem, and no statistical method eliminates it.

Takeaway: Audit your experiment metrics and remove ones that are redundant or highly correlated, focusing on a smaller set of clearly defined measurements that track different aspects of user behavior.

Decoder

p-value: A statistical measure of whether a result is likely due to chance; typically set at 5% threshold, meaning you'd expect the result by random chance only 5% of the time
Multiple testing problem: When testing many metrics simultaneously, some will appear statistically significant purely by chance (e.g., 5 out of 100 metrics at 5% threshold)
Recall: The proportion of true effects that are successfully detected by your statistical tests
North-star metrics: Primary success metrics that define overall product goals
Guardrail metrics: Safety metrics that ensure experiments don't harm important aspects of the product

Original Article

Discord improved experimentation by removing redundant metrics, grouping related ones, and focusing on a small set of clearly defined “north-star” and guardrail metrics. Adding too many metrics to experiments increases multiple-testing issues and metric correlation, which can require stricter statistical corrections and make real effects harder to detect.

Data aidatabaseinfrastructure

Databases Were Not Designed For This

Traditional databases assumed human-authored queries and intentional writes, but AI agents generate queries dynamically and retry autonomously, requiring new defensive patterns to prevent silent failures at scale.

Read original

Summary

What: An in-depth technical analysis of how AI agents violate core database design assumptions by issuing unpredictable queries, holding connections during LLM inference, writing autonomously without human review, and making silent mistakes that traditional error handling won't catch.

Why it matters: This exposes a fundamental architectural mismatch: databases evolved with the expectation that applications are deterministic and failures are loud, but agents can reason their way to never-before-seen queries, retry writes that appear to fail, and continue processing on semantically wrong results without raising exceptions.

Takeaway: Implement agent-specific database roles with minimal privileges, add statement timeouts at the role level, require idempotency keys on all agent writes, use soft deletes or append-only logs for audit trails, and tag queries with agent context for observability.

Deep Dive

Traditional databases assume callers are human-authored applications running deterministic code with predictable query patterns reviewed before deployment, but AI agents violate this by reasoning their way to queries dynamically based on context
Agents hold database connections much longer than typical applications because they pause for LLM inference between queries, fan out into parallel sub-agents, and multiply unexpectedly in production, exhausting connection pools sized for brief human-facing transactions
Statement timeouts become critical defense mechanisms; setting role-level timeouts (e.g., 5 seconds for statements, 10 seconds for idle transactions) prevents reasoning loops from holding resources indefinitely
Soft deletes with deleted_by and delete_reason columns should be the baseline for any agent-writable table, creating an audit trail and preventing permanent data loss from autonomous decisions
Append-only event logs replace UPDATE/DELETE operations for high-stakes data like financial records, making every state change traceable and enabling undo operations through projection queries
Idempotency keys are mandatory because agents retry by design under at-least-once delivery semantics; unique constraints on deterministic keys (hashed from task_id + operation + target_id) ensure retries produce identical results
Dedicated connection pools for agent workloads should be sized based on (num_agent_workers × avg_concurrent_steps × 0.5) with aggressive pool_timeout settings to fail fast rather than queue requests under saturation
PgBouncer in transaction pooling mode acts as a multiplier, allowing 500 agent connections to share 20 actual Postgres connections by releasing them immediately after each transaction instead of holding for entire multi-step tasks
Bad agent queries fail silently because an agent receiving an empty result set cannot distinguish between "no data exists" and "my query was wrong," leading to decisions based on incomplete data without raising observable exceptions
Query comments injected via database hooks (sqlalchemy events) tag every query with agent_id, task_id, and reasoning step, making slow query logs immediately actionable and enabling monitoring dashboards to show which agents consume the most database time
Schema design becomes a contract with language models in text-to-SQL scenarios; descriptive column names, CHECK constraints with explicit enum values, and NOT NULL constraints help LLMs generate correct queries without extensive prompt engineering
Agent-facing views translate legacy schema (usr_id, stat_cd, flg_1) into LLM-legible names (customer_id, fulfillment_status, requires_signature) with column comments written as docstrings that guide query generation
Role-per-agent-type access with minimum necessary privileges defines blast radius at the database level; analytics agents get read-only access, support agents can insert to event logs but cannot UPDATE directly, fulfillment agents can only modify shipping-related columns
The defensive data layer treats patterns like soft deletes, idempotency keys, least-privilege roles, and query tagging as load-bearing infrastructure rather than aspirational best practices, because agents don't provide the luxury of deferring them
Circuit breakers enforced in orchestration (max writes per task, max rows per statement, max task duration with watchdog processes) provide final safeguards against runaway agent behavior that database-level controls cannot prevent

Decoder

Agentic AI: AI systems that autonomously perform tasks by reasoning about actions and making decisions, rather than following predetermined code paths
Idempotency: The property that performing an operation multiple times produces the same result as performing it once, critical for handling automatic retries
Event sourcing: A pattern where state changes are stored as a sequence of immutable events rather than updating records in place, enabling full audit trails
PgBouncer: A connection pooler for PostgreSQL that multiplies effective connection capacity by sharing a small pool of actual database connections among many clients
Transaction pooling: A mode where connections are returned to the pool immediately after each transaction commits, rather than being held for an entire session
pg_stat_statements: A PostgreSQL extension that tracks execution statistics for all SQL statements, used for performance monitoring and query analysis
Text-to-SQL: The capability of LLMs to generate SQL queries from natural language descriptions, making schema naming and structure directly visible to AI
Blast radius: The scope of potential damage if a component fails or behaves incorrectly, limited through access controls and architectural boundaries

Original Article

Databases Were Not Designed For This

There is an implicit contract at the foundation of every database architecture decision you have ever made. You probably never wrote it down. Nobody does. It just… existed.

The contract goes something like this: the caller is a human-authored application, running deterministic code, issuing predictable queries, reviewed by a developer before deployment. Writes are intentional. Connections are brief. When something goes wrong, a human notices. The database can be dumb and fast because the application layer is smart and careful.

For forty years, this contract held. It shaped how we designed schemas, sized connection pools, granted permissions, and thought about failure modes. It worked because the assumption was correct.

It is no longer correct. Agentic AI systems violate this contract at every layer simultaneously.

In this article, I break down exactly which assumptions are failing, why they matter, and what to do about it - with concrete patterns and code. Let's dig right in…

Assumption - Deterministic Caller

In every application you have deployed before agents, the queries hitting your database were authored by a human.

developer wrote the SQL
developer code-reviewed it
developer tested it and deployed it.

This assumption runs so deep that the tooling reflects it automatically: the Postgres query planner builds statistics around observed query patterns, caching layers warm up on repeated queries, and connection pools are tuned around the expected number of concurrent queries of a known complexity.

Agents work differently; they reason their way to queries. Different reasoning paths produce different queries against the same tables.

An agent working on a customer analytics task might issue a join across five tables that has never been issued before, hold the connection while it thinks about the result, then issue a completely different follow-up. Your indexes cover the happy path. Your connection pool is sized for your observed peak. Neither of those holds when the agent can build any query depending on the data it needs.

Statement Timeouts

Statement timeouts are your first line of defense. A human-authored query that takes 30 seconds is a bug that someone will notice. An agent query that takes 30 seconds might be a reasoning loop that no one is watching.

So, set timeouts at the role level, not just the application level.

CREATE ROLE agent_worker;
ALTER ROLE agent_worker SET statement_timeout = '5s';
ALTER ROLE agent_worker SET idle_in_transaction_session_timeout = '10s';

The idle_in_transaction_session_timeout is especially important. Agents that pause mid-reasoning while holding an open transaction could be a legitimate situation.

Assumption - Writes are Intentional

The most dangerous assumption in database architecture is that every write was reviewed by a human before it happened. This was basically true for your entire career, but not anymore.

Agents write autonomously. They write based on their current understanding of the task, which may be wrong. Agents write in loops when their tools return unexpected results. Agents write on retries when a transient network error makes them 'think' the first attempt failed. Agents can even write thousands of rows in the time it takes you to get a Slack notification that something looks off.

Here's a real documented failure pattern - an agent calling a legacy API receives HTTP 200 with an empty result set. The API failed silently because the database connection pool was exhausted downstream. The agent interprets "no data" as "no problem" and proceeds to process 500 transactions with incomplete data. No exception was raised. No alert fired. The log showed "decision: approved" on every record.

The core fix here is to design your write paths assuming the caller might be wrong, might retry, and might not be watching the results.

Soft Deletes Everywhere

Never let an agent hard-delete anything. Use soft deletes as a baseline for any table an agent can write to

ALTER TABLE orders ADD COLUMN deleted_at TIMESTAMPTZ;
ALTER TABLE orders ADD COLUMN deleted_by TEXT; -- 'agent:customer-support-v2', 'user:abc123'
ALTER TABLE orders ADD COLUMN delete_reason TEXT;

-- Agents query this view; they never see deleted rows and can't accidentally undelete
CREATE VIEW active_orders AS
  SELECT * FROM orders WHERE deleted_at IS NULL;

The deleted_by column is more important than it looks. When you are debugging what happened two hours ago, "show me everything agent X deleted" is a query you will want to run.

Append-only Event Logs

For operations where the stakes are higher - financial records, inventory changes, user state mutations - consider going further and making the table append-only. The agent never issues UPDATE or DELETE. It issues INSERT with a new state and a reason:

CREATE TABLE order_state_log (
  id UUID DEFAULT gen_random_uuid() PRIMARY KEY,
  order_id UUID NOT NULL REFERENCES orders(id),
  previous_status TEXT,
  new_status TEXT NOT NULL,
  changed_by TEXT NOT NULL,
  changed_at TIMESTAMPTZ DEFAULT now(),
  reason TEXT,
  idempotency_key TEXT UNIQUE
);

This is the event sourcing pattern applied at the table level. A single append-only log table for your most sensitive entities gives you a complete audit trail and makes "undo" a projection query.

Idempotency Keys Are Not Optional

Agents retry, and this is by design. Every orchestration framework operates on at-least-once delivery semantics. If a step fails, it runs again. Your write paths need to be designed for this.

An idempotency key is a stable identifier that an agent includes with every write. The database rejects duplicates silently with a unique constraint. The agent gets a successful response either way. Running the operation twice produces the same result as running it once.

-- The agent generates this key from
-- task_id + operation_type + target_id
-- It is deterministic for the same logical
-- operation, so retries produce the same key
ALTER TABLE order_state_log
 ADD CONSTRAINT uq_idempotency_key UNIQUE (idempotency_key);

In practice, the agent constructs the key like this:

import hashlib

def make_idempotency_key(task_id: str,
   operation: str, target_id: str) -> str:
    raw = f"{task_id}:{operation}:{target_id}"
    return hashlib.sha256(raw.encode()).hexdigest()[:32]

The task ID comes from the orchestration layer and is stable across retries of the same logical task. This means the agent can retry as many times as it needs to, and your database sees exactly one write per logical operation.

Assumption - Connections are Brief

Traditional connection pool sizing follows a straightforward mental model. Your application handles N concurrent requests. Each request needs one database connection for a brief period. You size your pool to slightly above your expected concurrency peak, add a little headroom, and you are done.

Agents break this model in three ways.

Agents hold connections longer

A multi-step reasoning task may issue a query, pause to process the result with the LLM, issue another query, pause again, and repeat. Each pause holds the connection open. The connection time per task is no longer "query execution time" - it is "query execution time + LLM inference time x reasoning steps."

Agents fan out

A single high-level agent task often spawns sub-agents to work in parallel. One task becomes five simultaneous database sessions. This can exhaust connections when concurrent agent workflows holding db.session open across long IO waits until Postgres ran out of connection slots.

Agents multiply unexpectedly

In development, you had three agents. In production, you have thirty. Nobody updated the connection pool configuration.

The fix is a dedicated connection pool for agent workloads, sized independently from your human-facing transactional application traffic

# Rule of thumb: (num_agent_workers * avg_concurrent_steps * 0.5)
# The 0.5 accounts for the fact that most agent steps
# involve LLM time, not DB time

agent_engine = create_engine(
    DATABASE_URL,
    pool_size=10,           # base pool for agents
    max_overflow=5,         # burst capacity
    pool_timeout=3,         # fail fast rather than queue
    pool_recycle=300,       # recycle connections every 5 minutes
    pool_pre_ping=True,     # validate connections before checkout
    connect_args={
        "options": "-c statement_timeout=5000 -c idle_in_transaction_session_timeout=10000"
    }
)

The pool_timeout=3 is deliberate. When an agent cannot get a connection within 3 seconds, it should fail fast and retry with backoff, not queue indefinitely. Queued requests under a saturated pool is how you get cascading failures.

For systems running many agents concurrently, add PgBouncer between your agents and Postgres. PgBouncer operates in transaction pooling mode, which means it returns a connection to the pool immediately after each transaction rather than holding it for the entire session. This is a significant multiplier on your effective connection capacity for agentic workloads.

# pgbouncer.ini
[databases]
mydb = host=postgres_host dbname=mydb

[pgbouncer]
pool_mode = transaction       # critical: release connection after each transaction
max_client_conn = 500         # clients (agents) can connect up to this number
default_pool_size = 20        # actual postgres connections (much smaller)
reserve_pool_size = 5         # emergency capacity
reserve_pool_timeout = 1.0    # fail fast if reserve is also exhausted

In transaction pooling mode, 20 actual Postgres connections can serve 500 agent connections, because each agent only holds a Postgres connection for the duration of a single transaction, not the entire multi-step task.

Assumption - Bad Queries Fail Loudly

In a human-operated system, a slow or incorrect query surfaces quickly. The dashboard loads slowly. The API times out. An engineer runs EXPLAIN ANALYZE and finds the problem. The feedback loop is tight.

Agents close that feedback loop. An agent that gets a slow query result just uses the result. An agent that gets an empty result set does not know whether the data genuinely does not exist or whether the query was wrong. It continues with its task, potentially writing decisions based on a bad read.

This is a different class of failure from application errors. An exception is observable. A semantically wrong query that returns rows is not.

The mitigation is building agent-specific observability into your database access layer. Standard slow query logs are not enough. You need to know which agent, which task, and which reasoning step produced a query. The most practical way to do this in Postgres is query comments

from sqlalchemy import text, event
from sqlalchemy.engine import Engine

@event.listens_for(Engine, "before_cursor_execute")
def add_agent_context_comment(conn, cursor, statement, parameters, context, executemany):
    agent_ctx = getattr(conn.info, "agent_context", None)
    if agent_ctx:
        statement = f"/* agent_id={agent_ctx['agent_id']}, task_id={agent_ctx['task_id']}, step={agent_ctx['step']} */ {statement}"
    return statement, parameters

# Usage: set context on the connection before executing
with engine.connect() as conn:
    conn.info["agent_context"] = {
        "agent_id": "fulfillment-v3",
        "task_id": "task-abc-123",
        "step": "check-inventory"
    }
    conn.execute(text("SELECT ..."))

These comments appear in pg_stat_activity, pg_stat_statements, and your slow query logs. A query that appears in your slow query log tagged agent_id=fulfillment-v3, task_id=task-abc-123, step=check-inventory is immediately actionable. Without this, you are doing archaeology.

Build a monitoring view that surfaces queries grouped by agent:

-- pg_stat_statements with agent context extracted from query text
SELECT
  (regexp_match(query, 'agent_id=([^,]+)'))[1] AS agent_id,
  (regexp_match(query, 'task_id=([^,]+)'))[1] AS task_id,
  count(*) AS call_count,
  round(mean_exec_time::numeric, 2) AS avg_ms,
  round(total_exec_time::numeric, 2) AS total_ms
FROM pg_stat_statements
WHERE query LIKE '%agent_id=%'
GROUP BY 1, 2
ORDER BY total_ms DESC;

When you see a single agent type accounting for 60% of total database time, you know where to look.

Assumption - Schema is a Contract With Engg

This is the assumption that most teams never think about until it breaks. Your schema was designed for developer ergonomics - named to make sense to the engineers, structured for query convenience, with nullable columns that "mean something" only if you read the original migration comment.

When an agent can see your schema - through Text-to-SQL, through tool definitions, through an MCP server wrapping your database - the schema becomes a contract with a language model. Column names, table structure, and nullability now affect whether the LLM generates correct queries or confident-sounding nonsense.

Consider the difference between these two column definitions

-- What most schemas look like
CREATE TABLE orders (
  id UUID PRIMARY KEY,
  usr_id UUID,           -- which user?
  stat_cd INT,           -- what does 2 mean? what does 7 mean?
  flg_1 BOOLEAN,         -- ???
  upd_ts TIMESTAMPTZ     -- updated at? but by whom?
);

-- What a schema legible to an agent looks like
CREATE TABLE orders (
  id UUID PRIMARY KEY,
  customer_id UUID NOT NULL REFERENCES customers(id),
  fulfillment_status TEXT NOT NULL CHECK (
    fulfillment_status IN ('pending', 'processing', 'shipped', 'delivered', 'cancelled')
  ),
  requires_signature BOOLEAN NOT NULL DEFAULT false,
  last_modified_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

The second schema generates correct LLM queries almost automatically. The first schema requires extensive prompt engineering to compensate for what should have been done at the schema level.

For schemas you cannot rename (legacy systems, high-migration-cost tables), build an agent-facing view layer

-- The raw table retains its legacy names
-- Agents query this view; they never touch the underlying table directly
CREATE VIEW agent_orders AS
SELECT
  id,
  usr_id         AS customer_id,
  CASE stat_cd
    WHEN 1 THEN 'pending'
    WHEN 2 THEN 'processing'
    WHEN 5 THEN 'shipped'
    WHEN 7 THEN 'delivered'
    WHEN 9 THEN 'cancelled'
  END            AS fulfillment_status,
  flg_1          AS requires_signature,
  upd_ts         AS last_modified_at
FROM orders
WHERE deleted_at IS NULL;  -- agents only ever see active rows

Write column comments as if they are docstrings - because for Text-to-SQL agents, they are:

COMMENT ON COLUMN agent_orders.fulfillment_status IS
  'Current state of the order in the fulfillment pipeline. '
  'Use this to filter orders that need action: pending and processing orders are active. '
  'Cancelled orders should never be modified.';

COMMENT ON COLUMN agent_orders.requires_signature IS
  'True if the delivery requires an adult signature. '
  'When true, the shipping agent must schedule a delivery window.';

Scoping Blast Radius

There is one more failure mode worth treating separately, because it cuts across all the assumptions above: the blast radius of a misbehaving agent is determined by the access it was granted.

Traditional applications share a database role, or at best have a few roles for different services. The assumption was that the application code was the guard rail. If the code only allowed users to update their own records, the database role did not need to enforce that - the application layer handled it.

Agents make this assumption dangerous. An agent that reasons itself into an incorrect state can issue queries that the application developers never anticipated. The agent is not a known, finite set of code paths - it is a general-purpose reasoner with access to a database connection. Application-layer guardrails do not bind it the way they bind deterministic code.

The fix is role-per-agent-type access, with the minimum necessary privileges defined at the database level:

-- Each agent type gets its own role
CREATE ROLE agent_fulfillment;
CREATE ROLE agent_customer_support;
CREATE ROLE agent_analytics;

-- agent_analytics: read-only, only the tables it needs
GRANT SELECT ON agent_orders TO agent_analytics;
GRANT SELECT ON customers TO agent_analytics;
-- Explicitly: no access to payments, credentials, PII tables

-- agent_customer_support: can update order status, cannot touch financials
GRANT SELECT ON agent_orders TO agent_customer_support;
GRANT INSERT ON order_state_log TO agent_customer_support;
-- Does not have UPDATE on orders -- changes go through the event log

-- agent_fulfillment: can read and update shipping-related fields only
GRANT SELECT, UPDATE (fulfillment_status, shipped_at, tracking_number)
  ON orders TO agent_fulfillment;

The question to ask in your access design review is not "what does this agent need?" but "what is the worst case if this agent's reasoning goes wrong, or if its credentials are compromised?" Reduce that blast radius at the database level, where it cannot be reasoned around.

Defensively Designed Data Layer

Pulling this together, here is what the data layer looks like for a team that has internalized these failure modes. None of it is exotic. All of it exists in battle-tested database tooling.

Every agent type has its own database role with the minimum necessary privileges, enforced at the database level with role-level timeouts. Agents connect through a dedicated connection pool, sized for agentic workload patterns and separated from human-facing traffic. PgBouncer runs in transaction pooling mode between agents and Postgres.

Tables that agents can write to use soft deletes with a deleted_by column that captures agent identity. High-stakes write paths use append-only event log tables with idempotency key constraints. Every write carries an agent ID and task ID so the audit trail is always traversable.

Schema objects that agents can see are named for legibility, not legacy convenience. A maintained view layer translates legacy column names to meaningful ones. Column comments are written as docstrings. Agents are granted access to views, not directly to underlying tables.

Every query issued by an agent carries a comment with the agent ID, task ID, and reasoning step. A monitoring dashboard aggregates this data so the on-call engineer can see "agent X consumed 40% of database time in the last hour" in real time.

The circuit breakers are defined: max writes per task enforced in the orchestration layer, max rows affected per statement enforced via statement complexity checks, max task duration enforced with a watchdog process that terminates stalled agent sessions.

None of this is new technology. Soft deletes, append-only logs, least-privilege roles, row-level security, idempotency keys, query tagging - these are patterns that have existed for years. The shift that agents force is that these patterns go from "best practice we keep meaning to implement" to "load-bearing infrastructure." Agents do not give you the luxury of deferring them.

The database was not designed for this caller. But the tools to make it safe are already there.

Conclusion and Footnote

Traditional database architecture rests on assumptions that agentic AI workloads systematically violate: deterministic callers, intentional writes, brief connections, loud failures, and schema as a developer contract.

Each of these assumptions held because a human was always somewhere in the loop. Agents remove that guarantee. The result is that patterns long treated as optional best practice - soft deletes, append-only logs, idempotency keys, least-privilege roles, query tagging - become load-bearing infrastructure.

None of this requires new technology. It requires treating the database as a defensive layer that assumes the caller might be wrong, might retry, and might not be watching the results.

Data infrastructureclouddevops

When a Cloud Region Fails: Rethinking High Availability in a Geopolitically Unstable World

Traditional multi-AZ cloud architectures fail to account for geopolitical disruptions that can take down entire regions through sanctions, internet shutdowns, or conflict, requiring a fundamental rethink of high availability design.

Read original

Summary

What: An in-depth architectural analysis introducing the concept of Sovereign Fault Domains, which treats region-level disruption from geopolitical events (sanctions, data localization laws, infrastructure damage) as a first-class failure mode requiring multi-region, jurisdiction-aware design patterns.

Why it matters: Cloud providers withdrawing from Russia in 2022, submarine cable cuts, and data localization enforcement have exposed a critical gap in traditional failure models that assume regions only fail for technical reasons and can recover predictably, when geopolitical disruptions can make entire regions legally or physically inaccessible with no recovery timeline.

Takeaway: Audit your system's failure model by mapping dependencies for region-scoped single points of failure, defining explicit region evacuation playbooks, and extending chaos engineering to simulate sovereign fault domain loss including control plane unavailability.

Deep Dive

The traditional cloud failure hierarchy (instance → AZ → region) assumes regions fail only for technical reasons and can recover predictably, but geopolitical events like sanctions, internet shutdowns, and conflict can compromise entire regions as correlated units with no guaranteed recovery
Real-world stress tests revealed critical assumptions: cloud provider withdrawal from Russia in 2022 showed cross-region replication wasn't designed for involuntary exit; conflict zones proved AZs can fail in a correlated manner; data localization laws made compliant replication topologies suddenly non-compliant
Sovereign Fault Domains are failure boundaries defined by legal, political, or physical jurisdiction rather than hardware topology, representing emergent boundaries that providers cannot engineer away and architects must plan for explicitly
Geopolitical events map to known distributed systems failures: sanctions behave like forced dependency removal, internet shutdowns like network partitions, data localization laws like replication constraints, and physical conflict like correlated AZ failure
Multi-region architecture shifts from optional to baseline for systems that cannot tolerate sovereign disruption, with active-passive achieving RTO of minutes through DNS failover and database promotion, while active-active achieves near-zero RTO at the cost of consistency guarantees
Control plane separation is frequently overlooked—systems can have multi-region data planes but remain functionally single-region if configuration, orchestration, and secrets management reside in one region with no sovereign fallback
Jurisdiction-aware data abstraction layers enforce data residency at write time by validating every write's jurisdiction tag against compliant storage endpoints, preventing post-hoc compliance violations but requiring maintaining accurate classification models
Replication-within-sovereignty inverts the default assumption from global-by-default to explicit cross-border flows, maintaining two replication graphs (intra-sovereign always-on, cross-border terminable) to prevent RPO violations when borders close
Region evacuation playbooks must enforce strict ordering: replication flows freeze and data exports complete before DNS failover, or write-splits occur where both evacuating and destination regions accept writes against diverged states
Chaos engineering must extend to region-level experiments including complete network blackholing (not just degradation), cross-region traffic partitioning, legal partition drills that disable cross-border replication, and dependency removal injection to surface region-scoped dependencies
Cost justification uses Annual Loss Expectancy modeling (ALE = Annual Rate of Occurrence × Single Loss Expectancy) where SLE includes downtime revenue loss, re-platforming costs, and customer churn—for a $50M ARR platform, 5% annual probability of sovereign disruption yields $125K annual expected loss
Investment decisions should run sensitivity analysis at 1%, 5%, and 10% annual probability levels, with justification requiring yes to questions about operating across sovereign boundaries, region-scoped dependencies, or unacceptable blast radius from legal inaccessibility
The article provides concrete implementation details including AWS NACL rules for region blackholing, CockroachDB locality constraints for jurisdiction-aware replica placement, and Gremlin network attacks for chaos experiments
Health check latency (30-90 seconds), DNS propagation TTL, and database promotion lag combine to determine actual RTO, with AWS Global Accelerator or Azure Front Door bypassing DNS propagation through anycast network-layer routing
The fragmentation of the global cloud ecosystem is fundamentally a distributed systems reliability problem requiring the same engineering rigor applied to hardware failures, extended to account for the full range of operational conditions including geopolitical instability

Decoder

Multi-AZ (Availability Zone): Deploying applications across multiple isolated data centers within a single cloud region to protect against individual data center failures
Sovereign Fault Domain (SFD): A failure boundary defined by legal, political, or physical jurisdiction rather than technical infrastructure, representing regions that can become inaccessible due to geopolitical events
RTO (Recovery Time Objective): The maximum acceptable time that a system can be down after a failure before business impact becomes unacceptable
RPO (Recovery Point Objective): The maximum acceptable amount of data loss measured in time, determining how far back you can restore from backups
CAP Theorem: A distributed systems principle stating you can only guarantee two of three properties: Consistency, Availability, and Partition tolerance
Active-Active vs Active-Passive: Active-active distributes traffic across all regions simultaneously with no primary; active-passive routes to a primary region with hot standby for failover
Control Plane: The infrastructure components responsible for configuration, orchestration, and operational management, as opposed to the data plane that handles user traffic
Chaos Engineering: The practice of deliberately injecting failures into systems to validate resilience assumptions and discover weaknesses before real outages occur
Annual Loss Expectancy (ALE): Risk modeling calculation multiplying the probability of an event occurring by the financial impact if it does occur, used to justify security and resilience investments
Network Partition: A failure mode where parts of a distributed system can't communicate with each other, forcing decisions about consistency versus availability

Original Article

Cloud high availability can no longer assume regions are safe, independent failure domains: sanctions, data localization laws, conflict zones, and submarine cable cuts can take out an entire region or make it noncompliant. Treat region-level disruption as a first-class risk, with multi-region, jurisdiction-aware data placement, control-plane separation, and dependency audits. The added cost and complexity should be justified with Annual Loss Expectancy modeling rather than assumed.

Data infrastructuredevops

Stop Letting Tools Lead Your Platform Decisions

Data engineers often choose their tech stack before understanding requirements, leading to overengineered systems that could be much simpler.

Read original

Summary

What: An opinion piece by Andreas Kretz arguing that data platform decisions should begin with use cases, constraints, and operating requirements rather than defaulting to popular tools like Kafka, Spark, Snowflake, or Airflow.

Why it matters: Teams gravitate toward trendy or "enterprise" tools before analyzing actual needs, resulting in unnecessary complexity that wastes budget and makes systems harder to maintain when simpler solutions would work better.

Takeaway: Before choosing your data stack, answer these questions first: What are the actual use cases? What latency do you need? What's your budget? Who will consume this data? Then pick the simplest solution that fits those constraints.

Deep Dive

The common anti-pattern: teams start projects by declaring "we'll use Kafka, Spark, Snowflake, and Airflow" before defining what they're actually building or why
This stack-first approach is fine for learning projects but becomes problematic when building production systems people depend on
The correct sequence: define use cases → identify constraints → understand user patterns → design architecture → select tools
Critical questions that should come first: Do you need streaming or is batch sufficient? What latency is acceptable? What storage fits your access patterns? Do you need strong consistency everywhere?
Real coaching example: team initially planned real-time Spark processing, but analysis revealed weekly API ingestion and historical queries, leading to simplified batch processing with small streaming component only where actually needed
Tools should follow strategy, not define it—don't pick Kafka because it's "standard," Snowflake because it's hyped, or Spark because it sounds enterprise
Selection should account for your specific team capabilities, budget constraints, timelines, and actual system complexity requirements
The better solution is very often the simpler one, but simplicity requires slowing down at the start to ask the right questions
Key framework: What really needs to happen? Where does speed actually matter? What can stay simple without breaking requirements?
No reward exists for having the most complex stack, and initial time investment in understanding requirements pays off by making subsequent decisions obvious

Original Article

Data platform decisions should start with use cases, constraints, and operating requirements, not with Kafka, Spark, Snowflake, or Airflow. The key questions are latency, data freshness, cost, failure handling, and who will consume the system. Choose the simplest stack that fits the problem, team, budget, and timelines.

Data pythonmlvisualization

tda-mapper (GitHub Repo)

tda-mapper is a scikit-learn-compatible library that reveals hidden clusters and patterns in messy data using topological data analysis to create intuitive graph representations.

Read original

Summary

What: tda-mapper implements the Mapper algorithm from Topological Data Analysis, which transforms complex high-dimensional data into graph representations that reveal clusters, transitions, and structural patterns. It integrates with scikit-learn pipelines and offers multiple visualization backends including an interactive web app for exploring data without code.

Why it matters: Topological Data Analysis provides a different approach to understanding data structure that can reveal patterns invisible to traditional clustering or dimensionality reduction methods, particularly useful for exploratory analysis of complex or high-dimensional datasets.

Takeaway: Install via pip and try the live demo app, or integrate it into existing scikit-learn pipelines for topological feature extraction on your own datasets.

Deep Dive

The library implements the Mapper algorithm, which extracts topological structure from data by creating a graph representation that highlights clusters and transitions between them
Uses optimized spatial search techniques and parallelization to scale efficiently to high-dimensional datasets
Provides custom scikit-learn-compatible estimators that fit into standard ML pipelines for dimensionality reduction, clustering, and feature extraction
Supports multiple visualization backends (Plotly, Matplotlib, PyVis) with adjustable layouts for exploring the resulting graph structures
Includes an interactive web-based app for real-time parameter tuning and visualization without writing code
The Mapper algorithm works in four steps: choose a lens function (like PCA), cover the lens image with overlapping intervals, cluster data in each interval, and build a graph from overlapping clusters
Example use case demonstrates applying Mapper to concentric circles dataset, successfully identifying the two circles as connected components in the topological graph
Applications span social sciences, biology, and machine learning for discovering hidden relationships in complex data
Library is published on PyPI with documentation and accompanying research paper for methodological details

Decoder

TDA (Topological Data Analysis): A mathematical approach to data analysis that studies the shape and structure of data by examining topological features like connected components, holes, and clustering patterns
Mapper algorithm: A technique that creates a simplified graph representation of complex data by projecting it through a lens function, covering the projection with overlapping regions, clustering within each region, and connecting overlapping clusters
Lens: A function (often dimensionality reduction like PCA) that projects high-dimensional data to lower dimensions to help identify structure before clustering
Topological features: Structural properties of data like connectedness and clustering patterns that remain invariant under continuous deformations

Original Article

tda-mapper

tda-mapper is a Python library built around the Mapper algorithm, a core technique in Topological Data Analysis (TDA) for extracting topological structure from complex data. Designed for computational efficiency and scalability, it leverages optimized spatial search methods to support high-dimensional datasets. The library is well-suited for integration into machine learning pipelines, unsupervised learning tasks, and exploratory data analysis.

Further details in the documentation and in the paper.

Core Features

Efficient construction

Leverages optimized spatial search techniques and parallelization to accelerate the construction of Mapper graphs, supporting the analysis of high-dimensional datasets.
Scikit-learn integration

Provides custom estimators that are fully compatible with scikit-learn's API, enabling seamless integration into scikit-learn pipelines for tasks such as dimensionality reduction, clustering, and feature extraction.
Flexible visualization

Multiple visualization backends supported (Plotly, Matplotlib, PyVis) for generating high-quality Mapper graph representations with adjustable layouts and styling.
Interactive app

Provides an interactive web-based interface for dynamic exploration of Mapper graph structures, offering real-time adjustments to parameters and visualizations.

Background

The Mapper algorithm extracts topological features from complex datasets, representing them as graphs that highlight clusters, transitions, and key structural patterns. These insights reveal hidden data relationships and are applicable across diverse fields, including social sciences, biology, and machine learning. For an in-depth overview of Mapper, including its mathematical foundations and practical applications, read the original paper.

Step 1	Step 2	Step 3	Step 4

Choose lens	Cover image	Run clustering	Build graph

Quick Start

Installation

To install the latest version uploaded on PyPI

pip install tda-mapper

How to Use

Here's a minimal example using the circles dataset from scikit-learn to demonstrate how to use tda-mapper. This example demonstrates how to apply the Mapper algorithm on a synthetic dataset (concentric circles). The goal is to extract a topological graph representation using PCA as a lens and DBSCAN for clustering. We proceed as follows:

import matplotlib.pyplot as plt
from sklearn.datasets import make_circles

import numpy as np
from sklearn.decomposition import PCA
from sklearn.cluster import DBSCAN

from tdamapper.learn import MapperAlgorithm
from tdamapper.cover import CubicalCover
from tdamapper.plot import MapperPlot

# Generate toy dataset
X, labels = make_circles(n_samples=5000, noise=0.05, factor=0.3, random_state=42)
plt.figure(figsize=(5, 5))
plt.scatter(X[:,0], X[:,1], c=labels, s=0.25, cmap="jet")
plt.axis("off")
plt.show()

# Apply PCA as lens
y = PCA(2, random_state=42).fit_transform(X)

# Mapper pipeline
cover = CubicalCover(n_intervals=10, overlap_frac=0.3)
clust = DBSCAN()
graph = MapperAlgorithm(cover, clust).fit_transform(X, y)

# Visualize the Mapper graph
fig = MapperPlot(graph, dim=2, seed=42, iterations=60).plot_plotly(colors=labels)
fig.show(config={"scrollZoom": True})

Original Dataset	Mapper Graph

Left: the original dataset consisting of two concentric circles with noise, colored by class label. Right: the resulting Mapper graph, built from the PCA projection and clustered using DBSCAN. The two concentric circles are well identified by the connected components in the Mapper graph.

More examples can be found in the documentation.

Interactive App

Use our app to interactively visualize and explore your data without writing code. You can try it right away using our live demo, or run it locally on your machine.

To run it locally:

Install the app and its dependencies:
```
pip install tda-mapper[app]
```
Launch the app:
```
tda-mapper-app
```

Citations

If you use tda-mapper in your work, please consider citing both the library, archived in a permanent Zenodo record, and the paper, which provides a broader methodological overview. We recommend citing the specific version of the library used in your research, along with the paper. For citation examples, please refer to the documentation.

Data aidatabase

DuckDB Extension - Whisper (Tool)

A DuckDB extension brings OpenAI's Whisper speech recognition directly into SQL queries, letting you transcribe audio files and even speak database queries.

Read original

Summary

What: A DuckDB extension that embeds Whisper speech-to-text capabilities directly into the database, allowing developers to transcribe audio files (WAV, MP3, FLAC, etc.) using SQL functions, with support for microphone recording, detailed timestamps, and experimental voice-to-SQL features.

Why it matters: This makes audio data queryable like any other structured data, enabling use cases like searchable meeting transcripts, batch audio processing pipelines, and voice-controlled database interfaces without external API dependencies.

Takeaway: Install with `INSTALL whisper FROM community;` in DuckDB, download a Whisper model, and start transcribing audio with SQL functions like `whisper_transcribe('audio.wav', 'tiny.en')`.

Decoder

Whisper: OpenAI's open-source speech recognition model that can transcribe and translate audio in multiple languages
DuckDB: An embedded analytical database designed for data analysis (similar to SQLite but optimized for analytics)
whisper.cpp: A C/C++ implementation of Whisper for efficient local inference without Python dependencies
Community extension: Third-party DuckDB extension installable from the community repository

Original Article

Installing and Loading

INSTALL whisper FROM community;
LOAD whisper;

Example

-- Transcribe an audio file
D SELECT whisper_transcribe('audio.wav', 'tiny.en');
┌──────────────────────────────────────────────────┐
│           whisper_transcribe(...)                │
│                   varchar                        │
├──────────────────────────────────────────────────┤
│ Hello, this is a test of the whisper extension.  │
└──────────────────────────────────────────────────┘

-- Get detailed transcription segments with timestamps
D SELECT * FROM whisper_transcribe_segments('audio.wav', 'tiny.en');
┌────────────┬────────────┬──────────┬────────────────────┬────────────┬──────────┐
│ segment_id │ start_time │ end_time │        text        │ confidence │ language │
│   int32    │   double   │  double  │      varchar       │   double   │ varchar  │
├────────────┼────────────┼──────────┼────────────────────┼────────────┼──────────┤
│          0 │       0.00 │     2.50 │ Hello, this is a   │       0.95 │ en       │
│          1 │       2.50 │     4.00 │ test of whisper.   │       0.92 │ en       │
└────────────┴────────────┴──────────┴────────────────────┴────────────┴──────────┘

-- List available models
D SELECT model_name, size_mb, is_downloaded FROM whisper_list_models() LIMIT 3;
┌────────────┬─────────┬───────────────┐
│ model_name │ size_mb │ is_downloaded │
│  varchar   │  int64  │    boolean    │
├────────────┼─────────┼───────────────┤
│ tiny       │      75 │ false         │
│ tiny.en    │      75 │ true          │
│ base       │     142 │ false         │
└────────────┴─────────┴───────────────┘

About whisper

A DuckDB extension for speech-to-text transcription using whisper.cpp, the C/C++ port of OpenAI's Whisper model.

Transcribe audio files directly from SQL queries in DuckDB, making it easy to process and analyze audio data alongside your other data.

Features

Transcribe audio files (WAV, MP3, FLAC, OGG, and more)
Live recording and transcription from microphone
Voice-to-SQL: speak natural language questions, get query results
Support for all Whisper models (tiny, base, small, medium, large)
Detailed transcription segments with timestamps and confidence scores
Automatic language detection or specify target language
Works with file paths, BLOB data, or remote URLs

Quick Start

Download a Model

Models must be downloaded before use. They are stored in ~/.duckdb/whisper/models/.

mkdir -p ~/.duckdb/whisper/models
curl -L -o ~/.duckdb/whisper/models/ggml-tiny.en.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-tiny.en.bin

Check available models and download status:

SELECT * FROM whisper_list_models();

Transcribe an Audio File

-- Simple transcription
SELECT whisper_transcribe('audio.wav', 'tiny.en');

-- Get detailed segments with timestamps
SELECT * FROM whisper_transcribe_segments('audio.wav', 'tiny.en');

-- Translate foreign audio to English
SELECT whisper_translate('german_speech.mp3', 'small');

Example Use Cases

Transcribe Remote Audio

INSTALL httpfs;
LOAD httpfs;

SELECT whisper_transcribe(content, 'tiny.en')
FROM read_blob('https://example.com/audio.mp3');

Batch Transcribe Multiple Files

SELECT file, whisper_transcribe(file, 'tiny.en') as transcript
FROM glob('audio/*.wav');

Search Within Transcriptions

SELECT * FROM whisper_transcribe_segments('meeting.wav', 'base.en')
WHERE text ILIKE '%action item%';

Generate Subtitles (SRT Format)

SELECT
    segment_id + 1 as id,
    printf('%02d:%02d:%02d,%03d',
        (start_time/3600)::int, ((start_time%3600)/60)::int,
        (start_time%60)::int, ((start_time - start_time::int) * 1000)::int
    ) || ' --> ' ||
    printf('%02d:%02d:%02d,%03d',
        (end_time/3600)::int, ((end_time%3600)/60)::int,
        (end_time%60)::int, ((end_time - end_time::int) * 1000)::int
    ) as timestamp,
    trim(text) as text
FROM whisper_transcribe_segments('video.mp4', 'small.en');

Recording (requires microphone)

Setup Audio Device

Before using microphone recording or voice query features:

-- List all available audio input devices
SELECT * FROM whisper_list_devices();

-- Set the device ID (use a device_id from the list above)
SET whisper_device_id = 0;

-- Verify your microphone is working
SELECT whisper_mic_level(3);

Record and Transcribe

-- Record for 5 seconds
SELECT whisper_record(5, 'tiny.en');

-- Record until silence (max 30 seconds)
SELECT whisper_record_auto(30);

-- Record and translate to English
SELECT whisper_record_translate(5, 'small');

Voice-to-SQL (Experimental)

Speak natural language questions about your data and receive SQL query results.

Requires text-to-sql-proxy running locally.

-- Create test data
CREATE TABLE customers (id INT, name VARCHAR, revenue DECIMAL);
INSERT INTO customers VALUES (1, 'Acme', 100000), (2, 'Beta', 50000);

-- Get SQL from voice (doesn't execute)
SELECT whisper_voice_to_sql();

-- Execute voice query directly
FROM whisper_voice_query();

-- Include generated SQL in results
FROM whisper_voice_query_with_sql();

Available Models

Model	Size	Description
tiny/tiny.en	~75MB	Fastest
base/base.en	~142MB	Fast
small/small.en	~466MB	Good balance
medium/medium.en	~1.5GB	High quality
large-v1/v2/v3	~2.9GB	Best quality
large-v3-turbo	~1.6GB	Fast + accurate

Models with .en suffix are optimized for English.

Supported Audio Formats

The extension uses FFmpeg for audio decoding: WAV, MP3, FLAC, OGG/Vorbis, AAC/M4A, and many more. Audio is automatically converted to 16kHz mono as required by Whisper.

Function Reference

Transcription Functions

whisper_transcribe(audio, [model]) - Transcribes audio and returns the full text
whisper_translate(audio, [model]) - Translates audio from any language to English
whisper_transcribe_segments(audio, [model], [language], [translate]) - Returns table of segments with timestamps

Recording Functions

whisper_list_devices() - Lists available audio input devices
whisper_record(duration_seconds, [model], [device_id]) - Records and transcribes
whisper_record_auto(max_seconds, [silence_seconds], [model], [threshold], [device_id]) - Records until silence
whisper_record_translate(duration_seconds, [model], [device_id]) - Records and translates to English
whisper_mic_level(duration_seconds, [device_id]) - Check microphone amplitude levels

Voice-to-SQL Functions

whisper_voice_to_sql([model], [device_id]) - Records voice and returns generated SQL
whisper_voice_query([model], [device_id]) - Records voice, generates SQL, executes it
whisper_voice_query_with_sql([model], [device_id]) - Same as above with SQL columns

Model Management Functions

whisper_list_models() - Lists all available models and download status
whisper_download_model(model_name) - Returns download instructions

Utility Functions

whisper_version() - Returns extension and whisper.cpp version info
whisper_check_audio(file_path) - Validates that an audio file can be read
whisper_audio_info(file_path) - Returns audio file metadata
whisper_get_config() - Returns current whisper configuration settings

Configuration

Configure settings using standard SET statements:

-- Model settings
SET whisper_model = 'small.en';
SET whisper_model_path = '/custom/path/models';
SET whisper_language = 'en';
SET whisper_threads = 4;

-- Recording settings
SET whisper_device_id = 0;
SET whisper_max_duration = 30;
SET whisper_silence_duration = 2;
SET whisper_silence_threshold = 0.005;

-- Voice query settings
SET whisper_text_to_sql_url = 'http://localhost:8080/generate-sql';
SET whisper_text_to_sql_timeout = 60;
SET whisper_voice_query_show_sql = true;

-- View all whisper settings
SELECT * FROM duckdb_settings() WHERE name LIKE 'whisper_%';

See the GitHub repository for full documentation.

Data observabilityaiopentelemetryoperations

Jaeger adopts OpenTelemetry at its core to solve the AI agent observability gap

Jaeger v2 rebuilds on OpenTelemetry and adds Model Context Protocol support to help engineers trace AI agent execution paths and debug with natural language queries.

Read original

Summary

What: Jaeger v2 is a major architectural update that replaces Jaeger's original core with the OpenTelemetry Collector framework and adds support for new protocols (MCP, ACP, AG-UI) that enable AI-assisted debugging and visualization of GenAI application traces.

Why it matters: As AI applications and autonomous agents move to production, traditional tracing tools struggle to map complex execution paths involving prompt assembly, vector database calls, and external tool interactions, requiring observability platforms to evolve beyond standard microservices tracing.

Takeaway: Developers can run the same Jaeger v2 binary locally with a small language model for testing and in production with cloud LLMs, maintaining consistency while adding natural-language query capabilities to their observability workflow.

Deep Dive

Jaeger v2 replaces its original collection mechanisms with the OpenTelemetry Collector framework, natively ingesting OTLP and consolidating metrics, logs, and traces into a unified deployment model without intermediate translation steps.
The new architecture adds an Agent Client Protocol layer that acts as a stateless translator between the Jaeger frontend and external AI sidecars, allowing natural-language queries to be converted into deterministic trace queries.
Engineers can use natural language to specify constraints like "identify 500-level errors in the payment service with latency exceeding 2 seconds" instead of manually building filter queries during incident response.
Organizations can configure the backend to use cloud-based large language models for complex reasoning or local small language models for strict data privacy, with the AI restricted to protocol translation and query generation to minimize hallucination risks.
The UI is being migrated from Redux to Zustand + React Query and adds an in-app assistant powered by assistant-ui + AG-UI that streams trace context to the backend gateway for incident summarization.
Jaeger is adding support for visualizing GenAI application traces themselves, implementing emerging OpenTelemetry semantic conventions for RAG pipelines, autonomous agents, embedding model latency, external tool calls, and token usage tracking.
The project is tracking OpenTelemetry community drafts for Generative AI Agentic Systems to monitor tasks, memory, and actions, and conventions for AI Sandboxes to observe ephemeral code execution environments.
Because Jaeger v2 is built on the OpenTelemetry Collector, developers run the exact same binary locally as in production, enabling consistent tracing configurations from development to deployment.
The unified binary approach allows developers to run Jaeger v2 locally with a private SLM for testing without exposing data to external APIs, then swap to larger cloud LLMs in production for incident analysis.
The architecture integrates three open standards: MCP for standardizing how AI models access external data sources, ACP for UI-to-agent communication, and AG-UI for the collaborative workspace interface.

Decoder

OpenTelemetry (OTLP): Vendor-neutral observability framework and protocol for collecting telemetry data (metrics, logs, traces) from applications
MCP (Model Context Protocol): Standard for how AI models securely access external data sources and context
ACP (Agent Client Protocol): Uniform method for user interfaces to communicate with AI agents and sidecars
AG-UI (Agent-User Interaction Protocol): Protocol enabling collaborative workspace interfaces between humans and AI agents
SLM: Small Language Model, a compact AI model that can run locally for privacy-sensitive workloads
RAG (Retrieval-Augmented Generation): AI pattern that enhances model responses by retrieving relevant information from external knowledge bases before generating output

Original Article

Jaeger adopts OpenTelemetry at its core to solve the AI agent observability gap

Jaeger v2 uses OpenTelemetry and MCP to trace GenAI pipelines and facilitate collaboration between engineers and AI agents for observability.

As software architectures evolve, observability tools must adapt. When the industry moved to microservices, distributed tracing became a necessity. Jaeger emerged as a core tool for engineers to understand those fragmented systems. Now, as organizations integrate generative AI applications and autonomous agents into production, tracing requirements are shifting again. Mapping the execution path of an AI agent involves prompt assembly, vector database retrievals, and multiple external tool calls.

"By adopting the Model Context Protocol (MCP), Agent Client Protocol (ACP), and Agent–User Interaction Protocol (AG-UI), the project is building an environment where engineers and AI agents can collaborate."

Jaeger is evolving to address these new workloads. This transition involves two main phases. First, the project rebuilt its core architecture in Jaeger v2 to natively integrate OpenTelemetry. Second, Jaeger is expanding beyond standard data visualization. By adopting the Model Context Protocol (MCP), Agent Client Protocol (ACP), and Agent–User Interaction Protocol (AG-UI), the project is building an environment where engineers and AI agents can collaborate. This helps map the complex execution paths of AI pipelines that often stretch the limits of traditional tracing tools.

Setting the foundation: Jaeger v2

Managing AI workloads requires an efficient data collection pipeline. This need guided the architectural changes detailed in the CNCF blog post, Jaeger v2 released: OpenTelemetry in the core!

Jaeger v2 replaces its original collection mechanisms with the OpenTelemetry Collector framework. This approach consolidates metrics, logs, and traces into a unified deployment model. By natively ingesting the OpenTelemetry Protocol (OTLP), the system eliminates intermediate translation steps, improving ingestion performance. This OpenTelemetry integration provides the necessary data foundation for more advanced tracing features.

Human and agent collaboration

Building on Jaeger v2, the project is exploring new ways for teams to analyze distributed systems. The goal is to facilitate collaboration between engineers and AI agents during debugging. Contributors from the CNCF LFX Mentorship program and Google Summer of Code (GSoC) are actively driving this work.

To support AI integration, Jaeger is adopting three open standards: the Model Context Protocol (MCP), Agent Client Protocol (ACP), and Agent–User Interaction Protocol (AG-UI). MCP standardizes how AI models securely access external data sources. ACP provides a uniform method for user interfaces to communicate with AI agents and sidecars. Together, they allow Jaeger to function as an interactive workspace.

Building the backend protocol layer

The technical implementation starts in the backend. We are building an Agent Client Protocol layer to act as a stateless translator between the Jaeger frontend and external AI sidecars. The design and proof of concept are documented in Jaeger backend issue #8252 (Implement AG-UI to ACP Jaeger AI) and issue #8295 (Implement ACP-based AI handler).

graph LR
  J_UI["Jaeger UI"]
  AI_A["AI Agent"]
  subgraph JAEGER["Jaeger v2"]
    AGW["Agent Gateway"]
    JMCP["Jaeger MCP"]
  end

  J_UI -- "AG-UI Protocol" --> AGW
  AGW -- "ACP Protocol" --> AI_A
  AGW -- "MCP Protocol" <--> JMCP

Traditionally, incident responders build queries by manually filtering services and tags. The ACP integration allows the backend to parse natural-language constraints (such as identifying 500-level errors in a payment service with latency exceeding 2 seconds) and translate them into deterministic trace queries.

Organizations can configure this backend to use cloud-based large language models (LLMs) for complex reasoning or local small language models (SLMs) for strict data privacy. The depth of analysis depends on the chosen model, as noted in industry analyses of hosted versus local AI infrastructure. By restricting the AI to protocol translation and query generation, the architecture minimizes the risk of hallucinations associated with open-ended chatbots.

The collaborative UI workspace

The Jaeger user interface is also being updated to support this backend logic. As tracked in Jaeger UI issue #3313, we are migrating from legacy Redux to modern Zustand + React Query.

The frontend introduces an in-app assistant powered by assistant-ui + AG-UI. The UI uses streaming events to send trace context (such as error logs and key-value tags) to the backend gateway. This allows engineers to prompt the assistant to summarize a failure path within a specific span, reducing the need to manually review raw log lines during an incident.

Visualizing GenAI execution paths

Beyond using AI to analyze standard traces, the project is adding support for tracing the AI applications themselves.

Outlined in Jaeger issue #8401 (GenAI integration), this work focuses on visualizing the rapidly evolving OpenTelemetry GenAI semantic conventions. The OpenTelemetry community is currently drafting specifications to standardize telemetry for these highly dynamic workflows. Key initiatives include emerging drafts for Generative AI Agentic Systems (Issue #2664) to track tasks, memory, and actions, and conventions for AI Sandboxes (Issue #3583) to monitor ephemeral code execution environments.

"Jaeger will map these new standard operations in the UI to provide clear visibility into AI execution paths without locking teams into vendor-specific formats."

Developers building Retrieval-Augmented Generation (RAG) pipelines and autonomous agents need to measure embedding model latency, track external tool calls, and monitor token usage. Jaeger will map these new standard operations in the UI to provide clear visibility into AI execution paths without locking teams into vendor-specific formats.

Unified observability: from local testing to production

Maintaining consistency across testing and production environments is a practical challenge. Jaeger originally popularized an "all-in-one" executable to simplify local testing. Because Jaeger v2 is built on the OpenTelemetry Collector, developers run the exact same binary locally as they do in production.

During testing, engineers can run the Jaeger v2 container with a local SLM. This creates a private sandbox for testing generative AI traces or debugging ACP integrations without exposing data to external APIs.

In production, platform teams deploy the same unified binary, often using tools like the OpenTelemetry Operator for Kubernetes. Organizations can then replace the local SLM with a larger cloud-based LLM to handle production incident analysis. This ensures that tracing configurations remain consistent from development to deployment.

What's next

Tracing requirements are shifting to accommodate the complexity of AI applications. By establishing a solid OpenTelemetry foundation with Jaeger v2 and integrating MCP and ACP standards, the project is adapting its core capabilities. This technical path forward enables a practical workflow where human engineers and AI agents collaborate to diagnose distributed system failures.

Data aillmagents

Fixing What LLMs Get Wrong

Reflexion enables LLM systems to learn from factual errors across sessions by storing natural language reflections of failures in episodic memory and reinjecting them into future prompts, without requiring model retraining.

Read original

Summary

What: Reflexion is a framework introduced at NeurIPS 2023 that addresses the "hallucination tax" in enterprise LLM systems by implementing a three-component architecture: an Actor that generates answers, an Evaluator that verifies claims against authoritative knowledge sources, and a Self-Reflection model that generates generalized lessons from failures and stores them in sliding-window episodic memory for future context injection.

Why it matters: Existing solutions like fine-tuning, RAG, and static verification all fail to learn from repeated errors—each query starts from the same baseline hallucination propensity. Reflexion demonstrates that frozen-weight systems can improve through architectural changes alone by treating natural language reflections as "semantic gradients" that modify behavior through context rather than parameters, achieving 20-22% performance improvements on reasoning benchmarks and 91% on code generation.

Takeaway: Developers building LLM systems against private knowledge bases can implement Reflexion's pattern: decompose answers into atomic claims, verify each against structured knowledge, generate category-level reflections from contradictions, and maintain a sliding window of 3 recent lessons in the system prompt to reduce recurring error patterns.

Deep Dive

Enterprise LLMs produce fluent but factually wrong answers on private structured data (pricing, policies, org charts) creating operational risk where specific numbers govern contracts and legal precedence
Fine-tuning fails because it's computationally expensive, cannot track evolving knowledge without continuous retraining, and suffers from incomplete learning even after convergence—2026 research shows models fail to reproduce parts of their own training data
RAG improves access to current information but doesn't verify generation—studies show 70% of retrieved passages don't contain the true answer, and merely reordering the same documents can change outputs due to evidence integration instability
Static verification catches errors through claim decomposition and knowledge graph checking but doesn't prevent them—each session starts with identical hallucination propensity because no learning accumulates
Reflexion implements "verbal reinforcement learning" where the policy is parameterized by weights-plus-context, enabling behavior modification through prompt engineering rather than gradient descent
The three-component architecture separates concerns: Actor generates trajectories, Evaluator performs claim-level credit assignment through structured verdicts (SUPPORTED/CONTRADICTED/UNVERIFIABLE), Self-Reflection produces generalized behavioral lessons from failure patterns
Two-tier memory architecture mirrors cognitive science: short-term trajectory buffer holds episode details (discarded after use), long-term episodic store holds compressed reflections in a sliding window of 3 entries with recency bias
Entity linking uses LLM-based extraction with fuzzy matching fallback, short-circuits to UNVERIFIABLE when no relevant knowledge graph triples are found to avoid fabricating verdicts on empty evidence
Reflections must be generalized category diagnoses, not specific corrections—"verify tier-specific prices explicitly" transfers across queries, while "NovaPilot Growth costs $8,000" does not
Empirical results show 22% improvement on AlfWorld household tasks, 20% on HotPotQA multi-hop reasoning, 91% on HumanEval code generation versus GPT-4's 80% baseline
Ablation studies isolate reflection's contribution: episodic memory alone provides 8% improvement, verbal reflection adds 12 percentage points independently—but self-reflection without reliable evaluation actively harms performance
WebShop e-commerce task revealed critical boundary condition: Reflexion requires failures to be diagnosable, correctable, and recurrent within categories—fails on stochastic environments requiring diverse exploration
Weaker models cannot generate useful self-corrections—reflection accuracy on StarChat-beta was statistically indistinguishable from baseline, indicating emergent property of stronger models
System vulnerable to local minima if early reflections encode incorrect causal theories—no external signal corrects bad reflections since self-assessment is authoritative
The framework establishes that frozen-weight systems can learn through architecture when failure feedback has sufficient structure, but leaves open questions about graph-based memory, structured reflection templates, and relevance-based retention strategies

Decoder

RAG (Retrieval-Augmented Generation): Technique that injects relevant context from a knowledge store into prompts at inference time, grounding the model in documents it hasn't seen before
Fine-tuning: Retraining a pre-trained model on domain-specific data to write knowledge into the weights, but computationally expensive and vulnerable to catastrophic forgetting
Hallucination tax: Operational risk created when LLMs produce fluent but factually incorrect answers about specific numbers, dates, or policies in enterprise contexts
Episodic memory: Long-term storage of compressed reflections from prior failures, persists across sessions and is injected into future prompts (contrast with short-term trajectory buffer)
Knowledge graph (KG): Structured database of facts as subject-predicate-object triples (e.g., "NovaAI" - "founded in" - "2021") used as authoritative source for verification
Credit assignment problem: Challenge of identifying which specific decisions in a multi-step sequence deserve blame for a failure when only a terminal reward signal is available
Atomic claims: Individual verifiable assertions extracted from an answer, each with explicit subject and single fact (no pronouns or aggregation) to enable independent verification
Semantic gradient: Natural language directional feedback about behavior ("verify prices explicitly") that modifies generation patterns through context, analogous to numerical gradients in parameter space
ReAct prompting: Strategy that interleaves reasoning and acting, producing explicit thought traces alongside actions for better interpretability

Original Article

Fixing What LLMs Get Wrong

A Field Guide to Verification, Repair, and Self-Improvement

Part I - The Problem Worth Solving

The Hallucination Tax

There is a particular kind of operational failure that only emerges when you deploy a language model against real, private, structured knowledge. The model is fluent. Its grammar is impeccable. Its tone matches your internal documentation. And it is wrong about the things that matter most.

Not wrong in a detectable way - not grammatically wrong, not logically incoherent, not obviously absurd. Wrong in the way that a confident, articulate person is wrong when they misremember a number or confuse two similar-sounding policies. The answer sounds authoritative because the model's generation process is entirely indifferent to whether the specific fact it's producing is true. It is optimized for plausibility at the token level, not for correctness at the fact level.

This is the hallucination tax, and every organization building LLM-powered systems against proprietary knowledge pays it.

Ask an internal assistant what the NovaPilot Starter plan costs and it may confidently return $5,000 per month when the correct figure, encoded in your own knowledge graph, is $2,500. Ask it when a key executive joined and it will synthesize a plausible-sounding date from the statistical texture of similar corporate biographies. It will tell you "NovaAI was founded in 2019" because 2019 is a frequent founding year among AI companies in its training distribution, not because it knows anything specific about NovaAI.

The cost is not merely embarrassment. In domains where specific numbers govern contracts, where policy thresholds determine eligibility, where dates establish legal precedence - the hallucination tax accumulates as real operational risk.

The question that shapes this entire discussion is not "why do LLMs hallucinate?" That question has been answered well enough.

The productive question is: given that they do, what does a trustworthy system built on top of them actually look like?

wait, why are we even talking about it in 2026? we have fine-tuning, we have 30+ types of production-used RAG architectures. that should be enough... or so we thought.

Why the Obvious Solutions Are Incomplete

The engineering response to hallucination has followed a familiar three-stage progression.

The first response was fine-tuning. If the model doesn't know your domain, teach it. Write knowledge into the weights. This approach has an appealing directness - the model that emerges should, in principle, know the facts you care about. In practice, it fails for three reasons. First, the computational cost of fine-tuning frontier models at organizational scale is non-trivial. Second, private-domain knowledge is not static - pricing changes, personnel turns over, policies update - and fine-tuning cannot track a living knowledge base without continuous re-investment. Third, and most subtly, fine-tuning conflates model capability with model knowledge. You are retraining everything to change a few things, which is the engineering equivalent of reprogramming a city's traffic system every time a road is renamed.

A 2026 paper on continuous knowledge drift says continual pretraining or fine-tuning is often effective only at a cost: it is "computationally expensive," requires repeated retraining as knowledge evolves, and is vulnerable to catastrophic forgetting.

That matters because enterprise facts are rarely static:

prices change
org charts change
policies change
product bundles change

A separate 2026 study, Why Supervised Fine-Tuning Fails to Learn, reports a persistent Incomplete Learning Phenomenon: even after convergence, models can still fail to reproduce part of the very supervised data they were trained on. The paper says incomplete learning is "widespread and heterogeneous," and that aggregate metrics can hide unlearned subsets.

Fine-tuning is good for behavioral adaptation and domain shaping, but it is a poor single source of truth for fast-changing factual knowledge. And even when used, it does not guarantee complete or durable fact internalization.

The second response was retrieval-augmented generation, or RAG. Rather than writing knowledge into the weights, inject it at inference time. Retrieve relevant context from your knowledge store and prepend it to the prompt. This is substantially better. The model can now be grounded in documents or structured data it has never seen before. But RAG has a structural limitation that is easy to miss: it augments generation, it does not verify it. The model with retrieved context still generates freely. If the retrieved context is incomplete, ambiguous, or misaligned with the query, the model fills the gap with its training priors. More critically, RAG does not teach the model anything. The next query starts from the same baseline. The same category of error recurs.

A 2025 paper, ASTUTE RAG, explicitly argues that "imperfect retrieval augmentation might be inevitable and quite harmful." In their controlled analysis, they report that under realistic conditions roughly 70% of retrieved passages did not directly contain the true answer.

That is a big deal. It means that even if you build a retrieval pipeline, the model is often operating on:

incomplete evidence
irrelevant evidence
mixed evidence
conflicting evidence

A 2026 paper, Stable-RAG, shows something even more uncomfortable: RAG systems are "far from hallucination-free," and merely reordering the same retrieved documents can change the answer. In some cases, the model ignored the gold document even when it was present.

That means the failure is not only retrieval quality. It is also evidence integration instability inside the generator.

A 2025 paper, FVA-RAG, identifies retrieval sycophancy: if the user asks from a false premise, the retriever may fetch documents aligned with that false framing, causing the model to "hallucinate with citations."

RAG is necessary in many real systems, but it is not a truth machine. It improves access to current information, but does not guarantee:

correct retrieval
correct selection among retrieved evidence
correct reconciliation of conflicting evidence

The third response was static verification - a catch-and-correct pipeline. Generate an answer, decompose it into claims, check each claim against a knowledge source, flag what is wrong, rewrite the flagged content. This is architecturally sound and operationally useful. In a system built against a knowledge graph, the static verifier compares each extracted claim against the graph's triples and returns a structured verdict: SUPPORTED, CONTRADICTED, UNVERIFIABLE. When a claim contradicts the graph, a repairer rewrites it using the correct evidence.

Static verification catches errors. It does not prevent them. The model answers the next question with no memory of having been wrong on the last one. Each session starts with the same hallucination propensity, and the verification layer must work just as hard each time. The catch-and-correct pipeline is an error filter, not a learning system.

These three approaches - fine-tuning, RAG, and static verification - are not wrong. They are incomplete. Each addresses one layer of the problem while leaving another layer untouched. Fine-tuning embeds knowledge but cannot track it. RAG grounds generation but cannot verify it. Static verification catches errors but cannot accumulate lessons.

The gap these three approaches share is the same gap: none of them improve through experience.

The Productive Question

Here, then, is the constraint that shapes everything that follows:

We have a model whose weights we cannot - or prefer not to - continuously update.
We have a private knowledge base that is authoritative, structured, and current.
We have a verification mechanism that can detect when the model is wrong.

The open question is whether the system as a whole can improve across sessions - not by retraining the model, but by restructuring how failure is processed and retained.

Phrased more precisely: can we build a pipeline where each hallucination, caught and corrected, produces a durable lesson that makes subsequent answers less likely to contain the same category of error?

Can a frozen-weight system learn through its architecture rather than through its parameters?

The Reflexion framework, introduced by Shinn et al. at NeurIPS 2023, is the first rigorous answer to this question. It is not the only answer - and subsequent approaches have extended, refined, and in some cases superseded it - but it is the foundational one, and understanding it deeply is prerequisite to understanding what came after.

Part II - Reinforcement Without Gradients

What Traditional RL Requires

Reinforcement learning, in its standard formulation, is an optimization algorithm. An agent acts in an environment, receives a reward signal, and updates its policy to maximize cumulative reward. The update mechanism is typically gradient descent through parameter space: the reward signal backpropagates through the model, nudging weights toward configurations that produce better outcomes.

This works spectacularly well in narrow, simulable domains - games, robotic control, token-level optimization. It works poorly in the context of language model deployment for several compounding reasons.

The sample complexity problem comes first. Gradient-based policy optimization requires a large number of trials before the signal-to-noise ratio in the reward gradient becomes useful. For a language model, each trial is an LLM inference call. At the scale of samples required for meaningful policy improvement, the cost in compute, latency, and dollars is prohibitive for most production environments.

The reward design problem follows closely. Defining a reward function that accurately captures what "a good answer" means is harder than it appears. A scalar reward - correct or incorrect - provides no gradient information about which part of a complex answer was wrong. The model improved its score, but we cannot easily attribute the improvement to any specific aspect of its behavior. This is the credit assignment problem: when a trajectory spans multiple decisions and produces a single terminal reward, distributing that reward backward through time so that each action receives appropriate credit is computationally expensive and theoretically fraught.

Finally, and most practically: traditional RL requires differentiable access to the model's parameters. For closed-weight APIs, this access does not exist. The model is a black box. You can observe its outputs; you cannot nudge its internals.

Traditional RL is the right algorithm if you have a simulable environment, abundant compute, a well-defined reward, and parameter access. In production language model deployment, you typically have none of these.

The Reflexion Hypothesis

Reflexion begins from a different premise. If gradient descent is unavailable, what else can carry the learning signal? The hypothesis: language itself.

A large language model is not merely a generator of text. It is a system that conditions its outputs on its inputs with remarkable sensitivity. When the system prompt says "always verify tier names against their specific price points," the model produces different outputs than when it doesn't. That difference is not mediated by weight updates. It is mediated by context. The instruction modifies behavior as surely as a gradient - with less precision, perhaps, but also with considerably less cost.

Reflexion formalizes this intuition. The core mechanism is a feedback loop: the agent acts, the environment evaluates, the agent reflects on the evaluation in natural language, and that reflection is stored and injected into future action prompts. The weights never change. The policy changes because the context changes.

Reflexion Hypothesis:

LLMs can improve across attempts if failure feedback is translated into natural-language reflections, stored as episodic memory, and reinserted into future prompts.

This is verbal reinforcement learning - reinforcement not through gradient updates but through the accumulation of linguistically encoded experience.

The mechanism is more subtle than it first appears. The reflection is not simply a correction of the specific wrong answer. It is an abstraction - a generalized lesson derived from the particular failure. The difference between:

"NovaAI was founded in 2019 is wrong" and
"the model tends to infer founding dates from executive tenure rather than explicit founding records"

is the difference between error correction and category learning. The former fixes one answer. The latter alters a behavioral pattern across an entire category of future questions.

Part III - Anatomy of a Reflexion Agent

The Tripartite System

A Reflexion agent is not a single model. It is three models - or, more precisely, three distinct roles that may be instantiated by the same model with different prompting. The distinction is architectural, not necessarily computational.

These three components are:

the Actor,
the Evaluator, and
the Self-Reflection model.

Their interaction is precisely defined, and understanding each in isolation before examining how they work together reveals why this architecture is both elegant and practically grounded.

This is not RL in the classical sense of gradient descent over policy parameters. But it is still reinforcement in the broader sense, because:

the agent acts
receives external evaluation
converts evaluation into a better future policy

The difference is where the policy lives.

The Actor: Trajectory Generator and Policy Carrier

The Actor is the policy engine. It is the component that takes the current state of the world - a question, a task, an environment - and produces an action: an answer, a decision, a sequence of tool calls. In the standard language model framing, the Actor is simply an LLM invoked with a system prompt.

But the Actor is more precisely described as a composite: the model's weights, plus its memory. In Reflexion's formulation, the policy is not stored in weights alone. It is parameterized jointly by the weights and by whatever resides in context. This is the key insight that makes the whole framework coherent: if the policy is the weights-plus-context, then you can modify the policy by modifying the context without touching the weights.

In a knowledge graph verifier, the Actor's first invocation has no memory. It answers questions about NovaAI using whatever it knows from pretraining, augmented by its system prompt's general instruction to be accurate and specific. It will produce a fluent answer that may or may not be grounded in the actual facts of the company. At this stage, it is V0 - the raw baseline.

After the first hallucination is caught and reflected upon, the Actor's subsequent invocations carry the accumulated lessons. The system prompt now contains something like:

You have previously made factual errors about NovaAI. Apply these lessons to improve your accuracy:

- The model tends to confuse pricing tiers when product names share tier suffixes. Always verify price amounts before stating them.

This is a modified policy. The same model, producing different behavior, not through weight change but through context change. The weights are a fixed substrate; the context is the mutable state space.

The Actor also produces, as a byproduct of its execution, what the paper calls a trajectory: the sequence of observations, thoughts, and actions taken during a single episode. This trajectory is short-term memory - the immediate working record of what happened during this particular attempt. It resets between episodes but persists within them, providing the raw material for evaluation and reflection.

The Evaluator: Credit Assignment Through Structure

The credit assignment problem is one of the oldest and most persistent challenges in reinforcement learning.

In a multi-step trajectory - a sequence of decisions leading to an outcome - which specific decisions deserve credit or blame for the final result? A simple scalar reward at the end provides a single number that must somehow distribute its signal backward through a sequence of potentially many decisions, most of which had nothing particular to do with why the outcome was good or bad.

Language agents face a specific variant of this problem. An LLM answer is not a single decision; it is an aggregation of claims, each of which may be independently true or false.

The answer "NovaAI was founded in 2021 by Dr. Mara Chen and Lucas Ferreira, and raised $210M in Series C funding in November 2024" contains multiple verifiable assertions. Some may be correct, some wrong. A binary "incorrect" verdict on the full answer tells the system nothing about which claim failed.

Reflexion addresses this through what amounts to claim-level credit assignment. The Evaluator decomposes the task of evaluation into atomic units: individual claims, each assessed independently against the knowledge source. The result is not a scalar but a structured verdict - a list of results, each specifying which claim is supported, which is contradicted, and which cannot be assessed given available evidence.

In a knowledge graph implementation, this evaluation pipeline is quite specific. An LLM answer is first decomposed into atomic claims - each claim asserting exactly one verifiable fact with an explicit subject, no pronouns, no aggregation.

"NovaAI was founded in 2021" is a proper claim. "It was founded in 2021" is not - the pronoun reference breaks the claim's self-containedness.

This atomicity is essential because the verification step requires mapping each claim to relevant KG triples via entity linking, and entity linking requires that the claim's subject be explicitly named.

Once claims are extracted, each is passed through an entity linker - a component that identifies the named entities and relation phrases in the claim, fuzzy-matches them against the KG's entity index, and retrieves the most relevant triples. The entity linking step itself is a critical architectural decision: LLM-based entity extraction produces better entity recall than pure fuzzy matching, but pure fuzzy matching is the correct fallback when the LLM extraction fails or returns no entity matches. The two-stage approach - LLM extraction with relation-score boosting, falling back to triple-view fuzzy scoring - balances recall against robustness.

The resulting evidence is then passed to the verification LLM, which returns a structured verdict:

SUPPORTED,
CONTRADICTED, or
UNVERIFIABLE,

with a confidence score and a one-sentence reasoning trace.

Critically, if no relevant triples are found for a claim, the system returns UNVERIFIABLE without an LLM call at all - a short-circuit that avoids fabricating a verdict against empty evidence, and saves a token budget that would otherwise be spent on meaningless inference.

This structured output is credit assignment made explicit. Instead of "the answer was wrong" the Evaluator produces "this specific claim was wrong, here is the contradicting evidence, here is why." The failure has been localized, the responsible claim identified, the correct value surfaced. The gradient, if we want to use that metaphor, is pointing in a specific direction.

The Self-Reflection Model: From Error to Episodic Memory

Between the Evaluator's verdict and the Actor's next invocation lies the step that distinguishes Reflexion from mere error correction: the generation of verbal reflections.

The Self-Reflection model - typically the same LLM prompted differently - receives the full context of a failed episode: the original question, the wrong answer, the structured verdict identifying which claims were contradicted and what the correct values are, and the repaired answer. From this, it is asked to produce a generalized lesson.

The key word is generalized. The reflection prompt explicitly forbids generic advice ("be more careful") and demands category diagnosis: what class of information was hallucinated, why the model likely made that error, and what specific strategy would prevent recurrence. The output might read:

"The model confused the NovaPilot Growth and Enterprise pricing tiers - a common error when multiple product tiers share a naming pattern. The mistake likely arose from conflating the tier suffix ('Growth', 'Enterprise') with pricing magnitude heuristics rather than consulting explicit price triples. Strategy: when answering any question about product pricing, trace the specific product tier name to its exact price before generating a number."

This reflection performs semantic compression. It takes a particular failure about a particular product at a particular price point and extracts a behavioral principle applicable to any pricing question in the domain. The particular generates the general. This is exactly what learning should look like: not memorizing that "NovaPilot Growth costs $8,000/month" but internalizing "when answering pricing questions, verify tier-specific prices explicitly."

The reflection is then stored in episodic memory, which is the long-term component of the Reflexion memory architecture.

Part IV - Memory, Context, and the Execution Loop

Two Memory Systems

Reflexion implements a two-tier memory architecture that maps, with reasonable fidelity, onto how human memory is described in cognitive science.

Short-term memory - also called the trajectory buffer - holds the contents of the current episode. The question, the raw answer, the extracted claims, the verification verdicts, the repaired output, the latency. This is working memory: rich, detailed, and ephemeral. It is assembled anew with each question and discarded when the episode closes. It provides the immediate context for evaluation and reflection but does not persist to inform future episodes.

Long-term memory - the episodic store - holds the distilled reflections from prior failures. It persists across episodes, is bounded by a sliding window (typically three entries), and is injected into the Actor's system prompt at the start of each new episode. It is compressed, general, and durable.

The sliding window is a deliberate design constraint. Three reflections is not a magic number - it reflects a trade-off between memory coverage and context budget. Too few reflections and the agent forgets recent lessons; too many and the prompt grows bloated, the relevant lessons diluted by earlier ones that may no longer apply. The window always retains the most recent entries: as new lessons are added, the oldest are evicted. This recency bias is principled - errors made on earlier questions in a session are less likely to recur than the categories of error encountered most recently.

In our knowledge graph verifier, this memory is shared across an entire evaluation session of forty-five questions. A single `ReflexionMemory` instance persists throughout the run - not resetting between questions. This is the architectural decision that makes V2 a genuine learning system rather than a per-query corrector. Early questions generate reflections. Those reflections alter the system prompt for all subsequent questions. The agent that answers question thirty-eight is not the same agent that answered question one. Its weights are identical. Its context is not.

The Injection Pattern: Context as Policy State

The mechanics of memory injection reveal something important about how Reflexion achieves its effect. At the start of each episode, the Actor's system prompt is assembled conditionally: if the episodic memory contains reflections, a formatted memory block is appended to the base instructions.

The formatted block follows a strict template:

[PRIOR SESSION LEARNING - APPLY THESE LESSONS]
- Reflection 1
- Reflection 2
- Reflection 3
[END PRIOR SESSION LEARNING]

The boundary markers - `[PRIOR SESSION LEARNING]` and `[END PRIOR SESSION LEARNING]` - are not decorative. They signal to the model that this content is distinct from the task instructions: it is retrospective guidance, not prospective direction. The model is not being told what the answer is; it is being told what patterns of error to avoid. The distinction matters because it preserves the model's generative freedom while biasing it away from known failure modes.

This pattern is the operational core of context engineering. The prompt is not a static template that encodes the task. It is a dynamic state object that encodes the task, the constraints, the history, and the learned lessons from prior failures. As the session progresses and more reflections accumulate, the context evolves - and so does the effective policy.

When weights are frozen, context is the only tunable parameter. Reflexion is, among other things, an argument that this single degree of freedom is sufficient for meaningful learning in bounded domains.

The Execution Loop

The full execution loop, described abstractly in the paper and concretized in implementation, operates as follows.

An episode begins when the Actor receives a question. If the episodic memory is non-empty, the system prompt includes the accumulated reflections from prior failures. The Actor generates a raw answer conditioned on this prompt.

The raw answer is then decomposed into atomic claims by the Claim Extractor. Each claim is processed independently: entity linking identifies the relevant KG triples, and the Evaluator LLM judges each claim as SUPPORTED, CONTRADICTED, or UNVERIFIABLE. The Evaluator's output is a list of structured verdicts - one per claim.

If any claims are CONTRADICTED, the Repairer is invoked. It receives the original question, the raw answer, and the full verification results including the KG evidence for each contradicted claim. Its task is surgical: replace the wrong values with the correct ones from the KG evidence, leave supported claims untouched, soften unverifiable claims with epistemic hedges. The output is the final answer.

Whether or not repair occurred, if any CONTRADICTED verdicts were found, the Self-Reflection model is invoked. It receives the full failure context and generates a verbal lesson. That lesson is appended to the episodic memory, and the sliding window trims anything beyond the maximum retention count.

The next episode begins with this updated memory. The loop is now closed.

The loop runs for every question in the session. The V0 pipeline - raw LLM with no verification - has no loop: question in, answer out, nothing learned. The V1 pipeline - static KG verifier - adds verification and repair but still has no loop: the episode closes cleanly with no residue. V2 adds the reflection and memory update, and in doing so, adds the loop. The architecture that produces learning is precisely this closed loop between failure, reflection, and future context.

Part V - The Evidence and Its Limits

What the Numbers Say

Empirical validation is the necessary counterpart to architectural elegance. The Reflexion paper evaluates the framework across three task domains, each stress-testing a different dimension of agent capability.

In AlfWorld, a suite of text-based household environment tasks, Reflexion agents using the ReAct prompting strategy improved absolute performance by 22 percentage points over the baseline across 134 tasks in 12 iterative learning steps - completing 130 of 134 tasks total. The improvement arose through the agent's accumulated ability to diagnose two specific failure modes: hallucinated possession of objects (the agent acts as if it holds an item it hasn't retrieved) and inefficient planning (searching for items in a fixed order regardless of prior evidence). Reflections on these failures produced lessons about state tracking and search strategy that transferred across tasks.

In HotPotQA, a multi-hop question-answering benchmark, Reflexion improved accuracy by 20% over baseline. Crucially, the paper isolates the contribution of reflection from the contribution of episodic memory alone through ablation: adding only the most recent trajectory (episodic memory without reflection) produces an 8% improvement; adding the verbal reflection step on top of that produces the full 20%. Reflection contributes 12 percentage points independently of memory. This is the quantitative argument that verbal reflection is not merely a fancy way of logging history - it extracts something from failure that history alone does not provide.

In code generation on HumanEval, Reflexion achieved 91% pass@1 accuracy, surpassing GPT-4's 80% - at the time of publication, a state-of-the-art result. The mechanism is worth examining: the coding agent generates self-written unit tests alongside its implementation, runs them, and reflects on failures. The unit test suite acts as the Evaluator, providing grounded, executable feedback rather than LLM-generated judgment. The reflection then diagnoses why the implementation failed specific tests and proposes a corrected approach for the next attempt.

The ablation study on HumanEval Rust is particularly revealing. Four conditions are compared: base model alone (60% accuracy), test generation without self-reflection (60% - no improvement), self-reflection without test generation (52% - performance degrades), and the full Reflexion combination (68%). The counter-intuitive result - that self-reflection without test-driven grounding actively harms performance - is an important finding. Without reliable evaluation, the agent is reflecting on incorrect premises. It may conclude it failed for reasons it didn't fail, and generate lessons that steer it toward worse behavior. The quality of reflection is bounded by the quality of evaluation.

The Boundary Conditions

The WebShop experiment is the most instructive failure in the paper. WebShop is an e-commerce navigation task: the agent must search for products matching client specifications in a simulated online store. After four trials with no improvement, the experiment was terminated. The agent failed to generate useful reflections - its lessons were either too generic or misidentified the source of its failures.

The diagnosis is precise: Reflexion works when failures can be articulated as correctable behaviors. In WebShop, the failure modes were structural. The search space required diverse exploration strategies, not incremental correction of specific wrong moves. When the agent searched for "blue cotton shirt" and failed, the lesson wasn't "don't search for blue cotton shirts differently" - it was that the e-commerce search engine interpreted natural language queries non-deterministically. This is a failure the agent cannot reflect its way out of. The environment is irreducibly stochastic in a way that verbal lessons cannot address.

This points to the general boundary condition: Reflexion is effective when failure is diagnosable, correctable, and recurrent within recognizable categories. It is less effective - potentially harmful - when failures are random, structurally unavoidable, or require diverse exploration rather than careful correction.

A second boundary condition concerns model capability. The authors find that the ability to generate useful self-corrections is an emergent property of stronger, larger models. When the framework is applied to a weaker model (StarChat-beta in the appendix), performance does not improve at all - the reflection accuracy on HumanEval Python is statistically indistinguishable between baseline and Reflexion. The self-reflection step requires genuine reasoning capability to produce meaningful category diagnoses. A model that cannot accurately diagnose its own failures will generate lessons that are either trivially generic or actively misleading.

Finally, Reflexion does not escape the local minima problem inherent to all iterative optimization. If early reflections encode an incorrect causal theory of the failure - if the agent blames the wrong aspect of its behavior - subsequent attempts will optimize against the wrong target. There is no external error signal to correct a bad reflection; the system's self-assessment is authoritative.

Part VI - The Deeper Architecture

Verbal Feedback as Semantic Gradient

The most persistent insight in Reflexion, beyond any specific empirical result, is the framing of verbal feedback as a semantic analog to the gradient.

In gradient descent, the gradient is a vector in parameter space that points in the direction of steepest loss increase. The update rule moves parameters in the opposite direction - toward lower loss. The gradient carries directional information: not just "this was wrong" but "adjust these weights in these directions by these magnitudes to make it less wrong."

A scalar reward, by contrast, carries no directional information. It says "this was better or worse than expected" without specifying what to change or how. The gradient that backpropagates from a scalar reward to the parameters that produced it is computed through the model's computational graph - a process that requires differentiable parameter access and produces updates that may be hard to interpret.

A verbal reflection is a semantic gradient: it does not point in parameter space, but it points in behavior space. "When answering pricing questions, trace the specific product tier name to its exact price before generating a number" is directional information about behavior. It does not tell the model which weights to adjust; it tells the model which generation patterns to adopt. The directional information is encoded in natural language rather than real-valued vectors, and it is consumed by the model's attention mechanism rather than by a parameter update rule.

The insight that language models respond to linguistic direction as effectively as gradient descent responds to numerical gradients - within the bounded scope of in-context learning - is what makes Reflexion theoretically coherent rather than merely pragmatically useful. It is not a workaround for the unavailability of gradient access. It is an alternative signal modality that carries complementary information.

The Episodic Memory Lifecycle

In the full Reflexion framework, the lifecycle of a reflection follows a precise arc: from failure to diagnosis to storage to injection to effect.

Failure: the Actor produces an answer that is evaluated as containing contradictions.

Diagnosis: the Self-Reflection model produces a generalized lesson - the reflection - from the failure context. The lesson encodes what failed, why it likely failed, and what behavioral strategy would prevent recurrence.

Storage: the reflection is appended to the episodic memory. The sliding window trims entries beyond the retention limit, maintaining recency.

Injection: at the start of the next episode, the memory is formatted and appended to the Actor's system prompt, inside the session learning block.

Effect: the Actor's generation for the next episode is conditioned on this memory. The model's effective behavior is modified without any weight update.

The lifecycle is self-contained and modular. Each stage is independently observable and testable. If reflections are failing to prevent recurrent errors, the failure can be diagnosed at any stage: Are the reflections being generated with sufficient specificity? Is the injection format being consumed appropriately by the model? Is the category of failure amenable to verbal diagnosis at all? The modularity of the pipeline is not merely an engineering convenience - it is what makes the system debuggable.

What This Opens Up

Reflexion is, in the framework we're building here, the first answer to the central question. It is not the final one.

The framework it establishes - frozen weights, structured failure feedback, verbal reflection, episodic memory injection - is a specific architectural choice among several possible choices. The memory is a flat list. The reflections are free-form text. The sliding window is a simple recency heuristic. The evaluation is synchronous and claim-level. These are not the only ways to implement any of these components.

What happens when the memory is not a flat list but a graph? When reflections are not free text but structured templates with typed fields for error category, confidence, and scope? When evaluation is not post-hoc but interleaved within the generation process? When memory retention is not governed by recency but by estimated future relevance?

These are not rhetorical questions. They are active research directions, and each represents a distinct architectural approach to the same fundamental problem: how does a frozen-weight system improve its behavior through experience?

Reflexion establishes the hypothesis: it can. The specific mechanism - verbal reflection stored in episodic memory - is one answer. Subsequent work has proposed others, each with different characteristics, different strengths, and different failure modes.

The hallucination problem with which we began is not solved by Reflexion. It is meaningfully addressed. The residual gap - what Reflexion does not handle, where it fails, what capabilities it leaves unrealized - is precisely what the subsequent answers in this series will attempt to fill.

References

Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., & Yao, S. (2023). Reflexion: Language Agents with Verbal Reinforcement Learning. Neural Information Processing Systems (NeurIPS) 2023.

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2023). ReAct: Synergizing Reasoning and Acting in Language Models. International Conference on Learning Representations (ICLR) 2023.

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., & Zhou, D. (2022). Chain of Thought Prompting Elicits Reasoning in LLMs. arXiv:2201.11903.

Sutton, R. S. & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.

Chen, M., Tworek, J., Jun, H. et al. (2021). Evaluating Large Language Models Trained on Code. arXiv:2107.03374.

Data aianalytics

Measurement Engineering: The Part of Data Science That Will Thrive in AI

As AI automates SQL queries and dashboards, the most valuable data science skill is becoming judgment—knowing what to measure and whether metrics actually reflect reality.

Read original

Summary

What: An argument that data science is splitting into two categories: execution work that AI increasingly handles (writing queries, building pipelines, creating charts) and measurement engineering that requires human judgment (validating whether metrics capture what they're supposed to, interpreting ambiguous results, deciding which metrics actually matter).

Why it matters: Most data science training focuses on execution skills while ignoring the science of measurement—construct validity, reliability theory, and decision-making under ambiguity. This gap leads to expensive mistakes like AI evaluation suites that measure fluency instead of usefulness, or A/B tests where teams cherry-pick from hundreds of metrics to support predetermined conclusions.

Takeaway: Data professionals should study measurement theory from psychometrics (construct validity, inter-rater reliability, item response theory) rather than optimizing for SQL speed. Organizations should create measurement functions with authority to retire bad metrics and flag flawed evaluations.

Deep Dive

Data science has focused on execution for 15 years (learn Python, SQL, build models) with the assumption that judgment would come naturally with experience, but AI is now exposing that this was never true
Execution questions like writing queries and training models are becoming automated, while judgment questions like "are we measuring the right thing" and "should we trust this A/B test" remain human territory
The 100+ metric problem: teams run experiments against hundreds of metrics, cherry-pick the few that support their preferred decision, and call themselves data-driven—the fix requires courage to retire metrics that don't predict business outcomes
AI evaluation crisis: teams build eval suites where scores improve weekly but users hate the shipped product because evals measured fluency (reproducible but wrong) instead of usefulness (what actually matters)
Ambiguous A/B test dilemma: primary metric up 3% (significant), retention down 1.5% (not significant but trending wrong)—measurement engineers know when to say "we don't have enough evidence to decide" instead of just presenting p-values
Judgment isn't intuition—it comes from specific disciplines data science ignores: construct validity from psychometrics (does time-on-page measure interest or confusion?), measurement reliability (consistent but wrong vs accurate), and decision theory under ambiguity
Three converging factors: AI handles execution work, AI evaluation is the hardest measurement problem organizations face, and the cost of bad measurement now scales to models hallucinating in production
Hiring should test for judgment with ambiguous datasets instead of SQL speed—the best candidates explain what data cannot support, not just what it shows
Every data science program should require measurement theory as a core subject, not an elective, since building systems that measure things requires understanding the science of measurement
Organizations need measurement functions with authority to kill bad metrics across analytics, data science, and ML—most teams have the skills distributed but lack the mandate to use them

Decoder

Construct validity: Whether a measurement actually captures the concept you care about (e.g., does time-on-page measure user interest or confusion?)
Measurement reliability: The difference between consistent-but-wrong measurements and accurate ones (a thermometer that always reads 72 degrees is reliable but potentially useless)
Guardrail metric: A secondary metric monitored to catch unintended negative consequences of optimizing for the primary metric (like tracking retention while improving engagement)
Item Response Theory: A psychometric framework for understanding how well test items measure underlying traits, applicable to evaluating whether metrics capture what they claim to
Power analysis: Statistical technique for determining whether a test has enough data to detect effects that actually matter to the business

Original Article

Measurement Engineering: The Part of Data Science That Will Thrive in AI

What good judgment actually looks like in data science needs a new name

Am I measuring the right thing?

If you've been in data long enough, you know that question matters more than any model you've ever built. Yet most discussions center on "what's going on with this metric" before considering if we're measuring what we intend to.

We've spent 15 years building a field around execution. Learn Python. Learn SQL. Build a model. Ship a dashboard. Get good at Kaggle. The assumption was always that if you could execute, the judgment would come with experience. You'd just absorb it.

I don't think that was ever true. And now, with AI taking over more of the execution layer every month, the gap between people who can do the work and people who can tell you whether the work was worth doing is becoming the defining split in our field.

I've started thinking about the latter group as measurement engineers. Not because they need a new title, but because the data science title is too broad and ambiguous, and they need to be recognized for what they actually do, which is fundamentally different from what most data science job descriptions ask for.

The split we're already feeling

You know this in your gut even if you haven't named it yet. There are two kinds of questions in data work:

Execution questions (AI can do these, and it's getting better fast):

Write the query
Build the pipeline
Train the model
Generate the chart

Judgment questions (AI cannot do these, at least not well):

Are we measuring the right thing?
Does the metric actually capture what we think it captures?
Should we trust this A/B test, or is something confounded?
The data says X, the users say Y. Who's right?
We have 300 metrics. Which 4 actually matter?

We interview for the first list, but the second list generates most of the impact. We get. That disconnect is the problem.

What good measurement engineering looks like

Here are three situations you've probably lived through, or will soon.

The 100+ metric problem

Your team runs an experiment. It's measured against hundreds of metrics. Some go up, some go down, some don't move. The PM cherry-picks the three that support what they already wanted to do. Another PM on a different team runs a different experiment against a different subset of the same metrics and reaches the opposite conclusion. Both teams call themselves "data-driven."

The execution was fine. The queries were correct. The dashboards were accurate. The problem is that nobody did the harder work: deciding which metrics actually predict the outcomes the business cares about and having the courage to retire the rest.

That's judgment. The person who solves this doesn't write a query. They sit in a room with product leaders and make uncomfortable decisions about which numbers matter and which ones are noise dressed up as signal. They kill metrics that teams have been watching for years. It might be the single highest-leverage thing a measurement engineer does.

The eval that lied

An AI team builds evaluations for their model. The eval scores say quality is improving every week. Charts go up and to the right. Everyone feels good. They ship.

Users hate it. Support tickets spike. Satisfaction drops. The model is measurably "better" on every internal eval and measurably worse in the real world.

What happened? The evals were precise and reproducible, but they measured the wrong thing. They tested whether the output was fluent. Users cared whether the output was useful. Those are correlated, but they're not the same construct. The eval suite was systematically missing the dimension that mattered.

A data scientist would look at the scores and say, "The model improved." A measurement engineer would ask: When was the last time we validated that these scores predict what users actually experience? The answer, in most organizations, is never.

The fix isn't a better model or a better pipeline. It's redesigning what you measure. And that requires a skill most of us were never taught: understanding the difference between "the thing we're measuring" and "the thing we think we're measuring." Those are not the same. The gap between them is where organizations make their most expensive mistakes.

The ambiguous result

Your team runs an A/B test on a new feature. The primary metric is up 3%, statistically significant. The guardrail metric, long-term retention, is down 1.5%. Not significant, but trending wrong. The PM wants to ship. Everyone looks at you.

What do you do?

A data scientist can tell you what the numbers are. A measurement engineer can tell you what the numbers mean. And sometimes what they mean is: we don't have enough evidence to decide, and pretending we do is more dangerous than waiting.

That willingness to say "the data is ambiguous and here's what I'd do about it" rather than just presenting a p-value is the skill that separates measurement engineers from analysts. It requires understanding power analysis well enough to know when your test couldn't detect the effect that matters. It requires understanding the business well enough to know whether a 1.5% retention drop compounds into a crisis over six months. And it requires the confidence to walk into a room full of people who want a green light and say, "I'd wait."

Judgment is not intuition

I don't believe judgment is just "experience" or "gut feel." That framing lets us off the hook for not teaching it deliberately. You can learn judgment. It comes from specific disciplines that the data science curriculum almost entirely ignores because they're from fields we don't talk to enough. Some examples:

Construct validity comes from psychometrics. Before you measure something, you ask: Does this measurement actually capture the thing I care about? If you're measuring "engagement" with time-on-page, are you measuring interest or confusion? Most data scientists have never been taught to ask this question. Most ML eval suites have never been subjected to it.

Measurement reliability is the difference between a thermometer that gives you a different reading every time and one that reads 72 degrees regardless of the actual temperature. Consistent and wrong. Most ML eval suites have this problem. They're reproducible, and they're reproducibly measuring something that doesn't matter.

Decision theory under ambiguity is what happens after the number comes back, and it's unclear. When two metrics disagree. When the confidence interval is wide. When the cost of being wrong in one direction is 10x the cost of being wrong in the other. Almost all data training teaches you to produce the number. Almost none teaches you what to do when the number doesn't give you a clean answer.

Why this matters right now

Three things changed, and they all point in the same direction.

First, AI is handling the execution. If a language model can write the query, build the pipeline, and generate the chart, what's left for us? The part the model can't do: deciding whether the query answered the right question, whether the pipeline measured the right thing, and whether the chart tells a true story or a convenient one.

Second, AI itself needs to be evaluated. And AI evaluation is the hardest measurement problem most organizations have ever faced. Traditional software either works or it doesn't. AI outputs are non-deterministic, context-dependent, and subjective. "Is this response good?" is not an engineering question. It's a measurement question. And most teams are improvising because nobody on the team was trained in the science of evaluation.

Third, the cost of bad measurement is scaling. An incorrect number on a dashboard leads to a poor quarterly decision. An incorrect eval score on an AI system can lead to a model that hallucinates in production, even when the metrics say everything is fine. The consequences of getting measurement wrong are outpacing our ability to get it right.

What I'd change

In hiring, stop testing for SQL speed and model-building. Start testing for judgment. Give candidates an ambiguous dataset and a question with no clean answer. The best candidates will tell you what the data can and cannot support. The rest will just give you a number.

In training, every data science program should require measurement theory. Construct validity. Inter-rater reliability. Item Response Theory. Not as an elective. As a core requirement. If you're going to build systems that measure things, you should know the science of measurement.

In org design, build a measurement function that owns the question "is this working?" across analytics, data science, and ML. Give it the authority to kill bad metrics and flag bad evals. Most data teams have the skills distributed across individuals. Few have the organizational structure that gives those individuals the mandate to actually use them.

Where this leaves you

If you're reading this and recognize yourself as someone who already does this work, the thing I want you to hear is this: the skill you have is about to become the most valuable on any data team. The people who can judge whether a measurement is valid, whether an eval is predictive, whether a result is trustworthy, those people were always important. In a world where AI handles the execution, they become essential.

And if you're reading this and realizing you've been building your identity around the execution layer, the SQL, the models, and the pipelines, just know it's not enough anymore. The people who make the shift toward judgment, toward measurement, toward owning the question "is this actually working," they're the ones who will define what data teams look like for the next decade.

The transition is uncomfortable. A lot of what made us feel technical and respected is moving to the execution layer that AI handles. But the part that stays human is the part that always mattered most.

Data infrastructurecloud

HDFS Lost. How Object Storage and Table Formats Won the Data Lake

Object storage plus table formats replaced HDFS for data lakes because network speeds caught up to disk speeds, eliminating the need for data locality and enabling independent scaling of storage and compute.

Read original

Summary

What: The article explains how the data engineering industry quietly abandoned Hadoop HDFS between 2018-2022 in favor of object storage like S3 combined with table formats like Apache Iceberg and Delta Lake, fundamentally changing how data lakes are built.

Why it matters: This shift happened because the hardware assumptions underlying HDFS became obsolete—modern cloud networks (25-100 Gbps) are now as fast or faster than local disk I/O, eliminating the performance benefit of data locality while the new architecture offers massive operational and cost savings through decoupled storage and compute.

Takeaway: If you're running Hadoop on cloud VMs, consider migrating to S3 (or equivalent object storage) plus Iceberg or Delta Lake with ephemeral compute to reduce costs and operational complexity.

Deep Dive

The fundamental principle of Hadoop—"move code to data" through data locality—made sense when networks were 1 Gbps and reading from disk was 8x faster than network transfer, but modern cloud instances get 25-100 Gbps network throughput which matches or exceeds local disk speeds
HDFS architecturally couples storage and compute in the same DataNodes, forcing you to scale both together even when you only need one, leading to constant over-provisioning and paying 24/7 for idle compute attached to storage
Object storage breaks this coupling completely—data lives in S3, compute runs on ephemeral clusters that exist for minutes and disappear, scaling independently and paying only for seconds of actual usage
S3's initial "rename problem" (no atomic directory renames, only copy-and-delete of individual objects) seemed like a dealbreaker for Spark committers that relied on atomic directory operations
Table formats like Iceberg and Delta Lake solved this by maintaining explicit manifests of files rather than treating directories as implicit state, making commits atomic pointer swaps to new manifests
This wasn't just a workaround but an upgrade—table formats added ACID transactions, schema evolution, time travel, hidden partitioning, and efficient deletes that HDFS-based data lakes never had
The winning stack is S3 for storage, a table format for ACID and metadata, and ephemeral compute (Spark, Trino, DuckDB, Snowflake) reading directly from S3, with each layer independently swappable
Operational benefits include no NameNode to manage, no block balancer tuning, no specialized Hadoop administrators needed, with the cloud provider handling durability and replication
HDFS still makes sense in one scenario—on-premises hardware you already own, where you can't turn servers off anyway and HDFS maximizes utilization of both disks and CPUs
Running HDFS on cloud instances is an anti-pattern that pays for compute you can't turn off while fighting the ecosystem that assumes S3

Decoder

HDFS: Hadoop Distributed File System, a distributed filesystem that stores data across multiple machines with built-in replication
Data locality: The principle of running computation on the same physical machines where the data is stored to avoid network transfer
Table formats: Metadata layers like Apache Iceberg and Delta Lake that track which files belong to a table and enable ACID transactions on object storage
Object storage: Simple key-value storage systems like S3 that store immutable files without filesystem features like directories or atomic renames
ACID transactions: Atomicity, Consistency, Isolation, Durability guarantees that ensure database operations complete reliably
NameNode: The master server in HDFS that manages the filesystem namespace and metadata
DataNodes: Worker nodes in HDFS that both store data blocks and execute computation
Ephemeral compute: Short-lived compute clusters that spin up for a specific job and terminate when complete
Parquet: A columnar file format commonly used for analytics workloads

Original Article

Data systems evolved to decouple storage and compute, making it cheaper and easier to scale.

Data devopsinfrastructure

Airflow 2 reaches end of life

Apache Airflow 2 reached end of life on April 22, 2026, leaving production deployments without security patches or provider updates unless teams migrate to Airflow 3.

Read original

Summary

What: Apache Airflow 2.x is no longer receiving security patches, bug fixes, or provider package updates after reaching official end of life. Airflow 3.0 shipped a year ago, but many teams have delayed migration despite the announced timeline.

Why it matters: The risks compound over time: unpatched CVEs in Airflow 2 and dependencies will accumulate, while provider packages for data platforms like Snowflake and BigQuery will increasingly require Airflow 3, breaking backward compatibility. The migration involves architectural changes beyond a simple version bump, including removal of SubDAGs, deprecated context variables, and a split of the monolithic webserver into separate API and DAG processor components.

Takeaway: Spin up an Airflow 3 instance and run existing DAGs through migration tooling to identify breaking changes before the gap between your environment and the ecosystem widens further.

Deep Dive

Airflow 2 still runs but will not receive security patches for newly discovered CVEs, creating growing vulnerability exposure over time
Provider packages for major data platforms will drop Airflow 2 compatibility as they advance, breaking integrations without clear migration paths
SubDAGs have been completely removed in favor of Task Groups or dynamic task mapping, requiring DAG refactoring
Legacy context variables like execution_date, prev_ds, and next_ds are gone, breaking code that relied on them
The monolithic webserver architecture has been split into separate API server and DAG processor components, changing deployment patterns
Built-in DAG versioning now tracks which version of a DAG was used for each run, eliminating custom versioning workarounds
Event-driven scheduling enables triggering pipelines from message queues instead of polling with sensors
Human-in-the-loop task execution provides native workflow pause capabilities for approval or validation steps
The completely rebuilt UI delivers noticeable performance improvements and better usability
Migration tooling exists but requires active testing to surface breaking changes specific to each deployment

Decoder

Apache Airflow: Open source workflow orchestration platform for scheduling and monitoring data pipelines, widely used for ETL and batch processing
DAG: Directed Acyclic Graph, the core abstraction in Airflow representing a workflow as tasks and their dependencies
Provider packages: Modular integrations for connecting Airflow to external platforms like cloud data warehouses and data processing engines
SubDAGs: Legacy pattern for nesting workflows inside other workflows, replaced by Task Groups in Airflow 3
Taskflow API: Decorator-based syntax for defining tasks using Python functions instead of operator classes
Deferrable operators: Operators that suspend and free up worker resources while waiting for long-running external tasks
CVE: Common Vulnerabilities and Exposures, standardized identifiers for security vulnerabilities

Original Article

Security patches and provider updates stopped last week.

Design ailegal

Court sides with iyO in trademark fight against OpenAI and Jony Ive

A federal court blocked OpenAI and Jony Ive from using the "io" name for their AI hardware venture after trademark holder iyO sued for infringement.

Read original

Summary

What: Judge Trina Thompson granted iyO a preliminary injunction preventing OpenAI and former Apple designer Jony Ive's collaboration from using the "io" brand name for their announced AI hardware product, ruling that iyO is likely to prevail on claims of trademark infringement, consumer confusion, and trade secret theft.

Why it matters: This sets a precedent for trademark enforcement in the AI hardware space and shows that even high-profile players like OpenAI must navigate existing trademark rights when launching new product lines, despite their resources and visibility.

Original Article

A US federal court granted iyO a preliminary injunction blocking OpenAI and Jony Ive's venture from using the “io” name, siding with iyO in a trademark dispute. The case began after OpenAI and Sam Altman announced the “io” AI hardware brand, prompting iyO to sue over infringement, consumer confusion, and later trade secret theft. Judge Trina Thompson ruled iyO is likely to win and could suffer irreparable harm, rejecting OpenAI's claim that dropping the name makes the case moot since the injunction prevents any future use.

Design aimarketingwebllm

This Adobe Sneak Uses AI to Rethink Website Personalization

Adobe's Project Page Turner uses large language models to generate personalized web pages in under 100 milliseconds, aiming to replace static websites with AI-driven experiences tailored to individual visitors.

Read original

Summary

What: Project Page Turner is an Adobe prototype that uses large language models to generate custom web pages in real-time for individual visitors, working on top of Adobe Experience Manager to pull from existing brand assets and layouts.

Why it matters: This represents a fundamental shift from segment-based personalization to true individual experiences, and positions brands to compete with ChatGPT by keeping AI-powered interactions on their own sites instead of losing users to third-party platforms.

Takeaway: If working with Adobe Experience Manager or enterprise websites, explore how real-time LLM-based personalization might integrate with existing content management workflows.

Deep Dive

Adobe's Project Page Turner generates personalized web pages using LLMs in under 100ms, below the typical 2-second user expectation threshold
The system works on top of Adobe Experience Manager, using AEM Assets for content library and AEM Sites for layout elements, with a new indexing layer for rapid LLM access
Currently uses Cerebras for inference and primarily GPT models, but maintains a model-agnostic architecture allowing any fast LLM to be swapped in
One implementation approach uses natural language search queries as "zero-party data" to generate customized results pages without relying on cookies or third-party data
The system can aggregate intent signals from multiple sources including search prompts, browsing behavior, and referral context from platforms like ChatGPT or Google
Project Page Turner aims to bridge the gap between traditional brand websites (rich experience, no personalization) and AI chatbots (personalized answers, no brand experience)
Developed in response to feedback from over 90 AEM enterprise customers seeking better personalization options beyond segment-based approaches
This is one of seven "Sneaks" at Adobe's AI Summit, with historically 30-40% of sneaks making it to production in some form
Adobe received over 500 Sneak submissions this year compared to 150 last year, suggesting AI is accelerating internal innovation
The project creator envisions websites that "create themselves" during browsing rather than existing as static entities, adapting like a "second skin" to user intent

Decoder

Sneak: Adobe's term for experimental prototypes showcased at events, similar to concept demos or internal hackathon projects
AEM (Adobe Experience Manager): Adobe's enterprise content management platform used by major brands to build and manage websites
Zero-party data: Information users intentionally provide (like search queries), as opposed to third-party cookies or inferred behavioral data
Inference speed: How quickly an AI model generates a response after receiving input, critical for real-time applications
First-party data: Information collected directly from users on a company's own platforms, as opposed to third-party tracking

Original Article

This Adobe Sneak Uses AI to Rethink the Personalized Web

Cookies have long been the default tool brands reach for when personalizing the web. For over 30 years, small text files stored in browsers have given companies a window into how users move across the web—and a way to tailor ads and content accordingly. But privacy regulations such as GDPR and CCPA, restrictions in browsers like Firefox and Safari, and shifting consumer expectations have gradually eroded their effectiveness, pushing brands to search for alternatives.

What emerged to fill that gap, though, has its own set of tradeoffs. First-party data, contextual targeting, and login-based personalization all attempt to tackle this problem, but they share the same flaw—they personalize for segments, not the individual.

However, a new Sneak from Adobe called Project Page Turner is taking a different approach—and it's powered by artificial intelligence. Adobe previewed it exclusively with The AI Economy ahead of its public unveiling this week.

Meet Project Page Turner

What if you could use a large language model to transform your website so it feels tailor-made for every visitor? It's what Adobe is exploring with Project Page Turner, a working idea that helps brands provide a more one-to-one experience.

Paolo Mottadelli, the Adobe Experience Manager (AEM) engineering director and the creator of this project, says the timing is right—inference speeds have improved to the point where personalized pages can be generated in real-time. Models are now fast enough to deliver a first response in under 100 milliseconds, quicker than you can blink, and with the full page loading in under a second—well below the two-second threshold most users expect.

He points out that websites haven't really changed: "Everybody goes to the same page, and then there is the marketer-driven personalization that requires effort, and it's limited." What Project Page Turner does is empower the page to "imagine itself based on what the site knows about the visitor." In other words, Mottadelli imagines a future in which websites are no longer "static" or even exist anymore.

"They will create themselves under the browsing experience of the user," he says.

Conceived From Customer Feedback

The idea behind Project Page Turner didn't appear out of nowhere. Mottadelli reveals it originated from customer feedback—in working with more than 90 AEM customers in the past year, he was repeatedly asked for ways to provide better, more meaningful personalization. He recalls working on this project back "when technology to do it was not around yet." Now, it is.

As the company behind one of the web's leading content management platforms, Adobe has a direct stake in how websites evolve. Mottadelli says that after seeing Google provide dynamic visual answers that took three minutes to generate, his team began asking what would happen if that same experience could be delivered "in a page delivery timeframe, which happens to be under two seconds."

After some successful trialing, Mottadelli believes "this might be one of the directions the world [wide] web is going."

Unsurprisingly, Project Page Turner sits on top of AEM, utilizing it in two ways: AEM Assets—where brands already store images, video, and content—is the content library this system draws from; and AEM Sites—Adobe's content management system—provides the layout elements. Project Page Turner applies a new indexing layer that makes both rapidly accessible to a fast LLM, allowing it to assemble a personalized page on the fly.

Mottadelli notes that Project Page Turner's inference and underlying models are currently provided by Cerebras, primarily using GPT—though any fast model can be swapped in—reflecting Adobe's broader model-agnostic approach, as evidenced by its Firefly platform.

When implemented, brands need to provide the LLM with brand guidelines, product knowledge, and content rules. All of this goes into instructing it on what to recommend and when, similar to how you'd brief a new employee on company policy. Mottadelli predicts that, eventually, "training the website is like training a human."

How Project Page Turner May Work

One possible way to implement Project Page Turner could be through a website's search feature. Say you're a personal trainer looking for a new blender that can make multiple protein shakes for your clients. Instead of visiting Amazon or Google, you go directly to a website like Vitamix to do your research. There, you start by entering your query in the search feature.

Traditionally, information is surfaced by keyword or by a referenced product name. However, Project Page Turner analyzes the natural language query and returns a fully customized results page just for you. So, you can specify that you're curious about how different models compare in terms of heavy daily use and cleanup, and that speed matters a lot.

What gets generated isn't a generic results page where you have to click through to find piecemeal information. Rather, you're provided a fully assembled experience built around your specific request. Information about high-volume blenders, self-cleaning instructions, thick protein shake recipes, model comparisons, and purchase options all surface together, organized around what you actually asked for.

"The prompt is basically zero-party data…it doesn't require cookies," Eric Matisoff, Adobe's principal evangelist, tells The AI Economy. "It wouldn't require… second-party or third-party data to that user in order to personalize [the search results]."

He adds, "The most interesting thing to analyze across an entire website is internal search data, because there's no other time that someone is telling you exactly what they're looking for than when they're doing an internal search on your website. This takes that to the next step, while also building the personalized experience on the fly."

Beyond the explicit intent of the search prompt, Project Page Turner can also learn from how you browse—observing which products you view, which categories you explore, and what content you engage with. Mottadelli adds that if you arrive at a supported site from ChatGPT, Google, or another platform, the system could carry that intent over as well. But regardless of the signals available, he emphasizes that consent is central to the vision.

Mottadelli says that his ambitions for Project Page Turner extend far beyond the search bar. "What we're really envisioning is the seamless browsing experience that the website is…becoming your second skin without you even noticing." He imagines sites adapting so closely to us that it feels natural and invisible, easily picking up intent and signals and quietly reshaping content and layout around us in real time. How this idea is realized isn't a technology question but rather a question of how companies want users to experience their websites.

ChatGPT, But With the Brand Experience

People approach traditional search engines and AI chatbots like Perplexity, OpenAI's ChatGPT, Google's Gemini, or Anthropic's Claude very differently. The former is about discovery—you're browsing and exploring without a clear destination—while the latter is about intent; you have a specific goal in mind and expect a tailored answer.

Brands want to capitalize on intent signals without sending visitors to a third-party platform. Yes, ChatGPT answers your question, but it strips away the brand experience—the visuals, the product narrative, the path to purchase. A brand website has all of that, but can't respond to your specific intent in real time. Project Page Turner is Adobe's attempt to sit in between: a brand website that behaves like a chatbot.

Companies can already extend their brands to platforms like ChatGPT. Booking.com, Canva, Coursera, Expedia, Figma, Spotify, Zillow, Upwork, and even Adobe have integrated their services with the popular chatbot. But that means operating on someone else's turf and surrendering control of the brand experience. Project Page Turner flips that dynamic, keeping AI's intent-driven responsiveness on the brand's own website.

"This is putting two things together: immediate reaction to the intent with breadthful brand experience," Mottadelli says.

Whether Adobe can defend Project Page Turner against the likes of OpenAI, Google, Anthropic, Salesforce, and others—some of which are moving aggressively to disrupt established players—remains to be seen. But Mottadelli argues that Adobe has something they mostly do not: 4,000 enterprise customers, decades of content management infrastructure, and direct relationships with the marketing teams responsible for brand experiences. "AEM has the right technology to implement [this solution]," he says. The advantage isn't the technology, Mottadelli points out—it's the trust.

What Are Adobe Sneaks?

Project Page Turner is one of seven so-called Sneaks being unveiled this week at Adobe's AI Summit. None are committed to the product roadmap, though Matisoff notes that historically, 30 to 40 percent have made it into production in some form. That said, the goal of Adobe Sneaks is to highlight innovation.

Matisoff says that AI has contributed to an increase in ideas, all jockeying to be featured on stage: "Last year, we had a little over 150 [submissions]. This year, we had well over 500." He adds that even after identifying the projects that will be featured at the event, "we've seen the ideas continue to advance at a more rapid pace…because of how easy it is for our teams to ideate and expand and improve what they're delivering."

Adobe Sneaks isn't unique to Summit—the showcase also happens at Adobe Max. Think of it as Silicon Valley demo day with a comedic twist: engineers present their ideas on stage before a live audience of thousands, Matisoff, and a celebrity co-host. This year, that's comedian, actress, and producer Iliza Shlesinger.

And for those who want a say in what gets built, new this year is a concept that lets audience members vote for their favorite Sneak. The results will help influence the product roadmap. Now, everyone can voice their support in hopes that their favorite idea becomes reality.

For Adobe, the stakes around Sneaks extend beyond any single idea. The company faces mounting pressure—not just from longtime rivals like Figma and Canva, but also from AI-native challengers like Anthropic rewriting the rules of the software industry. Matisoff sees this showcase as proof that Adobe hasn't lost its founder mentality—that innovation still bubbles up from anywhere in the organization. He points to a line from one of Adobe's founders that still guides the program: "Great ideas can come from anywhere in the company." In 2026, Adobe needs that to be true more than ever.

Design aistartupllm

SpaceX is Working with Cursor and has an Option to Buy the Startup for $60B

SpaceX partnered with AI coding assistant Cursor and has an option to acquire the company for $60 billion, potentially consolidating Musk's AI ambitions ahead of a SpaceX IPO.

Read original

Summary

What: SpaceX announced a partnership with Cursor to develop next-generation coding AI using SpaceX's Colossus supercomputer, with an option to either pay Cursor $10 billion for development work or acquire the entire company for $60 billion later in 2026.

Why it matters: The deal reveals competitive weaknesses—neither Cursor nor xAI has models matching Anthropic and OpenAI, which now compete directly with Cursor in developer tools. It could help Cursor escape its awkward dependence on reselling Claude and GPT models from its own competitors while giving SpaceX more appeal to IPO investors.

Deep Dive

SpaceX can either pay Cursor $10 billion for development work or acquire the entire company for $60 billion at an undisclosed point later in 2026
The partnership combines Cursor's developer platform and engineering expertise with SpaceX's Colossus supercomputer to build next-generation coding AI
Cursor's valuation has exploded from $2.5 billion in January 2025 to potentially $60 billion in April 2026, a 24x increase in 15 months
The deal follows xAI (another Musk company) beginning to rent tens of thousands of AI chips to Cursor for model training last week
Two senior Cursor engineers recently left for xAI to report directly to Musk, signaling deepening ties between the companies
Neither Cursor nor xAI has proprietary models competitive with Anthropic's Claude or OpenAI's GPT series
Cursor currently resells access to Claude and GPT models even as Anthropic and OpenAI roll out competing coding tools, creating an awkward dependency
The SpaceX partnership may help Cursor break free from reliance on models provided by its direct competitors
The deal positions SpaceX to offer more value to IPO investors by expanding beyond aerospace into AI infrastructure and developer tools
SpaceX faces financial pressure from the xAI and X (Twitter) acquisitions plus planned capital investments, though it's unclear if the Cursor deal could be paid in stock

Decoder

Cursor: AI-powered code editor and coding assistant platform popular among developers
Colossus: SpaceX's supercomputer claimed to have computing power equivalent to one million Nvidia H100 chips
xAI: Elon Musk's artificial intelligence company, separate from but related to SpaceX
H100: Nvidia's high-end GPU chip designed for AI training and inference workloads

Original Article

SpaceX said it has struck a deal with Cursor to develop a next-generation "coding and knowledge work AI," which includes a surprising provision — an option to buy the popular software development platform for $60 billion later this year.

Partnering with and potentially purchasing a leader in the hottest AI product category can only be seen in the context of SpaceX's much-anticipated public offering. Investors seeking more value in the IPO might see its engagement with Cursor as another way to extract value from Elon Musk's increasingly sprawling tech conglomerate.

The deal won't shock those who follow the industry closely. Last week, it was reported that xAI would begin renting computing power from its data centers to Cursor, with the coding startup using tens of thousands of xAI chips to train its latest AI model. And last month, two of Cursor's most senior engineering leaders, Andrew Milich and Jason Ginsberg, left the company to join xAI, where both report directly to Musk.

SpaceX described the partnership as a project combining Cursor's "product and distribution to expert software engineers" with SpaceX's Colossus supercomputer, which the company claims has the equivalent compute power of a million Nvidia H100 chips.

SpaceX also said that at some undisclosed point later this year, it will either pay Cursor $10 billion for its work or acquire the company for $60 billion. Last week, TechCrunch reported that Cursor was eyeing a $50 billion valuation in an upcoming private fundraising round. That figure itself reflects an astonishing series of leaps. Cursor was valued at just $2.5 billion in January of last year, climbed to $9 billion by last May, and was assigned a $29.3 billion post-money valuation when it closed on $2.3 billion in Series D funding in November.

Either figure would represent a significant expense for SpaceX, which is widely seen to be losing money following the acquisition of xAI and the social media network X and is planning extensive capital investment. The brief statement did not say if either deal could be paid in SpaceX stock.

In the meantime, the move could shore up weaknesses at each company, but it also reveals them. Neither Cursor nor xAI has proprietary models that can match the leading offerings from Anthropic and OpenAI — the same companies now competing directly with Cursor for the developer market.

Cursor still uses and sells access to Claude and GPT models even as both firms roll out their own coding tools, an awkward arrangement that this new SpaceX partnership may be designed to eventually escape.

Design uxaicareer

The UX Designer's Nightmare: When “Production-ready” Becomes a Design Deliverable

UX designers are increasingly expected to deliver production-ready code using AI tools, creating a competency gap where traditional design expertise is being devalued in favor of technical coding skills.

Read original

Summary

What: The article examines how 2026 job requirements demand UX designers use AI to generate functional code alongside traditional design work, effectively merging design and engineering roles into a "design engineer" hybrid that expects both user research expertise and the ability to ship React components to production.

Why it matters: This shift risks creating widespread "quality debt" as designers lack the technical foundation to audit AI-generated code—92% of AI codebases contain critical security vulnerabilities, accessibility is often broken, and code duplication runs 4x higher than human-written code, forcing engineering teams to spend significant time cleaning up rather than gaining promised efficiency.

Takeaway: Treat AI-generated code as a collaboration starting point with developers rather than a shortcut around engineering expertise, and never ship code you cannot explain or debug yourself.

Deep Dive

UX/UI roles projected to grow 16% through 2034, but growth is increasingly tied to AI product development where design skills are now the most in-demand capability, even ahead of coding
73% of designers now view AI as a primary collaborator, but this often manifests as "role creep" where recruiters expect designers to handle both user empathy and prompting React components into Git repositories
Attempting to master both deep design and coding disciplines simultaneously leads to average competency in both—research shows AI-assisted participants scored 17% lower on comprehension tests than those who coded by hand
The largest performance gap between AI-reliant and manual coders appears in debugging, where designers cannot trace logic failures in code they didn't fully understand when generating
92% of AI-generated codebases contain at least one critical vulnerability, with AI-generated login forms showing an 86% failure rate in XSS defense measures
AI frequently generates non-semantic HTML (like div-based toggle switches) that lacks keyboard focus and screen-reader compatibility, creating expensive accessibility debt
AI-generated code is linked to 4x more code duplication than human-written code, producing verbose output that slows page loads and harms SEO
While velocity increases, incidents per pull request are rising 23.5% as organizations discover the "rework tax" of cleaning up AI-generated code
Only 69% of designers feel AI improves work quality compared to 82% of developers, exposing a gap where "code that compiles" differs from "code that is maintainable"
The solution involves a human-AI-human loop where designers handle prompts for intent and accessibility while engineers handle architecture and performance, with design systems serving as guardrails
Companies prioritizing designer-shipped code without engineering oversight face accumulating technical debt, security breaches, and potential accessibility lawsuits

Decoder

XSS (Cross-Site Scripting): Security attack where malicious scripts are injected into trusted websites, often through poorly validated input fields
Semantic HTML: Markup that conveys meaning about content structure to browsers and assistive technologies, not just visual appearance
Accessibility debt: Accumulated cost of fixing usability issues for users with disabilities that were ignored during initial development
Technical debt: Future rework required from taking shortcuts or making suboptimal decisions during development
Design system: Centralized library of reusable UI components and standards that ensure consistency across products
Pull request (PR): Proposed code changes submitted for review before merging into the main codebase

Original Article

In a rush to embrace AI, the industry is redefining what it means to be a UX designer, blurring the line between design and engineering. Carrie Webster explores what's gained, what's lost, and why designers need to remain the guardians of the user experience.

In early 2026, I noticed that the UX designer's toolkit seemed to shift overnight. The industry standard "Should designers code?" debate was abruptly settled by the market, not through a consensus of our craft, but through the brute force of job requirements. If you browse LinkedIn today, you'll notice a stark change: UX roles increasingly demand AI-augmented development, technical orchestration, and production-ready prototyping.

For many, including myself, this is the ultimate design job nightmare. We are being asked to deliver both the "vibe" and the "code" simultaneously, using AI agents to bridge a technical gap that previously took years of computer science knowledge and coding experience to cross. But as the industry rushes to meet these new expectations, they are discovering that AI-generated functional code is not always good code.

The LinkedIn Pressure Cooker: Role Creep In 2026

The job market is sending a clear signal. While traditional graphic design roles are expected to grow by only 3% through 2034, UX, UI, and Product Design roles are projected to grow by 16% over the same period.

However, this growth is increasingly tied to the rise of AI product development, where "design skills" have recently become the #1 most in-demand capability, even ahead of coding and cloud infrastructure. Companies building these platforms are no longer just looking for visual designers; they need professionals who can "translate technical capability into human-centered experiences."

This creates a high-stakes environment for the UX designer. We are no longer just responsible for the interface; we are expected to understand the technical logic well enough to ensure that complex AI capabilities feel intuitive, safe, and useful for the human on the other side of the screen. Designers are being pushed toward a "design engineer" model, where we must bridge the gap between abstract AI logic and user-facing code.

A recent survey found that 73% of designers now view AI as a primary collaborator rather than just a tool. However, this "collaboration" often looks like "role creep." Recruiters are often not just looking for someone who understands user empathy and information architecture — they want someone who can also prompt a React component into existence and push it to a repository!

This shift has created a competency gap.

As an experienced senior designer who has spent decades mastering the nuances of cognitive load, accessibility standards, and ethnographic research, I am suddenly finding myself being judged on my ability to debug a CSS Flexbox issue or manage a Git branch.

The nightmare isn't the technology itself. It's the reallocation of value.

Businesses are beginning to value the speed of output over the quality of the experience, fundamentally changing what it means to be a "successful" designer in 2026.

Figma to AI code tools that allow designers to switch from design to code — Tools that allow designers to switch from design to code. (Image source: Figma)

The Competence Trap: Two Job Skill Sets, One Average Result

There is potentially a very dangerous myth circulating in boardrooms that AI makes a designer "equal" to an engineer. This narrative suggests that because an LLM can generate a functional JavaScript event handler, the person prompting it doesn't need to understand the underlying logic. In reality, attempting to master two disparate, deep fields simultaneously will most likely lead to being averagely competent at both.

The "Averagely Competent" Dilemma

For a senior UX designer to become a senior-level coder is like asking a master chef to also be a master plumber because "they both work in the kitchen." You might get the water running, but you won't know why the pipes are rattling.

The "cognitive offloading" risk. Research shows that while AI can speed up task completion, it often leads to a significant decrease in conceptual mastery. In a controlled study, participants using AI assistance scored 17% lower on comprehension tests than those who coded by hand.
The debugging gap. The largest performance gap between AI-reliant users and hand-coders is in debugging. When a designer uses AI to write code they don't fully understand, they don't have the ability to identify when and why it fails.

A chart showing how AI assistance impacts coding speed and skill formation — Using AI tools impedes coding skill formation.

So, if a designer ships an AI-generated component that breaks during a high-traffic event and cannot manually trace the logic, they are no longer an expert. They are now a liability.

The High Cost Of Unoptimised Code

Any experienced code engineer will tell you that creating code with AI without the right prompt leads to a lot of rework. Because most designers lack the technical foundation to audit the code the AI gives them, they are inadvertently shipping massive amounts of "Quality Debt".

Common Issues In Designer-Generated AI Code

The security flaw: Recent reports indicate that up to 92% of AI-generated codebases contain at least one critical vulnerability. A designer might see a functioning login form, unaware that it has an 86% failure rate in XSS defense, which are the security measures aimed at preventing attackers from injecting malicious scripts into trusted websites.
The accessibility illusion: AI often generates "functional" applications that lack semantic integrity. A designer might prompt a "beautiful and functional toggle switch," but the AI may provide a non-semantic <div> that lacks keyboard focus and screen-reader compatibility, creating Accessibility Debt that is expensive to fix later.
The performance penalty: AI-generated code tends to be verbose. AI is linked to 4x more code duplication than human-written code. This verbosity slows down page loads, creates massive CSS files, and negatively impacts SEO. To a business, the task looks "done." To a user with a slow connection or a screen reader, the site is a nightmare.

Creating More Work, Not Less

The promise of AI was that designers could ship features without bothering the engineers. The reality has been the birth of a "Rework Tax" that is draining engineering resources across the industry.

Cleaning up: Organisations are finding that while velocity increases, incidents per Pull Request are also rising by 23.5%. Some engineering teams now spend a significant portion of their week cleaning up "AI slop" delivered by design teams who skipped a rigorous review process.
The communication gap: Only 69% of designers feel AI improves the quality of their work, compared to 82% of developers. This gap exists because "code that compiles" is not the same as "code that is maintainable."

When a designer hands off AI-generated code that ignores a company's internal naming conventions or management patterns, they aren't helping the engineer; they are creating a puzzle that someone else has to solve later.

Typical issues that developers face with AI-generated code.

The Solution

We need to move away from the nightmare of the "Solo Full-Stack Designer" and toward a model of designer/coder collaboration.

The ideal reality:

The Partnership: Instead of designers trying to be mediocre coders, they should work in a human-AI-human loop. A senior UX designer should work with an engineer to use AI; the designer creates prompts for intent, accessibility, and user flow, while the engineer creates prompts for architecture and performance.
Design systems as guardrails: To prevent accessibility debt from spreading at scale, accessible components must be the default in your design system. AI should be used to feed these tokens into your UI, ensuring that even generated code stays within the "source of truth."

Beyond The Prompt

The industry is currently in a state of "AI Infatuation," but the pendulum will eventually swing back toward quality.

The UX designer's nightmare ends when we stop trying to compete with AI tools at what they do best (generating syntax) and keep our focus on what they cannot do (understanding human complexity).

Businesses that prioritise "designer-shipped code" without engineering oversight will eventually face a reckoning of technical debt, security breaches, and accessibility lawsuits. The designers who thrive in 2026 and beyond will be those who refuse to be "prompt operators" and instead position themselves as the guardians of the user experience. This is the perfect outcome for experienced designers and for the industry.

Our value has always been our ability to advocate for the human on the other side of the screen. We must use AI to augment our design thinking, allowing us to test more ideas and iterate faster, but we must never let it replace the specialised engineering expertise that ensures our designs technically work for everyone.

Summary Checklist for UX Designers

Work Together. Use AI-made code as a starting point to talk with your developers. Don't use it as a shortcut to avoid working with them. Ask them to help you with prompts for code creation for the best outcomes.
Understand the "Why". Never submit code you don't understand. If you can't explain how the AI-generated logic works, don't include it in your work.
Build for Everyone. Good design is more than just looks. Use AI to check if your code works for people using screen readers or keyboards, not just to make things look pretty.

Design aitools

Designing with AI without losing your mind

A designer concerned about outsourcing critical thinking to AI built a real-time collaborator that challenges ideas during sketching rather than just generating solutions.

Read original

Summary

What: Thia is an open-source AI app with vision and voice capabilities that engages in critical dialogue while a designer sketches on a whiteboard, acting more like a questioning colleague than a solution generator. Built using Google AI Studio in three days and released on GitHub.

Why it matters: The article argues that while AI tools accelerate execution, the pressure to ship faster is causing designers to skip crucial thinking stages, leading to weak solutions that get built quickly but lack substance. The author wanted to preserve the "cognitive friction" of early ideation rather than jumping straight to high-fidelity outputs.

Takeaway: Try Thia on GitHub or Google AI Studio to experiment with AI that supports early-stage ideation, or consider how to preserve critical thinking stages in your own AI-augmented workflow.

Deep Dive

The author noticed a behavioral shift where they were reaching for LLMs to solve design problems instead of spending time thinking through and sketching solutions themselves
Current AI design tools enable organizations to ship fast and test rapidly, but this has devolved into throwing "cheap spaghetti" at the wall rather than filtering out weak ideas through early scrutiny
The author identified whiteboarding and sketching as a "sacred" design stage where deep thinking shouldn't be outsourced to AI, but wanted support for it when working solo or remotely
Existing tools like Google Stitch and Gemini either jumped straight to high-fidelity or weren't tuned to be critical, missing the need for challenging dialogue during early ideation
Thia was built to engage in Socratic dialogue in real-time while watching the designer sketch via webcam, creating a tight feedback loop instead of a waterfall upload-chat-iterate process
The app was developed using Google AI Studio and refined through iterative prompting over three days, with the author using Thia to design Thia's own features
Physical sketching and writing activates different brain regions than typing, forging deeper long-term memory connections and supporting internal evaluation
The real game-changer with AI is execution speed at each stage, not permission to skip stages like user research or robust idea refinement
AI's ease can create a facade where polished prototypes with features like breathing animations paper over fractured concepts and weak value propositions
Critical thinking skills—analysis, evaluation, cross-domain synthesis—will become more valuable once the AI hype settles, and maintaining them requires embracing cognitive friction rather than rushing to the finish line
The roadmap for Thia includes overwriting on screenshots, multiple simultaneous participants, calling multiple models for complex tasks, and orchestrating agents on platforms like OpenClaw

Decoder

LLM: Large Language Model, AI systems like ChatGPT or Claude trained on massive text datasets to generate human-like responses
Socratic dialogue: A questioning method that challenges assumptions and encourages critical thinking through systematic questioning rather than direct answers
Google AI Studio: A development environment from Google providing access to frontier AI models for building applications
Multi-modal AI: AI systems that can process multiple types of input simultaneously, like text, voice, and images
Cognitive friction: The mental effort required for deep thinking and evaluation, which the author argues should be preserved rather than eliminated

Original Article

Heavy reliance on AI tools is pushing designers to outsource critical thinking, favoring speed and rapid output over deeper evaluation and idea development. To counter this, an AI collaborator (Thia) was created to engage in real-time, critical dialogue during sketching—supporting, rather than replacing, human thinking. The key takeaway is that while AI accelerates execution, strong design still depends on deliberate reasoning and “cognitive friction,” which must be preserved to avoid building fast but flawed solutions.

Design opensourceproduct

Working in the open

Working in open source transforms the design process from shipping features to designing transparently with community feedback, public roadmaps, and long-term stewardship.

Read original

Summary

What: An essay exploring how open source work (specifically on projects like Open Data Kit) fundamentally changes the design and development process, emphasizing transparency, public collaboration, community feedback, and long-term maintenance over rapid feature delivery.

Why it matters: This highlights a fundamental difference between open source and traditional product development: open collaboration forces clarity, accountability, and inclusive decision-making, while the "un/learning loop" approach—slowing down to learn in public and embrace uncertainty—produces better outcomes in complex environments.

Takeaway: Consider adopting transparent workflows like public roadmaps and open design processes, even in closed-source projects, to improve decision-making through community accountability.

Deep Dive

Open source design work extends beyond feature delivery to include transparent collaboration, public accountability, and long-term impact considerations
Public roadmaps and constant community feedback create forcing functions for clearer thinking and better decision-making
The "un/learning loop" concept advocates slowing down, staying open to uncertainty, and learning publicly rather than rushing to solutions
Close collaboration with actual users throughout the design process leads to more inclusive and contextually appropriate solutions
Stewardship—maintaining and evolving systems responsibly—takes priority over shipping speed in open source contexts
Transparency and public work create accountability that improves design quality, particularly in high-stakes or complex environments
Working openly requires developing comfort with uncertainty and iterating based on community input rather than internal assumptions

Decoder

ODK (Open Data Kit): An open-source platform for mobile data collection, often used in humanitarian and research contexts
un/learning loop: A design approach emphasizing continuous learning and unlearning through public iteration, uncertainty, and community feedback rather than linear execution

Original Article

Working in open source—especially on large-scale projects like ODK—pushes designers beyond simply shipping features toward designing transparently, collaborating closely with users, and thinking long-term about maintenance and impact. Open collaboration, public roadmaps, and constant community feedback improve decision-making while forcing clarity, accountability, and trust. A key lesson is embracing the “un/learning loop”: slowing down, staying open to uncertainty, and learning in public leads to better, more inclusive solutions—especially in complex or high-stakes environments—while prioritizing stewardship over speed.

Design aiaccessibilitybenchmarksllm

AI Model Accessibility checker (Website)

A new benchmark reveals which AI coding models generate accessible web pages, with OpenAI dominating the top spots while Anthropic's Claude models produce nearly triple the accessibility violations of competitors.

Read original

Summary

What: AIMAC tests 43 AI models by prompting them to build web pages across 28 categories without accessibility guidance, then uses axe-core to audit for WCAG 2.2 Level AA violations. The GAAD Foundation project aims to push AI providers to generate accessible code by default.

Why it matters: With AI writing an increasing share of the web's code and 95.9% of top websites failing basic accessibility checks, this benchmark addresses whether AI will perpetuate or solve accessibility problems. The results show significant variation between models, suggesting accessibility isn't an automatic byproduct of model capability.

Takeaway: Check your preferred AI coding model's accessibility performance at the AIMAC website, especially if you're generating frontend code - models like GPT 5.4 Mini achieve zero accessibility debt for under $1, while popular models like Claude Sonnet 4.6 generate nearly 3x the average violations.

Deep Dive

OpenAI's GPT 5.3 Codex achieves a median AIMAC Debt of 0.00 with just 20 violations across 28 categories, taking first place at $3.02 total cost
GPT 5.4 Mini ranks second with the same 0.00 debt score for only $0.89, demonstrating that top accessibility doesn't require premium pricing
Anthropic's Claude models performed poorly across the board, with Sonnet 4.6 generating 1,186 accessibility violations (nearly 3x the field average of 403)
Claude Sonnet 4.6 regressed sharply from version 4.5, dropping from rank #19 to #36, making Anthropic the only major provider whose models got worse
Google dramatically improved from dead last to #9 after Gemini 3.1 Pro Preview replaced Gemini 3 Pro Preview, showing the benchmark is driving real change
Low contrast text dominates violations at 84.8% of pages, mirroring the same 84% rate found in the WebAIM Million audit of real websites
Chinese models like Qwen, DeepSeek, and MoonshotAI account for 15 of 43 models tested, but DeepSeek's efficiency breakthrough hasn't translated to accessible output
The benchmark uses prompt randomization with hundreds of billions of variants to prevent gaming through memorization or fine-tuning on specific test cases
AIMAC Debt scores violations logarithmically: critical violations (missing labels, empty buttons) count 5 points, serious violations (color contrast) count 2 points, with dampening for duplicates
Twenty-eight of 43 models cost under $1 to generate all 28 test pages, including strong performers like Grok 4.1 Fast at 8 cents and gpt-oss-120b at 10 cents
Arcee AI's Trinity Large Preview ranks #5 for free, representing the largest permissively-licensed open model from a US company at 400B sparse parameters
The benchmark includes an emdash benchmark tracking punctuation style, where Claude Sonnet 4.6 used 754 emdashes while Mistral's Codestral used zero
Version upgrades frequently regressed accessibility: GLM 5 to GLM 5.1 dropped 37 ranks, KAT V1 to V2 dropped 28 ranks, and Sonnet 4.5 to 4.6 dropped 18 ranks
Automated testing catches issues like color contrast and missing labels but can't evaluate keyboard navigation or real screen reader usability
The project was inspired by WebAIM's Million report showing 95.9% of top websites fail accessibility checks, with errors jumping 10% in 2026 after six years of improvement

Decoder

AIMAC Debt: Score measuring accessibility violations, where 0.00 is perfect and higher numbers indicate more/severe violations (critical issues count 5 points, serious issues count 2 points)
axe-core: Open-source accessibility testing engine used by Microsoft, Google, and major tech companies to detect WCAG violations
WCAG 2.2 Level AA: Web Content Accessibility Guidelines standard required by US and EU accessibility laws
Pareto optimal: Model that represents the best cost-quality tradeoff - you can't get better accessibility without paying more, or lower cost without worse accessibility
GAAD Foundation: Global Accessibility Awareness Day Foundation, the organization behind this benchmark
Screen reader: Assistive technology that reads web content aloud for blind or visually impaired users
ARIA attributes: HTML attributes that help screen readers understand page structure and interactive elements
WebAIM Million: Annual audit of the top 1 million websites for accessibility compliance

Original Article

AIMAC: The AI Model Accessibility Checker

AI is writing more code than ever. But is it accessible to People with Disabilities?

We prompted the top AI models to build web pages across 28 categories and audited them for accessibility. We published every generated page, side by side, so you can see how different models tackled the same design challenge. We even measured emdash usage.

AIMAC is an initiative by the GAAD Foundation, in partnership with ServiceNow. Updated Apr 08, 2026.

Leaderboard

Legend: 🅿️ Pareto Optimal | 🏆 Lowest Debt

Rank ▲	Model ▲▼	AIMAC Debt ▲▼lower = better	Total Cost ▲▼in USD
1 🏆	OpenAI: GPT 5.3 Codex	0.00 1 *	$3.02
2 🅿️	OpenAI: GPT 5.4 Mini	0.00 1 *	$0.89
3	OpenAI: GPT 5.4	0.00 1 *	$4.48
4	OpenAI: GPT 5.4 Pro	1.00	$212.75
5 🅿️	Arcee AI: Trinity Large Preview (free)	3.98	$0.00
6	OpenAI: o3	4.02	$1.13
7	Qwen: Qwen3 Max	4.14	$1.07
8	Qwen: Qwen3 Coder Plus	4.36	$0.75
9	Google: Gemini 3.1 Pro Preview	4.40	$4.16
10	Anthropic: Claude Haiku 4.5	4.47	$2.14
11	OpenAI: gpt oss 120b	4.51	$0.10
12	MoonshotAI: Kimi K2 Thinking	4.54	$0.82
13	OpenAI: o4 Mini	4.58	$0.74
14	xAI: Grok 4.1 Fast	4.61	$0.08
15	Mistral: Mistral Medium 3.1	4.70	$0.45
16	AllenAI: Olmo 3.1 32B Instruct	4.80	$0.12
17	NVIDIA: Nemotron 3 Super (free)	4.82	$0.00
18	Qwen: Qwen3 Coder Next	4.99	$0.41
19	MoonshotAI: Kimi K2.5	5.12	$1.07
20	MiniMax: MiniMax M2.7	5.34	$1.28
21	Z.ai: GLM 4.7 Flash	5.37	$0.11
22	Qwen: Qwen3.6 Plus	5.41	$0.99
23	Anthropic: Claude Opus 4.6 (Fast)	5.94	$112.18
24	Qwen: Qwen3 Max Thinking	6.00	$1.11
25	Google: Gemini 3 Flash Preview	6.84	$0.66
26	Anthropic: Claude Opus 4.6	6.90	$18.34
27	Amazon: Nova 2 Lite	6.98	$0.40
28	DeepSeek: DeepSeek V3.2 Speciale	7.21	$0.20
29	Arcee AI: Trinity Large Thinking	7.97	$0.25
30	DeepSeek: DeepSeek V3.2	8.18	$0.09
31	Qwen: Qwen3 Coder Flash	8.20	$0.31
32	Mistral: Mistral Large 3 2512	8.33	$0.44
33	Mistral: Codestral 2508	8.40	$0.15
34	Mistral: Devstral 2 2512	9.01	$0.13
35	Qwen: Qwen3.5 397B A17B	9.16	$0.87
36	Anthropic: Claude Sonnet 4.6	9.83	$12.64
37	DeepSeek: R1 0528	10.17	$0.66
38	xAI: Grok 4.20	11.16	$2.07
39	Qwen: Qwen3.5 Flash	11.37	$0.09
40	Google: Gemma 4 31B	12.59	$0.08
41	Kwaipilot: KAT Coder Pro V2	13.19	$0.44
42	Z.ai: GLM 5.1	16.38	$2.47
43	Mistral: Mistral Small 4	16.96	$0.18
Total	$390.28

1 #1 and #2 and #3 tied with an AIMAC Debt of 0.00. Tiebreaker: #1 averaged fewer violations (0.94 vs 1.02 vs 1.31).

* GPT 5.3 Codex shows a median AIMAC Debt of 0.00. This means at least half of the 28 categories had zero accessibility violations, but some categories still had minor issues (20 total violations across all categories).

Analysis

Introduction

95.9% of the top million websites fail basic accessibility checks. WebAIM has tracked it for seven years. After six years of marginal improvement, 2026 reversed the trend: errors per page jumped 10% to 56.1 and the failure rate climbed back to 95.9%.

AI is writing more of the world's code every day. Vibe Coding was the Collins Dictionary Word of the Year. If AI keeps writing code as poorly as the developers it learned from, nothing changes. But if it prioritizes accessibility, the web gets its first real chance to improve.

Our one ambitious goal

Ensure that AI models write accessible code by default.

Which Model is Best?

GPT 5.3 Codex by OpenAI holds the top spot with a median AIMAC Debt of 0.00, generating just 20 accessibility violations across all 28 categories. GPT 5.4 Mini joins it at #2, also achieving zero debt for just $0.89. GPT 5.4 takes #3 (also 0.00) and GPT 5.4 Pro sits at #4 (1.00). Arcee AI's Trinity Large Preview (free) rounds out the top five at #5 with a debt of 3.98. OpenAI holds four of the top five spots.

Three models now achieve a median AIMAC Debt of 0.00. The cheapest is GPT 5.4 Mini at $0.89. The most expensive model on the benchmark is GPT 5.4 Pro at $212.75.

GPT 5.3 Codex: #1

The Pareto Frontier

AIMAC Debt vs Cost

Choosing a model isn't simply about which model is most accessible. Some models are very expensive. Benchmarks commonly use Pareto Frontier analysis to compare models on quality vs cost dimensions. Pareto optimal models (teal diamonds) are the efficient picks: to lower the AIMAC Debt grade, you'd pay more; to pay less, your AIMAC Debt grade rises. A gold ring marks the lowest AIMAC Debt.

Accessibility vs cost comparison of 43 AI models. Lowest debt: GPT 5.3 Codex (0.00, $3.02). 2 models are Pareto optimal. Costs range from free (Trinity Large Preview (free)) to $3.02.

Top 3 Winners

OpenAI
Arcee AI
Alibaba/Qwen

1. OpenAI dominates the top of the leaderboard. Three of their models achieve a median AIMAC Debt of 0.00: GPT 5.3 Codex (#1), GPT 5.4 Mini (#2), and GPT 5.4 (#3). A median of 0.00 means at least half of the 28 categories had zero violations, though a few categories still had minor issues. GPT 5.4 Pro rounds out the top four at 1.00. They hold four of the top five spots (#1, #2, #3, #4), and their open-source gpt-oss-120b lands at #11 for just $0.10.

2. Arcee AI is a 30-person San Francisco startup that built Trinity Large Preview from scratch for $20 million. It is a 400-billion-parameter sparse model with only 13 billion active per token, Apache 2.0 licensed, and ranks #5 for free. It is the largest permissively-licensed open model from a US company. Their Trinity Large Thinking variant lands at #29 (debt 7.97, $0.25).

3. Alibaba/Qwen fields the deepest roster outside OpenAI, with eight models spanning the leaderboard. Qwen3 Max leads the pack at #7 (debt 4.14, $1.07), and Qwen3 Coder Plus follows at #8 (debt 4.36, $0.75). Their lineup covers proprietary flagships, code-tuned variants, and open-weight releases, giving developers more accessibility-tested options than any other non-US provider.

Google's Comeback

Google's Gemini 3 Pro Preview once finished dead last at #39 with an AIMAC Debt of 10.65. We were disappointed because Google has the tools and talent to do better.

Its replacement, Gemini 3.1 Pro Preview, now sits at #9 with an AIMAC Debt of 4.40. Google is no longer at the bottom. This is exactly the kind of progress we hoped this benchmark would encourage.

What About Claude?

Anthropic has real developer mindshare, and Claude is often the default choice when teams want strong coding help. In this benchmark, their best result is Claude Haiku 4.5 at rank #10 (4.47 for $2.14). Claude Opus 4.6 (Fast) ranks #23 (5.94 for $112.18), Claude Opus 4.6 ranks #26 (6.90 for $18.34), and Claude Sonnet 4.6 sits at #36 (9.83 for $12.64). Sonnet's regression from 4.5 remains sharp: from #19 (debt 4.78) to #36 (debt 9.83).

Other major providers improved their accessibility results. Anthropic is the only one whose models got worse.

Sonnet 4.6 generates 1,186 accessibility violations, nearly three times the field average of 403. Opus 4.6 produces 788, nearly double. The same gap shows in Anthropic's frontend-design skill, where accessibility is barely mentioned while the guidance is overwhelmingly visual.

Anthropic is a Public Benefit Corporation whose stated values include acting "for the global good" and being good to their users, whom they define broadly as "anyone impacted by the technology we build." People with disabilities are impacted every time Claude generates inaccessible code. We hope Anthropic will bring the same energy they apply to AI safety to the accessibility of their models' output.

The broader trend is encouraging. Models are training against automated tools like axe-core, and scores are improving. We are working on new benchmarks that test models that go beyond axe-code testing.

Beyond the Rankings

As AI models get better at visual design, they face a tradeoff between optimizing for beauty or for accessibility. You can achieve both, but it requires deliberate effort.

If you're picking a model for your workflow, start with a category page that matches your use case. Each category shows how ~40 models handled the same prompt, alongside their AIMAC Debt. For example, visit the Sports category in grid mode to compare sports designs side by side.

If you already have a model in mind, use the model detail page to see what its pages look like across all 28 categories. Then drill down into the categories that match your needs. See GPT 5.4 Pro in grid mode for an example.

Value vs Premium

The #1 model on this benchmark costs $3.02. But the #2 model costs just $0.89. GPT 5.4 Mini achieves zero debt for under a dollar, while GPT 5.4 Pro costs $212.75 for a debt of 1.00. Twenty-eight of 43 models cost under $1, including Grok 4.1 Fast at #14 for 8 cents, OpenAI's gpt-oss-120b at #11 for 10 cents, and Gemma 4 31B at #40 for 8 cents.

Closed vs Open (Source/Weights)

A closed model holds the top spot. GPT 5.3 Codex (proprietary) remains #1, and three more closed OpenAI models fill spots #2 through #4. Trinity Large Preview (open-source, Apache 2.0) holds #5 for free.

The upper-middle tier remains packed with value picks across both closed and open-weight releases. Qwen3 Max and Coder Plus rank 7th and 8th. Kimi K2 Thinking sits at #12 for 82 cents. OpenAI's open-weight gpt-oss-120b ranks 11th for 10 cents.

The smaller models are mostly from China. Qwen, Z.ai, DeepSeek, and MoonshotAI account for 15 of 43 models. Alibaba's Qwen line has eight entries. Despite US chip restrictions, Chinese labs keep shipping.

DeepSeek made headlines in January 2025 when their R1 model triggered a $600 billion single-day loss in Nvidia's market cap by claiming similar performance to Western models at a fraction of the cost. But on accessibility, their lineup sits near the bottom: R1 0528 scores 10.17 (rank #37), V3.2 lands at 8.18, and their best result (V3.2 Speciale) still comes in at 7.21. R1 0528 regressed from the original R1 it replaced (which scored 8.08). The efficiency breakthrough has not translated to accessible output.

Mistral struggled on this benchmark. Medium 3.1 is their lone mid-pack result at 4.70 (rank #15), while the rest cluster near the bottom: Large 3 2512 at 8.33 (rank #32), Codestral 2508 at 8.40 (rank #33), and Devstral 2 2512 at 9.01 (rank #34). Their newest release, Mistral Small 4, finishes dead last at #43 with a debt of 16.96, a sharp regression from the Small 3.2 24B it replaced (which scored 8.48). Only Medium 3.1 cracked the Top 20, and it is not open source. Their open-source Small 4 finishes last.

What Trips Models Up

Low contrast text dominates both AI-generated and human-built websites. Both columns report the share of pages with at least one instance of each issue. AIMAC uses axe-core across AI-generated pages; WebAIM uses WAVE across the top 1,000,000 home pages.

AIMAC: Top 6 issues

Share of pages with >= 1 error

Low contrast text: 84.8%
Empty links: 31.0%
Missing form labels: 19.7%
Empty buttons: 6.9%
Links distinguished only by color: 3.9%
Target size too small: 3.7%

WebAIM Million 2026: Top 6 issues

Share of home pages with >= 1 error

Low contrast text: 84%
Missing alt text: 53%
Missing form labels: 51%
Empty links: 46%
Empty buttons: 31%
Missing document language: 14%

Emdash Benchmark

We tracked emdash usage as a small writing-style signal, because punctuation can affect how text is read aloud.

Screen readers handle emdashes differently. Some announce "em dash" at every occurrence, others treat it as a pause, and some ignore it depending on voice and settings. Ricky Onsman explored the issue and found this is largely a screen reader behavior difference, not a content authoring bug.

We sanity-checked this with screen reader users, and none of our friends reported emdash-heavy text as a practical problem in day-to-day reading. So this turned out to be more interesting than impactful, and it does not affect AIMAC rankings.

Emdash usage varies widely. Across 43 models, counts range from 0 (Codestral 2508) to 754 (Claude Sonnet 4.6). 1 model uses zero emdashes.

Rank	Model	Emdash Count
1	Codestral 2508 Mistral	0
2	Devstral 2 2512 Mistral	4
3	Qwen3 Coder Flash Qwen	4
4	Nova 2 Lite Amazon	5
5	R1 0528 DeepSeek	7
6	Qwen3.5 Flash Qwen	8
7	Qwen3 Coder Plus Qwen	9
8	DeepSeek V3.2 DeepSeek	10
9	GLM 4.7 Flash Z.ai	10
10	Trinity Large Preview (free) Arcee AI	11
11	KAT Coder Pro V2 Kwaipilot	12
12	Gemini 3 Flash Preview Google	13
13	Qwen3.5 397B A17B Qwen	16
14	Trinity Large Thinking Arcee AI	16
15	DeepSeek V3.2 Speciale DeepSeek	17
16	Gemini 3.1 Pro Preview Google	17
17	Grok 4.1 Fast xAI	21
18	Mistral Large 3 2512 Mistral	22
19	Gemma 4 31B Google	24
20	Mistral Medium 3.1 Mistral	24
21	Qwen3 Max Qwen	26
22	Grok 4.20 xAI	29
23	Mistral Small 4 Mistral	30
24	Qwen3 Max Thinking Qwen	38
25	GPT 5.4 OpenAI	40
26	gpt oss 120b OpenAI	40
27	Kimi K2 Thinking MoonshotAI	56
28	Nemotron 3 Super (free) NVIDIA	60
29	o4 Mini OpenAI	67
30	MiniMax M2.7 MiniMax	68
31	Olmo 3.1 32B Instruct AllenAI	70
32	GPT 5.3 Codex OpenAI	71
33	Claude Haiku 4.5 Anthropic	72
34	GPT 5.4 Mini OpenAI	84
35	Kimi K2.5 MoonshotAI	85
36	o3 OpenAI	85
37	GPT 5.4 Pro OpenAI	103
38	Qwen3.6 Plus Qwen	347
39	Qwen3 Coder Next Qwen	351
40	Claude Opus 4.6 (Fast) Anthropic	420
41	Claude Opus 4.6 Anthropic	424
42	GLM 5.1 Z.ai	503
43	Claude Sonnet 4.6 Anthropic	754

Models Tested

This analysis covers 43 models from 15 companies:

Claude Haiku 4.5
Claude Opus 4.6
Claude Opus 4.6 (Fast)
Claude Sonnet 4.6
Codestral 2508
DeepSeek V3.2
DeepSeek V3.2 Speciale
Devstral 2 2512
Gemini 3 Flash Preview
Gemini 3.1 Pro Preview
Gemma 4 31B
GLM 4.7 Flash
GLM 5.1
GPT 5.3 Codex
GPT 5.4
GPT 5.4 Mini
GPT 5.4 Pro
gpt oss 120b
Grok 4.1 Fast
Grok 4.20
KAT Coder Pro V2
Kimi K2 Thinking
Kimi K2.5
MiniMax M2.7
Mistral Large 3 2512
Mistral Medium 3.1
Mistral Small 4
Nemotron 3 Super (free)
Nova 2 Lite
o3
o4 Mini
Olmo 3.1 32B Instruct
Qwen3 Coder Flash
Qwen3 Coder Next
Qwen3 Coder Plus
Qwen3 Max
Qwen3 Max Thinking
Qwen3.5 397B A17B
Qwen3.5 Flash
Qwen3.6 Plus
R1 0528
Trinity Large Preview (free)
Trinity Large Thinking

Update History

April 9, 2026

Models Removed

GLM 5: was Rank #5, AIMAC Debt 2.66, Cost $2.03
GPT 5.1 Codex Mini: was Rank #6, AIMAC Debt 3.78, Cost $0.39
Qwen3.6 Plus Preview (free): was Rank #12, AIMAC Debt 4.44, Cost $0.00

Models Added

Qwen3.6 Plus: Rank #22, AIMAC Debt 5.41, Cost $0.99
Claude Opus 4.6 (Fast): Rank #23, AIMAC Debt 5.94, Cost $112.18
Gemma 4 31B: Rank #40, AIMAC Debt 12.59, Cost $0.08
GLM 5.1: Rank #42, AIMAC Debt 16.38, Cost $2.47

Notes

GLM 5.1 replaces GLM 5 as Z.AI's latest flagship. GLM 5.1 was retrained for coding/agentic tasks and achieved SOTA on SWE-Bench Pro, but dropped from #5 to #42 on accessibility.
GPT 5.1 Codex Mini retired as OpenAI consolidates around the 5.3/5.4 generation.
Qwen3.6 Plus is the GA release of the preview. Free access ends; the paid model costs $0.99.
Claude Opus 4.6 (Fast) is Anthropic's fourth model on the benchmark, filling the premium-fast slot.
Gemma 4 31B is Google's first open-weight model to pass the 10KB HTML threshold.

Impact

The version-upgrade regression pattern strikes again: GLM 5 (#5) to GLM 5.1 (#42) is the largest rank drop from a version upgrade we have tracked (37 ranks). It joins KAT V1 to V2 (-28) and Sonnet 4.5 to 4.6 (-18).
Z.AI drops out of the Top 3 Winners. Alibaba/Qwen enters, with Qwen3 Max at #7 and eight models across the leaderboard.
Trinity Large Preview (free) moves to #5, a free model in the top five.

April 1, 2026

Models Added

Trinity Large Thinking (Arcee AI): Rank #30, AIMAC Debt 7.97, Cost $0.25

Notes

Arcee AI's second model on the benchmark, alongside Trinity Large Preview (free) at #7.

Impact

Arcee AI now has two models on the benchmark: Trinity Large Preview (free) at #7 and Trinity Large Thinking at #30.

March 31, 2026

Models Removed

Grok 4.20 Beta: was Rank #39, AIMAC Debt 12.53, Cost $2.09

Models Added

Grok 4.20: Rank #38, AIMAC Debt 11.16, Cost $2.07

Notes

Grok 4.20 GA supersedes the Beta release.
Updated WebAIM Million comparison data from 2025 to 2026 following the new report (published March 30, 2026). The 2026 report reversed six years of gradual improvement: errors per page jumped 10% and 95.9% of sites now fail basic checks.

Impact

xAI's flagship improves slightly (debt 12.53 to 11.16), gaining one rank from #39 to #38.
Qwen3.5 Flash shifts from #38 to #39. All other rankings are unchanged.

March 30, 2026

Models Removed

Qwen3.5 Plus 2026 02 15: was Rank #24, AIMAC Debt 5.73, Cost $0.54

Models Added

Qwen3.6 Plus Preview (free): Rank #12, AIMAC Debt 4.44, Cost $0.00

Notes

Qwen3.6 Plus Preview supersedes Qwen3.5 Plus.
The count of models under $1 remains 27 (both models are under $1).

Impact

Alibaba's Qwen Plus line improves from #24 to #12, a 12-spot jump.
Ranks 12-23 each shift down by one. Top 11 and ranks 25-41 are unchanged.

March 28, 2026

Models Removed

KAT Coder Pro V1: was Rank #12, AIMAC Debt 4.46, Cost $0.22

Models Added

KAT Coder Pro V2: Rank #40, AIMAC Debt 13.19, Cost $0.44

Notes

KAT Coder Pro V2 supersedes V1. Score regressed sharply from 4.46 to 13.19, rank dropping from #12 to #40. Cost doubled from $0.22 to $0.44.
The count of models under $1 remains 27 (both V1 and V2 are under $1).

Impact

KwaIPilot drops from the upper tier to second-to-last. V1 was a standout budget pick; V2 is one of the weakest models on the benchmark.
Ranks 13-39 each shift up by one. Top 11 and last place are unchanged.

March 22, 2026

Models Removed

MiniMax M2.5: was Rank #9, AIMAC Debt 4.04, Cost $0.25

Models Added

MiniMax M2.7: Rank #23, AIMAC Debt 5.34, Cost $1.28

Notes

MiniMax M2.7 supersedes M2.5. Score regressed from 4.04 to 5.34, rank dropping from #9 to #23. Cost increased from $0.25 to $1.28.
The count of models under $1 drops from 28 to 27 because M2.7 costs $1.28 (M2.5 was $0.25).

Impact

MiniMax falls out of the Top 10 after its M2.5 held #9. The upgrade traded accessibility for other capabilities.
Top 8 and bottom 18 ranks are unchanged. Only ranks 9-22 shift up by one.

March 17, 2026

Models Removed

GPT 5.2 Pro: was Rank #3, AIMAC Debt 2.77, Cost $95.45
GPT 5.2: was Rank #4, AIMAC Debt 3.05, Cost $6.63
Gemini 2.5 Flash Lite: was Rank #23, AIMAC Debt 5.42, Cost $0.09
GPT 5 Mini: was Rank #30, AIMAC Debt 7.62, Cost $0.60
Mistral Small 3.2 24B: was Rank #35, AIMAC Debt 8.48, Cost $0.03

Models Added

GPT 5.4 Mini: Rank #2, AIMAC Debt 0.00, Cost $0.89
GPT 5.4: Rank #3, AIMAC Debt 0.00, Cost $4.48
GPT 5.4 Pro: Rank #4, AIMAC Debt 1.00, Cost $212.75
Nemotron 3 Super (free): Rank #21, AIMAC Debt 4.82, Cost $0.00
Qwen3.5 Flash: Rank #39, AIMAC Debt 11.37, Cost $0.09
Grok 4.20 Beta: Rank #40, AIMAC Debt 12.53, Cost $2.09
Mistral Small 4: Rank #41, AIMAC Debt 16.96, Cost $0.18

Notes

GPT 5.4 Mini achieves a median AIMAC Debt of 0.00 at $0.89, the second model to reach zero debt.
GPT 5.4 and GPT 5.4 Pro replace their 5.2 predecessors as the latest versions in each slot. GPT 5.4 Pro costs $212.75, more than double the $95.45 that GPT 5.2 Pro cost, making it the most expensive model on the benchmark by a wide margin.
Gemini 2.5 Flash Lite is removed. Google's replacement (Gemini 3.1 Flash Lite Preview) fails the 10KB minimum HTML size requirement, so the Flash Lite slot goes away.
Mistral Small 4 (119B) replaces Small 3.2 24B as Mistral's latest Small release, but regresses from #35 (debt 8.48) to dead last at #41 (debt 16.96).
Grok 4.20 Beta is xAI's new flagship but lands at #40, while xAI's own Grok 4.1 Fast sits at #18.

Impact

GPT 5.4 Mini ($0.89) is the value headline: zero accessibility debt for under a dollar.
Three OpenAI models now hold spots #2 through #4, with the existing GPT 5.3 Codex still at #1.

February 25, 2026

Models Removed

GPT 5.2 Codex: was Rank #3, AIMAC Debt 3.05, Cost $1.94
Gemini 3 Pro Preview: was Rank #39, AIMAC Debt 10.65, Cost $3.35

Models Added

GPT 5.3 Codex: Rank #1, AIMAC Debt 0.00, Cost $3.02
Gemini 3.1 Pro Preview: Rank #11, AIMAC Debt 4.40, Cost $4.16

Notes

GPT 5.3 Codex replaces GPT 5.2 Codex per version policy. It achieves a median AIMAC Debt of 0.00 with just 20 total violations across 28 categories, taking #1 from GLM 5.
Gemini 3.1 Pro Preview replaces Gemini 3 Pro Preview per version policy. A dramatic improvement from #39 (debt 10.65) to #11 (debt 4.40).

Impact

OpenAI takes back #1 with the first model to achieve a median AIMAC Debt of 0.00.
Google goes from dead last to #11, the largest single-update improvement we have tracked.

February 17, 2026

Models Removed

Aurora Alpha: was Rank #9, AIMAC Debt 4.05, Cost $0.00
Claude Sonnet 4.5: was Rank #19, AIMAC Debt 4.78, Cost $6.20

Models Added

Claude Sonnet 4.6: Rank #37, AIMAC Debt 9.83, Cost $12.64
OLMo 3.1 32B Instruct: Rank #18, AIMAC Debt 4.80, Cost $0.12
Qwen3.5 Plus 2026 02 15: Rank #23, AIMAC Debt 5.73, Cost $0.54
Qwen3.5 397B A17B: Rank #36, AIMAC Debt 9.16, Cost $0.87

Notes

Claude Sonnet 4.6 replaces Sonnet 4.5 per version policy; Sonnet regresses from #19 (Debt 4.78) to #37 (Debt 9.83) in this collection.
Aurora Alpha was removed due to a 10KB minimum HTML threshold failure (one or more categories fell below 10KB).
OLMo 3.1 32B Instruct adds an open-weight model from AllenAI (Allen Institute for AI), a US-based non-profit AI research lab.
Two Qwen 3.5 models expand Qwen coverage with additional product lines.

Impact

The production leaderboard expands from 37 to 39 models.
#1 remains GLM 5 (AIMAC Debt 2.66).
27 of 39 models cost under $1.

February 12, 2026

Models Removed

MiniMax M2.1: was Rank #18, AIMAC Debt 4.74, Cost $0.28

Models Added

MiniMax M2.5: Rank #8, AIMAC Debt 4.04, Cost $0.25

Notes

MiniMax M2.5 supersedes M2.1. Score improved from 4.74 to 4.04, rank jumping from #18 to #8.

Impact

MiniMax M2.5 returns to the Top 10 with this entry.

February 11, 2026

Models Removed

MiniMax M1: was Rank #5, AIMAC Debt 3.78, Cost $0.82
Kimi K2 0905: was Rank #9, AIMAC Debt 4.39, Cost $0.54
GLM 4.7: was Rank #13, AIMAC Debt 4.51, Cost $0.44
Qwen3 Coder 480B A35B: was Rank #22, AIMAC Debt 6.01, Cost $0.33
Claude Opus 4.5: was Rank #25, AIMAC Debt 7.06, Cost $19.92
Qwen3 235B A22B Instruct 2507: was Rank #27, AIMAC Debt 7.54, Cost $0.26
R1: was Rank #29, AIMAC Debt 8.08, Cost $0.70
GLM 4.5 Air: was Rank #36, AIMAC Debt 9.33, Cost $0.33

Models Added

GLM 5: Rank #1, AIMAC Debt 2.66, Cost $2.03
Trinity Large Preview: Rank #6, AIMAC Debt 3.98, Cost $0.00
Aurora Alpha: Rank #8, AIMAC Debt 4.05, Cost $0.00
Qwen3 Coder Next: Rank #20, AIMAC Debt 4.99, Cost $0.41
Kimi K2.5: Rank #21, AIMAC Debt 5.12, Cost $1.07
Qwen3 Max Thinking: Rank #24, AIMAC Debt 6.00, Cost $1.11
Claude Opus 4.6: Rank #26, AIMAC Debt 6.90, Cost $18.34
R1 0528: Rank #36, AIMAC Debt 10.17, Cost $0.66

Notes

GLM 5 by Z.ai takes 1st place, dethroning GPT 5.2 Pro. It is MIT-licensed open-source at $2.03 total, compared to GPT 5.2 Pro's $95.45.
Trinity Large Preview by Arcee AI enters at #6 for free. It is a US-based open-source model.
GLM 4.7 and GLM 4.5 Air are both superseded by GLM 5.
Claude Opus 4.6 replaces Opus 4.5. Marginally better debt (6.90 vs 7.06) but ranks #26 due to new entrants.
R1 0528 replaces R1. Cheaper with larger context, but scored worse on accessibility (10.17 vs 8.08).
MiniMax M1 superseded by M2.1.
Kimi K2 0905 superseded by K2.5.
Qwen3 235B A22B Instruct 2507 removed. Qwen3 Coder 480B A35B trimmed in favor of Coder Next and Coder Plus variants.

Impact

Open-source models now lead the leaderboard: GLM 5 at #1 and Trinity Large Preview at #6.
OpenAI retains four of the top five spots (#2 through #5) but loses the crown.

January 19, 2026

Models Removed

GPT 5.1 Codex: was Rank #3, AIMAC Debt 3.53, Cost $1.72
KAT Coder Pro V1 (free): was Rank #14, AIMAC Debt 4.55, Cost $0.00

Models Added

GPT 5.2 Codex: Rank #2, AIMAC Debt 3.05, Cost $1.94
KAT Coder Pro V1: Rank #10, AIMAC Debt 4.46, Cost $0.22
GLM 4.7 Flash: Rank #20, AIMAC Debt 5.37, Cost $0.11

Notes

GPT 5.2 Codex succeeds GPT 5.1 Codex in our benchmark. We test the latest version of each model line.
KAT Coder Pro V1 replaces the free tier, which OpenRouter deprecated on January 12.

Impact

KAT Coder Pro V1 entering at #10 pushed Claude Haiku 4.5 to #11.

Acknowledgments

We are grateful to ServiceNow for their partnership, without which AIMAC would not be possible. This project was inspired by the WebAIM Million report. Note: although the Foundation regularly partners with WebAIM on projects such as the Global Digital Accessibility Salary Survey (GDASS), we did not happen to consult with them yet on this project. Any errors in this report are our own. We must acknowledge Deque for open sourcing axe-core, the engine powering our testing, and Microsoft for Playwright, which powers our browser automation. Microsoft are also partners in AIMAC's next benchmark, which will be accompanied by a research paper written by Yumeng Ma. We appreciate the collaboration, advice and feedback from Ed Summers, Jennifer Mankoff, Aaron Gustafson, Charlie Triplett, AnnMarie Dittell, Venkatesh Potluri, Jacob Wobbrock and Mary Bellard.

Methodology

What is AIMAC?

When you ask an LLM to build a web page, how accessible is the HTML it generates? AIMAC answers that question.

We prompt 43 models to generate pages across 28 categories with no accessibility guidance. Then we run axe-core, the accessibility testing engine used by Microsoft, Google, and most major tech companies, to count violations against WCAG 2.2 Level AA (the standard required by US and EU accessibility laws). Lower scores mean fewer problems.

The result is an apples-to-apples comparison of how well models handle accessibility without being told to.

Understanding "AIMAC Debt"

The AIMAC Debt measures automated accessibility violations. Lower is better.

What the numbers mean

0.00 is best possible (no critical/serious violations detected by axe-core)
Lower single-digit scores generally mean the model generates relatively clean HTML
Higher scores mean more severe and/or more numerous violations; double-digit scores indicate substantial accessibility work is needed

In practice, 80-90% of violations are color-contrast issues (text that's hard to read against its background). The remaining 10-20% are structural: empty buttons, missing form labels, misused ARIA attributes (the HTML attributes that help screen readers understand page structure). Structural issues are often more severe because they completely block assistive technology users.

How AIMAC Debt is calculated

Critical violations (missing form labels, empty buttons): 5 points each
Serious violations (insufficient color contrast): 2 points each
Duplicate violations are dampened to avoid over-counting (details below)

We report the median debt across all 28 categories to avoid outliers.

Technical details: Dampening formula

For those interested in the math: we use logarithmic dampening for duplicate violations of the same issue type on a page.

Formula: score = base_weight × (1 + k × log₂(N))

Where:

N = number of duplicate violations
k = dampening factor (0.75 critical, 0.5 serious, 0.33 color-contrast)
base_weight = 5 critical, 2 serious

This ensures 100 duplicates from one bad CSS rule don't count as heavily as 100 separate issues, while still encouraging comprehensive fixes.

How Reliable Are These Rankings?

We tested ranking stability extensively:

Top 2 and bottom 4 positions: Rock-solid across configurations
Mid-tier rankings (6-16): More variable because some models excel at certain page types while struggling with others

Choosing between mid-ranked models? Check category-specific performance on the detail pages.

Excluded Models

Pages under 10KB don't contain enough HTML to fairly test for accessibility mistakes. A 5KB page with zero violations isn't "better" than a 50KB page with three. They're solving different problems. We exclude models that produce minimal output to keep comparisons fair.

Why This Scoring System?

Our scoring reflects real-world impact:

Critical violations (5 points vs 2 for serious): Empty buttons or missing labels completely block screen reader users
Duplicate dampening: One bad CSS color causing 50 violations doesn't overwhelm the score
Extra color-contrast dampening: Color issues are 70-100% of violations, so we reduce their weight to surface other differences

We validated this with sensitivity testing. Rankings stay consistent across configurations, and there's no penalty for generating complex HTML.

Scope of Automated Testing

Axe-core catches many issues: color contrast, missing labels, empty buttons, ARIA errors.

What automated testing can't catch: keyboard navigation problems, screen reader flow issues, real-world usability. Lower AIMAC Debt means a cleaner starting point, not a guarantee of perfect accessibility. Manual testing is still needed.

Preventing Benchmark Gaming

AIMAC's prompts are public. You can inspect exactly what we ask models to generate.

To prevent AI companies from training on specific prompt patterns, we use prompt randomization. Each category prompt contains example lists (navigation links, section themes) randomly selected from larger pools at runtime.

With current fragment pools, this yields hundreds of billions of prompt variants per category (tens of trillions across all 28 categories). Memorization is computationally infeasible.

The core instructions stay constant; only stylistic examples vary. AIMAC Debt reflects genuine capability, not prompt memorization.

Understanding the Total Cost

The Total Cost column shows what we paid to generate all 28 HTML pages for each model. This is the actual API cost in USD for the complete test suite.

For Benchmark Authors

While preparing to publish, we hit some visualization challenges that led us to consider Datasette, Simon Willison's tool for exploring and publishing data. We haven't implemented it yet, but the problems we encountered seem common to benchmarks, so we're sharing what we learned.

Data publication

Why publish your data?

Benchmark authors face a perplexing set of tradeoffs. Which models to include? How many prompts to test? How many variants? Worst of all, we can't know which version of a model we're testing because models get updated without public versioning. Companies change system prompts, infrastructure, or other API elements without warning. No benchmark will ever be truly repeatable. Publishing your data with an interactive website is a service to everyone, allowing the hive mind to improve the benchmark by highlighting gaps.

Datasette with its 154 plugins could let us embed an interactive database viewer directly on this page. It doesn't yet have Pareto or slope chart plugins, but general visualization plugins exist (datasette-vega, datasette-plot). Any interactive charting would need careful accessibility testing, which is why we deferred it.

Pareto Optimality

Sure, it's a great model, but is it Pareto optimal?

AI models are expensive. Benchmarks need a way to measure quality per dollar. Pareto Front charts are popular because they plot cost on one axis and quality on the other, quickly showing which models offer the best value.

When you divide such a chart into quadrants, the Pareto-optimal region combines maximum capability with minimum cost. That region represents the best tradeoff.

Most viewers instinctively interpret the top-right quadrant as "best." But in this benchmark, both axes reward lower values: lower debt means better accessibility, and lower cost is preferable. Plotted directly, that makes the bottom-left quadrant the Pareto-optimal one.

Pareto Optimality The limit of efficiency. To lower the AIMAC Debt score it would cost more. And you can't pay less without your AIMAC Debt score rising.

Naming matters here. We originally called our metric "AIMAC Score," but users expect higher scores to be better. Thanks to an insight by Charlie Triplett, we renamed it to "AIMAC Debt." People intuitively understand that you want debt to be lower.

Even with better naming, the Pareto chart still forces a choice between two kinds of unintuitiveness. You can leave the axes semantically honest (lower = better) and accept that the optimal quadrant is bottom-left. Or you can invert both axes so "up and to the right" means better, at the cost of making the axis directions themselves counterintuitive. Neither is great. An interactive chart that lets users toggle between views would help.

Slope Charts

Slope charts offer another way to visualize the cost vs. quality tradeoff. They plot each model's quality rank on the left and cost rank on the right, then draw a line between them. A flat or rising line means the model delivers good value: its cost rank matches or beats its quality rank. A steep drop means you're paying a premium for that quality. Pareto-optimal models are highlighted.

We removed the slope chart from this iteration because with so many models, effectiveness drops, especially when costs cluster at the low end as ours do. Here's a mockup showing what it looked like when the distribution told the story well (note that this chart pre-dated our renaming of AIMAC Score to AIMAC Debt):

Sample slope chart mockup showing each model's rank by AIMAC Debt on the left and rank by total cost on the right, connected by lines. Flatter or rising lines indicate better value. — Sample slope chart mockup. Slope charts can make the cost versus debt tradeoff easier to scan when the distribution is clear.

For Model Providers

Most benchmarks are simply focused on ranking models. Our goal is that every model achieve a perfect score.

When AI generates accessible code by default, people with disabilities are no longer locked out digital technology. A future where every model aces this benchmark means an internet that works for everyone. To be frank, as much effort as we put into the first iteration of this benchmark, achieving a perfect score on an automated metric that is easy to lint and produce synthetic data for is table stakes. However we decided to start here because the models aren't even doing this minimum step.

How to Ace the Benchmark

Test the way we test. We run axe-core against rendered HTML in a real browser. Static linters won't catch everything. Color contrast, for example, requires computing actual styles from cascading CSS. Playwright's accessibility testing guide covers the setup.

The AIMAC repository has our test prompts and scoring code.

Questions? [email protected]

Press

AIMAC has been featured in publications, podcasts, conference talks, and industry roundups. Selected coverage highlights:

Launch Coverage

VKTR - The Open‑Source AI Accessibility Checker Holding LLMs Accountable by Michelle Hawley
The AI Economy - Can Your AI Code for Everyone? by Ken Yeung
Curb Cuts - GAAD Foundation, ServiceNow Announce AI Model Accessibility Checker API by Steven Aqui

Design frontendanimation

Control Lottie Animations with Real Data (Website)

LottieFiles introduces Motion Tokens, letting developers change animation colors, text, and transforms at runtime without re-exporting files.

Read original

Summary

What: Motion Tokens is a new feature in LottieFiles' Lottie Creator that allows developers to mark animation properties as runtime-editable tokens, then bind them to real data and modify them programmatically without regenerating the animation file.

Why it matters: This removes a major friction point in the design-to-development workflow where every small change to animation content—like updating text, swapping colors for theming, or localizing content—previously required designers to re-export files and developers to replace them.

Takeaway: Try defining tokens in Lottie Creator by marking properties as editable, then export as dotLottie format to control them via the runtime API on web, iOS, or Android.

Deep Dive

Motion Tokens expose animation properties (colors, text, transforms) as runtime-editable variables that can be bound to product data without modifying or re-exporting the animation file
The workflow involves three steps: define tokens in Lottie Creator by marking properties as editable, export as dotLottie format, then control values programmatically at runtime via API
This addresses a core pain point where any animation update—changing a headline, swapping a color, adding a language—required designers to re-export files and developers to swap them out
The feature enables animations to respond to product state, including live data updates, personalization, theming (light/dark mode), and dynamic content based on user context
Theming is built on Motion Tokens, allowing teams to define multiple variants (light, dark, branded) in a single file and switch between them at runtime via API
Motion Tokens integrate with design systems using the same logic as color and typography tokens, maintaining consistency across static and animated elements
The system includes AI-powered theme generation where developers can describe a theme and get token values automatically, then fine-tune manually
Runtime guarantees include deterministic behavior, versioning embedded in the dotLottie file, stable API across platforms, and property changes without re-rendering
Cross-platform support works consistently on Web, iOS, and Android with the same runtime API
Example use case shown is a weather widget where selecting different cities updates both data and visual elements dynamically within the animation

Decoder

Lottie: An animation format based on JSON that renders vector animations exported from After Effects, widely used in apps and websites for lightweight, scalable motion graphics
dotLottie: A file format from LottieFiles that packages Lottie animations with additional metadata, assets, and features like theming and interactivity
Motion Tokens: Runtime-editable variables within Lottie animations that expose specific properties (color, text, transform) to programmatic control without re-exporting the file
Runtime API: The programming interface that allows developers to control token values while the application is running, changing animation properties on the fly

Original Article

Bind your Lottie animation properties to real data. Change colors, text, and transforms at runtime, without re-exporting.

Design aifrontendagents

Learning Agentic Design Systems

An experiment with Google's Antigravity IDE and Figma Console MCP enabled a two-way workflow — generating Figma components from code and React code from Figma designs — keeping design tokens in sync throughout.

Read original

Summary

What: A hands-on exploration of building an "agentic design system" workspace that uses AI to bidirectionally generate both React components from Figma designs and Figma components from code. The setup leverages Figma Console MCP (a Model Context Protocol server) within Antigravity IDE to maintain synchronization between design tokens, code, and Figma variables, while also generating metadata files that provide AI agents with structured rules for consistent component usage.

Why it matters: This points to a fundamental shift in design systems work: teams must move from writing human-readable documentation to encoding governance through machine-readable metadata and agentic workflows. Designers will need to learn to instruct AI agents rather than create static mockups, while design system maintainers must structure rules and context so AI can generate consistent, on-brand outputs. The traditional workflow of designers creating Figma files and handing them to engineers is becoming obsolete in favor of spec-driven, AI-mediated collaboration.

Takeaway: Experiment with Figma Console MCP in Antigravity or similar agentic IDEs to set up bidirectional design-code workflows, and explore generating component metadata that encodes design system rules for AI consumption rather than just human documentation.

Deep Dive

The author set up a two-way workflow where AI agents can generate Figma components from React code and React code from Figma designs, using Google's Antigravity IDE with the Figma Console MCP server
Code-to-design workflow: imported shadcn components, generated design tokens as Figma variables, then used AI to create matching Figma components that got "60-90% there" depending on complexity
Design-to-code workflow: brainstormed a card component design using Pencil (integrated in Antigravity), converted it to Figma with proper variables via MCP, then generated React code and metadata
Designers reviewing AI-generated Figma components need proficiency in auto layouts, variables, props, slots, and understanding of how components are built in code
Generated two types of metadata: human-readable documentation from Figma components, and machine-readable metadata (using Cristian Morales' skills) that guides AI on when, where, and how to use components correctly
The codebase-index skill prevents AI from inventing raw HTML or inline styles by giving it exact locations of design tokens and components
Design parity reports automatically check alignment between Figma and code, generating HTML dashboards that track drift and enable AI-assisted corrections
The shift from "vibe coding" to "specs design" is necessary for automated UI generation — AI needs precise, structured, machine-readable context to produce quality outputs
Design system teams must learn to encode governance through structured data rather than writing website documentation for humans
Skills in the workspace are created once and reused, allowing designers to create pull requests and commit changes after modifying Figma, bringing them closer to actual code
This represents a role transformation: designers move from "drawing pretty pictures" to instructing agents on building design and code that follows system rules
Context generation, pattern understanding, and rule encoding become critical skills for designers in AI-integrated organizations

Decoder

MCP (Model Context Protocol): A protocol that allows AI agents to connect to external systems and tools with read/write access, like Figma Console MCP which lets AI manipulate Figma files programmatically
Antigravity IDE: Google's agentic IDE (a fork of VS Code) that orchestrates multiple AI agents simultaneously across different workspaces
Design tokens: Atomic design values (colors, spacing, typography) stored as variables that can be referenced across design and code
shadcn: A popular collection of reusable React components built with Radix UI and Tailwind CSS
Skills: Reusable AI capabilities or workflows that can be embedded in an agentic workspace and invoked during future tasks
Design parity: The degree of alignment between design specifications (in Figma) and actual implemented code
Vibe coding: Informal, prompt-driven UI generation without strict design system constraints
Specs design: Structured, specification-driven design that provides precise context for AI-powered generation

Original Article

An experiment with Google's Antigravity IDE and Figma Console MCP enabled a two-way workflow — generating Figma components from code and React code from Figma designs — keeping design tokens in sync throughout. Metadata files were also generated to give AI agents structured rules and context for consistent, on-brand component usage. The experience points to a broader shift: design system teams must move from writing human-readable guidelines to encoding governance through machine-readable metadata and agentic workflows.

Design aiagents

Your UX Skills Were Built for One Kind of Intelligence

UX designers must shift from designing interfaces users control to designing autonomous AI agents that act on users' behalf, fundamentally changing the craft from layout to behavior.

Read original

Summary

What: An essay arguing that AI agents are transforming UX design from "what users click" to "how users delegate." Instead of designing screens and flows, designers now shape AI behavior, trust patterns, and oversight mechanisms as agents interpret goals, make decisions, and act autonomously on behalf of users.

Why it matters: This represents a fundamental shift in the UX profession where traditional skills around affordance and hierarchy become less central than behavioral design, alignment, and trust-building. The article argues this is already happening—teams are doing this work now but lack established frameworks and methods to do it well.

Takeaway: Start learning behavioral design and agent architecture by asking new questions: How does it decide what to pay attention to? When does it act versus ask? How does it recover from mistakes? How does it earn trust over time?

Deep Dive

For 20 years, UX designers focused on creating interfaces that users directly control through clicks, flows, and layouts where the UI was the ground truth of user experience
AI agents shift design from "what users click" to "how users delegate" since agents have autonomy and make decisions independently rather than just responding to commands
The interface is no longer the primary surface—behavior becomes the interface as agents interpret goals, make trade-offs, take initiative, and occasionally surprise users
Traditional UX assumed deterministic states and direct manipulation; agents operate in uncertainty, autonomy, negotiation, and improvisation
Design concerns shift from affordance and hierarchy to trust, transparency, oversight, safety, and shared control as the foundational "user operates tool" model collapses
Delegation becomes a first-class interaction pattern where users hand over goals ("Solve this, not that" or "Optimize for X") and trust the right thing will happen
Alignment is the gap between literal instructions and actual intent—what keeps a capable intelligence from being a dangerous one when it takes initiative
Agent-oriented design follows a different loop than traditional screen-first design: observe → interpret → decide → act → reflect → adapt, running inside both the system and the design process
Designers must answer: How does it decide what to pay attention to? When does it act versus ask? How does it recover when wrong? How does it earn trust over time? How can users reshape behavior without micromanaging?
Behavioral design, intent modeling, interaction choreography, and safety become core UX responsibilities that land on designers' desks, not just policy teams
The article uses a calendar scheduling example to illustrate how each step of the agent loop involves design decisions about attention, interpretation, decision-making, action, feedback incorporation, and adaptation
Teams are already doing this work now—it's not speculative—but the field needs frameworks, language, methods, and critique to craft great work in this space
Screens will keep getting easier to generate while behavior will keep getting harder to design well, making this the most exciting period of design in decades

Decoder

Agency: The ability of AI systems to make autonomous decisions and take initiative rather than just responding to direct commands from users
Alignment: Making sure an AI system does what you actually want it to do, not just what you literally told it to—the gap between instruction and intent
Delegation model: The design pattern for how users hand over goals and tasks to AI agents rather than commanding them step-by-step through an interface
Interpretability: Understanding and making visible how AI systems reason and make decisions, related to researching how they think
Interaction choreography: Designing when agents take initiative versus hold back, how they resolve conflicts, express uncertainty, and recover from mistakes

Original Article

Your UX Skills Were Built for One Kind of Intelligence.

We spent 20 years designing things users control. The next 20 will be about designing things that act on our behalf.

Photo by Google DeepMind: https://www.pexels.com/photo/graphic-design-of-molecular-models-25626509/

We spent two decades mastering interfaces. We learned how to map workflows, how to communicate intent, how to shape interactions through layout and information hierarchy. The UI really was the ground truth of user experience and I have loved this space. Now, AI changes all of that. We're moving into a reality where software has agency.

An AI agent isn't a feature; it's an intelligence negotiating on our behalf, interpreting goals, making trade-offs, taking initiative, and occasionally surprising us. That is a profound shift, because once products possess agency, the interface stops being the primary surface. The behaviour becomes the interface (that's not to say all UI disappears either).

This means our job as designers changes. We used to design what users click, and now we design how they delegate. We used to define flows, and now we define boundaries. We used to specify screens, and now we specify how a system interprets intent, resolves ambiguity, asks for clarification, and recovers from mistakes.

Why This Is Happening Now

We're at an inflection point where LLMs finally have enough reasoning ability to make agentic behaviour genuinely useful, and not just a party trick. The orchestration tooling around them is maturing. Interfaces are increasingly being auto-generated at runtime. Users are getting more comfortable collaborating with systems rather than commanding them and the tasks we're asking of software are getting more open-ended, more ambiguous, more like the kind of thing you'd hand to a team mate and trust them to come back with something half decent.

The frontier of design has slipped underneath the surface, into the intelligence layer, where the thinking happens.

The Interface Stops Being the Center

For 20 years, UX Design revolved around what users see and now it revolves around what systems understand and attempt to do.

Designing for agents means rethinking a handful of questions we thought we had already nailed down. How does this system explain itself when I ask? When should I intervene, and when should it ask permission before it moves? How much initiative is appropriate here, and how should it tell me it's uncertain without burying me in hedging or sycophancy?

The goal has shifted. Intelligibility matters more than ease now, along with alignment, and the trust both of those things earn over time.

Traditional UX assumed the world was deterministic. It assumed that states were predictable, manipulation was direct, and control was visible. You could map a flow because the flow didn't keep changing on you.

Agents exist on another plane entirely. They live in uncertainty, in autonomy, in decision-making and improvisation and negotiation. We also don't control them, they have a degree of …well "agency". This changes the dynamic Culkin described when he said "We shape our tools, and thereafter our tools shape us." - now they shape themselves too.

The foundational mental model of "user operates tool" collapses. We now have something closer to "user collaborates with intelligence." Our concerns as designers shift from affordance and hierarchy to trust, transparency, oversight, safety, and shared control.

Delegation Becomes a First-Class Interaction Pattern

Most UI lets users command and be in charge. Agents need patterns that let users delegate. "Solve this, not that." "Optimise for X." "Avoid Y." "Notify me only if…" The interface is no longer a set of buttons. It's a way of handing over a goal and trusting the right thing will happen.

Oversight becomes a primary UX surface, and not just a secondary settings panel tucked three clicks deep. We'll increasingly design interaction rules instead of interface layouts.

Behaviour Is the Material

Behaviours are grown and evolved over time.

We shape agentic behaviour the way a gardener shapes a climbing plant. Through context, with goals, with the prompts that tell it what to do and how to do those things well, with constraints that tell it where the walls are. With feedback loops, and a sense of personality, and a working memory of what it has done before. None of these are interface decisions but all of them are design decisions.

The design surface moves from pixels in space to behaviour in time.

"Alignment" is the word AI researchers use for the problem of making sure a system does what you actually want it to do, not just what you literally told it to do. It's the gap between the instruction given and the intent you meant when you wrote it. It's what happens when a system takes initiative and you have to decide whether it took the right one. It's what keeps a capable intelligence from being a dangerous one.

The Real Work Is Interaction Choreography

We'll design when the agent takes initiative and when it holds back. How it resolves conflict between what you asked for and what it sees. How it recovers from its own mistakes, and how it tells you it made one. How it expresses uncertainty without either over-claiming or drowning you in qualifications. How it negotiates shared intent when your goals and its goals don't quite line up. How it handles the edge cases none of us saw coming.

Instead of mapping flows, we architect protocols. Instead of wireframing screens, we scaffold reasoning. Instead of specifying output, we shape process.

The Agent-Oriented Design Loop

Most teams don't know how to start, so they keep doing screen-first design and bolting AI on top. It doesn't work. Agent design needs a different loop. Here's what it looks like:

observe → interpret → decide → act → reflect → adapt

This loop runs inside the system and inside the design process. It's iterative, exploratory, and behaviour-first. You prototype the thinking, and not the interface. Here's a quick example:

Observe — The agent scans your calendar and notices you have back-to-back meetings for six hours with no break. What should it pay attention to? What should it ignore? These are design decisions.
Interpret — It infers you're overbooked, but is that a problem or a preference? Maybe you like dense days. Maybe this week is an anomaly. How the agent interprets context and not just data is where alignment lives.
Decide — It considers moving your 2pm, declining the optional standup, or just flagging the situation. Who taught it which option to prefer? You did, through the delegation model you designed.
Act — It reschedules your 2pm and blocks 30 minutes for lunch. But it doesn't touch the client call, because it learned that external meetings are higher stakes. That is a design choice.
Reflect — You override it. You wanted that 2pm where it was. The agent registers the correction. How does it store that feedback? Does it learn from one correction or need a pattern? That's a design question about learning rate and confidence.
Adapt — Next week, it handles a similar situation differently. It asks instead of acting. The relationship between the agent and the user has evolved and that evolution is the UX.

The practical questions change shape. How does it decide? How does it ask for help? What does it ignore? When does it try something new, and when does it stop? Good agent UX depends less on clarity of UI and more on clarity of intent and alignment.

When you're designing for an agent, ask: How does it decide what to pay attention to? How does it know when to act versus when to ask? How does it recover when it gets it wrong? How does it earn trust over time? How does the user reshape its behaviour without micromanaging it?

We design the conditions that make that use-agent coordination feel effortless.

So What Does This Mean for Designers?

It means the craft expands. Behavioural design becomes something you really craft and is a meaningful part of your everyday job. You start working on delegation models, on agent architecture, on how reasoning itself can be a UX surface. Safety shows up on your work in a way it never used to, this is not just the policy team's responsibility anymore. Intent modelling, interaction choreography, alignment work… These will all be landing on your desk pretty soon if they haven't already.

It means that we need to be researching interpretability alongside usability. Understanding systems that think, and not just designing flows that users follow.

We're entering the most exciting period of design in decades. Screens will keep getting easier to generate and behaviour will keep getting harder to design well. The designers I can't wait to talk to are the ones who learn to shape living, adaptive systems that feel trustworthy, expressive, and deeply human.

I want to stress that this isn't speculative either, teams are already doing it. We need frameworks for it, and language, and methods, and critique that allow us to really craft great work in this space.

Design frontendcss

Color is finally OK

OKLCH, a perceptual color system created by a game developer in a 2020 blog post, has quietly replaced HSL in major frameworks and browsers by fixing decades-old problems with how digital color matches human vision.

Read original

Summary

What: OKLCH is a color format now default in Tailwind v4, modern browsers, and design tools that maintains consistent perceived brightness across different hues, keeps colors from drifting when lightened or darkened, and simplifies building accessible color systems without manual tweaking.

Why it matters: HSL has a fundamental flaw where colors at the same lightness value look drastically different in brightness (yellow glows, blue looks nearly black), forcing designers to manually compensate for years. OKLCH mathematically aligns with human perception, making tasks like even color scales, smooth gradients, and accessible contrast pairs straightforward arithmetic rather than iterative guesswork.

Takeaway: Rebuild one color scale in your design system using OKLCH with evenly spaced lightness values and constant chroma/hue to see how it naturally produces perceptually uniform results without tweaking.

Deep Dive

HSL's lightness is a mathematical midpoint of RGB channels, not perceived brightness—a yellow and blue both at 50% lightness look wildly different in actual brightness to the human eye
Designers have spent years manually compensating for HSL's broken math through contrast checking, palette adjustments, and dark mode fixes, treating this labor as normal rather than questioning the tool itself
Björn Ottosson, a Swedish game engineer at EA DICE (not a color scientist), published a blog post in December 2020 introducing Oklab because existing color spaces failed at simple tasks like darkening colors without hue drift
The W3C's Chris Lilley and Apple's Simon Fraser championed the format, getting it into CSS Color Module Level 4 by 2021 and implemented in all major browsers by 2023
OKLCH (Ok Lightness Chroma Hue) provides three perceptually meaningful dimensions: lightness that matches perceived brightness, hue that stays stable when adjusted, and chroma that honestly represents available color intensity
Building accessible contrast pairs becomes simple arithmetic—pick two lightness values differing by 0.4 or more; five-step ramps need even L-axis spacing; gradients interpolate cleanly without dead gray middles
The intellectual foundation traces directly to Albert Munsell's 1905 perceptual color system and CIELAB (1976), but Ottosson's implementation solved known issues like the "blue turn" with just ten lines of code
Tailwind v4 shipped OKLCH as the default for all palette colors in early 2025 with no announcement, migration guide, or apology—just new values as if they'd always existed
Most designers missed the shift despite three years of browser support because the problems OKLCH fixes (broken gradients, drifting hues, inconsistent contrast) had become normalized pain compensated for with elaborate tooling
Photoshop adopted Oklab for gradient interpolation, Unity and Godot integrated it into their engines, shadcn/ui uses it for theming, and AI coding assistants now output OKLCH by default from training data
One engineer's side project effectively replaced sixty years of institutional color science in mainstream design tools because the simple implementation solved practical problems academic models left unaddressed

Decoder

OKLCH: Ok Lightness Chroma Hue, a perceptual color space where numeric values match how colors actually look to human vision rather than mathematical channel calculations
HSL: Hue Saturation Lightness, the older color model where lightness is a mathematical average of RGB values, not perceived brightness
Chroma: The colorfulness or intensity of a hue; OKLCH reports the actual maximum available rather than pretending all hues reach full saturation equally
Perceptual uniformity: A color space where equal numeric differences produce equal perceived differences—a 10-point change looks the same whether you're adjusting red, blue, or gray
CIELAB: The 1976 color space from the International Commission on Illumination designed for perceptual uniformity, which OKLCH improves upon by fixing issues like blue-to-purple drift
Blue turn: A known CIELAB flaw where blue gradients drift toward purple due to hue nonlinearities, familiar to anyone who's built a dark mode and watched blues go purple

Original Article

OKLCH is emerging as a replacement for older color systems like HSL because it aligns with human perception—maintaining consistent brightness, stable hues, and more accurate color intensity. Created by Björn Ottosson, it simplifies tasks like building color scales, gradients, and accessible contrast, removing much of the manual tweaking designers have long relied on. Although already adopted in modern tools, browsers, and frameworks, the shift has gone largely unnoticed—even as it quietly fixes long-standing issues in digital color design.

Crypto paymentsinfrastructure

Western Union Enters Stablecoin Race with two new products

Western Union is launching a Solana-based stablecoin to replace SWIFT for internal settlements and bridging crypto wallets with its global retail network through new consumer products.

Read original

Summary

What: Western Union is rolling out USDPT, a Solana stablecoin for cross-border settlement, plus a Digital Asset Network connecting crypto wallets to its retail locations and a Stable Card that lets consumers spend stablecoins globally.

Why it matters: A century-old remittance giant moving core settlement infrastructure to blockchain signals maturation of crypto rails beyond experimentation, potentially accelerating mainstream adoption of onchain payments.

Takeaway: Developers building payment or remittance infrastructure should monitor how Western Union's Solana-based settlement performs against traditional SWIFT timelines and costs.

Deep Dive

Western Union launching Solana-based stablecoin USDPT next month for internal cross-border settlement operations
Primary use case: replacing legacy SWIFT payment rails with blockchain infrastructure for faster settlement
Blockchain enables 24/7 transaction processing versus traditional banking hour constraints
Digital Asset Network will bridge crypto wallets with Western Union's existing global retail footprint
Stable Card targets consumer market, allowing stablecoin holders to spend at point-of-sale worldwide
Represents shift from pilot projects to production deployment of blockchain by major legacy financial institution
Solana selection indicates prioritization of high throughput and low transaction costs over alternatives like Ethereum L1

Decoder

USDPT: Western Union's Solana-based stablecoin pegged to the US dollar
SWIFT: legacy international payment messaging network used by banks, typically requires days for settlement
Stablecoin: cryptocurrency designed to maintain stable value by pegging to fiat currency like the dollar
Solana: blockchain network known for high transaction speeds and low fees

Original Article

Western Union plans to launch its Solana-based stablecoin USDPT next month, initially for internal settlement to replace SWIFT rails and enable faster, always-on cross-border transactions. The company is also rolling out a Digital Asset Network to connect crypto wallets with its global retail network, alongside a “Stable Card” that lets consumers hold and spend stablecoins worldwide, signaling a full-stack push into onchain payments.

Crypto aisecurity

How Anthropic's Mythos model is forcing the crypto industry to rethink everything about security

Anthropic's Mythos AI model is shifting crypto security from smart contract audits to infrastructure vulnerabilities by simulating multi-step exploit chains across interconnected DeFi protocols.

Read original

Summary

What: Mythos is a new AI model from Anthropic designed to identify and chain together weaknesses across crypto systems, particularly focusing on infrastructure like key management, bridges, and oracle networks rather than just smart contract code. Major players like Coinbase and Binance have reportedly approached Anthropic to test it.

Why it matters: DeFi protocols are highly interconnected through composability, meaning small vulnerabilities can cascade into systemic failures. AI can map and exploit these dependencies at machine speed, forcing the industry to move from periodic human audits to continuous AI-driven security monitoring.

Takeaway: If you're building in crypto or DeFi, consider integrating continuous AI-driven auditing and stress testing tools, and prioritize infrastructure security beyond just smart contract code reviews.

Deep Dive

Mythos represents a new class of AI systems that simulate adversaries rather than just scanning for known bugs, exploring how protocols interact and testing how small weaknesses combine into exploits
DeFi security has historically focused on smart contract code audits, but Mythos highlights infrastructure risks like key management, bridges, oracle networks, and cryptographic layers that are often outside traditional audit scope
The model can identify multi-step exploit chains that historically only get discovered after money is lost, as well as infrastructure-layer vulnerabilities that traditional audits never touch
DeFi's composability creates pathways for risk to spread—a minor vulnerability in one protocol can become a critical exploit vector with contagion potential across the ecosystem, as seen in the Hyperbridge attack that exploited cross-chain message verification
Major institutions are taking notice: Coinbase and Binance reportedly approached Anthropic to test Mythos, while banks like JP Morgan are treating AI-driven cyber risk as systemic
Some industry leaders like Aave's founder see this as evolution rather than revolution, noting that DeFi already operates at machine speed with automated execution and defenses
The defense strategy is shifting from periodic audits to continuous AI-driven monitoring and simulation, with the assumption that breaches will happen
Aave has integrated AI into workflows for simulations and code review alongside human auditors, taking an "AI-first approach where it adds clear value"
Industry leaders expect the gap between secure and insecure protocols to widen significantly, with projects that don't prioritize security being most at risk
Security is evolving from eliminating vulnerabilities to continuously adapting to systems where vulnerabilities are constantly rediscovered and recombined at machine speed

Decoder

DeFi (Decentralized Finance): Crypto financial services built on blockchain without traditional intermediaries
Oracle networks: Services that provide external real-world data to blockchain smart contracts
Composability: The ability of DeFi protocols to interconnect and build on each other's services, creating capital efficiency but also risk pathways
Bridge: Technology that allows crypto assets to move between different blockchains
Exploit chain: A sequence of vulnerabilities that attackers combine to execute a successful attack
Smart contract: Self-executing code on a blockchain that automatically enforces agreement terms
Key management: Systems and processes for protecting cryptographic keys that control access to crypto assets

Original Article

How Anthropic's Mythos model is forcing the crypto industry to rethink everything about security

DeFi leaders say that AI will arm both attackers and defenders, and widen the gap between projects that prioritize security and those that do not.

CEO and co-founder of Anthropic Dario Amodei (Getty Images)

What to know:

Anthropic's Mythos AI model is shifting DeFi security focus from smart contract bugs to deeper infrastructure risks such as key management, bridges and oracle networks.
By simulating adversaries and chaining together small weaknesses across interconnected protocols, Mythos highlights how AI can turn isolated flaws into systemic, cascading failures.
DeFi leaders say AI will arm both attackers and defenders, pushing protocols toward continuous, AI-driven auditing and widening the gap between projects that prioritize security and those that do not.

Mythos, the new AI model from Anthropic that has sparked fear and confusion in traditional tech and finance, is also driving a massive shift in how the crypto industry thinks about security.

For years, decentralized finance has focused its defenses on smart contracts. Code is audited, vulnerabilities are cataloged, and many common exploits are well understood. But Mythos, a model designed to identify and chain together weaknesses across systems, is pushing attention beyond code and into the infrastructure that supports it.

"The bigger risks sit in infrastructure," said Paul Vijender, head of security at Gauntlet, a risk management firm. "When I think about AI-driven threats, I'm less concerned about smart contract exploits and more focused on AI-assisted attacks against the human and infrastructure layers."

That includes key management systems, signing services, bridges, oracle networks, and the cryptographic layers that connect them. These components are less visible than smart contracts and are often outside traditional audit scope.

In fact, this month, web infrastructure provider Vercel, which many crypto companies use, disclosed a security breach that may have exposed customer API keys, prompting crypto projects to rotate credentials and review their code. Vercel traced the intrusion to a compromised Google Workspace connection via the third-party AI tool Context.ai, which an employee used.

Mythos belongs to a new class of AI systems built to simulate adversaries. Instead of scanning for known bugs, it explores how protocols interact, testing how small weaknesses can be combined into real-world exploits. That approach has drawn attention beyond crypto. Banks like JP Morgan are increasingly treating AI-driven cyber risk as systemic and are exploring tools like Mythos for stress testing. Earlier this month, Coinbase and Binance both reportedly approached Anthropic to test Mythos.

Early findings from models like Mythos have identified weaknesses in the behind-the-scenes systems that keep crypto platforms secure, including the technology that protects keys and handles communication between systems.

"I think there are two areas where AI models are especially valuable," Vijender said. "First, multi-step exploit chains that historically only get discovered after money is lost. Second, infrastructure-layer vulnerabilities that traditional audits never touch."

That shift matters in a system built on composability, where DeFi protocols can connect and build on each other's services.

DeFi protocols are designed to interconnect. They share liquidity, rely on common oracles, and interact through layers of integrations that are difficult to map in full. That interconnectedness has driven growth, but it also creates pathways for risk to spread, as seen in recent bridge exploits like the Hyperbridge attack, in which an attacker minted $1 billion worth of bridged Polkadot tokens on Ethereum by exploiting a flaw in how cross-chain messages were verified.

"Composability is what makes DeFi capital efficient and innovative," Vijender said. "But it also means a minor vulnerability in one protocol can become a critical exploit vector with contagion potential across the ecosystem."

Without AI, those dependencies are hard to trace. With AI, they can be mapped and exploited at scale. The result is a shift from isolated exploits to systemic failures that cascade across protocols.

Evolution of AI attacks

Still, some industry leaders see Mythos as an acceleration rather than a turning point.

At Aave Labs, founder Stani Kulechov said AI reflects the dynamics already at play in DeFi's adversarial environment.

"Web3 is no stranger to well-funded and motivated adversaries," he told CoinDesk. "AI models represent an evolution in the tools used to achieve exploits."

From that perspective, DeFi is already built for machine-speed attacks. Smart contracts execute automatically, and defenses such as liquidation mechanisms and risk parameters operate without human intervention.

"DeFi operates at compute speed, so AI doesn't introduce a new dynamic," Kulechov said. "It intensifies an environment that has always required constant vigilance."

Even so, Aave is seeing AI surface new categories of vulnerabilities, including issues that human auditors may have previously deprioritized.

"The Mythos paper shows that AI can uncover old bugs that were previously deprioritized," he said.

That breadth still matters in a system where even smaller vulnerabilities can undermine trust or be combined into larger exploits.

If attackers can move faster, the question becomes whether defenses can keep pace.

For both Gauntlet and Aave, the answer lies in changing the security model itself. Audits before deployment and monitoring after were designed for human-paced threats. AI compresses that timeline.

"To defend against offensive AI, we will need to take an AI-centric approach where speed and continuous adaptation are essential," Vijender of Gauntlet said. That includes continuous auditing, real-time simulation, and systems built with the assumption that breaches will happen.

A 'greater way'

Aave has already integrated AI into its workflows, using it for simulations and code review alongside human auditors. "We take an AI-first approach where it adds clear value," Kulechov of Aave Labs said. "But it complements, rather than replaces, human-led auditing."

In that sense, AI equips both attackers and defenders.

For builders, the long-term effect may be less disruption than divergence.

"We haven't tested Mythos yet, but we're genuinely interested in what it and tools like it can do for protocol security," said Hayden Adams, founder and CEO of Uniswap Labs. "AI gives builders better ways to stress test and harden systems."

Over time, Adams expects the gap between secure and insecure protocols to widen.

"Projects that prioritize security will have greater ability to test and harden systems before launching," he said. "Projects that don't will be most at risk."

That may be the real shift. Security is no longer about eliminating vulnerabilities. It is about continuously adapting to a system in which those vulnerabilities are constantly rediscovered and recombined.

Crypto aiagents

Binance AI Wallet: Keyless ‘Agentic Wallet' for Web3 Automation

Binance launched Agentic Wallet, a keyless crypto wallet that lets AI agents autonomously execute blockchain transactions within user-configured spending limits and security rules.

Read original

Summary

What: Agentic Wallet is an isolated account within Binance Wallet that enables AI agents to trade, transfer, and manage digital assets on BNB Smart Chain, Solana, Base, and Ethereum without accessing users' primary funds. It uses keyless technology and pre-approved transaction signing to eliminate the need for repeated confirmations.

Why it matters: This tackles a core challenge in crypto automation: allowing AI agents to operate on-chain safely without full access to user funds, addressing both the security risks and UX friction that have limited AI-driven Web3 applications.

Takeaway: Developers can integrate the wallet with AI frameworks like Claude Code, OpenClaw, or Cursor using tool-use protocols to build AI-native financial applications.

Deep Dive

The wallet operates as a completely isolated account separate from a user's main Binance Wallet, creating a security boundary between AI agent activity and primary holdings
Users configure spending limits, token restrictions, and transaction rules upfront, with all transfers restricted to whitelisted addresses saved in the address book
A real-time monitoring dashboard provides visibility into all agent actions, addressing the transparency concerns inherent in autonomous financial systems
The keyless architecture removes the burden of private key management while relying on enterprise infrastructure and Binance's "Secure Auto Sign" for pre-approved execution
Supported operations at launch include balance checks, transfers, market and limit order trading, and order management across four major chains
The wallet is compatible with multiple AI agent frameworks that support tool-use protocols, suggesting a focus on developer ecosystem rather than just retail users
Launch includes a 15-day promotion with up to 20 gas-free transactions per user (capped at 200k globally) and waived service fees to drive early adoption
Each user is currently limited to one Agentic Wallet, likely a risk management measure during the initial rollout phase
The product extends Binance's AI capabilities beyond exchange-based trading tools into broader Web3 ecosystem automation
This represents a notable shift toward treating AI agents as first-class participants in blockchain economies, not just analysis tools

Decoder

Keyless wallet: A crypto wallet that doesn't require users to manage private keys directly; instead relies on other authentication methods and infrastructure
On-chain: Actions that occur directly on a blockchain network, as opposed to off-chain or centralized database transactions
Self-custodial: A wallet type where users maintain control of their assets without relying on a third party to hold funds
Gas-free transactions: Blockchain transactions where the network fees (gas) are covered by another party rather than the user
Whitelisted addresses: Pre-approved wallet addresses that are permitted to receive transfers, blocking transactions to unknown or unverified destinations
Tool-use protocols: Standards that allow AI agents to interact with external systems and execute functions beyond simple text generation

Original Article

Binance's Agentic Wallet is a keyless, isolated account that enables AI agents to execute on-chain transactions. Supporting BNB, Solana, Base, and Ethereum, the wallet features configurable spending limits and security protocols. It integrates with frameworks like Claude Code to facilitate secure, automated Web3 asset management for users.

Crypto aifintechbanking

Revolut Built a Foundation Model for Money

Revolut's custom foundation model for banking events achieved 130% improvement in credit scoring and 65% in fraud detection, signaling a new competitive battleground for financial services.

Read original

Summary

What: PRAGMA is a transformer-based foundation model trained on 24 billion banking events from 26 million Revolut customers, designed to predict credit risk, fraud, and customer behavior. Unlike ChatGPT, it preserves the structured nature of financial event data rather than using text tokenization, and replaces six separate production ML models with a single architecture.

Why it matters: This represents a fundamental shift in how banks build competitive advantage—traditional hand-crafted ML features are being replaced by models that discover patterns across all customer events simultaneously. Even conservative estimates suggest applying 10% of Revolut's stated gains to a bank like JPMorgan could save hundreds of millions annually, with compounding improvements over time as the model evolves.

Takeaway: Banks face three options: build their own foundation models with proprietary data, collaborate with industry partners to pool datasets, or buy models from vendors—but the window to establish this competitive moat is closing as digital-first banks enter an arms race.

Deep Dive

Revolut replaced six separate production ML models with one foundation model, achieving 130% uplift in credit scoring precision-recall and 65% improvement in fraud detection recall compared to existing systems
The model was trained on 207 billion tokens derived from 24 billion customer events (logins, transactions, screen taps) across 26 million customers in 111 countries, using 32 NVIDIA H100 GPUs over approximately two weeks
Unlike large language models, PRAGMA preserves the structured nature of banking event data rather than fragmenting it through text tokenization, making it specifically suited for financial prediction tasks
The model currently operates in predictive mode (similar to BERT in 2020) but the roadmap includes generative capabilities that could simulate customer futures and work backwards to identify triggers for desired behaviors
PRAGMA failed at anti-money laundering tasks (47% worse than production systems) because AML requires network analysis of transaction chains, while the model currently analyzes individual customer histories in isolation
Nubank, Mastercard, and PayPal have all published similar work in the past year, with PayPal achieving 49% faster and 45% cheaper inference by fine-tuning Llama in just two weeks
Three competitive layers emerge beyond the model itself: talent consolidation, proprietary data assets, and agentic workflows that orchestrate the model within broader product systems
NVIDIA's enterprise pitch focuses on inference costs rather than training costs, since banks run predictions billions of times daily on transactions, applications, and sessions—making model efficiency the key economic factor
The availability of open-weight base models (Qwen, Kimi, GLM) and rentable GPU infrastructure means foundation models for finance have shifted from a research problem to an execution problem
Credit scoring improvements translate directly to revenue through better loan pricing and expanded lending to previously marginal customers, while fraud improvements both save costs and increase revenue from approved legitimate transactions
Mid-sized banks likely need collaborative approaches or vendor solutions rather than custom models, while digital-first banks are entering an arms race to build proprietary foundation models as core IP
The fundamental moat of banking (regulation, lending, balance sheet) remains intact, but the winners within that moat will be determined by who upgrades their risk-pricing and fraud-detection IP first

Decoder

Foundation model: A machine learning model trained on massive datasets that can be adapted for multiple downstream tasks, as opposed to building separate models for each specific use case
PR-AUC: Precision-Recall Area Under Curve, a metric measuring how well a model catches rare but important events (like defaults or fraud) without generating excessive false positives
Fraud recall: The percentage of actual fraud events successfully detected out of all fraud that occurred—higher recall means catching more real fraud
Transformer architecture: The neural network design behind models like GPT and BERT, using attention mechanisms to process sequences of data (originally text, here banking events)
Embeddings: Dense numerical representations that capture the meaning or characteristics of data (users, transactions, events) in a way that similar things are mathematically close together
Fine-tuning: Taking a pre-trained model and training it further on a specific dataset or task, typically much faster and cheaper than training from scratch
LoRA: Low-Rank Adaptation, a technique for efficiently fine-tuning large models by updating only a small subset of parameters
Inference: The process of running a trained model to make predictions on new data, as opposed to training which builds the model initially
Open-weight models: AI models whose parameters are publicly released, allowing organizations to run or modify them locally rather than accessing them only through APIs
Agentic workflow: A system where multiple AI agents handle different specialized tasks (reasoning, fraud detection, checkout) and coordinate to accomplish complex goals

Original Article

🤖 Revolut Built a Foundation Model for Money

Plus; $16bn withdrawn from Aave in a "run on DeFi" after KelpDAO hack, and Plaid's annual letter.

Weekly Rant 📣

🤖 Revolut Built a Foundation Model for Money

(And why it's worth billions to whichever bank builds the next one.)

Something big happened in financial services in the last twelve months and almost nobody clocked it. Companies started training their own foundation models for finance:

Revolut published PRAGMA — a foundation model trained on 24 billion banking events across 111 countries. Credit scoring up 130%. Fraud recall up 65%.
Nubank published nuFormer — a foundation model trained on 100+ billion transactions across 100M+ customers. It's narrower than PRAGMA by use case, but more GPT-like as a model.
Mastercard launched LTM — a foundation model trained on billions of card transactions to identify cyber risk.

And some others started fine-tuning existing models.

NPCI (India's UPI operator) — fine-tuned Mistral 24B, for UPI Help, a conversational agent for 400M+ UPI users.
PayPal published Nemo-4-PayPal — fine-tuned llama3.1-nemotron-nano-8B-v1. Their shopping assistant got 49% faster and 45% cheaper to run. Two weeks of fine-tuning.

Foundation models for finance are here.

The most important signal in Revolut's PRAGMA work with NVIDIA is the significant improvement (uplift) their model delivered vs their existing custom ML models, across multiple use cases.

It's now the case that the two most credible, fastest-growing Neobanks in the western hemisphere, Revolut and Nubank, are betting that transformer architectures will drive their growth.

Clearly, a custom finance foundation model will be a massive competitive advantage in the coming decade.

The IP of banking was always how they priced and managed risks, the squishy lending decisions, the fraud stuff, and the optimization of every interaction. If transformers deliver benefits in production on the scale Revolut is suggesting, they're a game changer.

I've been asking for years — where are the foundation models for finance? Healthcare was doing this with DNA sequences. Ad-tech was doing it with clickstreams. Finance, the most data-rich industry on earth, was still stuck training bespoke ML models on hand-crafted features. PRAGMA is the first real answer that cuts across multiple use cases.

What is a Foundation Model for Money?

Revolut took their massive data set and looked at customer events (logins, screen taps, payments) over time.

26 million customers
24 billion customer events
207 billion tokens created

This is NOT a large language model. It's not generating text or images. It's not competing with Claude or ChatGPT. Customer event data has a structure that text tokenization destroys. Numbers and scale become fragmented.

To prove the model worked, Revolut ran three experiments. Pavel walked me through them.

Experiment 1: Use the pre-trained "embeddings" on an old model. Answers one question: how much information is already in the pre-trained embedding, before you've even told it what task you're solving?
Experiment 2: Use pre-trained embeddings alongside the hand-crafted ML features the data science team had spent years building. This answers: what new information does the foundation model capture that the old features missed?
Experiment 3: Fine-tune the foundation model on finance outcomes using LoRA. This answers can we beat the entire data science team by taking a pre-trained model and pressing a button?

can we beat the entire data science team by taking a pre-trained [foundation] model and pressing a button?

Pavel Nesterov

In most cases, the answer is yes.

The PRAGMA model took six production tasks, and replaced six separate custom machine learning (ML) models with one.

Revolut's Paper Shows STAGGERING Potential

The paper is the first time I've seen published evidence on the effectiveness of a foundation model for banking's most important competitive levers. Revolut published the uplift this model produced vs their existing ML models:

Credit scoring: +130% PR-AUC uplift vs ML models (how well it catches rare cases that matter)
Fraud recall: +65% uplift vs ML models (how much true fraud is caught above old ML model)
Marketing engagement: +79% & product recommendation: +40% uplift vs ML models

Revolut's final uplift of the PRAGMA vs production Deep Learning models

Each of these is a competitive lever:

1. Credit risk pricing is the competitive lever for banks

Credit scoring lets banks price and issue loans. Loans are the profit center of banks. Lending is a product that never has a demand problem. People will always take the money, the harder question is will they pay back, and if they do, what can I should them as a % so that they're likely to take the loan from me, not someone else who's cheaper. Today, that involves looking at their data compared to other customers' data over time.

→ A model that can find more customers who are likely to pay back, can price more aggressively, and the bank sells more loans, which means more revenue.

(Assuming regulators don't tie themselves in knots over "explainability," this is the area of opportunity. FWIW, I think we need a shared responsibility framework for AI, as we have for cloud, and we'll be building V1 at Fintech Nerdcon this year.)

2. Fraud recall is a competitive lever for anyone in payments.

Fraud recall detects the % of true fraud events detected out of all of the fraud events that occurred. If you're detecting more of the actual fraud, you're also not falsely blocking good payments. Banks and fintech companies make money from card transactions (interchange), but are often liable for refunding fraud. So

→ A model that catches more real fraud saves cost, and allowing more good transactions through generates revenue.

3. Marketing is how banks get new customers or cross-sell product.

The "attach rate" of customers is a really important metric that Neobanks like Revolut and Nubank have been excellent at. Whereas a large bank will have more than 60% of its customer base just using a single product. That's a lost opportunity for revenue.

→ If customers are engaging more with marketing, their propensity to take the product being recommended increases. If they do take that product, the bank gets more revenue. Yay.

Notably, it had a seventh area where it did not succeed.

PRAGMA performed 47% worse than Revolut's production system on anti-money laundering. This was not a surprise. Pavel told me the team expected to fail. AML is a network problem — what matters is who you transact with, not what you do. PRAGMA reads each user's history in isolation. It can't see the chain. They published the failure because it's a finding, and because they already know what to build next.

(Although I imagine if you trained a model on that network transaction data instead of individual customer events, it would be net far more effective. The issue was the training data set, not the model type.)

Building a customer model created six areas of competitive advantage. For a bank that already has 70M customers and is one of the fastest growing in the world.

For Revolut, this kind of research compounds their advantage in tech.

The Prize is More Revenue and Lower Costs

Not a shocking business case. The magnitude, however, is.

Napkin math o'clock.

Credit (Based on JPMorgan Chase earnings data). JPM has credit costs of over $10 billion a year, dominated by Card Services. Net charge-offs ~$2 billion per quarter in consumer, running $8bn+ annualised. Revolut's credit scoring improvement was +130% PR-AUC. That's not an 130% reduction in losses — nothing works like that in production. But even at 10% of the stated gain applied to JPM's card book?

That's hundreds of millions of dollars a year. Every year. Compounding as the model improves.

Fraud losses can be millions per quarter. For every $1 lost in fraud, another $5.75 lost in operational costs. So that balloons to tens of millions. Revolut's fraud recall went up 65% — meaning they caught roughly 1.65x as much fraud as their prior production system.

A large card issuer catching even a fraction more fraud pays for the GPU bill a thousand times over.

Foundation models do all the things a consulting strategy deck will tell you to do.

So you can skip that part and go straight to figuring out execution.

But why can't every bank do this?

What Revolut built is 2020-era LLM. What comes next is wild.

Today PRAGMA predicts. That's it. Given a customer's history, it predicts who's a credit risk, who's committing fraud, and who's about to churn and a customers's lifetime value. Which is useful by all means but boring compared to what's coming.

Pavel gave me the analogy of where LLMs were in 2020. Back then BERT could read a sentence and fill in the missing word. Useful for search, classification, and ranking. But it wasn't until GPT came along and could generate the next sentence that things really began to change.

PRAGMA today is like BERT was in 2020. It reads a customer's history and fills in the gaps.

A generative version would write the next chapter. If you can generate a customer's future events, you can simulate when they'll buy a new product. And then rewind the tape to see the things that led to that decision, and try to make those things happen.

Low-key Minority Report, for banking.

That doesn't exist yet, but that's the goal.

Why haven't banks done this?

Do not underestimate banks AI capabilities in some areas.

Banks have been doing machine learning for decades. Every major bank has capable AI talent in lending, underwriting, and AML — people who've forgotten more about credit modeling than most Silicon Valley ML engineers will ever learn. Their data sets are enormous and compounded over decades of experience.

So why did a neobank ship this first?

1. Capability, and the ability move quickly.

Revolut is the canonical move-fast-and-occasionally-break-a-thing Fintech company, that famously took a little while to get its full UK bank license because of regulator concerns over audit gaps and compliance staff turnover. They will move quickly.

And the entire tech stack is modern, regularly replaced, and ideal for this kind of experimentation. A bank could take months just to find and scrub the data they need for this training. Revolut likely had it. Ready to go.

2. The availability of open-weight base models.

Open-weight models like Qwen, Kimi and GLM are now incredibly capable, cheap and with a couple of weeks of rentable GPUs can be trained to run a foundation model on private data sets.

That conversation may be happening in large institutions, but the execution pace isn't in the same universe. But I don't think that means they'll stay absent. When it's a company's IP, they're actually good at this stuff.

The fact that Mastercard built a foundation model is a big tell for me.

Foundation models for finance aren't a research problem anymore. They're an execution problem. The base models are open-weight. The frameworks are public. The papers are on arXiv. The compute is rentable. Banks can now do this.

Banks have never lacked talent or resources; they just often get McKonsultant'd and PowerPoint-slided to death. The banks that train their own foundation models will be the ones who ignore that guff, and get their data and risk talent working hands-on with folks like NVIDIA.

And Pahal Patangia runs payments business development at Nvidia. Where did he come from? FICO. The person who spent a decade helping retail banks build credit models is now the person helping Nvidia sell them the thing that replaces credit models.

That's not a coincidence.

NVIDIA's Enterprise Finance Pitch

You think NVIDIA, you think chips, Jensen's leather jacket, and a stock price that's propping up the S&P 500

Almost nobody outside the people already working with them knows what they do in financial services.

For Revolut, NVIDIA built the factory that lets a bank turn its proprietary data into a proprietary foundation model. It supports rolling your own foundation model or fine-tuning.

Foundation Model Building (Revolut, Mastercard): NVIDIA provides the silicon (H100s, Blackwell), the data libraries (cuDF for the GPU-accelerated feature engineering that used to take weeks), and a training framework (NeMo AutoModel) that handles the parallelism so the ML team doesn't have to. PRAGMA's 1B model trained on 32 H100s in roughly two weeks.
Fine Tuning a Model (PayPal): NVIDIA provides the base model itself — Nemotron. Plus the framework to do the fine-tuning (NeMo), plus the inference runtime to serve it cheaply (TensorRT-LLM).

Pahal Patangia made the point on a recent Tokenized podcast episode — and I keep coming back to it — that training is a one-time cost. Inference is forever.

A bank running fraud on every transaction, credit on every application, personalization on every session, is paying the inference bill billions of times a day. NVIDIA's whole recent model family has been re-tuned around that second number. Smaller models trained on more data, so they punch above their weight when you fine-tune them.

So they'll help you train at low cost, because they want your inference workloads.

Sort of like how GLP-1s give you something you want (better body shape) in return for a subscription fee to the medication.

And before I turn this into an NVIDIA sales pitch, there are plenty of alternative companies competing here. Google TPUs, AWS Trainium on Bedrock, and AMD are all building out their offerings here.

In fact, the raw inference space is becoming so competitive, there's a growing view in Silicon Valley that NVIDIA's moat is falling to more specialized silicon. So much so that the CEO of NVIDIA, Jensen Huang, went on the Dwarkesh podcast to lay out his counterargument (and accidentally spawned a hilarious set of memes about cars).

It's not just the models

A foundation model is nice.

But there's actually 3 layers of AI competitive advantage, the model is just one of them.

Talent. In one room. With one architecture.
The Data. Big banks have plenty.
The Model. What we've discussed today.
The workflow around the model is another.

The Fintech Brainfood: Four Layers of AI Competitive Advantage

Look at what PayPal did. They didn't just fine-tune Nemotron and ship it. They built a multi-agent system around it. Agents handle reasoning, checkout, fraud. The fine-tuned model is one component in a larger agentic workflow that, combined, is the product.

The model gets better the more data you put through it.
The workflow gets better the more customer interactions you observe.
The data gets richer the more products you cross-sell.

Each layer compounds the one below it.

Agentic workflows as intellectual property. Anyone with data and a computer can train a model. But the workflow — the orchestration, the tool routing, the evaluation harnesses, the guardrails, the specific prompts that encode your product's opinion about what good looks like — that's harder to copy.

That's where competitive advantage starts to move to, once everyone has a foundation model.

An aside: One more thing that held in Revolut's research: scaling laws work on events. More parameters, better performance. The same pattern that drives LLM progress applies to banking event data. The 10M-parameter model is good. The 1B-parameter model is better. The frontier keeps moving.

So what do you do to compete?

I think there are three options emerging.

Build your own custom model like Revolut did. This requires a massive dataset and, more importantly, the ability to use it for training.
Collaborate with other industry actors to build a model. This could be with your core processor, industry consortia or vendors. Sardine has 6 billion user device profiles and event data. FICO has more lending data than anyone I can think of.
Buy a model from a company with a large data set. Vendors exist in risk and credit who have pretty large datasets themselves. It's a matter of time until these companies have foundation models you could just buy.

Your decision comes down to where you see your competitive advantage, and your honest assessment of your ability to execute.

Neobanks and digital-only banks are now in an arms race to build their own models.

The big banks have to figure out if they compete or if any of their bigger suppliers can help them compete.

A mid-sized or smaller bank with a smaller customer base probably shouldn't be banging on NVIDIA's door to custom-train a model on their batch file outputs. But a group of smaller banks with pooled data and resources could do something very interesting.

Banks are not disrupted. But AI is disruptive.

Banks are not disrupted. I still believe that.

The moats around regulation, lending, and balance sheet are real, and they aren't going anywhere. But the banks that win the next decade aren't the ones with the biggest balance sheets. They're the ones that upgrade their IP. The moat stays the same. The winners inside the moat change.

The IP of banking was always how they price and manage risks. The squishy lending decisions. The fraud stuff. For 200 years that IP lived in spreadsheets and in the heads of underwriters. Revolut just moved it into a model. And the model is only phase one of four.

The only thing between you and a PRAGMA of your own is deciding which of the three paths is yours. Build. Collaborate. Buy.

Pick one. Start now.

ST.

* Much of the PRAGMA detail in this Rant came from a conversation with Pavel Nesterov, one of the paper's authors. Any errors are mine. I also spoke to Pahal from NVIDIA on the Tokenized Podcast and it was an absolute BANGER episode. You should really check it out.

Crypto blockchainaiautomation

How to Automate MEV Blockchain Analysis Using OpenClaw and MCP

A tutorial demonstrates automating MEV blockchain analysis by connecting the mevlog-rs tool to AI agents like Claude Code or OpenClaw for natural language transaction queries.

Read original

Summary

What: The mevlog-rs MCP server integrates with Claude Code or OpenClaw agents to enable natural language queries of Ethereum blockchain data, allowing developers to search for high-value token transfers, gas usage patterns, and MEV opportunities without writing manual queries.

Why it matters: Searching for MEV opportunities typically requires tedious manual blockchain monitoring and custom scripting; this approach lets developers ask questions in plain English and automate recurring analysis tasks like scheduled Telegram reports.

Takeaway: Install mevlog-rs with `cargo install mevlog --features mcp`, run the MCP server locally, and configure Claude Code to query blockchain data using conversational commands like "find transactions that transferred over 100k USDC in the last 200 blocks".

Deep Dive

The tutorial covers two integration approaches: manual MCP setup with Claude Code for local queries, and OpenClaw agent for automated scheduled monitoring with Telegram notifications
Local setup requires installing mevlog-rs with MCP features enabled, running the server on localhost:6671, and adding it to Claude Code's MCP configuration
Performance bottleneck is blockchain data download; the tool uses extensive caching to speed up repetitive queries over same block ranges
Remote deployment recommended for production, using a full Geth node on a VPS with NGINX-secured HTTPS gateway and OAuth token authentication
The NGINX configuration proxies requests to the local MCP server while handling SSL termination and proper headers for long-running requests
OpenClaw integration uses a SKILL.md file instead of MCP endpoints, works well for scheduled cron tasks reporting directly to Telegram
Example automated queries include hourly monitoring for million-dollar USDC transfers and daily reports on highest gas-spending transactions
Free RPC endpoints from ChainList work for testing but are throttled, lack historical data, and disable EVM tracing features
Premium RPC endpoints from services like Alchemy or self-hosted Geth nodes provide significantly better performance for scanning large block ranges
The author mentions a major upcoming release that will further improve querying performance across large block ranges

Decoder

MEV (Maximal Extractable Value): Profit opportunities extracted from reordering, inserting, or censoring blockchain transactions, typically by bots or miners
MCP (Model Context Protocol): A protocol that allows LLM agents to access external tools and data sources through standardized interfaces
OpenClaw: An autonomous AI agent framework that can perform tasks, run on schedules, and integrate with communication platforms like Telegram
RPC (Remote Procedure Call): An endpoint that allows querying blockchain node data; can be local (self-hosted) or remote (third-party service)
Geth: Go Ethereum, the official Golang implementation of an Ethereum full node that stores complete blockchain history
EVM tracing: Advanced blockchain query features that show detailed execution steps of smart contract transactions, often disabled on free endpoints

Original Article

OpenClaw LLM agent is represented by this crab Photo by Sandra Iglesias on Unsplash

Searching for MEV profit opportunities often requires tedious blockchain monitoring and analysis. In this tutorial, I'll describe how to automate this process using the mevlog-rs MCP interface and the OpenClaw agent. We'll use a secure HTTPS connection and Telegram chat for communication. We'll also discuss the pros and cons of using OpenClaw vs. a manual MCP integration.

Initial setup and RPC requirements

I don't think the world needs another article on a basic OpenClaw setup. Any tutorial on running OpenClaw with a Ubuntu VPS should work. I will assume you already have an OpenClaw integration with a communication interface, such as Telegram, configured.

For an MCP integration, you'll need an LLM agent configured and running locally. For the rest of this article, we will use the Claude Code CLI.

An optional but useful prerequisite for this tutorial is running a Geth full node on the same machine as OpenClaw. Local node offers significantly better performance for scanning larger block ranges. You can also use a premium RPC endpoint, e.g., from Alchemy. But if you want a quick and easy way to test this setup, you can also use the built-in ChainList integration, which automatically selects the fastest free RPC endpoint. But, please remember that free RPC endpoints are often throttled, don't provide access to older data, and disable EVM tracing features.

Configuring an MCP server with mevlog-rs

Let's start with a less "magical" setup. We will first configure a local MCP server and later a remote one to enable querying blockchain data by simply talking to your LLM agent. Full reference is in the mevlog-rs MCP docs.

First, install an mevlog-rs CLI with the MCP feature enabled:

cargo install mevlog --features mcp

Start by verifying if basic querying works:

mevlog search -b latest --rpc-url=$ETH_RPC_URL

You should see JSON output with txs from the recent block.

Now run a mevlog MCP server locally:

mevlog mcp --rpc-url=$ETH_RPC_URL

Next, configure it with your Claude Code instance:

claude mcp add --transport http mevlog http://localhost:6671

That's it! You can query blockchain through an LLM interface by just asking questions like:

query the last 50 blocks for transactions that transferred PEPE (0x6982508145454ce325ddbe47a25d4ec3d2311933) token
find transactions that transferred over 100k USDC (0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48) in the last 200 blocks
which tx transferred the most DAI (0x6b175474e89094c44da98b954eedeac495271d0f) in the last 6 hours

Claude Code using mevlog using an MCP interface

Configuring a remote MCP endpoint for better RPC performance

The above setup assumes that the mevlog CLI executes on your local machine. For production-like deployments, this is usually not optimal. The largest bottleneck of querying blockchains is downloading the data. mevlog uses caching extensively, so repetitive queries against the same block ranges will be significantly faster, but initial data must be downloaded over the wire.

For the best download performance, I recommend a proprietary full node. But you're likely not run it in a local network but on an external VPS. To configure mevlog with an external node, you'll need an authenticated HTTPS gateway using NGINX. So let's do it now.

I won't elaborate on configuring SSH access or running a full node on a proprietary VPS. Check my other tutorial for detailed info on how to do it.

To expose your MCP endpoint externally, you'll need a similar nginx config:

server {
    listen 80;
    listen [::]:80;
    server_name mcphost.com;
    return 301 https://$host$request_uri;
}

server {
    listen 443 ssl;
    listen [::]:443 ssl;
    server_name mcphost.com;

    ssl_certificate /etc/nginx/ssl/mcphost.crt;
    ssl_certificate_key /etc/nginx/ssl/mcphost.key;
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_session_cache shared:SSL:10m;
    ssl_session_timeout 1d;

    location = /mcp {
        proxy_pass http://127.0.0.1:6671/mcp;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto https;
        proxy_set_header Connection "";
        proxy_buffering off;
        proxy_request_buffering off;
        chunked_transfer_encoding on;
        proxy_read_timeout 3600s;
        proxy_send_timeout 3600s;
    }

    location = /.well-known/oauth-authorization-server {
        proxy_pass http://127.0.0.1:6671/.well-known/oauth-authorization-server;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Forwarded-Proto https;
    }

    location = /.well-known/oauth-protected-resource {
        proxy_pass http://127.0.0.1:6671/.well-known/oauth-protected-resource;
        proxy_http_version 1.1;
        proxy_set_header Host $host;
        proxy_set_header X-Forwarded-Proto https;
    }
}

I usually use a Cloudflare proxy and origin certificates to configure a secure SSL connection. Alternatively, you can use Certbot. BTW, if you already have an OpenClaw running on your VPS, you can ask him to configure it; he'll know what to do.

Now start the authenticated MCP process on a VPS:

MEVLOG_MCP_AUTH_TOKEN=mcp_password mevlog mcp --rpc-url=$ETH_RPC_URL --host=127.0.0.1

and configure your Claude Code to use this external endpoint:

claude mcp add --transport http mevlog-remote https://mcphost.com --header "Authorization: Bearer mcp_password"

and use it as you did before. It's a simple way to use a high-performance, externally hosted node to query the blockchain from a local LLM agent.

How to query EVM blockchains with the OpenClaw LLM agent

The MCP server required some manual configuration. If you're more into "agentic workflows", the OpenClaw setup might work better for you. Once you already have an OpenClaw configured, getting started with mevlog-rs is straightforward. There's a dedicated SKILL.md file that explains how to use the CLI directly without the MCP endpoint. A recommended setup is to run OpenClaw on the same VPS as your RPC node to ensure the best performance.

Just ask your OpenClaw to import the skill into memory, and provide an RPC (local or remote) endpoint, and you're good to go.

OpenClaw agent using mevlog using a dedicated skill

I find OpenClaw setup useful for running scheduled queries and reporting directly to a Telegram channel. You can configure cron tasks like:

every 1h check and notify me if any transaction transferred more than 1 million USDC
each day at 8AM send me a report on which 10 txs paid the most for gas in the last 24 hours
how many txs did jaredfromsubway.eth send in the last 24h, and how much was his total gas spending

Summary

I wish I had access to similar tooling when I was full-time focused on building MEV bots some time ago. It would have saved me dozens of hours of manually querying blockchains for relevant txs. I hope you'll find some of these techniques useful. Please send any feature requests or bug reports.

Also, stay tuned: I'm currently working on a major release of the mevlog CLI that will massively speed up and improve querying across large block ranges.

Crypto ethereumblockchaininfrastructure

Ethereum Didn't Kill Blizzard; It Moved Control to the Verification Layer

Ethereum solved centralized control over execution, but verification of what happened on-chain still depends on non-standardized tools like RPCs and indexers that can produce different answers.

Read original

Summary

What: An analysis arguing that while Ethereum decentralized rule execution (solving the problem of entities like Blizzard changing game rules arbitrarily), it hasn't standardized verification—the process of independently confirming what actually happened on-chain across different systems.

Why it matters: As Ethereum moves toward rollups and proof-based systems, the lack of portable, implementation-independent verification creates new dependencies where control over "truth" shifts to whoever defines how data is encoded, extracted, and interpreted.

Deep Dive

Ethereum guarantees deterministic execution, verifiable state transitions, and distributed consensus, solving centralized control over rules at the execution layer
The unsolved problem is that independent parties cannot verify claims about on-chain events using a standard, system-independent method
Verification today emerges from a non-standardized stack including RPCs, indexers, decoding logic, proof formats, and custom implementations
Three different systems verifying the same claim might extract data from calldata, logs, or an indexer respectively, returning different answers based on encoding assumptions and parsing logic
Running your own node doesn't solve how different systems agree on what to verify and how to verify it
Ethereum standardized execution, consensus, and state, but never standardized a portable verification artifact
There is no shared invariant for mapping "this exact byte sequence corresponds to this on-chain commitment"
Control didn't disappear—it moved to the layer that defines what counts as truth through interpretation and verification methods
The claim: a system is not fully decentralized if verification is not independently reproducible across implementations
As Ethereum emphasizes rollups, proofs, and data minimization, we increasingly verify correctness without preserving portable references to what was verified

Decoder

Execution layer: The component of a blockchain that processes transactions and updates state according to deterministic rules
Verification layer: The systems and processes used to confirm what happened on-chain, including data extraction and interpretation
RPC (Remote Procedure Call): Interface that allows applications to communicate with blockchain nodes to read data
Indexers: Services that process and organize blockchain data into queryable databases for easier access
State transitions: Changes to the blockchain's state (account balances, smart contract storage) resulting from transaction execution
Rollups: Layer 2 scaling solutions that execute transactions off-chain but post data to Ethereum for verification
Calldata: Data included in a transaction that is passed to a smart contract function
Deterministic execution: Computation that produces the same output given the same input, ensuring all nodes reach consensus

Original Article

Ethereum ensures deterministic execution and distributed consensus, effectively decentralizing rule-making previously held by entities like Blizzard. While the execution layer is secure, the challenge remains in enabling independent parties to verify state transitions, shifting the focus from simple execution to robust, decentralized verification of system reality.

Crypto paymentsinfrastructure

Stablecoins are going local

Stablecoins are shifting from crypto trading tools to everyday payment infrastructure, with local transactions now dominating over cross-border use.

Read original

Summary

What: Data analysis from a16z crypto showing stablecoins evolving beyond trading instruments, with Q1 2026 volume hitting $4.5 trillion, consumer-to-business transactions growing 128% year-over-year, and velocity doubling to 6x since early 2024.

Why it matters: The surprise finding is that stablecoins are becoming local payment rails rather than primarily cross-border tools as widely assumed, with intra-country payments growing from 50% to 75% of volume, challenging the dominant narrative and suggesting they're maturing into general-purpose financial infrastructure that happens to run on global blockchain networks.

Takeaway: Developers building payment infrastructure should monitor stablecoin payment APIs and local currency stablecoin variants, particularly in emerging markets where adoption is accelerating.

Deep Dive

US regulatory clarity through the GENIUS Act and Europe's MiCA framework accelerated institutional participation, with Q1 2026 adjusted volume reaching approximately $4.5 trillion after several quarters of pre-existing growth
MiCA compliance caused major exchanges to delist USDT, creating a spike in non-USD stablecoin activity exceeding $40B that has since stabilized at $15-25B per month, establishing a persistent market where one barely existed before
Consumer-to-business stablecoin transactions more than doubled year-over-year from 124.9M in 2024 to 284.6M in 2025, the fastest growing category despite consumer-to-consumer dominating by raw count with 789.5M transactions
Stablecoin card infrastructure collateral deposits across Rain-powered programs grew from near zero in November 2024 to over $300M per month by early 2026, providing tangible evidence of commerce adoption
Stablecoin velocity doubled from 2.6x to 6x since early 2024, indicating demand for transactions is outpacing new issuance and existing supply is being actively used rather than just held
After stripping out trading, treasury flows, and exchange mechanics, an estimated $350-550B in actual payments between different parties occurred in 2025, with business-to-business dominating by volume
Geographic distribution is heavily concentrated with Asia accounting for nearly two-thirds of volume (primarily Singapore, Hong Kong, and Japan), North America roughly one quarter, Europe about 13%, and Latin America plus Africa under $1B combined
Intra-country stablecoin transactions grew from roughly 50% of payment volume in early 2024 to nearly 75% by early 2026, contradicting the common narrative that stablecoins are primarily a cross-border tool
Non-USD stablecoins are gaining meaningful traction, with Brazilian-real-backed BRLA growing from near zero in early 2023 to roughly $400M per month by early 2026, aided by integration with Brazil's instant payments network PIX
The data collectively suggests stablecoins are developing into general-purpose payment infrastructure that is global by design but increasingly local in practice, finding footing as a domestic payments medium running on global blockchain rails

Decoder

GENIUS Act: First federal framework for stablecoin issuance in the United States, establishing regulatory clarity
MiCA (Markets in Crypto-Assets): European Union's comprehensive regulatory framework for crypto assets including stablecoins, which took full effect at the end of 2024
Velocity: Ratio of transaction volume to circulating supply, measuring how frequently each dollar of stablecoin changes hands; higher velocity indicates more active usage
C2B/C2C/B2B: Consumer-to-business, consumer-to-consumer, and business-to-business transaction categories
PIX: Brazil's instant payments network operated by the central bank
USDT: Tether, the largest USD-backed stablecoin by market capitalization
BRLA: Brazilian-real-backed stablecoin gaining adoption in Brazil

Original Article

Stablecoins are shifting from trading tools to core financial infrastructure. Q1 2026 volume reached $4.5 trillion, with consumer-to-business transactions growing 128% year-over-year. Velocity doubled to 6x, while intra-country payments now account for 75% of volume, signaling a transition toward local, general-purpose payment rails on global blockchain networks.

Crypto paymentsinfrastructureai

The Missing Key to x402

The x402 protocol for machine-native payments is now part of the Linux Foundation, but needs a practical key management layer to work in production systems.

Read original

Summary

What: x402 is an open protocol that enables AI agents and services to autonomously pay for APIs and resources using HTTP's 402 "Payment Required" status code with cryptocurrency. The author argues it lacks essential session and key management infrastructure needed for real-world deployment at scale.

Why it matters: Without key management, every API request requires a separate blockchain signature ceremony, making hyperscale adoption impractical. Building an infrastructure layer for authentication, scoped access, key rotation, and sessions would bridge the gap between the elegant protocol and existing production systems where developers actually operate.

Takeaway: If you're building on x402, implement a key management layer that lets agents buy API keys with the protocol rather than re-signing every individual request, enabling traditional session-based patterns alongside micropayments.

Deep Dive

x402 became part of the Linux Foundation on April 2, 2026, providing a simple four-step protocol: client requests resource, server responds with 402 and payment details, client pays with crypto signature, server validates and returns resource
The protocol's simplicity is its strength, similar to Bitcoin's approachable design, but it ships without addressing authentication, authorization, key rotation, or session management
At hyperscale (millions of requests per second), performing a signature ceremony for every request becomes impractical due to latency, rate limiting needs, abuse prevention, and DoS protection
The missing layer isn't a protocol change but infrastructure that enables buying API keys with x402, turning one-off payments into ongoing service relationships
Key management enables critical production features: establish identity once per session, scope keys to minimum access, revoke compromised keys instantly, delegate to sub-agents without exposing root wallets, and integrate with existing billing systems
Circle's approach batches thousands of nano-transactions into one x402 transaction through a gateway, staying close to the base protocol
Stripe's Machine Payments Protocol (MPP) adds payment channels for streaming payments with sub-millisecond latency, but bundles first-class blockchain support and card network integration that blurs the line between open protocol and product
The author advocates for the "middle path": key management should live in an SDK ecosystem around x402, not baked into the protocol itself or locked into a proprietary product like MPP
Current market reality check: while stablecoins exist and agent adoption is growing, sites like x402scan.com and agentic.market show mostly tinkering rather than real market traction yet
The fundamental challenge mirrors IPv6 adoption (still under 50% after decades): meeting existing systems where they are and incrementally building toward the future, rather than requiring wholesale replacement
Key management is the "boring" infrastructure that makes systems shippable, enabling x402 to integrate with existing observability, rate-limiting, balance tracking, and billing systems that companies already rely on
Grove (the author's company) shipped a wallet-first product anticipating agent payments, but found the agent-paying-human market hasn't materialized yet despite the technical readiness

Decoder

x402: An open protocol that enables automated payments for API access using HTTP's 402 "Payment Required" status code combined with cryptocurrency payment rails like stablecoins
HTTP 402: A 20+ year old IETF standard status code originally reserved for future payment systems, now being activated by x402
Payment channels: Mechanisms for streaming micropayments with sub-millisecond latency by opening a channel once and settling many transactions off-chain
Stablecoins: Cryptocurrencies pegged to traditional currencies (like USD) that provide price stability for payments
Signature ceremony: The cryptographic process of signing a blockchain transaction to authorize payment, which adds latency to each request
Multi-sig: Multi-signature wallets requiring multiple approvals to move funds, used to reduce the risk of holding "hot keys" connected to the internet

Original Article

The x402 protocol provides a base layer for machine-native payments using HTTP 402 status codes. However, it lacks essential key and session management for production environments. Implementing a robust infrastructure layer for authentication, authorization, and key rotation is necessary to enable scalable, real-world adoption for AI agents.

Crypto economicsblockchain

What prevents cryptocurrencies from functioning as daily money

A forum discussion identifies two core paradoxes preventing cryptocurrencies from functioning as daily money: the first-mover advantage that concentrates supply with early adopters, and deflation dynamics that incentivize hoarding over spending.

Read original

Summary

What: A cryptocurrency economics discussion analyzing why no crypto has achieved the original Bitcoin vision of peer-to-peer electronic cash, focusing on structural issues around early adopter concentration and deflationary incentives that turn crypto into speculative assets rather than transactional currency.

Why it matters: This articulates fundamental design challenges in cryptocurrency economics that go beyond technical implementation—if the incentive structure rewards holding over spending, no amount of scalability improvements will achieve mainstream payment adoption.

Takeaway: For developers building crypto payment systems, consider whether your token economics address velocity of money (circulation) and not just supply mechanics, or whether stablecoins and layer-2 solutions might be more practical than trying to fix the base layer.

Deep Dive

The discussion identifies the "first-mover advantage paradox" where early crypto participants acquired most supply at negligible cost, creating a perception (and arguably reality) of a pyramid-like structure that limits mainstream adoption
The "deflation paradox" stems from Bitcoin's famous pizza transaction: if you expect prices to rise relative to goods, spending means losing future purchasing power, so rational actors hoard rather than transact
Current cryptocurrencies function as speculative assets with fiat on-ramps and off-ramps, not as actual media of exchange, defeating the original peer-to-peer cash vision
The velocity of money (V in MV=PQ) matters more than supply (M) for functional economies—liquidity trapped in speculative hoarding creates asset bubbles without economic utility
Deflationary base assets are described as "money as gold not oil"—fuel that people hoard rather than circulate stops functioning as economic lubricant
Debt deflation dynamics (Irving Fisher framework) show deflation increases real debt burden over time, making rigid deflationary systems structurally destabilizing
One proposed solution involves separating store-of-value from medium-of-exchange by using personalized baskets of tokenized futures contracts matching individual consumption patterns
Another proposal uses daily-expiring point allocations (1,440 per day) with rebasing mechanisms for earned points to eliminate first-mover advantage and speculation incentives while forcing velocity
The fundamental challenge is designing incentives that encourage productive circulation rather than speculative accumulation
Stablecoins don't solve the core problem because they're "fiat on rails" inheriting fiat's issues while adding counterparty risk, and still require deflationary gas tokens for transactions

Decoder

MV=PQ: The equation of exchange where M is money supply, V is velocity (circulation rate), P is price level, and Q is real output—shows that money velocity matters as much as supply
Velocity of money: How frequently a unit of currency circulates through the economy in transactions, critical for economic activity
Debt deflation: Economic phenomenon where deflation increases the real burden of debt, identified by economist Irving Fisher as destabilizing
Gresham's Law: The principle that "bad money drives out good"—people spend depreciating currency and hoard appreciating currency
First-mover advantage: In crypto, the structural advantage of early participants who acquired tokens at negligible cost before wider awareness
Rebasing: Mechanism that adjusts token quantities in all wallets to maintain purchasing power ratios rather than absolute numbers
Proof-of-human: Verification system attempting to ensure one unique account per real human to prevent Sybil attacks
Gas: Transaction fees on networks like Ethereum, paid in the native token (ETH), required to move any assets including stablecoins

Original Article

Cryptocurrency adoption as a daily currency faces two critical paradoxes: the first-mover advantage, where early participants hold excessive supply, and the inherent conflict between decentralization and usability.

Crypto finance

Bitcoin Reclaims $79K as Risk Sentiment Stabilizes

Bitcoin surged past $79,000 as cryptocurrency markets rebounded alongside traditional risk assets on improving sentiment.

Read original

Summary

What: Bitcoin broke above the $79,000 price level with Ethereum and Asian equities also posting gains, supported by sustained ETF inflows, technical chart patterns signaling upward momentum, and reduced geopolitical concerns that shifted overall market sentiment from bearish to neutral.

Original Article

Bitcoin climbed above $79,000 alongside gains in Ethereum and Asian equities, driven by steady ETF inflows, technical breakouts, and easing geopolitical tensions that pushed market sentiment back to neutral.

Crypto aiagentsinfrastructure

Crypto is built for AI agents, not humans, says Alchemy's CEO

Alchemy's CEO argues that cryptocurrency infrastructure is actually purpose-built for AI agents rather than humans, as its borderless, always-on nature matches how autonomous software operates.

Read original

Summary

What: Nikil Viswanathan, CEO of crypto infrastructure company Alchemy, argues that features making crypto difficult for humans (like managing private keys and interacting with code) are ideal for AI agents that need borderless, programmable, 24/7 financial transactions without geographic or temporal constraints.

Why it matters: This reframes crypto's complexity from a user experience problem into a feature for the emerging economy of autonomous AI agents conducting transactions, potentially shifting how we think about cryptocurrency adoption and infrastructure design.

Deep Dive

Traditional finance was designed around human constraints like geography, sleep cycles, banking hours, and physical identity verification, which creates friction for machine-to-machine transactions
AI agents operate fundamentally differently from humans: they don't sleep, have no geographic location, can't visit physical banks, and transact entirely online across borders
What makes crypto hard for humans (seed phrases, private keys, direct code interaction) is precisely what makes it powerful for machines that operate natively in code
Crypto offers agents exactly what they need: global always-on infrastructure, programmability through code, direct control over funds, and systems independent of physical presence
The complexity barrier flips when the user is an agent rather than a human, similar to how email is designed for computers while postal mail was designed for humans
Viswanathan envisions a layered future where agents sit on top of crypto rails handling transactions automatically while humans interact through simplified interfaces above that layer
Unlike bank accounts, crypto wallets can be programmed and controlled entirely through code, enabling agents to manage funds, execute transactions, and optimize capital flows autonomously
This positions crypto less as an alternative human financial system and more as native infrastructure for a new class of economic actors
Alchemy provides the underlying APIs, node infrastructure, and data services that developers use to build blockchain applications without managing blockchain complexity themselves

Decoder

Private keys: Cryptographic passwords that control access to cryptocurrency wallets, typically managed by users themselves rather than institutions
Seed phrases: Sequences of words used to recover crypto wallets, representing a security model that assumes technical sophistication
Onchain: Transactions and data recorded directly on a blockchain rather than handled by intermediaries
AI agents: Autonomous software programs that can make decisions and execute tasks including financial transactions without human intervention
Node infrastructure: The network of computers that maintain and validate blockchain networks, which developers typically don't want to run themselves

Original Article

Crypto is built for AI agents, not humans, says Alchemy's CEO

Alchemy CEO Nikil Viswanathan argues the global financial system was designed for humans, but the next wave of commerce will be driven by AI agents that operate natively in crypto.

What to know:

Crypto matches how agents operate, being borderless, continuous, and fully digital, said Nikil Viswanathan, co-founder and CEO of Alchemy.
Complexity like keys and code is a feature for agents, not a barrier.
AI agents will sit on top of crypto rails, handling transactions while humans interact through simpler interfaces, he argued.

The modern financial system was never designed for machines. It was built around the constraints of human life: geography, sleep cycles, paperwork, and physical presence. But as AI agents begin to act as economic participants, that human-centric design is starting to look less like a feature, and more like a bottleneck, said the co-founder of crypto firm Alchemy.

"You can argue that crypto was built for AI agents, not humans," said Alchemy CEO and co-founder Nikil Viswanathan.

The mismatch is everywhere. Banks have operating hours because humans do. Payments are tied to countries because people live in them. Credit cards assume physical identity and presence, he said.

AI agents operate differently. They don't sleep. They don't live anywhere. They don't walk into banks or carry cards. And increasingly, they don't just assist with tasks, they transact.

"All transactions for agents are online. They're inherently global," Viswanathan, who will be speaking at Consensus Miami next month, told CoinDesk in an interview.

That's where crypto starts to look less like an alternative financial system and more like the native infrastructure for a new kind of economic actor, he said.

Alchemy is a crypto infrastructure company that provides the underlying tools and services developers need to build blockchain-based applications. It offers APIs, node infrastructure, and data services that power everything from financial apps to non-fungible tokens (NFTs) and games, enabling companies to build and scale onchain products without managing the complexity of blockchain systems themselves.

Built for the wrong user

Traditional finance assumes friction. Paying someone in another country involves currency exchanges, intermediaries, delays and fees. For humans, that's normal. But for AI agents, it's unusable.

Agents need to transact seamlessly across borders, at any time, often in tiny increments. They need programmability, direct control over money via code, and systems that don't depend on physical infrastructure or identity.

Crypto offers exactly that: a global, always-on financial layer where value moves as easily as data, he said.

"Crypto is the global infrastructure for money that agents need," Viswanathan said.

Complexity flips

What has long made crypto difficult for humans, including seed phrases, private keys and interacting directly with code, is exactly what makes it powerful for machines, Viswanathan said.

Unlike humans, agents operate natively in code.

"Agents read in zeros and ones. That's their native language," he said. "That's also the language of crypto."

For years, crypto has tried to abstract itself into something more human-friendly. But its underlying architecture was never really built for humans in the first place.

Viswanathan compared the shift from crypto tools being built primarily for humans to crypto tools being used by AI agents to an earlier epochal shift from the postal system to the internet. While people once had to physically write out a letter, buy a stamp and mail it to share messages across the globe, communication in the modern era is much faster.

"Email is far more powerful than the postal system because it's designed for computers," Viswanathan said. "Crypto is similar."

Agent-run financial system

Viswanathan said that moving forward, AI agents will sit on top of crypto infrastructure, handling complexity automatically, managing wallets, executing transactions and optimizing flows of capital in real time, letting people control their own funds more easily.

"You can write code to manage a crypto wallet," Viswanathan said. "You can't write code to manage a bank account in the same way."

The result would be a financial system that is more global, more programmable, and more autonomous.

Viswanathan said he sees a layered future: traditional finance and crypto as the base, an agent layer operating on top and a human interface above that.

"Just like computers operate the internet and humans use it, agents will operate finance," he said.

Crypto paymentsfintech

Stablecoin B2B Payments Set for Large Growth by 2026

Stablecoin-based business-to-business payments are forecast to grow from $3.5 billion in 2023 to $147 billion by 2026, potentially displacing traditional banking infrastructure for cross-border transactions.

Read original

Summary

What: A forecast predicting explosive growth in companies using stablecoins like USDC and USDT for international business payments, bypassing traditional banking systems that charge 3-5% fees and take days to settle, in favor of near-instant settlements costing under $1.

Why it matters: This represents a structural shift in business payment infrastructure, driven by regulatory clarity in the EU and US, improved blockchain infrastructure, and new companies building payment rails directly on stablecoin layers without involving banks.

Takeaway: Developers building B2B payment systems should evaluate stablecoin payment APIs and infrastructure providers as alternatives to traditional banking integrations, especially for cross-border use cases.

Decoder

Stablecoins: Cryptocurrencies pegged to stable assets like the US dollar (USDC, USDT) to minimize price volatility
SWIFT: The traditional global banking network for cross-border transfers, known for slow settlement times and high fees
MiCA: Markets in Crypto-Assets regulation in the EU providing legal framework for crypto businesses
On-chain infrastructure: Blockchain-based transaction systems offering programmable, fast, and final settlements

Original Article

Stablecoin-powered B2B payment volume is projected to increase from $3.5B in 2023 to $147B by 2026, driven by faster, cheaper cross-border settlements that outperform SWIFT's 2-5 day timelines and 3–5% fees.

Digest devoured!

Apr 27

Next: Devoured - April 28, 2026