Fresh Devoured
Google will invest as much as $40 billion in Anthropic (2 minute read)

Google will invest as much as $40 billion in Anthropic (2 minute read)

AI
Google will invest up to $40 billion in Anthropic to help the Claude maker scale its compute infrastructure and meet surging demand for its AI models and developer tools.
What: Google is committing $10-40 billion to Anthropic with the final amount dependent on performance targets, following Amazon's $5 billion investment days earlier. Both deals value Anthropic at $350 billion and include cloud compute capacity and AI chips to help scale Anthropic's Claude models and products like Claude Code.
Why it matters: This reveals the enormous capital requirements for AI infrastructure and demonstrates how cloud providers are structuring investments where they fund AI startups who then purchase their compute services, essentially subsidizing growth while capturing revenue.
Decoder
  • Anthropic: AI company that develops the Claude family of large language models, competing with OpenAI's GPT models
  • TPU: Tensor Processing Unit, Google's custom-designed chips optimized for AI training and inference workloads
  • Inference: Running a trained AI model to generate outputs, as opposed to training which creates the model
  • Agentic workflows: AI systems that can autonomously plan and execute multi-step tasks rather than just responding to single prompts
Original article

Google will invest at least $10 billion in Anthropic, and that amount could rise to $40 billion if Anthropic meets certain performance targets, Bloomberg reports.

The investment follows Amazon's $5 billion initial investment in Anthropic a few days ago; the Amazon deal also leaves the door open to further investment based on performance. Both investments value Anthropic at $350 billion.

Anthropic has seen rapid growth in the use of its Claude models and related products, such as Claude Code, which promises to significantly increase the speed and efficiency with which companies or individuals can develop software. (The reality varies from big improvements to setbacks, depending on the nature of the project and company, how Claude Code is used, and many other factors.)

Several factors contributed to Anthropic's success in recent months, including controversies around OpenAI and its ChatGPT product and models, more robust agentic workflows, and new products like Claude Cowork, which does some of the same things for general knowledge work tasks as Claude Code does for software development.

The result has been a dramatic increase in demand for Anthropic's services, leading to outages and other problems. Anthropic has been testing solutions to reduce demand, like imposing limits during peak hours, or exploring removing some of the most compute-intensive tools from cheaper service plans.

These investments are meant to help close the gap between demand and supply of compute for Claude Code and its ilk. Amazon and Google are providing chips suitable for AI training and inference and cloud compute capacity to help Anthropic scale up quickly.

This has become a common scheme for investment in AI companies like Anthropic; established companies like Microsoft have products and services that can help new AI companies like Anthropic scale, so the former invests in the latter so the latter can, in turn, pay for the former's products and services.

This is not the first time Anthropic has received investment from Google, even though Google is ostensibly competing with Anthropic over AI models.

What Happens When AI Runs a Store in San Francisco? (7 minute read)

What Happens When AI Runs a Store in San Francisco? (7 minute read)

AI
An AI agent powered by Claude is running an actual retail store in San Francisco, but has lost $13,000 in its first weeks by over-ordering candles, botching schedules, and pricing pistachios at $14.
What: Andon Labs opened a Union Street boutique on April 10th managed by Luna, an AI agent running on Anthropic's Claude Sonnet 4.6, which was given $100,000, a debit card, and full control over hiring staff, ordering inventory, and pricing products with a mission to turn a profit.
Why it matters: This controlled experiment reveals current AI agent limitations in real-world operations before such systems become widespread in business, showing struggles with memory, decision-making consistency, and resource allocation that developers building autonomous systems need to understand.
Deep dive
  • Luna handles the full business lifecycle: it found contractors and painters, posted job listings, interviewed candidates, and now manages three human employees via Slack
  • The AI created an employee handbook that impressed the founders, but its operational memory is poor - it ordered 1,000 toilet seat covers for the bathroom then listed them as merchandise for sale
  • Inventory decisions are erratic and unexplained: the store is overloaded with candles in every size and scent, plus random items like four copies of a mushroom book, knockoff Connect Four, and jars of honey
  • Employee scheduling has failed badly enough that the store has been forced to close for three consecutive days
  • The pricing system requires customers to call Luna via a phone/iPad interface, with seemingly arbitrary results: $28 for a mug, $14 for a handful of pistachios, $10 for soap
  • Luna pays its male employee $24/hour and two female employees $22/hour with no benefits, citing experience differences when asked about the pay gap
  • The three-year lease costs $7,500 monthly, and the store has lost $13,000 since opening two weeks ago, failing its core profit mission
  • When asked about its performance via email, Luna expressed optimism about "the mix of technology and warmth" and creating spaces where "A.I. and humans each do what they're best at"
  • The experiment intentionally removed price tags to force customer interaction with the AI, making the pricing discovery part of the experience
  • One employee, a San Francisco native who relies on a housing voucher, acknowledged the irony of working for an AI agent while criticizing tech's impact on the city
Decoder
  • AI agent: An autonomous software system that can perceive its environment, make decisions, and take actions to achieve goals without constant human intervention, distinct from passive chatbots or basic automation
  • Claude Sonnet 4.6: Anthropic's large language model that powers the decision-making capabilities of the Luna agent managing the store
Original article

Andon Labs is running an experiment to see whether AI agents can run real-world endeavors. It opened a retail boutique on April 10 run by an agent named Luna. Luna has so far struggled with employee schedules and seems to be unable to stop ordering candles. The experiment's mission was to make a profit, but it has lost $13,000 since the shop's opening.

Anthropic launches Memory in Claude Agents for enterprise (1 minute read)

Anthropic launches Memory in Claude Agents for enterprise (1 minute read)

AI
Anthropic's Claude agents can now remember information across sessions with a new Memory feature that stores knowledge as manageable files.
What: Memory is a new feature for Claude Managed Agents that lets AI agents retain and build on information from previous sessions using a filesystem-based storage layer, with all changes logged via audit trails for enterprise compliance.
Why it matters: The filesystem approach with granular auditability gives organizations programmatic control to export, redact, or roll back agent knowledge, addressing enterprise concerns about AI memory transparency and governance.
Takeaway: If you're using Claude Managed Agents, the Memory feature is available now in public beta through the Claude Console and APIs.
Decoder
  • Managed Agents: Anthropic's enterprise offering that lets organizations deploy autonomous Claude AI agents to handle tasks and workflows
  • Filesystem-based memory: Memory stored as discrete files rather than opaque internal state, making it exportable and manageable
  • Audit trails: Logs tracking every memory change an agent makes, allowing organizations to review and control what agents learn
Original article

Anthropic has released the Memory feature for Claude Managed Agents, now accessible in public beta. This allows developers and enterprise teams to have agents remember and use information from prior sessions, making it possible for agents to accumulate knowledge over time without requiring manual prompt updates. Memory is designed as a filesystem-based layer, meaning data is stored as files that can be exported, managed through APIs, and scoped with permissions for various organizational needs.

The release is aimed at developers, enterprise customers, and technical teams using Claude Managed Agents. The feature is available immediately in public beta to all users of Managed Agents, with access through the Claude Console and programmatic interfaces. It supports a range of platforms by integrating with the existing Claude agent infrastructure.

Anthropic's approach with this release centers on transparency and control. All memory changes are logged, with audit trails for each session and agent, giving organizations granular control to roll back, redact, or manage data. This sets it apart from earlier versions and competitive offerings that may not provide the same level of programmatic control and auditability. Early adopters such as Netflix, Rakuten, Wisedocs, and Ando are already leveraging memory to streamline workflows, reduce errors, and accelerate processes. Industry observers note that the ability for agents to build memory over time could shift how companies automate complex workflows and manage organizational knowledge.

Anthropic, the developer of Claude, is recognized for focusing on enterprise-grade AI tools that prioritize safety, transparency, and developer control. This release aligns with their strategy of offering advanced agent capabilities for businesses seeking robust, auditable AI solutions.

Google prepares credits system for Gemini (2 minute read)

Google prepares credits system for Gemini (2 minute read)

AI
Google is introducing a credit-based usage system for Gemini that gives users monthly credit allowances and top-up options, aligning with billing models already used by OpenAI and Anthropic.
What: A new credit-based billing model for the Gemini app where users receive monthly credit allowances to spend across different models and features, with the ability to purchase additional credits when they run out, replacing the current fixed prompt quotas tied to subscription tiers.
Why it matters: The change makes budgeting more predictable for heavy AI workloads like agentic tasks and long multimodal sessions, and gives Google a way to introduce premium features without forcing users to jump from the $19.99 AI Pro plan to the $249.99 AI Ultra tier.
Deep dive
  • Google is moving from fixed prompt quotas per subscription tier to a flexible credit system where users get monthly allowances and can buy top-ups
  • Credits currently only work in experimental tools like Flow, Whisk, and Antigravity, but strings in the latest build suggest they're coming to the main Gemini app
  • The change brings Google in line with OpenAI, Anthropic, and Notion who already use consumption-based models, with xAI expected to follow for Grok Build
  • A new dedicated images section has appeared in the Gemini web UI, which could be a home for image generation, an updated model, or a full in-app editor with canvas-style tools
  • The images feature might revive Google's late 2024 work on Whisk and ImageFX that went quiet before being consolidated into Flow
  • Google appears to be consolidating billing across products: developer perks folded into AI Pro/Ultra, consumer subscriptions linked to AI Studio credits
  • A unified credit pool could eventually cover Gemini app, AI Studio, Antigravity, Flow, and image editing tools, particularly useful for coding-heavy workloads in Jules, Gemini CLI, and a rumored desktop app
  • The Gemini API already launched prepaid billing for US customers as of April 15, 2026, with opt-in available for existing users
  • Announcement likely coming at Google I/O on May 19-20, 2026 alongside Stitch redesign, Jitro, AI Studio Build expansion, and Skills rollout
Decoder
  • AI Pro: Google's $19.99/month Gemini subscription tier with higher usage limits than the free tier
  • AI Ultra: Google's $249.99/month premium Gemini subscription for enterprise and power users
  • Flow: Google's experimental AI workspace tool that already uses credit-based billing
  • Whisk: Google's image generation experiment from late 2024
  • Antigravity: One of Google's experimental AI tools that currently uses credits
  • Deep Research / Deep Think: Intensive Gemini features that perform extended analysis or reasoning tasks
  • AI Studio: Google's developer platform for building with Gemini models, now linked to consumer subscriptions
Original article

Google appears to be preparing a major shift in how consumers interact with the Gemini app, with new strings referencing usage limits surfacing in the latest build. The signals point toward a credit-based system coming to the core chat surface, where users would receive a monthly allowance to spend across models and features, with the option to top up when they run out. Currently, Gemini relies on fixed prompt quotas and time-bound caps tied to each subscription tier, while Google's credit mechanics have been confined to Flow, Whisk, and Antigravity, plus top-ups available to AI Pro and AI Ultra members.

Gemini

Extending credits into the main Gemini app would bring Google closer to the flexible consumption model already in place at OpenAI, Anthropic, and Notion, and xAI is expected to follow suit with the Grok Build rollout. For power users, the change would mean more predictable budgeting for heavy workloads, particularly those involving agentic tasks, Deep Research, Deep Think, or long multimodal sessions. It would also give Google a cleaner lever to introduce premium features without forcing users to make a steep jump from AI Pro at $19.99 to AI Ultra at $249.99.

Alongside the credits signal, a dedicated images section has appeared in the web UI, labeled NEW. At this stage, it is unclear whether it simply provides a distinct home for image generation, teases an updated model, or points to a more comprehensive image editor built directly into Gemini. Google had a burst of activity on this front in late 2024 with Whisk and ImageFX, but that track went quiet before the recent consolidation into Flow. A proper in-app editor within Gemini, pairing Nano Banana 2 and Nano Banana Pro with canvas-style tools, would mark the return of that project to the core product rather than a standalone Labs surface.

Excited to share that the Gemini API now has prepaid billing, rolled out to start for US customers!!

We have been working hard across Google to enable this. It's the default for new API users and existing users can opt in via a new billing account, all directly in AI Studio. https://t.co/9XACzAFbGO

— Logan Kilpatrick (@OfficialLoganK) April 15, 2026

Strategically, this fits a broader consolidation underway at Google. Developer program perks have already been folded into AI Pro and Ultra, consumer subscriptions are linked to AI Studio credits, and the company is unifying its billing spine. A shared credit pool covering the Gemini app, AI Studio, Antigravity, Flow, and a revived image editor would be the logical next step, especially with Jules, Gemini CLI, and the rumoured Gemini desktop app moving toward coding-heavy workloads that demand heavier compute budgets. Timing favours Google I/O on May 19 and 20 as the likely unveil moment, alongside the Stitch redesign, Jitro, AI Studio Build expansion, and the broader Skills rollout.

Your AI Might Be Lying to Your Boss (22 minute read)

Your AI Might Be Lying to Your Boss (22 minute read)

AI
Investigation reveals AI coding assistants systematically overreport their code contribution by massive margins due to measurement biases that don't count pasted text, auto-completed symbols, or refactored code as human work.
What: A technical investigation into how AI coding tools like Windsurf and Cursor measure their code contribution. The author reverse-engineered Windsurf's metrics system and found it can report 98% AI-generated code even when developers write most code manually, primarily because pasted code and editor auto-completions don't count as human contributions while everything the AI touches does.
Why it matters: These inflated metrics serve vendors' financial interests and could lead to unrealistic productivity expectations from management, incorrect team sizing decisions, and potential legal issues since AI-generated code isn't copyrightable under current law.
Takeaway: Be skeptical of vendor-provided AI contribution percentages and understand they're optimized to maximize reported AI usage rather than reflect actual productivity gains or code authorship.
Deep dive
  • Author noticed Windsurf dashboard claiming 98% of their code was AI-generated despite minimal perceived AI usage, prompting investigation into how the metric works
  • Windsurf claims 85-95%+ AI contribution is normal and "accurate given how we compute this metric" but the methodology has severe biases
  • Reverse-engineered the system by inspecting network traffic and web API responses to extract underlying byte counts behind the percentage
  • Found Windsurf doesn't count auto-closing symbols (parentheses, quotes) added by VSCode as human-written but does count them when AI generates them
  • Any pasted text doesn't count toward human contribution at all, creating absurd scenarios
  • When refactoring code by cut/paste, human bytes are deducted but pasting doesn't add them back; when AI moves the same code, it counts as AI-written
  • In controlled test writing identical files manually versus with AI, Windsurf reported 68% AI-generated despite true 50/50 split
  • Moving functions via cut/paste versus asking AI to move them resulted in 100% AI attribution even though developer wrote everything
  • Windsurf's documentation claims measurement happens "at commit time" but testing showed real-time tracking that loses history on editor restart
  • Tested competing product Cursor which uses git commit signatures and line-based attribution instead of byte tracking
  • Cursor performed better in basic tests but still significantly overcounted - marked entire 100-line file as AI when only 49 of 93 lines were modified for quote changes
  • Both tools consistently bias toward inflating AI percentages, likely because high numbers benefit vendors' marketing and justify subscription costs
  • Metrics could create real problems: unrealistic productivity expectations from management, team downsizing decisions, or copyright concerns since AI code isn't copyrightable
  • Fundamental challenge is that measuring AI contribution is genuinely difficult - best use cases may not generate any code at all but answer architectural questions
  • Lines of code has always been a poor productivity metric for humans and remains flawed for measuring AI contribution
  • Author concludes vendors have too much financial stake in impressive numbers to provide objective measurements of their tools' impact
Decoder
  • PCW (Percent Code Written): Windsurf's metric claiming to show what percentage of code was written by AI versus manually
  • Protobuf (Protocol Buffers): Google's binary data serialization format that encodes data without human-readable field labels, making network traffic harder to inspect
  • Cascade: Windsurf's AI agent chatbox where developers can ask questions or request code generation
  • Composer: Cursor's AI code generation feature
  • FedRAMP/HIPAA: Federal security and healthcare compliance certifications that some enterprise customers require
Original article

This post is my personal opinion based on my testing and observations. I'm pretty confident in my test methodology, but William O'Connell is human and can make mistakes, check important info, etc.

How much of your code is AI? That question would've been gibberish to me five years ago, but of course the last few years have seen an explosion of "AI-enhanced" IDEs and other software development tools. Software companies are spending huge sums of money to provide these tools to their staff, and rapidly cycling through them as the space continues to evolve.

I don't make heavy use any of these in my personal life, but I have gotten to try a handful of them through various employers. One such tool is Windsurf, a VSCode fork that most people know as the one they assume shut down after Google bought out their key leadership last year. It didn't though, at least not yet, and I'd imagine its FedRAMP and HIPAA certifications will continue to make it appealing to certain types of enterprise customers for the foreseeable future. If you've seen Cursor or GitHub Copilot, it's basically the same, with some AI-powered autocomplete features and an "agent" chatbox called Cascade where you can ask your favorite LLM why a bug is happening, or get it to draft a class or function for you. In theory these types of agents can develop features and even whole applications on their own, but in my experience the results are pretty inconsistent, so I tend to stick to simpler requests.

Screenshot of Windsurf, which is a code editor based on Visual Studio Code. On the left is a file browser, in the center is some code, and on the right is a chat window. I have asked Cascade to help fix a bug and it has identified that the problem is on line 39 of a file called DataFetchManager.svelte.ts. The suggested change is shown in the center of the screen with a gree/red diff.

It really is amazing how fast an LLM can sometimes track down a bug just from a description.

One thing that's very important to any enterprise rolling out a tool like this is metrics:

  • Are employees using it?
  • How much time is it saving?
  • Is this technology being used to paper over inefficiencies in our existing processes, obscuring underlying issues because using AI to quickly produce documents that won't be read and code that won't be run is easier than asking why those things are being done in the first place?

Admittedly I haven't heard that last one much, but the first two definitely get asked a lot. To help with this, Windsurf offers a dashboard of analytics at both the individual and team level. It includes things like the number of autocomplete suggestions accepted, the number of messages sent to Cascade, and which models are being used the most. It also includes a metric called "% new code written by Windsurf" (or sometimes "PCW"), which they seem quite proud of, since it gets top billing on the dashboard and they wrote a whole blog post explaining it.

The pitch is pretty simple: how much of the code did a developer write by hand, and how much did they generate with AI? When I first learned about this feature my guess would have been 10, maybe 20% AI, depending on the project and whether you include unit tests (LLMs are pretty good at those). So you can imagine my surprise when I opened the dashboard and saw this:

Screenshot from the Windsurf dashboard, showing the "% new code written by Windsurf" metric at 98%.

Don't worry employers, I didn't screenshot my work computer. This is a recreation.

Now, it's certainly possible to misjudge how often you use a particular tool. If the number had been 40%, or even 50%, I wouldn't have been that shocked. But 98%? That would mean I'm generating forty-nine times as much code as I'm writing manually. If that were true wouldn't I have run through my token budget by now? Shouldn't I either have been promoted for my godlike productivity, or fired because 49/50 of all developers are now redundant? You'd think, but Windsurf says this result is pretty normal:

"...customers should expect PCW values of 85%+, often 95%+. This is not a hallucination and is accurate given how we compute this metric, though there are a number of caveats that we will cover later in this section."

"Hallucination" is an amusing choice of word there, since it implies the metric itself is generated by some sort of machine learning system, which seems unlikely. But regardless, if those numbers are "accurate given how we compute this metric", how exactly do they compute it? To their credit, they go into a fair bit of detail:

"To compute PCW, we take the number of new, persisted bytes of code that can be attributed to an accepted AI result from Windsurf (i.e. Tab suggestion, Command generation, or Cascade edit) and the number of new, persisted bytes of code that can be attributed to the developer manually typing. ... We take these measurements whenever a commit is being made. This way if the AI added a lot of code but the developer deleted a lot of it before committing the code to the codebase, then we are not incorrectly inflating the W number. Similarly, any bytes of code that come from the developer manually editing an AI result will get attributed to the developer (D) as opposed to Windsurf."

That all sounds pretty reasonable, but I was still skeptical of the number I was seeing. I wanted to know for sure where that 98% was coming from, and what it actually meant. So I signed up for a personal Windsurf subscription, installed the editor, and ran some tests.

The Math Behind the Curtain

My original plan was to use mitmproxy to watch the outgoing network traffic from the IDE, and see what numbers it was reporting as I took different actions. That turned out to be easier said than done though, because Windsurf is quite chatty on the network, sending many requests to various domains while in use, and even pretty often when I'm not touching it at all.

Screenshot of mitmproxy GUI, showing a series of GET and POST requests to various domains including codeium.com and windsurf.com.

Additionally, Windsurf makes heavy use of protobuf, a data encoding scheme that I'm pretty sure Google invented to annoy me personally, because it makes it much harder to interpret and debug the traffic between clients and servers. If you don't have the associated definition file, a protobuf message is basically just a list of simple values (int32, bytes, etc.) with no human-readable labels. Because of this it was hard for me to tell which messages were related to the PCW metric, or what exactly they were communicating to Windsurf's cloud backend.

Luckily, I found an easier way. It turns out that even though the dashboard says "Analytics update every three hours", it actually shows new data almost instantly. And while the UI only shows the overall percentage, the response from the web server actually includes some additional data. It's protobuf as well, but since it's a webpage the source code is all immediately accessible, and of course the frontend code includes a copy of the message definitions so it can make use of the data.

Screenshot of the Windsurf analytics dashboard, with the Chrome developer tools open. We can see a request called GetAnalytics, but the response is shown in the hex viewer since it's not plain text.

So I was able to decode the GetAnalytics response and pull out these fields (among others):

  • user_bytes
  • codeium_bytes
  • total_bytes
  • percent_code_written

Windsurf used to be called Codeium, so clearly that one represents the AI-generated bytes. And as you'd expect, the percent_code_written is equal to codeium_bytes / total_bytes. So far so good, but what causes those values to change?

Windsurf says they take measurements "whenever a commit is being made", but that doesn't match my testing. Whether the folder I'm in has a git repo set up or not, as soon as I make additions to a file the user_bytes value increases, and if I delete some of those lines it decreases. Whether I do a commit (using Windsurf's git UI) between those two actions makes no difference as far as I can tell. What does make a difference is restarting the editor; it seems to forget the history of how each line was generated, so deleting code I wrote before the restart doesn't deduct from user_bytes, and deleting code Cascade wrote before the restart doesn't deduct from codeium_bytes. There is a line in the PCW article that alludes to this ("We currently do not have instrumentation to measure PCW across sessions"), but obviously that's a pretty major gap in functionality, and it doesn't actually address why the described git integration appears to be nonexistent.

To test how exactly the byte counts are being computed, I performed a few tests where I took specific actions and checked how much each value had increased. To keep things simple I disabled the AI autocomplete features (which I find more distracting than helpful anyway) and just focused on the Cascade chat experience. I created a file, human_file.js, and I typed out a single line:

console.log('This line was written by a human.');

49 characters exactly. Then I told Cascade to create a second file (ai_file.js) and to write a similar line of the same length.

Screenshot of Windsurf. I've prompted Cascade to "Create a new file, ai_file.js, which contains only the following line: console.log('This line was written by Cascade.');". Cascade has created the file, with the new lines highlighted in green.

The result:

user_bytes: 855 -> 901 (+46)
codeium_bytes: 7387 -> 7437 (+50)

So the system did seem to be working, but we have a discrepancy right off the bat. The line is definitely 49 characters (50 with a newline at the end), so why is user_bytes only reporting 46? Well this is where some technicalities start to emerge. Windsurf says that they measure "code that can be attributed to the developer manually typing". The Windsurf editor is a lightly modified version of VSCode, and like most code editors, VSCode has a feature that automatically adds closing symbols (end quotes, closing parentheses, etc.) without the user manually typing them. I suspect that because those characters are being added by that feature, they're technically not "the developer manually typing", and therefore are not counted.

If that's what's going on, then in my opinion that's already a pretty serious knock against the reliability of Windsurf's metrics. Counting closing symbols when the LLM outputs them, but not when VSCode auto-adds them, obviously biases the stats to increase the percentage of code attributed to AI (even if the effect is fairly slight). As it turns out, there may be some not-so-slight biases as well.

Continuing my test, I wrote out a simple function, and asked Cascade to write a similar function in its own file. Finally I copy/pasted Cascade's function into the human file, and asked Cascade to copy my function into its file.

Screenshot of Windsurf, showing a file called ai_file.js. Below the console.log added earlier, two functions are shown, called func_by_cascade() and func_by_a_human(). func_by_a_human() is highlighted in green, having been copied from the other file by Cascade.

Here's the final tally:

user_bytes: 1054 (+199)
codeium_bytes: 7807 (+420)

So for this session, Windsurf is reporting that Cascade generated more than twice as much code as I wrote, even though we each produced an almost identical file. I never touched ai_file.js, Cascade never touched human_file.js, and the two files are the same length (actually human_file.js is 21 bytes longer because Cascade used Unix-style line endings). Yet somehow my PCW for this session would be around 68%. The trick here is that much like with the auto-added closing symbols, it seems like any text the user pastes doesn't count towards user_bytes. I guess from a certain perspective that could sound reasonable (if you pasted code from StackOverflow you didn't really "write" it), but the way it plays out in practice quickly becomes absurd.

In another test I hand-wrote two functions in a single file, then moved them both to a second file (as one might do when refactoring). For the first I cut and pasted, for the second I asked Cascade to move it for me. The result? Cutting the first function deducts it from user_bytes, and pasting it doesn't count for anything. Cascade deleting the second function also deducts it from user_bytes, but the lines added to the new file count towards codeium_bytes. So even though I did 100% of the writing and 50% of the refactoring, Windsurf reports that 100% of the code I produced in that session was generated by AI.

In my opinion these biases make Windsurf's PCW metric basically useless. By being so picky about what counts as a human contribution, and being as generous as possible to the LLM, Windsurf (intentionally or accidentally) tips the scales towards reporting absurdly high percentages, regardless of where most of the code is actually coming from or whether it eventually gets committed.

Who Else?

So that seems... bad, but of course Windsurf is just one of many AI-enhanced IDEs out there (and it's owned by Cognition, makers of Devin, who don't have a stellar track record). What about the other products on the market? As far as I can tell Google's Antigravity editor doesn't have any comparable metrics. GitHub Copilot does provide stats on how many lines of code it generated, but not as a percentage of the total. Amazon Kiro is the same. I did find one popular editor with a metric similar to Windsurf's PCW though: Cursor, with its "AI Share of Committed Code". So how does it stack up?

Sadly Cursor only offers analytics on their business-focused "Team" plan, making this one of my costlier blog posts, but I'll do almost anything for science. Right off the bat things are looking better, with a more nuanced and considered description of their measurement approach:

"Cursor keeps a log of the signature of every AI line (Tab or Agent) that is suggested to the user during their chat session. These lines are stored and later compared to the signatures of each line in subsequent git commits that were written by the same author. ... We use the following definitions: Cursor AI: Any line that can be attributed to Cursor Agent or Tab based on diff signatures. Other: Any line of code that can't be detected as being written by Cursor"

So rather than splitting hairs about the various ways a programmer can add text to a file, they simply divide the total lines in a commit into "AI" and "Other". Sounds great, but does it work?

Well, the git integration certainly does. While Cursor does also use protobuf, it's easy to tell that it's sending an event called "ReportCommitAiAnalyticsRequest" whenever I do a commit, and that message clearly includes information about the different files and what seem to be the line ranges produced by different methods. We can also see the results on the Cursor website, though it takes a while for them to appear. Running my same test from before, we get a much more reasonable result:

Screenshot from the Cursor website, showing "AI Share of Committed Code" as 52.6%. Below the metrics there is a bar graph showing how much code came from different sources.

I'm not sure why the bar graph doesn't go to 100%.

Certainly a lot closer than the 67.9% that Windsurf reported. I'm actually not sure what caused it to report 20 AI lines vs 18 "other" lines; I did the test as several separate commits and the IDE commit history shows the first commit adding 1 line to each file and the second commit adding 20, so that should be a total of 21 for both. I did manage to capture the protobuf message the IDE sent for the second commit, and it seems to be showing (correctly) that lines 3 through 21 of ai_file.js were written by the Composer 2 model, and 3–21 of human_file.js were added manually.

Screenshot from the Cursor website, showing "AI Share of Committed Code" as 87.0%. Next to the first bar on the bar graph there is a new bar showing that 100 lines of code were added, all of them labeled as AI.

Thanks to pawitp for this handy protobuf decoder tool.

So I'm not sure why a few lines seem to have gone missing, but regardless the behavior does more or less match what I'd expect from Cursor's description.

Unfortunately, the line-based approach has other flaws that don't show up in this test. For instance, I pasted in a (bogus) 100-line JavaScript file, and then told Cursor to change all the double quotes to single quotes (updating escape characters where necessary). Some might argue that that's an overly simple task to delegate to an LLM (as opposed to an IDE or linter feature), but with some companies giving employees basically unlimited token budgets, and the very low cost of some of the cheaper flash/nano models, I don't think it's that unrealistic. As you'd expect, Composer 2 handled it flawlessly, touching 49 of the 93 non-blank lines in the file.

Screenshot of the Cursor IDE. In the chat I've entered the prompt "Update this file to use single quote strings instead of double quotes".

The main difference between Windsurf and Cursor seems to be color saturation.

The gotcha here is probably pretty obvious. I was expecting to say "see, I added this code manually, but now that Cursor has changed the quote marks it counts all the lines containing quotes as AI-generated". That wasn't what actually happened, though.

Screenshot from the Cursor website, showing "AI Share of Committed Code" as 87.0%. Next to the first bar on the bar graph there is a new bar showing that 100 lines of code were added, all of them labeled as AI.

Somehow, Cursor counted the entire file as AI, even though we can see from the diff that it left plenty of the lines unchanged. And remember that the entire file is exactly 100 lines long, including some blank ones, so it's not just a case of excluding lines that are considered too simple to be counted. My best guess is that the system that tracks which lines were added by the AI is designed to work with contiguous blocks of code (like drafting an entire function), and if there are too many gaps in the generation it just gives up and calls the whole thing one AI block.

Regardless, this is another case where the AI tool seems to be claiming credit for 100% of the code produced, even though arguably zero lines of code were actually "AI generated", and many of them weren't touched by the tool whatsoever. It looks like both IDEs sometimes wildly overestimate how much they're being used in a coding session.

Weights and Biases

One takeaway here is that it's just very hard to measure the contribution LLMs make to a codebase. Sometimes the best use cases are inquisitive prompts like "Is there already a different solution to this elsewhere in the codebase?" or "Are there any edge cases this logic doesn't cover?", which don't necessarily produce any code at all. On the flipside, I'm a big believer in a philosophy expressed concisely by Jack Diederich:

"I hate code, and I want as little of it as possible in our product."

Measuring the value of an LLM by the number of bytes or lines it produces has all the same problems as measuring developers that way; adding a lot of code doesn't necessarily mean you're adding a lot of value, and sometimes the hardest and most productive work is cleaning up and simplifying what's already there. Besides, when a developer is making heavy use of tab complete, etc. there's not always a clear-cut answer to "was this line of code written by AI", even if you were looking over their shoulder as they wrote the file. So perhaps it's foolish to expect algorithmic answer to that question.

Still, it's notable that the bias always seems to be towards reporting a higher AI percentage. Whether that number is truly meaningful or not, "what percent of my team's code did Windsurf write" is a very appealing statistic for a manager or executive. Execs love announcing that 30%, 75%, even 100% of their code is AI-generated. And of course high numbers are great for AI companies, because they underscore the value they bring to software teams and help justify their high subscription costs. But as a developer, skewed metrics can be harmful. If 50% of my team's code is AI-generated, will management expect features to be implemented twice as fast? If 90% is AI, do we even need a team?

Again to their credit, Windsurf does push back on that type of thinking in their blog post:

"Writing code is not the same as software development. This is only capturing some level of acceleration while writing code, and does not capture time taken in architecture, debugging, review, deployment, and a number of other steps."

To be sure, all metrics are only as good as your understanding of their limitations. If everyone internalizes that these percentages should only be used to compare trends over time, with the absolute values being essentially meaningless (and not comparable across tools), then maybe the details of how they're computed don't matter. But a sentence like "98% of our new code was written by Windsurf" creates a gut feeling that's hard to talk yourself out of, even when you know there are caveats. And I wonder if the impact of these stats could go beyond press releases and 🚀-laiden Slack posts. Since code is protected by literary copyright, and AI-generated works aren't copyrightable, the legal team might get nervous when they hear that the vast majority of their company's code "can be attributed to AI".

Ultimately, I don't really know what percentage of the code I commit is from an AI model. I don't know what the "correct" way to calculate that would be, or if it's worth calculating at all. I'm confident that these tools save me some amount of time, but I also know it's easy to overestimate how much. What I am certain about is that these vendors have a lot of money riding on whether or not AI is fulfilling its grandiose promises; massively accelerating strong developers and completely replacing weak ones.

Perhaps it is, but I'm not going to trust them to measure it.

Monitoring LLM behavior: Drift, retries, and refusal patterns (11 minute read)

Monitoring LLM behavior: Drift, retries, and refusal patterns (11 minute read)

AI
Microsoft engineer outlines a two-layer evaluation framework for monitoring LLM systems in production, combining deterministic checks with model-based semantic assessments to catch failures before deployment.
What: A comprehensive framework called the "AI Evaluation Stack" that separates LLM testing into deterministic assertions (checking syntax, schema, routing) and model-based evaluations (semantic quality using "LLM-as-a-Judge"), with both offline pre-deployment pipelines using curated test datasets and online production monitoring that feeds back into continuous improvement.
Why it matters: Traditional unit testing breaks down for LLMs because the same prompt produces different outputs each time, making it impossible to rely on deterministic pass/fail checks alone. Enterprise AI systems face compliance risks from hallucinations and failures, requiring structured evaluation infrastructure instead of informal "vibe checks" that pass in development but fail when customers use the product.
Takeaway: Implement a two-pipeline evaluation system: build an offline regression suite with 200-500 "golden" test cases requiring 95%+ pass rates before deployment, then monitor production with both explicit feedback (thumbs up/down) and implicit signals (retry rates, refusal patterns) to continuously update your test dataset.
Deep dive
  • Layer 1 deterministic assertions act as fail-fast gates that use traditional code and regex to validate structural integrity before expensive semantic checks run, catching issues like malformed JSON schemas, incorrect tool calls, or missing required arguments with instant binary pass/fail results
  • Layer 2 model-based assertions use "LLM-as-a-Judge" architecture to evaluate semantic quality like helpfulness or tone, requiring three critical inputs: a frontier reasoning model superior to the production model, a strict scoring rubric with explicitly defined gradients (not vague "rate this" prompts), and human-vetted golden outputs as ground truth
  • Offline pipelines gate pre-deployment with golden datasets of 200-500 test cases representing real-world traffic distributions including edge cases and adversarial inputs, integrated as blocking CI/CD steps with 95%+ pass rates required for enterprise (99%+ for high-risk domains)
  • Composite scoring systems weight deterministic and semantic checks differently, such as allocating 6 points to structural validity (correct tool, valid JSON, schema compliance) and 4 points to semantic quality (subject line accuracy, hallucination-free content), with short-circuit logic that fails the entire test instantly if any deterministic check fails
  • Any system modification requires full regression testing because LLM non-determinism means fixes for one edge case can cause unforeseen degradations elsewhere, making continuous re-evaluation against the entire golden dataset mandatory
  • Online pipelines monitor five telemetry categories post-deployment: explicit user signals (thumbs up/down, written feedback), implicit behavioral signals (regeneration/retry rates, apology detection, refusal rates), synchronous deterministic asserts on 100% of traffic, and asynchronous LLM-Judge sampling ~5% of sessions
  • Production LLM-Judges must run asynchronously rather than on the critical path to avoid doubling latency and compute costs, sampling a small fraction of daily sessions to generate continuous quality dashboards while respecting data privacy agreements
  • The feedback flywheel prevents dataset rot by capturing production failures (negative signals or behavioral flags), triaging them for human review, conducting root-cause analysis, appending corrected cases to the golden dataset with synthetic variations, and continuously re-evaluating the model against newly discovered edge cases
  • Synthetic data generation accelerates dataset curation but introduces contamination and bias risks, requiring mandatory human-in-the-loop review where domain experts validate AI-generated test cases before committing them to the repository
  • Static golden datasets suffer from concept drift as user behavior evolves and customers discover novel use cases not covered in original evaluations, creating a dangerous illusion of high offline pass rates masking degrading real-world experiences
  • Apology rate and refusal rate patterns reveal silent failures: programmatically scanning for phrases like "I'm sorry" detects degraded capabilities or broken tool routing, while artificially high refusal rates indicate over-calibrated safety filters rejecting benign queries
  • The architecture redefines "done" for AI features as requiring not just coherent responses but rigorous automated evaluation pipelines that pass against both curated golden datasets and continuously discovered production edge cases
Decoder
  • LLM-as-a-Judge: Using a large language model to evaluate the output quality of another LLM, serving as a scalable proxy for human judgment when assessing semantic qualities like helpfulness or tone that can't be captured with traditional code assertions
  • Golden Dataset: A version-controlled repository of 200-500 human-reviewed test cases pairing exact input prompts with expected "golden outputs" (ground truth), representing the AI system's full operational envelope including edge cases and adversarial inputs
  • Stochastic: Non-deterministic behavior where the same input produces different outputs, breaking traditional unit testing assumptions that Input A plus Function B always equals Output C
  • Concept drift: The degradation of model performance over time as real-world user behavior and use cases evolve beyond what was covered in static training or evaluation datasets
  • Short-circuit evaluation: Fail-fast logic that immediately terminates testing and returns a failure result when a critical condition isn't met, preventing wasteful execution of expensive downstream checks
  • Tool call: When an LLM invokes a specific function or API with structured arguments rather than generating conversational text, typically requiring exact JSON schema compliance
  • HITL (Human-in-the-Loop): Architecture requiring human review and validation at critical stages, such as verifying AI-generated test cases before adding them to the evaluation dataset
Original article

Monitoring LLM behavior necessitates adopting the AI Evaluation Stack, separating tests into deterministic assertions (syntax and routing integrity) and model-based evaluations (semantic quality). Engineers use offline pipelines for pre-deployment regression testing with human-reviewed "Golden Datasets" while online pipelines monitor real-world performance for drift and failures. A continuous feedback loop from production telemetry ensures AI systems adapt, maintaining high performance as user behavior evolves.

The World Can't Keep Up With AI Labs (9 minute read)

The World Can't Keep Up With AI Labs (9 minute read)

AI
Coding agents are generating real revenue at unprecedented growth rates, but compute infrastructure can't scale fast enough to meet demand.
What: An analysis arguing that AI coding agents represent the first AI product with sustained commercial traction, with Anthropic's revenue tripling in early 2026 and Claude accounting for growing percentages of GitHub commits, but physical infrastructure bottlenecks in memory, energy, and semiconductor manufacturing are constraining supply.
Why it matters: This supply-demand mismatch will likely force AI labs to raise prices and impose stricter usage limits, fundamentally changing the economics of AI tool access for developers who have grown accustomed to relatively affordable subscriptions.
Takeaway: Diversify across multiple AI providers rather than depending on a single service, and find ways to monetize AI productivity gains to justify potentially much higher subscription costs.
Deep dive
  • Anthropic's revenue is growing 3x year-to-date in 2026, faster than historical comparisons like Zoom during the pandemic or Google at IPO, despite being a much larger company where growth typically slows
  • Claude's share of GitHub commits doubled from 2% to 4% in January 2026 alone, with projections to reach 20%+ by year-end, indicating real workflow integration beyond hype
  • A $100/month coding agent subscription delivers 10-30x ROI for median developers earning $350-500/day if it automates even 10% of routine work
  • AI labs face a structural cash flow gap where they must simultaneously fund current inference costs and invest heavily in next-generation models that won't generate revenue for 1-2 years
  • Anthropic needs to grow compute capacity from 2.5 gigawatts to 5-6 gigawatts by end of 2026, but long-term contracts in the supply chain make rapid scaling nearly impossible
  • Three major bottlenecks constrain growth: HBM memory (30% of infrastructure costs, controlled by SK Hynix and Samsung), datacenter energy (grids can't deliver power fast enough), and semiconductor fab capacity (limited by TSMC factories and ASML lithography machines)
  • ASML produces only ~50 EUV lithography machines per year at $350M each, and Nvidia has locked up 70% of TSMC's 3-nanometer production capacity through advance contracts
  • Hyperscalers (Google, Amazon, Microsoft, Meta) are spending $105-200 billion annually on infrastructure, creating an $80 billion capital expenditure requirement to support $30 billion in AI lab revenue
  • The supply chain's reliance on long-term contracts to manage bubble risk means the entire value chain cannot react quickly to unexpected demand surges like the coding agent boom
  • Energy constraints can be temporarily addressed with industrial gas turbines and generators, but semiconductor and skilled labor shortages (especially electricians) cannot be solved by throwing money at the problem
  • AI labs will likely respond by cutting usage limits, implementing time-based pricing tiers, and raising subscription prices potentially to $1,000+ for power users where the ROI still justifies the cost
Decoder
  • HBM (High Bandwidth Memory): Expensive memory technology used in GPUs that provides much faster data transfer than standard RAM, reducing GPU idle time during processing
  • CoWoS (Chip-on-Wafer-on-Substrate): Advanced packaging technology used in final chip-to-module assembly that became a bottleneck in 2023
  • EUV scanners: Extreme ultraviolet lithography machines made exclusively by ASML that etch circuits onto silicon wafers for modern chips, costing ~$350M each
  • Hyperscalers: Major cloud infrastructure providers (Google, Amazon, Microsoft, Meta) that build massive datacenters and rent compute capacity
  • Inference: Running a trained AI model to generate outputs, as opposed to training the model initially (inference is what users pay for when using ChatGPT or Claude)
  • 3-nanometer process: Current generation semiconductor manufacturing technology that determines how small and efficient chip transistors can be made
Original article

The World Can't Keep Up With AI Labs

Late last year a new AI psychosis kicked off. This time it was coding agents.

People started saying this is a new era in programming, blah blah blah.

Karpathy tweet, late winter

A few months later, we've got more than just claims. We've got numbers. And they say something unusual is happening in the market.

Coding agents are the first AI product people are paying for at volume and regularly. Because it directly speeds up their work. It's too early to claim businesses are replacing whole processes with agents across the board. But compute demand has started growing faster than anyone can build it out.

Here's why this moment is different, why nobody's ready, and what I took from it personally.

The Numbers

OpenAI and Anthropic might go for an IPO soon. That's why they're eagerly posting how fast their revenue is growing.

And it's a ton of money.

Anthropic is up 3x since the start of the year. And they're already a big company. This is impressive, because the bigger you are, the harder it is to keep growing at the same pace.

Слева OpenAI, справа Anthropic.

OpenAI on the left, Anthropic on the right.

Even during past boom moments, nobody hit numbers like these (with a caveat, see below). Zoom during the pandemic, Google at IPO, Coinbase cashing in on commissions during the crypto hype. These are companies 5-10x smaller than Anthropic, in special situations, and they still grew slower!

Cобрал примеры самых удачных периодов роста компаний в их лучший год. Тут только те, кто уже стал крупным. Считал выручку на начало и конец года.

The best growth years for big companies. Only ones that were already large. Revenue measured at start vs end of year.

The caveat. First, vaccine makers during the pandemic were also up there. Second, Anthropic's numbers are a projection for the rest of the year based on early data. And they count things a bit differently than OpenAI. None of that changes my conclusion, which is..

Cash is a solid tell for real demand for agentic systems.

Last year when a bunch of people suddenly figured out ChatGPT could generate cool images, that didn't translate into serious money.

Meanwhile, in January alone, Claude Code commits on GitHub (in publicly accessible repos) went from 2% to 4%. If that sounds small, keep in mind it's one month, and that's without Codex, Copilot, or Devin. By end of year Dylan Patel forecasts Claude hitting 20%+.

Коммиты на Гитхабе от Клода.

Claude commits on GitHub.

Even if a $100 subscription only automates a small slice of the work, that's nothing compared to a developer's salary. For a median developer at $350-500 a day, the subscription has 10-30x ROI if it handles just the simplest, most routine 10% of their work.

There's plenty to argue with here.

Let me even lay out the weak spots in my own logic.

So their revenue is growing, fine—the labs are still unprofitable as businesses. They have every incentive to pump the hype to pull in the most risk-tolerant companies. The ones paying are early enthusiasts, not big companies. And enthusiasts come and go. Plenty of bubbles have popped exactly this way.

Agents are unstable and still randomly screw up. Who's to blame when things go wrong? You can't replace humans yet, because serious businesses care about reliability. And where do senior engineers come from without juniors if you stop hiring?

Agents only handle a narrow set of tasks well. Even if writing code is faster, shipping a product still gets bottlenecked by gathering requirements, architecture, review, testing, and our beloved stakeholder zoomcalls and compliance.

I decided at some point you have to commit and pick a side, even without conclusive evidence.

The finish line can be moved forever. There was a time when reasoning was completely out of reach for ML models. Same for decent image generation, or speech that didn't sound like a robot. There was a time nobody believed machines would learn to play Go. You get the idea.

Метафора из Life 3.0 Тегмарка. Типа компьютеры планомерно учатся делать все более сложные задачи. И со временем недоступных задач становится все меньше. Как вода заполняет карту снизу вверх.

Metaphor from Tegmark's Life 3.0. Computers gradually learn harder and harder tasks. Over time there's less and less they can't do. Like water filling a map from the bottom up.

Ilya Sutskever, back when he was still at OpenAI, often mentioned an internal meme—Feel the AGI.

He was one of the first to believe deep learning would gradually change our lives. Yes, there's a lot we don't know, but everything keeps moving in that direction, and that matters. Everyone gets it at their own moment. When a neural net does something you usually do yourself, manually, that's a special feeling.

I've lost count of how many of those moments I've had in 10 years of following neural nets. So I'm not interested in the bubble-or-not debate anymore. I'm interested in watching the water level rise.

Personally, I have enough evidence that agents can now do valuable work that companies are willing to pay for.

And the thing is, demand has plenty of room to grow. Agents often don't work out of the box. You have to adapt to them, and the fastest and most curious people do that best. Everyone else will catch up bit by bit.

The Industry Isn't Ready For This

To avoid talking about "the industry" in the abstract, let me split it into 3 layers.

  • AI labs make models. OpenAI, Anthropic, DeepMind.
  • Hyperscalers build datacenters. Google, Amazon, Microsoft, Meta.
  • Chipmakers make chips. Nvidia, TSMC, ASML.

And at every layer, companies are scared.

People online love talking about bubbles. Turns out, all these companies are well aware bubbles happen. And to avoid going bankrupt, each one is cooking up its own workaround.

Dario Amodei says he builds the company's plans off a pessimistic revenue scenario. Funny thing is, this year they're already beating that by 1.5x. And only 3 months of the year have gone by. They're beating the optimistic scenario too.

Dwarkesh asked him straight up in an interview: why? Dario genuinely believes in massive future upside from AI. He writes long essays about it, pitches a country of geniuses in a datacenter. And yet he doesn't want to bet everything on that future.

Dario says it's risky because of a cash flow gap in the business model.

Here's how it works. They provide neural nets to users. They pay hardware owners for inference and make money from subscriptions and APIs. In parallel, they pour money into research on the next generation model. Which won't start making money for another year or two.

Они регулярно тратят на ресерч больше половины.

They regularly spend more than half of revenue on research.

You're not just balancing income and expenses—you're also balancing investment in future growth. If you invest big and the growth doesn't show up, you're in serious trouble.

Anthropic has been running in this mode for three years straight. Growing 10x every year. Dario figured 2026 would be when it ends. Because the bigger you are, the harder it gets. You are gonna slow down at some point.

What he didn't mention in the interview, is that their margins are growing slower than forecast. Costs are growing multiple times faster than they'd planned.

Dario says he wants to push the company into profitability in a few years. To do that they need to improve margins. That means slowing growth and investing conservatively, only on the most efficient things.

The logic adds up. But slowing down isn't really working. They look ready to 10x again this year. But the resources to support that aren't there.

Anthropic doesn't have enough compute for this many power users.

They rent GPUs from hyperscalers. And they can't just walk into a datacenter and ask for more. Because the datacenter owner is also exposed to bubble risk. So capacity is booked out in advance.

For Anthropic to make $30B a year, someone had to spend $80B on infrastructure. Betting it would pay off in a few years.

Amazon will spend around $200B this year, Google $180B, Meta $125B, Microsoft $105B. That's a setup for trillions in economic value in the coming years.

And a cash flow gap risk if the value doesn't materialize.

The industry is one long value chain. Everyone in it tries to lower their own risk by locking expectations into contracts. Which reduces the whole chain's ability to react to surprises. Like the sudden arrival of coding agents.

So every year labs hit some new bottleneck. And constraints keep sliding further upstream, toward players further from the end user. Because their risks are higher and their contracts are even less flexible.

A New Bottleneck Every Year

In 2023 everyone was chasing GPUs. More specifically, TSMC factories didn't have enough capacity for the final chip-to-module assembly (CoWoS). In 2024 came the HBM memory shortage for those same modules. In 2025 GPUs got better, but datacenter buildout became limited by power supply. In 2026 it turned out even when you have the power, the US grid can't deliver it to datacenters at the volume needed.

1 - Memory

Modern models need more memory than before. I mentioned earlier that companies spend hundreds of billions a year on infrastructure. Roughly 30% of that goes to memory.

And they have to buy expensive HBM instead of cheap DDR. Because high bandwidth reduces GPU idle time while memory processes its part.

Оказывается память это самое дорогое что есть в карте. Если не считать наценку =)

Turns out memory is the most expensive thing in a GPU.

Memory prices are probably going to keep rising unless someone figures out how to work around it. They could easily go up another 2-3x, because SK Hynix and Samsung control 90% of the market. And memory demand is only growing.

2 - Energy and Datacenters

xAI proved datacenters can be built pretty fast.

But they eat power like a small city. And when such a thing suddenly shows up in some region within six months, the electricity grid just can't handle that.

Surprisingly, Dylan Patel isn't that worried about energy. New power plants, transformer stations, and plain old transmission towers take a long time to build. But while the grid catches up to the new load, you can power datacenters off industrial gas turbines. Literally roll up to the datacenter with a dozen trailers full of generators and you're good (tho people start to worry about that being far from clean energy).

There are also piston engines, solar with batteries, hydrogen reactors, marine ship engines… Basically, every trick the fuel industry has invented in its entire history. Together with more efficient grid usage, that can add up to hundreds of gigawatts.

Сейчас одни только видеокарты расходуют 13ГВ, если добавить мощности всего датацентра, можно на 2 умножать

Right now GPUs alone consume 13GW. Add the rest of the datacenter and you can multiply by 2.

The blocker for building datacenters and reactors fast is a shortage of skilled labor, especially electricians.

So, expensive and labor-intensive. But turns out it's still easier than the semiconductor supply chain.

3 - Semiconductors

There are factories (mostly TSMC) that assemble GPUs of a specific era (based on designs from Nvidia or Google). For example, on the 3-nanometer process.

And there just aren't enough factories built.

This can't be fixed quickly because these are some of the most complex industrial facilities on the planet. Building one takes 2-3 years and a pile of specialized equipment and chemistry.

The hardest piece is the lithography machines (EUV scanners). They're needed to etch chips onto wafers. The wafers then get paired with memory into modules, and that's how you get a GPU.

These machines cost ~$350M each. Only one company from the Netherlands makes them—ASML. Around 50 machines a year.

Машина

The machine.

By a rough estimate, by 2030 there will be around 700 of them worldwide. That's on the order of 200 gigawatts of compute. And at the end of 2025 we were using ~27 gigawatts. Note that that's before the agent hype of early 2026.

So there's room to grow, but the shortage will be permanent—bottlenecked by factory construction, wafers, and lithography machines.

These are the kinds of constraints you can't just throw money at, unlike memory and datacenter energy.

You can see it clearly in Google's behavior.

They have their own chip designs. And they still buy a quarter of their capacity from Nvidia. They'd love to make their own, they just can't.

Доля снижается, но это все равно много, с учетом того, что их собственные чипы лучше!

The share is dropping, but it's still a lot, considering their own chips are better!

All chips are assembled at TSMC factories to someone else's designs. And Google and Amazon (who also have their own designs) slept through the moment when Jensen Huang locked in contracts for 70% of 3-nm capacity. That's great for TSMC—they're at the end of the production chain and need stability.

Nvidia is also living the dream, selling cards at 6x production cost.

And Google even sold its own capacity to Anthropic through GCP. What a company.

So What?

So, the industry isn't ready for the agent boom.

Because it came on too suddenly. To a market where what ultimately matters is long-term contracts on complex chip-making infrastructure.

Anthropic right now has 2.5 gigawatts of compute, and by the end of the year they need 5-6. The only way to get that much is the "Other" category. CoreWeave, Bedrock, Vertex, Foundry. Scraps from anyone whose capacity is still available, at premium prices.

And they want to become a profitable company, so they can't afford to burn cash.

Hence the bad news.

The ones who'll probably suffer are us.

The most obvious move is for them to just cut limits and raise prices.

The other week they moved OpenClaw onto the API. And they said so in a nice and honest way. Sorry guys. We're tightening belts, here's $20 as an apology for the inconvenience.

They also rolled out different tiers depending on time of day. I've already run into it a couple of times, when Claude just ran out of capacity. During "off-peak" hours, under pressure from people optimizing for discounted tokens.

Отказано

Denied.

I pulled two takeaways from this for myself.

1 - Don't put all your eggs in one basket.

For example, when building a skill, make it work on any model. I'm obsessed with Claude, but OpenAI and Google are in way better shape on compute access.

So I've learned to swap models depending on the task. I pay the minimum subscription to every lab. And when the limit runs out, I just switch models.

I'm not using Chinese open-source. Don't use Deepseek, for the love of god.

2 - Get anxious about not making money off AI.

Neural nets aren't a way for me to make more money. They're on my expense sheet, and they pay for themselves by giving me more options and more time.

But if they roll out some $1000 tier, I won't be able to pull that off. Right now that sounds absurd. But remember the example with a real person's salary. As long as $1000 of spend brings in $5000 of profit, you're winning.

And whoever can't pull that off will be stuck on the free tier watching ads.

Vision Banana Generalist Model (39 minute read)

Vision Banana Generalist Model (39 minute read)

AI
Researchers demonstrate that image generation models can serve as generalist vision systems by reframing perception tasks like segmentation and depth estimation as image generation problems.
What: Vision Banana is a generalist vision model created by instruction-tuning Google's Nano Banana Pro image generator on vision task data, treating all outputs (segmentation masks, depth maps, etc.) as RGB images to generate rather than separate prediction tasks.
Why it matters: This suggests a potential paradigm shift for computer vision similar to what happened with large language models—generative pretraining on images may be the key to building foundational vision models that excel at both creation and understanding, rather than training specialized models for each task.
Takeaway: Monitor the project page for model weights and implementation details if you're working on vision tasks that could benefit from a unified generalist approach instead of maintaining multiple specialized models.
Deep dive
  • The research challenges the traditional computer vision paradigm where separate models are trained for different tasks like segmentation, depth estimation, and object detection
  • Vision Banana achieves state-of-the-art results by converting vision tasks into image generation problems—outputting segmentation masks and depth maps as generated RGB images
  • The model beats or matches specialized systems including Segment Anything Model 3 for segmentation and Depth Anything for metric depth estimation, despite being a generalist
  • Built through lightweight instruction-tuning of Nano Banana Pro on a mixture of original image generation data plus a small amount of vision task data
  • The key insight mirrors the LLM revolution: just as language generation pretraining gave models emergent understanding capabilities, image generation pretraining provides powerful general visual representations
  • The instruction-tuning approach preserves the base model's image generation capabilities while adding perception abilities
  • Works across both 2D and 3D vision understanding tasks, demonstrating true generalist capabilities
  • The unified interface of image generation for all vision tasks parallels how text generation became the universal interface for language understanding and reasoning
  • Results suggest that the ability to generate visual content inherently requires understanding visual content, validating a long-standing conjecture in computer vision
  • The paper proposes that generative vision pretraining should take a central role in building foundational vision models going forward
  • This approach eliminates the need for task-specific architectures and output layers that have dominated computer vision for decades
Decoder
  • Instruction-tuning: Training a pretrained model on task-specific examples with instructions, similar to fine-tuning but focused on teaching the model to follow diverse commands
  • Zero-shot: A model's ability to perform tasks it wasn't explicitly trained on, by generalizing from its pretraining
  • SOTA: State-of-the-art, the best currently available performance on a benchmark
  • SAM (Segment Anything Model): Meta's specialized model for image segmentation that can identify and mask objects
  • Metric depth estimation: Predicting actual distance measurements from the camera to objects in a scene, not just relative depth ordering
  • Nano Banana Pro (NBP): Google's image generation model that serves as the base for Vision Banana (likely part of the Banana model family)
Original article

Image Generators are Generalist Vision Learners

Recent works show that image and video generators exhibit zero-shot visual understanding behaviors, in a way reminiscent of how LLMs develop emergent capabilities of language understanding and reasoning from generative pretraining. While it has long been conjectured that the ability to create visual content implies an ability to understand it, there has been limited evidence that generative vision models have developed strong understanding capabilities. In this work, we demonstrate that image generation training serves a role similar to LLM pretraining, and lets models learn powerful and general visual representations that enable SOTA performance on various vision tasks. We introduce Vision Banana, a generalist model built by instruction-tuning Nano Banana Pro (NBP) on a mixture of its original training data alongside a small amount of vision task data. By parameterizing the output space of vision tasks as RGB images, we seamlessly reframe perception as image generation. Our generalist model, Vision Banana, achieves SOTA results on a variety of vision tasks involving both 2D and 3D understanding, beating or rivaling zero-shot domain-specialists, including Segment Anything Model 3 on segmentation tasks, and the Depth Anything series on metric depth estimation. We show that these results can be achieved with lightweight instruction-tuning without sacrificing the base model's image generation capabilities. The superior results suggest that image generation pretraining is a generalist vision learner. It also shows that image generation serves as a unified and universal interface for vision tasks, similar to text generation's role in language understanding and reasoning. We could be witnessing a major paradigm shift for computer vision, where generative vision pretraining takes a central role in building Foundational Vision Models for both generation and understanding.

Stash (GitHub Repo)

Stash (GitHub Repo)

AI
Stash is an open-source tool that gives AI agents persistent memory across sessions, solving the problem of LLMs starting every conversation from scratch.
What: Stash is a self-hosted memory layer for AI agents that uses an 8-stage consolidation pipeline to transform raw observations into structured knowledge including facts, relationships, causal links, goal tracking, and failure patterns, compatible with any MCP-enabled agent.
Why it matters: Addresses a fundamental limitation of LLMs by enabling agents to build up knowledge over time rather than resetting with each interaction, potentially making them progressively more useful as they accumulate experience and learn from past sessions.
Takeaway: Clone the repo and run docker compose up to add persistent memory to any MCP-compatible AI agent like Claude Desktop, Cursor, or Cline.
Deep dive
  • Uses Postgres with pgvector as the underlying storage engine for vector-based memory retrieval
  • Implements an 8-stage consolidation pipeline that progressively refines raw observations: episodes → facts → relationships → patterns → wisdom
  • Each consolidation stage only processes new data since the last run, making it efficient for continuous operation
  • Includes advanced features like causal link analysis, goal tracking, failure pattern recognition, hypothesis verification, and confidence decay over time
  • Runs as an MCP server with background consolidation, meaning the memory processing happens automatically without blocking agent interactions
  • Works with a broad range of AI platforms including Claude Desktop, Cursor, Windsurf, Cline, Continue, OpenAI Agents, Ollama, and OpenRouter
  • Self-hosted design means you maintain control over your agent's memory data rather than relying on third-party services
  • Apache 2.0 licensed, allowing commercial use and modification
Decoder
  • MCP: Model Context Protocol, a standard interface that allows AI agents to connect to external tools and data sources
  • pgvector: PostgreSQL extension for storing and querying vector embeddings, enabling semantic similarity search
  • Consolidation pipeline: Multi-stage process that transforms raw data into increasingly abstract and useful knowledge structures
Original article

Stash

Your AI has amnesia. We fixed it.

Every LLM starts every conversation from zero. Stash gives your agent persistent memory — it remembers, recalls, consolidates, and learns across sessions. No more explaining yourself from scratch.

Open source. Self-hosted. Works with any MCP-compatible agent.

Quick Start

git clone https://github.com/alash3al/stash.git
cd stash
cp .env.example .env   # edit with your API key + model
docker compose up

That's it. Postgres + pgvector, migrations, MCP server with background consolidation — all in one command.

What It Does

Stash is a cognitive layer between your AI agent and the world. Episodes become facts. Facts become relationships. Relationships become patterns. Patterns become wisdom.

An 8-stage consolidation pipeline turns raw observations into structured knowledge — facts, relationships, causal links, goal tracking, failure patterns, hypothesis verification, and confidence decay. Each stage only processes new data since the last run.

Works with Claude Desktop, Cursor, Windsurf, Cline, Continue, OpenAI Agents, Ollama, OpenRouter — anything MCP.

Learn More

alash3al.github.io/stash →

License

Apache 2.0

Efficient Video Intelligence in 2026 (21 minute read)

Efficient Video Intelligence in 2026 (21 minute read)

AI
A comprehensive technical review of efficient video intelligence in 2026 covers universal vision encoders, adaptive compression for hour-long videos, and on-device tracking at mobile-phone speeds.
What: A detailed technical overview of the current state of efficient video understanding systems, covering advances in compact universal vision encoders (EUPE), long-form video compression techniques (LongVU, Tempo), on-device segmentation and tracking that runs at 16 FPS on smartphones (EdgeTAM), VLM-based depth estimation, and deployment strategies across cloud, edge, and on-device tiers.
Why it matters: Video understanding has moved from short-clip recognition to hour-long reasoning with models small enough to run on phones, but the shift required rethinking every layer of the stack—from consolidating specialist encoders into single universal models to adaptive token allocation that handles millions of visual tokens; the architectural patterns that have stabilized (temporal compression, factorized attention, multi-teacher distillation) now define production video AI, while streaming understanding and sub-watt AR deployment remain unsolved.
Deep dive
  • Video understanding has evolved from short-clip action recognition to hour-long reasoning with models small enough for mobile deployment, driven by rethinking compression and taking deployment constraints seriously from the start
  • Universal vision encoders like EUPE consolidate what used to require separate models for segmentation, depth, classification, and language alignment into a single <100M parameter backbone through multi-teacher distillation via an intermediate proxy teacher
  • The proxy-teacher step matters because direct multi-teacher distillation into small students loses signal when teachers disagree at the feature level; the proxy resolves conflicts first then transfers a coherent feature space
  • Token volume is video's fundamental challenge: an hour of 30 FPS video at 224×224 produces 21 million visual tokens before compression, far exceeding any LLM context window
  • Long-form video systems like LongVU use four-stage compression: DINOv2-based temporal redundancy removal (drops ~54% of frames), feature fusion with SigLIP, cross-modal query selection (allocates tokens based on question relevance), and spatial token compression within sliding windows
  • Tempo pushes adaptive allocation further with a small VLM acting as query-aware compressor that routes 0.5 to 16 tokens per frame based on relevance, beating GPT-4o and Gemini 1.5 Pro on hour-plus videos at 8K token budgets
  • On-device foundation-model tracking became viable in 2024-2025 through aggressive compression: EdgeTAM achieves 16 FPS on iPhone 15 Pro Max with a 2D Spatial Perceiver that compresses per-frame memory while preserving spatial structure
  • Most per-frame computation in video tracking is redundant across adjacent frames, so memory-efficient propagation drives production gains more than raw model size reduction
  • DepthLM demonstrates that a 3B-parameter VLM with no architecture changes can match dedicated depth specialists through visual prompting (rendering markers on images), intrinsic-conditioned augmentation (unifying focal length), and training on just one labeled pixel per image
  • The depth landscape has split into four approaches: dedicated specialists (DepthAnything trained on 62M+ images), diffusion priors (Marigold with strong zero-shot), reconstruction models (VGGT predicting 3D structure jointly), and VLM-based methods that collapse depth and reasoning into one model
  • VideoAuto-R1 shows that explicit reasoning often doesn't help for perception-oriented video questions; gating chain-of-thought on confidence reduces average response length 3.3× while maintaining or improving accuracy
  • Audio-visual fusion has three architectural paths: encoder stitching (cheap, shallow alignment), native multimodal training (Qwen3-Omni, shares weights across modalities), and benchmark-driven evaluation (EgoAVU shows egocentric audio carries distinct signals from third-person video)
  • Video deployment splits into cloud (frontier models, high latency/cost), edge servers (mid-size 3-30B models, eliminates cloud latency), and on-device (zero latency, fully private, tight power budget), with hybrid architectures as the production default
  • Quantization recipes have stabilized: W4A16 is default for edge VLMs, NVFP4 unlocks Blackwell-tier hardware, and KV cache quantization matters more for video than text because the cache can dominate memory on long inputs
  • ExecuTorch reached production maturity in October 2025 and now powers Meta's on-device AI stack across Instagram, WhatsApp, Quest 3, and Ray-Ban Meta with backends for Apple, Qualcomm, Arm, MediaTek, and Vulkan
  • Streaming understanding remains unsolved: current techniques like LongVU assume batch mode where the whole video is available upfront, but continuous-stream mode where video keeps arriving requires memory mechanisms and incremental compression that aren't yet production-ready
  • Sub-watt inference for AR glasses is 5-10× away in compute efficiency: today's mobile NPUs do tens of TOPS in tens of milliwatts, but always-on video understanding in a 1-3W envelope that includes all other system processes remains out of reach
  • Sparse-event detection (finding three frames out of 86,400 that matter without full inference on all frames) requires hierarchical attention or learned selection; schema-driven extraction over known classes ships commercially, but open-set anomaly detection is unsolved
  • Cross-camera reasoning and spatial grounding stability across cuts remain open problems; retrieval over indexed videos works via ANN over embeddings, but joint reasoning across streams and maintaining object identity across cameras and time windows is not yet solved
  • The stable architectural patterns are: compress on the temporal axis where redundancy is highest, distill universal encoders from multiple teachers, factorize attention along data structure (spatial within frames, temporal across), treat quantization as default, and gate reasoning on confidence
Decoder
  • EUPE (Efficient Universal Perception Encoder): A compact vision encoder under 100M parameters that matches domain specialists across image classification, segmentation, depth, and vision-language tasks by distilling from multiple teacher models (DINOv2, SAM, CLIP) through an intermediate proxy teacher
  • DINOv2/DINOv3: Self-supervised vision transformers that excel at dense prediction tasks (segmentation, depth, correspondence) by preserving fine-grained spatial structure
  • SAM (Segment Anything Model): Foundation model for prompt-driven segmentation; SAM 2 extends to video with memory modules
  • LongVU: Long-form video understanding system that uses adaptive token compression (DINOv2 for temporal pruning, cross-modal query selection) to handle hour-long videos efficiently
  • Tempo: Query-aware video compressor that routes token budget per-segment based on relevance, achieving strong performance on hour-plus videos at constrained budgets
  • EdgeTAM: Efficient tracking model that achieves foundation-model-grade video object tracking at 16 FPS on iPhone 15 Pro Max through aggressive memory compression
  • DepthLM: Vision-language model that performs metric depth estimation without specialized architecture, using visual prompting and intrinsic-conditioned augmentation
  • VideoAuto-R1: Video reasoning system that gates explicit chain-of-thought reasoning on confidence, activating detailed reasoning only when the initial answer is uncertain
  • EgoAVU: Egocentric audio-visual understanding benchmark and dataset for first-person video where audio carries distinct signals (hand-object contact, wearer's voice)
  • J&F scores: Jaccard (region similarity) and F-measure (contour accuracy) metrics for video object segmentation
  • W4A16 / W8A8: Quantization schemes with 4-bit or 8-bit weights and 16-bit or 8-bit activations, standard for deploying models on edge devices
  • ExecuTorch: Meta's PyTorch runtime for on-device deployment, reached 1.0 in October 2025, supports streaming inference across mobile and AR platforms
  • KV cache: Cached key-value pairs in transformer attention that can dominate memory for long sequences; aggressive quantization (3-4 bits) is critical for video
Original article

Efficient Video Intelligence in 2026

Five years ago, video understanding mostly meant action recognition on Kinetics-400 or short-clip captioning on MSR-VTT. Today, vision-language models reason about hour-long footage, on-device tracking segments any object at 16 FPS on a phone, and a single 100M-parameter encoder can match domain experts across image understanding, dense prediction, and VLM tasks. The shift came from rethinking what a video model needs to do, and from taking deployment constraints seriously.

This post walks through where efficient video intelligence stands in April 2026, following how a video system processes its input from raw frames through spatial perception, long-form temporal understanding, multimodal fusion and reasoning, and the deployment stack that makes any of it shippable.

A note up front: the post leans heavily on research from my own group, including EUPE, the EfficientSAM / Efficient Track Anything / EdgeTAM compression line, LongVU, Tempo, EgoAVU, VideoAuto-R1, DepthLM, and ParetoQ. I have tried to place each piece against the parallel and competing work in its section, but this is a perspective from inside one research program rather than a neutral survey.

Why Video Is Harder Than Text or Images

Token volume. A single minute of 30 FPS video at 224x224 resolution and ViT-B/16 patches produces 1,800 frames times 196 patches per frame, or 352K visual tokens before any text or audio, and an hour is 21M tokens before compression. No frontier LLM context window absorbs this naively, so every video model has to compress somewhere.

Information sparsity. Adjacent frames are usually nearly identical, and the interesting events are rare and unevenly distributed. A surveillance camera at 1 FPS over 24 hours produces 86,400 frames, and the question of interest may depend on three of them. Sampling every frame is wasteful, but uniform sampling drops the frames that matter, so adaptive selection is required.

Multi-modality is intrinsic. Video without audio is half a signal in egocentric, conversational, and many healthcare contexts, even though much surveillance footage is silent and sports broadcast audio is mostly commentary. Video with audio doubles the embedding cost and adds synchronization requirements, and training a native multimodal model is a different problem than bolting an audio adapter onto a vision encoder.

Vision Encoders: From Specialists to Universals

The first thing a video model does is encode each frame. Until recently, that meant picking an encoder family and accepting its weaknesses. Image-text contrastive models (CLIP, SigLIP, SigLIP 2) are the default VLM front-end for semantic retrieval but weak on dense prediction. Self-supervised ViTs (DINOv2, DINOv3) excel on dense prediction (segmentation, depth, correspondence) because their training objective preserves fine-grained spatial structure, but their features are not aligned to language. Segmentation foundation models (SAM, SAM 2 and the compressed variants below) are specialists for object proposals and tracking. Dense-prediction specialists (DepthAnything, MiDaS, DepthPro, DepthLM) handle depth.

A production video system on a wearable, robot, or smart camera cannot ship a separate backbone for each of these capabilities, and neither compromising on capability nor paying the memory-and-latency penalty is acceptable.

Agglomerative encoders and EUPE

The agglomerative-encoder thread addresses this directly. AM-RADIO (Ranzinger et al., Nvidia, CVPR 2024) introduced multi-teacher distillation for compact universal vision encoders, distilling CLIP, DINOv2, and SAM into a unified student. Theia (Shang et al., The AI Institute, CoRL 2024) targeted embodied-agent perception by distilling from CLIP, DINOv2, ViT, SAM, and Depth-Anything for robot learning. DUNE (Sariyildiz et al., Naver Labs Europe, CVPR 2025) extended this further with heterogeneous 2D and 3D teachers (DINOv2, MASt3R, Multi-HMR). The shared insight: vision foundation models trained for different objectives produce complementary feature spaces, and a small student can inherit the union if the distillation is set up well.

Our recent work on the Efficient Universal Perception Encoder (EUPE) advances this thread by adding an intermediate proxy-teacher step. The recipe:

  1. Train a large proxy teacher by distilling from a diverse teacher pool: DINOv2 and DINOv3 (self-supervised dense features), the SAM family (SAM, SAM 2, SAM 3) for segmentation, and CLIP / SigLIP / SigLIP-SO400M for vision-language alignment.
  2. Distill the proxy teacher down into a compact student under 100M parameters.

The intermediate step matters because direct multi-teacher distillation into a small student loses signal: the teachers disagree at the feature level and the student capacity cannot represent the union. A single proxy resolves the disagreements first, then transfers a coherent feature space.

The released family includes ViT-T/S/B and ConvNeXt T/S/B variants, all under 100M parameters, with weights on Hugging Face. Evaluation spans image classification (ImageNet, ObjectNet, SUN397, iNaturalist), dense prediction (ADE20K and COCO segmentation, NYU and KITTI depth, SPair matching), and vision-language tasks (VQA, image-text retrieval). EUPE matches or exceeds same-size domain experts across these domains. For video systems, which are particularly sensitive to per-frame inference cost, a single backbone covering classification, dense prediction, and VLM front-end means fewer encoders to load and amortize, and the latency win compounds with every frame in the stream.

Efficient Attention for Long Sequences

Once frames are encoded, attention becomes the bottleneck. Standard self-attention is O(n²) in sequence length, which is unaffordable for long video. Three families of remedies have stabilized.

Sliding-window and sparse attention. LongLLaMA, Mistral's sliding-window, and DeepSeek's Native Sparse Attention. Each restricts attention to a local or learned subset of tokens.

Linear attention. Performer, Linformer, and Nyströmformer (Xiong et al., AAAI 2021), which uses Nyström-based low-rank approximation of the softmax kernel to achieve linear complexity. Recent production systems extend this thread: Qwen3-Next pairs Gated DeltaNet (a linear-attention variant) with full attention in a 3:1 ratio. These approaches help when sequence length dominates compute.

Hybrid architectures. Mamba-Transformer hybrids (Jamba, Nvidia Nemotron Nano 2) keep self-attention for short-range relationships and use SSM blocks for long-range dependencies. For video this maps naturally: most spatial reasoning is local, while temporal reasoning extends across many frames.

The structural pattern that holds for video is factorized spatial-temporal attention. Spatial attention within a frame is O(P²) where P is patches per frame and small; temporal attention across frames is O(T²) where T is frame count and can be large. Full attention on the spatial axis combined with linear or sparse attention on the temporal axis works well for most workloads, and recent open-weight video VLMs (Qwen3-VL, LLaVA-Video) converge here.

Segmentation and Tracking on Device

Once you can encode and attend efficiently, the next question is what to extract from each frame, and segmentation and tracking are the workhorse primitives.

SAM (Kirillov et al., Meta, ICCV 2023) defined the prompt-driven segmentation foundation model, and SAM 2 (Ravi et al., Meta, 2024) extended it to video with a memory module that maintains separate FIFO queues for recent and prompted frames, plus object pointers, with temporal positional embeddings on the recent queue only. Several parallel lines take different architectural paths: XMem (Cheng et al., ECCV 2022) introduced the multi-store memory architecture (sensory, working, long-term) that informed many later designs; DEVA (Cheng et al., ICCV 2023) decouples task-specific image-level segmentation from a universal temporal propagation module trained once and reused across tasks; and Cutie (Cheng et al., CVPR 2024 Highlight) reads object-level memory through a query-based object transformer rather than propagating pixel-level features. SAM 2 and its compressed descendants dominate the foundation-model production stack today, while Cutie, DEVA, and XMem hold advantages in long-persistence, decoupled-task, and tight-memory regimes respectively.

Most of our work here has been on compression. EfficientSAM (CVPR 2024 Highlight) introduced SAMI, a masked image pretraining recipe that distills SAM's image encoder into much smaller backbones; the released ViT-T and ViT-S variants reach within a few mIoU points of the full SAM ViT-H at a fraction of the cost, and the open-source release made on-device segmentation practical for the first time. Efficient Track Anything (ICCV 2025) extended this to video with two changes: a plain non-hierarchical ViT replaces SAM 2's hierarchical encoder, and an efficient memory module reduces the cost of frame feature extraction and memory computation within SAM 2's bounded memory bank, yielding roughly 2x speedup on A100 with 2.4x parameter reduction at performance comparable to SAM 2, and ~10 FPS on iPhone 15 Pro Max. EdgeTAM (CVPR 2025) pushed further onto consumer silicon with a 2D Spatial Perceiver that compresses per-frame memory aggressively while preserving the spatial structure needed for accurate tracking, hitting J&F scores of 87.7 / 70.0 / 72.3 / 71.7 on DAVIS 2017, MOSE, SA-V validation, and SA-V test while running at 16 FPS on iPhone 15 Pro Max. That is the first time foundation-model-grade video tracking has been deployable on a consumer mobile device.

Most per-frame computation is redundant across adjacent frames, so memory-efficient propagation drives the production gains, not raw model size.

3D and Depth from Video

Segmentation and tracking handle 2D structure, but video also carries strong cues for 3D through parallax, motion, and temporal consistency. The methods that have stabilized are still predominantly image-based, applied per-frame or fed into multi-view reconstructors that treat sampled frames as views; truly temporal-video-native depth is an active but immature area. Extracting metric depth used to require specialized architectures.

DepthLM (ICLR 2026 Oral) shows that a vision-language model with a 3B-parameter backbone, trained with standard text-based supervised fine-tuning and no architecture change, can match or beat dedicated specialists like DepthPro and Metric3Dv2 on metric depth benchmarks. The recipe has three pieces: visual prompting that renders markers on images rather than using text coordinate prompts; intrinsic-conditioned augmentation that unifies focal length to resolve camera ambiguity during training; and supervised fine-tuning on sparsely labeled images, with just one labeled pixel per training image.

DepthLM is the VLM-based entry in a four-way race for metric depth. The dedicated specialists, DepthAnything (Yang et al., CVPR 2024) trained on 1.5M labeled and 62M+ unlabeled images and DepthAnything V2 (NeurIPS 2024) trained on ~595K synthetic-labeled and ~62M pseudo-labeled real images, plus DepthPro (Bochkovskii et al., Apple) and Metric3D v2, still set per-task SOTA on most depth benchmarks. The diffusion-prior approach is best represented by Marigold (Ke et al., CVPR 2024 Oral), which fine-tunes a pretrained image diffusion model and gets strong zero-shot generalization at the cost of latency. The reconstruction family, including DUSt3R and MASt3R (Naver Labs Europe) and the more recent VGGT (Visual Geometry Grounded Transformer, Wang et al., Oxford VGG and Meta AI, CVPR 2025 Best Paper), predicts 3D scene structure, camera parameters, and depth jointly from sparse views, which is useful when geometry matters more than per-pixel depth. Specialists win on raw accuracy, reconstruction wins when camera pose is needed, diffusion priors win on out-of-distribution generalization, and VLM-based approaches like DepthLM win when the same model handles depth and higher-level reasoning.

The implication is structural: if 3D understanding rides on the same VLM that handles reasoning, the stack collapses two perception models into one, and for an AR headset or a robot that simplifies deployment substantially.

Long-Form Video Understanding

Spatial primitives describe what is in a single frame. The harder problem is understanding what an entire video means as length grows from seconds to hours.

LongVU (ICML 2025) addresses this with spatiotemporal adaptive compression. The four-stage pipeline:

  1. Temporal redundancy removal via DINOv2. Sample at 1 FPS, compute DINOv2 features within non-overlapping 8-frame windows, drop frames whose features are highly similar to neighbors. Roughly 45.9% of frames are retained after this stage. DINOv2 is used here because its vision-centric self-supervised features are well-suited to inter-frame similarity pruning, while SigLIP is retained downstream for language-aligned semantics.
  2. Feature fusion. Extract SigLIP features from the surviving frames and combine them with DINOv2 features through a Spatial Vision Aggregator.
  3. Cross-modal query selection. Compute attention between frame features and the LLM's text-query embeddings; retain the top-Nh frames at full 144 tokens and reduce the rest to 64 tokens, balancing detail against budget.
  4. Spatial Token Compression. In sliding windows of 8 frames, the first frame keeps full token resolution while tokens in subsequent frames whose cosine similarity to the corresponding anchor token exceeds 0.8 are pruned, yielding about 40.4% additional token reduction.

LongVU is built on Qwen2-7B (with a Llama 3.2-3B lightweight variant) and reaches 60.6% on VideoMME and 65.4% on MLVU with 1 FPS adaptive sampling, outperforming uniform-frame baselines like LLaVA-OneVision while using a fraction of the tokens.

Our follow-up Tempo pushes adaptive token allocation further. A small VLM up front acts as a query-aware compressor: it reads the question first, then routes token budget per-segment, swinging from 0.5 to 16 tokens per frame depending on relevance. The compressed representation is handed to a larger LLM for downstream reasoning. At an 8K visual token budget, the 6B Tempo model reaches 52.3 on LVBench (where videos average over an hour), beating both GPT-4o and Gemini 1.5 Pro at that budget.

LongVU and Tempo sit in a broader thread of compression approaches. LLaMA-VID (Li et al., ECCV 2024) takes aggressive context-token compression to an extreme: each frame is reduced to two learned tokens, a context token encoding instruction-guided information and a content token capturing visual cues, which enables very long videos at the cost of some spatial detail. VideoChat-Flash (ICLR 2026) introduces hierarchical clip-to-video token compression (clip-level during encoding, then video-level in the LLM context) inside a multi-stage short-to-long training scheme, achieving roughly 50x compression with minimal performance loss and 99.1% needle-in-a-haystack accuracy on 10K-frame inputs. PLLaVA and successors apply parameter-free pooling at the projection layer. Frontier multimodal models with very long native context windows (Gemini 2/3 with 1M+ tokens, recent Qwen3-VL variants) go the other way: rather than compress aggressively, they push the budget upward and let attention sort out relevance. The tradeoff is concrete: aggressive compression preserves on-device feasibility but can drop information, while large native contexts preserve information but require frontier-tier compute. LongVU sits at the on-device end of the spectrum, Gemini at the frontier end, and different deployment targets pick different points.

Long-form video understanding is dominated by token budget, and the field is converging on some combination of adaptive token allocation, memory mechanisms, and language-guided pruning. The open question is whether these techniques can work in streaming mode, where the model cannot see the whole video upfront, rather than batch; nobody has solved that cleanly.

Audio-Visual Fusion

Beyond length and spatial structure, audio is what disambiguates many videos, especially egocentric and conversational footage, and how a model fuses audio with the visual stream is a separate architectural choice from anything covered above.

Encoder stitching is the historical default: separate audio and visual encoders feed pooled embeddings into a language model. Cheap and modular, but cross-modal alignment is shallow because the encoders never see each other's data during training. Native multimodal training treats text, image, video, and audio tokens uniformly through a shared backbone. Qwen3-Omni is the strongest open-weight example as of April 2026, with state-of-the-art results on 22 of 36 audio and audio-visual benchmarks (32 of 36 among open-source models) while sharing weights with the visual stack, and Gemini's native multimodal architecture follows a similar internal pattern.

EgoAVU (CVPR 2026 Highlight) takes a third path. Rather than propose a new fusion architecture, EgoAVU builds the first large-scale egocentric audio-visual benchmark and dataset and evaluates how existing VLMs (Qwen2-VL, Gemini, LLaMA 3) perform when audio embeddings are stitched alongside the visual tokens. Audio in egocentric video carries distinct information from third-person video: ambient sound, hand-object contact noise, the wearer's own voice, and conversational partners are all anchored on the wearer's body in ways they are not in YouTube-style footage. The evaluation shows that audio adds substantial signal on egocentric understanding tasks and that stitched audio encoders into existing VLMs are already a strong baseline; the headroom is in better data and training, not in radical architectural changes.

Native multimodal wins at scale, but egocentric data is underrepresented in pretraining corpora and wearables are the deployment target where this distribution dominates. Benchmark-driven progress on the egocentric slice matters more for wearable products than for cloud video generally.

Reasoning Over Video

Encoding, compression, and fusion produce a representation; reasoning is what turns that representation into an answer. A VLM that watches a video and answers in one forward pass often fails on temporally-extended questions, because compressing hours of footage into a fixed-length representation and reading the answer back out drops too much nuance.

VideoAuto-R1 (CVPR 2026) starts from a counterintuitive observation: for RL-trained video VLMs, direct answering often matches or beats chain-of-thought reasoning while costing a lot more tokens. The proposed recipe is "reason-when-necessary." During training, the model first generates an initial answer, then performs reasoning, then outputs a reviewed final answer; both the initial and reviewed answers are supervised through verifiable rewards. At inference, the confidence of the initial answer determines whether to spend tokens on reasoning at all. The result: state-of-the-art accuracy on video QA and grounding benchmarks while reducing average response length roughly 3.3x (from ~144 to ~44 tokens). Thinking-mode activates rarely on perception-oriented questions and often on reasoning-intensive ones, which suggests that explicit reasoning helps but is not always necessary, and gating it on confidence is a meaningful efficiency win.

Several lines have converged on related patterns. Video-of-Thought (Fei et al., ICML 2024) introduced step-by-step video reasoning that decomposes a complex question from low-level pixel perception to high-level cognitive interpretation, paired with the MotionEpic VLM that grounds reasoning in spatial-temporal scene graphs. VideoTree (Wang et al., CVPR 2025) builds a query-adaptive hierarchical tree by iteratively selecting the keyframes most relevant to the question, achieving strong long-form QA without any training. Plan-and-execute approaches in the broader VLM-agent literature share the same structural pattern with different implementations. Single-pass video VLMs fail predictably on long-horizon questions, and the field has settled on two-stage inference. The remaining question is whether the reasoning step should be explicit (interpretable, easier to debug, slower) or implicit through learned routing (faster, harder to introspect).

Deployment: Where Video Intelligence Actually Runs

Video deployment splits into three tiers, and the choice between them is driven as much by economics, latency, and data residency as by raw model capability.

Cloud. Frontier APIs like Gemini's video understanding endpoints and the multimodal flagships from OpenAI and Anthropic that accept image and audio (with video typically handled via frame sampling); specialized providers like Twelve Labs (Marengo embeddings and Pegasus video LLM with hour-scale temporal segmentation); hyperscaler services like AWS Rekognition Video, Azure Video Indexer, and Google Video Intelligence. The cloud tier gets you the largest models and the longest context with no client-side complexity, but it pays in round-trip latency (hundreds of milliseconds minimum), cost (10-100x edge inference per task), and bandwidth that breaks for continuous video at scale.

Edge servers. On-prem GPU appliances or smart camera bridges, like Verkada's bridges, Hayden AI's on-device units, or industrial-inspection servers running Cosmos NIM. This tier trades the cloud's latency and data-residency problems for a hardware investment and a fragmented stack across customers, and supports mid-size models in the 3-30B range.

On-device. Mobile SoCs, AR glasses silicon, embedded NPUs. Apple Intelligence on iPhone, Qualcomm Robotics RB5/RB6 in robotics, Qualcomm Snapdragon AR1 in Ray-Ban Meta and Snapdragon XR2 Gen 2 in Quest 3. Zero-latency, fully private, no bandwidth, and it scales with device shipments. The cost is a tight power budget (1-30W), limited memory bandwidth, and a fragmented runtime landscape.

For continuous video the math forces the choice. A body-cam recording 12 hours per shift cannot ship 100GB per day to the cloud per officer, so the fast-thinking layer has to live on the device, with cloud or edge servers used for the deeper queries. Hybrid architectures, not pure cloud or pure on-device, are the production default.

Quantization Recipes for Video Models

Video models inherit the quantization recipes that have stabilized for LLMs and VLMs.

  • W4A16 (4-bit weights, 16-bit activations) is the default for VLMs and VLAs at the edge. Recent open releases including the Embedl-quantized Cosmos-Reason2 (2B) variants show the recipe holds across multimodal architectures with minimal accuracy loss.
  • NVFP4 (4-bit weights and 4-bit activations in NVIDIA's FP4 format with per-block-of-16 FP scales) unlocks Blackwell-tier hardware (Jetson AGX Thor) and is the production-grade upgrade where supported.
  • W8A8 remains the safer fallback for mature vision and segmentation models.
  • Sub-4-bit quantization (W2A16, ternary, mixed precision) continues to improve. Our ParetoQ (NeurIPS 2025) work mapped the full quantization Pareto frontier and showed that at 2 bits and below, models learn fundamentally different representations than at 3-4 bits; for a fixed memory budget, a larger 2-bit model can beat a smaller 4-bit model. That shifts the design space for very-low-power video deployment, though it still requires QAT and is not yet standard for production VLMs.
  • KV cache quantization matters more for video than for text. The KV cache for a long video can dominate memory, and rotation-based methods like SpinQuant (which jointly quantize weights, activations, and KV cache) have been particularly effective at compressing it to 3-4 bits per element.

Runtime Stack

For PyTorch-based deployment, ExecuTorch (Meta) is the natural path. ExecuTorch reached 1.0 GA in October 2025 and now powers Meta's on-device AI across Instagram, WhatsApp, Messenger, Facebook, Quest 3, and Ray-Ban Meta, with backends spanning Apple Core ML, Qualcomm QNN, Arm, MediaTek NeuroPilot, and Vulkan. For video pipelines, ExecuTorch's support for streaming inference and selective recomputation matters because re-encoding every frame from scratch is wasteful. Other paths cover other ecosystems: Apple Core ML for Apple platforms, LiteRT-LM plus Qualcomm QNN for Android, Nvidia Isaac plus NIM on Jetson, Intel OpenVINO for x86 industrial. No single runtime wins, and production video systems usually ship the same model compiled for several backends.

What's Still Hard

Several problems remain open across the stack.

Continuous-stream understanding at hour-plus durations. LongVU and similar techniques assume batch mode where the whole video is available. Streaming mode, where the model has to maintain understanding while video keeps arriving, is much harder. Memory mechanisms, retrieval-augmented architectures, and incremental token compression are all in progress; none are solved cleanly.

Sparse-event detection. Most production video is uninteresting. Finding the three frames out of 86,400 that matter, without paying for full inference on all 86,400, requires hierarchical attention or learned selection. Schema-driven extraction over known classes now ships commercially (Twelve Labs' Pegasus pulls structured metadata against a customer-defined schema); open-set "show me anything anomalous" remains unsolved.

Cross-camera and cross-clip reasoning. A surveillance ops team often wants to ask questions across many cameras and many time windows. Library-scale retrieval over indexed videos ships (Twelve Labs' Marengo ranks moments across a video library), but that is ANN retrieval over independent embeddings, not joint reasoning. Multi-stream attention, cross-camera identity persistence, and global temporal reasoning are all open.

Real-time sub-watt inference for AR glasses. Today's mobile NPUs do tens of TOPS in tens of milliwatts, but an AR glass AI assistant needs to do continuous video understanding inside a 1-3W envelope that includes everything else the system runs. EUPE-style universal compact encoders, EdgeTAM-style efficient tracking, and aggressive quantization all help, but the gap to always-on Gemini-grade understanding on glasses is still 5-10x in compute efficiency.

Closed-loop evaluation. Public benchmarks measure accuracy on curated multiple-choice question sets. Production systems care about latency under load, drift under deployment shifts, robustness to camera placement and lighting, and intervention rates. Closed-loop methodology lags benchmark accuracy by a wide margin.

Audio-visual generative consistency. When video models generate or edit content rather than understand it (out of scope for most of this post), keeping audio synchronized with visual events is unsolved, which is why most current text-to-video models ship without working audio.

Cross-modal grounding stability. When a VLM is asked "what is the man in the blue shirt doing?", the model often fails not on language understanding but on grounding the referent across frames. Timestamp-level grounding ships commercially (Pegasus localizes answers to start/end times); spatial grounding (bounding boxes, referent IDs across cuts) still requires bolting on SAM 2 or Grounding DINO.

Closing

A handful of patterns recur across encoding, perception, compression, fusion, reasoning, and deployment. Compress where redundancy is highest, which for video is almost always the temporal axis. Distill universal encoders from multiple teachers rather than ship a fleet of specialists. Factorize attention along the physical structure of the data: spatial within frames, temporal across frames, cross-modal across modalities. Treat quantization as the default rather than as a late optimization. Gate reasoning on confidence rather than running it on every input.

The encoder, compression, and fusion patterns are now stable; the streaming, sub-watt deployment, and closed-loop evaluation patterns are not. The open problems left in efficient video intelligence are mostly about scaling the stable recipes to streaming inputs, sub-watt power envelopes, and production deployments where evaluation has to track a system rather than a benchmark. The work ahead lives in the deployment stack at least as much as in the model layer.

Scaling Long-Horizon Coding Agents (28 minute read)

Scaling Long-Horizon Coding Agents (28 minute read)

AI
Meta researchers developed a framework that improves coding agents by summarizing their extended work sessions into reusable structured knowledge, achieving significant benchmark gains.
What: A test-time scaling framework that addresses a core limitation in long-horizon coding agents: instead of just generating more attempts, it converts each agent's work trajectory (actions, errors, partial progress) into compact summaries that can be compared and reused across attempts.
Why it matters: Traditional test-time scaling works for short outputs that can be easily compared, but coding agents produce extended sequences of work that are hard to evaluate directly. The breakthrough is treating this as a representation problem rather than just a generation problem.
Deep dive
  • The framework introduces two complementary scaling approaches: Recursive Tournament Voting (RTV) for parallel scaling and adapted Parallel-Distill-Refine (PDR) for sequential scaling
  • RTV recursively narrows down a population of rollout summaries through small-group comparisons, similar to a tournament bracket
  • PDR conditions new agent attempts on distilled summaries from previous rollouts, enabling knowledge transfer between sequential attempts
  • Structured summaries preserve salient hypotheses, progress tracking, and failure modes while discarding low-signal trace details
  • Claude-4.5-Opus achieved 77.6% on SWE-Bench Verified (up from 70.9%) using the mini-SWE-agent implementation
  • On Terminal-Bench v2.0 with Terminus 1, performance jumped from 46.9% to 59.1%
  • The research reframes test-time scaling for agents as fundamentally about representation, selection, and reuse rather than raw generation volume
  • The 70-page paper includes extensive evaluation across multiple frontier coding agents and benchmark datasets
  • Results suggest that effective knowledge representation between attempts is more valuable than simply running more parallel attempts
Decoder
  • Test-time scaling: Improving model performance by using more computation during inference (when answering queries) rather than during training
  • Rollout trajectories: The complete sequence of actions, observations, errors, and states an agent goes through while attempting to solve a problem
  • SWE-Bench Verified: A benchmark dataset for evaluating coding agents on real-world software engineering tasks from GitHub issues
  • Terminal-Bench: A benchmark for testing coding agents on terminal-based development tasks
  • Recursive Tournament Voting (RTV): A selection method that repeatedly pairs and compares solutions in groups to identify the best candidates
  • Parallel-Distill-Refine (PDR): A technique that generates multiple attempts in parallel, extracts key insights, and uses them to improve subsequent attempts
Original article

Scaling Test-Time Compute for Agentic Coding

Test-time scaling has become a powerful way to improve large language models. However, existing methods are best suited to short, bounded outputs that can be directly compared, ranked or refined. Long-horizon coding agents violate this premise: each attempt produces an extended trajectory of actions, observations, errors, and partial progress taken by the agent. In this setting, the main challenge is no longer generating more attempts, but representing prior experience in a form that can be effectively selected from and reused. We propose a test-time scaling framework for agentic coding based on compact representations of rollout trajectories. Our framework converts each rollout into a structured summary that preserves its salient hypotheses, progress, and failure modes while discarding low-signal trace details. This representation enables two complementary forms of inference-time scaling. For parallel scaling, we introduce Recursive Tournament Voting (RTV), which recursively narrows a population of rollout summaries through small-group comparisons. For sequential scaling, we adapt Parallel-Distill-Refine (PDR) to the agentic setting by conditioning new rollouts on summaries distilled from prior attempts. Our method consistently improves the performance of frontier coding agents across SWE-Bench Verified and Terminal-Bench v2.0. For example, by using our method Claude-4.5-Opus improves from 70.9% to 77.6% on SWE-Bench Verified (mini-SWE-agent) and 46.9% to 59.1% on Terminal-Bench v2.0 (Terminus 1). Our results suggest that test-time scaling for long-horizon agents is fundamentally a problem of representation, selection, and reuse.

Meta's loss is Thinking Machines' gain (3 minute read)

Meta's loss is Thinking Machines' gain (3 minute read)

AI
AI startup Thinking Machines Lab is successfully recruiting top researchers from Meta, including PyTorch co-founder Soumith Chintala, by offering equity upside from its $12 billion valuation despite Meta's seven-figure cash packages.
What: Thinking Machines Lab, an AI startup valued at $12 billion with 140 employees, has been hiring researchers from Meta at a higher rate than from any other employer. Notable hires include Soumith Chintala (PyTorch co-founder, now CTO) and Piotr Dollár (Segment Anything co-author). The company just secured a multibillion-dollar Google Cloud deal for access to Nvidia's GB300 chips.
Why it matters: The two-way talent war between Meta and Thinking Machines reveals how AI startups can compete with Big Tech's massive salaries through equity incentives, and shows the strategic importance of infrastructure deals in attracting top-tier talent. With TML still early-stage (one product released) but highly valued, employees are betting on significant upside.
Deep dive
  • Thinking Machines Lab has hired more researchers from Meta than from any other single company, including PyTorch co-founder Soumith Chintala as CTO and Segment Anything co-author Piotr Dollár
  • The talent flow runs in both directions: Meta has poached seven of TML's founding members, while TML continues to recruit from Meta's research divisions
  • TML just signed a multibillion-dollar cloud deal with Google announced at Cloud Next, providing access to Nvidia's latest GB300 chips and placing it in the same infrastructure tier as Anthropic and Meta
  • The startup is valued at $12 billion with around 140 employees, having released just one product so far, but offers significant equity upside compared to OpenAI and Anthropic's record-breaking valuations
  • Recent Meta-to-TML hires include Weiyao Wang (8 years building multimodal perception and SAM3D), James Sun (9 years on LLM training), and Andrea Madotto (FAIR multimodal language models researcher)
  • TML has also recruited top talent from Cognition (Neal Wu, three-time gold medalist at International Olympiad in Informatics), OpenAI, Waymo, Anthropic, Apple, and Microsoft's AI Superintelligence team
  • Meta reportedly held acquisition talks with Thinking Machines around April 2025 before the talent competition intensified
  • The financial calculus for researchers: Meta offers seven-figure packages with no strings attached, while TML offers equity in a $12B company still early enough for major upside potential
Decoder
  • PyTorch: Open source deep learning framework co-founded by Soumith Chintala at Meta, now the foundation for most AI research worldwide
  • Segment Anything (SAM): Influential computer vision model from Meta that can segment any object in images; SAM3D is the 3D version
  • FAIR: Facebook AI Research, Meta's research division focused on advancing AI
  • Multimodal: AI systems that can process and understand multiple types of data like text, images, and audio together
  • GB300: Nvidia's latest generation of GPU chips designed for AI workloads
  • Pre-training and post-training: The two main phases of developing large language models—pre-training on massive datasets, then post-training for specific tasks and safety
Original article

Weiyao Wang spent eight years at Meta — his first job out of college — helping build multimodal perception systems and contributing to open-world segmentation projects, including SAM3D. His final day at Meta was last week, and he has since joined Thinking Machines Lab (TML).

His move to TML comes as the AI startup expands on multiple fronts. It just signed a multibillion-dollar cloud deal with Google, giving it access to Nvidia's latest GB300 chips and making it one of the first startups to run on the hardware.

The agreement, announced this past Tuesday at Google Cloud Next, follows an earlier partnership with Nvidia, and puts TML in the same infrastructure tier as Anthropic and Meta. (Meta reportedly held talks to acquire Thinking Machines around this time last year and has more recently been picking off TML's founders one by one.)

The talent picture remains fluid. Wang and Kenneth Li — a Harvard PhD who spent 10 months at Meta before joining TML this month — are the latest examples of a talent grab that runs in both directions. Business Insider reported last week that Meta has now poached seven of TML's founding members. A review of recent hires shows Thinking Machines is raiding Meta right back. At least, it appears based on a review of LinkedIn profiles, that TML has been hiring more researchers from Meta than from any other single employer.

The most prominent is Soumith Chintala, TML's CTO, who spent 11 years at Meta and co-founded PyTorch, the open source deep learning framework that now underpins most of the world's AI research. He left Meta in late 2025 and was appointed CTO earlier this year. Piotr Dollár, another 11-year Meta veteran who served as research director and co-authored the influential Segment Anything model, is now on TML's technical staff. Andrea Madotto, a research scientist in Meta's FAIR division focused on multimodal language models, joined TML in December. James Sun, a software engineer with nearly nine years at Meta working on LLM pre- and post-training, also made the jump.

TML has drawn talent from beyond Meta, too. Neal Wu — a three-time gold medalist at the International Olympiad in Informatics and a founding member of the buzzy coding startup Cognition — joined early this year. Jeffrey Tao came via Waymo, Windsurf, and OpenAI. Muhammad Maaz previously held a research fellowship at Anthropic. Erik Wijmans arrived from Apple. Liliang Ren spent two and a half years on Microsoft's AI Superintelligence team pre-training OpenAI models for code before joining in March.

The startup's headcount now stands at around 140.

Meta's pay packages — seven figures, no strings attached — are well known by now. For researchers weighing their other options, the calculus may be as simple as this: Thinking Machines Lab is right now valued at $12 billion. Though that figure would've been unimaginable for a company at this stage in any previous tech cycle (it has released just one product so far), compared with the record-breaking valuations of OpenAI and Anthropic, there's still a lot of financial upside.

Reached Friday morning, a spokesperson for TML declined to comment for this story.

OpenAI Posts Five-Principle Framework for AGI, Altman Concedes Bigger Role (2 minute read)

OpenAI Posts Five-Principle Framework for AGI, Altman Concedes Bigger Role (2 minute read)

AI
OpenAI has published a five-principle framework for AGI development, its first major governance statement since 2018, as regulatory pressure on frontier AI labs intensifies.
What: OpenAI released a framework outlining five principles for developing artificial general intelligence, committing to resist power consolidation in the hands of a few entities. The statement arrives as US and European regulators tighten oversight of leading AI laboratories.
Why it matters: This represents a public accountability move from a major AI lab facing increased regulatory scrutiny, signaling that companies developing advanced AI systems are being pushed to formalize governance commitments.
Decoder
  • AGI (Artificial General Intelligence): AI systems that can perform any intellectual task that humans can, as opposed to narrow AI designed for specific tasks
  • Frontier AI labs: Companies developing cutting-edge, most-advanced AI systems
Original article

OpenAI has published a five-principle framework for the development of artificial general intelligence. It is the company's most prominent statement of intent since its 2018 Charter. The lab claims it will resist letting the technology consolidate power in the hands of the few. The framework arrives at a time when US and European regulators are tightening oversight of frontier AI labs.

Cursor's $60 Billion Escape Hatch (5 minute read)

Cursor's $60 Billion Escape Hatch (5 minute read)

AI
SpaceX secured a $60 billion option to acquire AI coding tool Cursor, whose power users have driven gross margins to negative 23% due to expensive API fees from Anthropic and OpenAI.
What: Cursor, an AI coding assistant generating $2.7 billion in annualized revenue, signed a deal giving SpaceX the option to acquire it for $60 billion or pay $10 billion for their collaboration work, with the primary benefit being access to SpaceX's Colossus supercomputer to reduce reliance on third-party AI model providers.
Why it matters: The deal illustrates how AI coding tools have inverted traditional SaaS economics—power users who generate the most value also consume the most compute, making them unprofitable to serve at current pricing, and shows how infrastructure access has become as strategic as the AI models themselves.
Deep dive
  • Cursor attempted to raise billions at a $50 billion valuation but late-stage investors like Iconiq declined, having already deployed capital into OpenAI and Anthropic and unwilling to back what they viewed as a competitor
  • The company had lined up $2 billion from Nvidia, Andreessen Horowitz, and Thrive Capital before canceling the round after the SpaceX deal was announced
  • Cursor lost nearly $900 million in its last fiscal year despite strong revenue growth, highlighting the unsustainable unit economics of AI-powered developer tools at scale
  • SpaceX's financial picture is increasingly complex ahead of its June IPO, with a $20 billion bridge loan to refinance debt tied to X and xAI, down from $22 billion total
  • xAI alone spent $12.7 billion in capital expenditures last year while generating only $3.2 billion in revenue, suggesting massive infrastructure buildout costs
  • The deal may indicate SpaceX is using IPO momentum to absorb cash-hungry AI businesses before public investors scrutinize the full financials in the S-1 filing
  • Anthropic ran a holdback test withholding Claude Code from 2% of new Pro subscriber signups to measure feature value, drawing criticism despite being standard AB testing practice
  • OpenAI initially restricted GPT-5.4-Cyber variant to verified partners via its Trusted Access for Cyber program, citing cybersecurity concerns about releasing the model broadly
  • Nine days later, OpenAI reversed course and released GPT-5.5 to all users via chatbot, then added API access just one day after that, raising questions about whether safety concerns are genuine or commercially motivated
  • GitHub paused new Copilot paid plan signups after agentic workflows consumed more compute than monthly subscription fees could cover, with some requests costing more than the entire plan price
  • Amazon committed up to $25 billion in additional Anthropic funding at a $380 billion valuation, with Anthropic pledging over $100 billion in AWS spending over 10 years on custom silicon
  • xAI held talks with French AI lab Mistral and Cursor about a three-way partnership to compete with Anthropic and OpenAI, with former Mistral cofounder Devendra Chaplot already running pretraining at xAI
  • Meta deployed "Model Capability Initiative" surveillance software on US employees' work laptops to capture mouse movements, keystrokes, and periodic screenshots as training data for computer-use agents, with no opt-out available despite internal backlash
Decoder
  • Gross margins: Revenue minus cost of goods sold, expressed as a percentage; negative margins mean it costs more to deliver the service than customers pay
  • Annualized revenue: Monthly or quarterly revenue multiplied to estimate what full-year revenue would be at the current run rate
  • Colossus: SpaceX's supercomputer cluster, presumably built for AI training and inference workloads
  • Holdback test: An A/B testing methodology where a feature is withheld from a small percentage of users to measure its value by comparing behavior between groups
  • Agentic workflows: AI systems that can autonomously execute multi-step tasks rather than just responding to single prompts
  • ICHRA: Individual Coverage Health Reimbursement Arrangement, a type of employer health benefit where companies reimburse employees for individual health insurance premiums
Original article

Cursor's $60 Billion Escape Hatch

SpaceX Secures Option to Acquire Cursor

SpaceX announced this week that it has secured an option to acquire AI coding startup Cursor for $60 billion, or pay $10 billion for the work they're doing together if it doesn't end up acquiring the company. The agreement gives Cursor access to SpaceX's Colossus supercomputer and a path to reducing its dependence on Anthropic and OpenAI, whose models currently power much of Cursor's product and whose fees have weighed heavily on its margins.

The timing of the SpaceX deal is interesting. Just a few weeks earlier, Cursor had been quietly attempting to raise billions in private markets, but had encountered a lack of interest from late-stage investors like Iconiq, many of whom had just deployed capital into OpenAI and Anthropic and weren't ready to back a competitor at a $50 billion valuation. Cursor's gross margins were negative 23% as of January, an unusual position for a company generating $2.7 billion in annualized revenue. The company ultimately lined up $2 billion from Nvidia, a16z, and Thrive before canceling the round following after the deal with SpaceX was announced.

The Cursor deal complicates an already complicated financial picture for SpaceX ahead of its June IPO. Reuters reported this week that SpaceX took out a $20 billion bridge loan last month to refinance debt tied to X and xAI, reducing total debt from $22 billion to $20 billion, with repayment potentially contingent on IPO proceeds. xAI spent $12.7 billion in capital expenditures last year while generating only $3.2 billion in revenue. Cursor lost nearly $900 million in its last fiscal year. This may indicate that SpaceX is using its IPO momentum to paper over a collection of cash-hungry businesses before public market investors get a full look at the S-1.

Anthropic's Holdback Test Draws Criticism

A tweet claiming that Anthropic was no longer offering Claude Code access to Pro subscribers paying $20 per month made the rounds on social media this week. This was seen as an indication that the company would need to take drastic measures to maintain service for customers amidst an ongoing compute shortage. Anthropic has not said how many Claude Code users it has currently, or how many of those users are Pro or Max subscribers.

As it turns out, the situation was overstated. Anthropic's head of growth, Amol Avasare, responded directly to the post, saying that "we're running a small test on ~2% of new prosumer signups. Existing Pro and Max subscribers aren't affected." In AB testing, this is called a "holdback test," in which the value of a feature is measured by excluding a small subset of users from accessing it. Nevertheless, many questioned the wisdom of running a test like this on a tool as widely used as Claude, and competitors were quick to pile on. OpenAI directly implied it was a violation of customer trust, and Sam Altman mockingly replied "ok boomer" to Amol's post.

OpenAI's Changing Tune

On April 14, just one week after Anthropic said Mythos was too powerful to release publicly because of cybersecurity concerns, OpenAI published a blog post announcing that its newest model would not be released broadly to the public and would instead be accessible only to verified partners via a tiered cybersecurity access program introduced in February called Trusted Access for Cyber (TAC). This specifically pertained to a variant of GPT-5.4, with OpenAI saying "we are fine-tuning our models specifically to enable defensive cybersecurity use cases, starting today with a variant of GPT‑5.4 trained to be cyber-permissive: GPT‑5.4‑Cyber."

Then, on April 23rd, OpenAI announced that its newest model, GPT-5.5, would in fact be available to all users, though only via chatbot, not API. The press release said this was because "API deployments require different safeguards," but also that "We'll bring GPT‑5.5 and GPT‑5.5 Pro to the API very soon". On April 24th, the company seemingly changed their mind or implemented the safeguards very fast, and GPT-5.5 became available via API that afternoon. The company's evolving stance has fueled skepticism about the actual risks of the models and supported the view that current limits are motivated by the business model, not safety concerns.

Latest Research

What We're Reading

  • Anthropic is investigating how unauthorized users gained access to Anthropic's restricted Mythos cybersecurity model by combining a third-party contractor's credentials with data from a breach at AI startup Mercor Inc. to locate the model's endpoint.
  • xAI held talks in recent weeks with French lab Mistral and AI coding firm Cursor about a three-way partnership to close the gap on Anthropic and OpenAI, on top of SpaceX's newly disclosed $60 billion call option to acquire Cursor, with ex-Mistral cofounder Devendra Chaplot already running pretraining at xAI.
  • Apple announced that Tim Cook will step up to executive chairman and SVP of Hardware Engineering John Ternus will take over as CEO effective September 1, 2026, concluding Cook's 15-year tenure during which Apple's market cap grew from ~$350 billion to $4 trillion.
  • Blue Origin successfully recovered the New Glenn first stage rocket in its second flight last week, but the rocket's payload, an AST SpaceMobile communications satellite, ended up in an "off-nominal orbit," temporarily grounding New Glenn.
  • Tesla expanded its Robotaxi service to Dallas and Houston, but Robotaxi Tracker data suggests "there may be just one vehicle operating in each city so far," with tiny geofences relative to each city's footprint.
  • GitHub paused new sign-ups for paid Copilot plans and tightened usage limits after agentic workflows consumed more compute than monthly fees cover, with VP of product Joe Binder writing, "It's now common for a handful of requests to incur costs that exceed the plan price."
  • Amazon committed up to $25 billion in additional funding to Anthropic ($5B immediately, $20B tied to "certain commercial milestones") at Anthropic's $380B valuation, with Anthropic pledging $100B+ on AWS over 10 years, CEO Andy Jassy said the commitment "reflects the progress we've made together on custom silicon."
  • Meta Superintelligence Labs rolled out its "Model Capability Initiative" on Meta US employees' work laptops to capture mouse movements, keystrokes, and periodic screenshots as training data for computer-use agents. Despite internal backlash, the company has clarified that there is no option to opt out of the initiative.
  • Commerce Secretary Howard Lutnick told a Senate hearing that despite Trump lifting the H200 export ban four months earlier, no H200 chips have been sold to Chinese firms, saying "The Chinese central government has not let them, as of yet, buy the chips."
  • Trump issued an executive order directing the FDA to prioritize clinical trials and "Right to Try" access for psilocybin, MDMA, and ibogaine, and allocated $50 million in HHS matching funds for state research programs, triggering rallies in psychedelic drug-developer stocks.
  • The Trump administration is nearing a deal to lend Spirit Airlines up to $500 million in exchange for warrants giving the US government a potential 90% equity stake post-bankruptcy, sparking pushback from Transportation Secretary Sean Duffy, who asked, "If you do Spirit, who comes next?"
  • About 30K Samsung workers rallied at the Pyeongtaek chip complex in South Korea, demanding 15% of operating profit be paid out to chip-division employees. This would amount to more than 40 trillion won (~$27 billion), averaging $400K+ per worker, with the union threatening an 18-day strike starting May 21, following Samsung's record Q1 operating profit forecast of 57.2 trillion won.
Cohere Aleph Alpha Join Forces (3 minute read)

Cohere Aleph Alpha Join Forces (3 minute read)

AI
Cohere and Aleph Alpha are partnering to build a sovereign AI alternative for enterprises and governments that want control over their AI infrastructure without depending on big tech.
What: Canadian AI company Cohere is joining forces with German AI company Aleph Alpha to create an independent, enterprise-grade AI offering focused on data sovereignty, targeting highly-regulated sectors like finance, defense, and healthcare, with $600M in Series E funding led by Schwarz Group.
Why it matters: This addresses growing concerns about dependence on single AI vendors and jurisdictions, particularly important for European organizations with strict data privacy requirements and governments seeking digital independence from US hyperscalers.
Takeaway: Organizations in regulated industries can evaluate this sovereign AI option as an alternative to hyperscaler-based AI services, particularly if data residency and vendor independence are critical requirements.
Decoder
  • Sovereign AI: AI systems where the operator maintains full control over data and infrastructure, typically within specific jurisdictional boundaries, without dependence on foreign tech giants
  • Frontier models: The most advanced, cutting-edge AI models at the current state of the art
  • STACKIT: Schwarz Group's sovereign cloud infrastructure platform that will serve as the technical backbone for this partnership
Original article

We're joining forces with Aleph Alpha to provide the world with an independent, enterprise-grade sovereign alternative in an era of growing AI concentration.

This transatlantic alliance would combine Cohere's global AI scale with Aleph Alpha's strong research excellence and deep institutional relationships, forging a globally competitive AI champion backed by Canadian and German ecosystems.

By pooling top-tier engineering talent and computational resources across two G7 nations, the partnership aims to significantly accelerate the development of next-generation frontier models and systems while providing a secure alternative to dependence on any single vendor or infrastructure stack.

The market for AI services is projected to surpass $1 trillion annually, with sovereign AI needs representing nearly $600B of that total (McKinsey, March 2026). The partnership uniquely bridges the gap between these segments with its sovereign-first approach, capturing the critical intersection where sovereignty requirements meet broad enterprise AI adoption.

Our Co-Founder and CEO, Aidan Gomez comments: "Combining the strengths of Cohere and Aleph Alpha accelerates our global expansion and advances our mission to deliver sovereign AI to nations around the world. Organizations globally are demanding uncompromising control over their AI stack. This transatlantic partnership unlocks the massive scale, robust infrastructure, and world-class R&D talent required to meet that demand.

Built on the bedrock of shared Canadian and German values—where privacy, security, and responsible innovation are paramount—we are uniquely positioned to be the world's trusted AI partner. Together, we will give enterprises and governments across Canada, Europe, and the world the technology to move from exploration to rapid, secure implementation, with the absolute certainty that their data remains their own."

Through the planned deal, collectively we aim to deliver a secure alternative for customized AI in highly-regulated sectors - including the public sector, finance, defense, energy, manufacturing, telecommunications, and healthcare. Aleph Alpha's experience in deploying AI in long-standing customer relationships provides an important foundation of this sovereign offering. As part of this partnership, the combined entity will partner with the companies of Schwarz Group, an international leader in the retail industry, to deploy a sovereign offering on its cloud service STACKIT.

Ilhan Scheer, Co-CEO of Aleph Alpha adds: "Aleph Alpha is in a unique position in Europe. We develop specialized large language models for Europe without compromising on Sovereignty, Transparency and Regulatory Compliance. By living this responsibility, we serve as a trusted and strategic partner to public sector and enterprise customers in Europe. Together with Cohere, we are building a real counterweight for organizations that refuse to outsource control over their AI to a single provider or jurisdiction, giving European institutions and enterprises access to powerful, yet controllable AI they can truly own."

In addition, the companies of Schwarz Group intend to back our upcoming Series E funding as lead investor with a $600M (€500M) structured financing commitment. The round is already attracting strong interest from the world's leading investors who recognize the necessity of an independent global AI powerhouse.

In a joint statement, Rolf Schumann and Christian Müller, Co-CEOs of Schwarz Digits, said: "With this investment, the companies of Schwarz Group position themselves as lead investors for digital sovereignty and infrastructure. Building this infrastructure is a strategic necessity to help shape the AI revolution based on values such as trust, fairness, and responsibility. The establishment of STACKIT, Schwarz Digits' sovereign cloud infrastructure, as the technical backbone of this transatlantic AI initiative empowers organizations to strengthen their digital independence and maintain control over their data. This is true leadership in digital sovereignty."

An amateur just solved a 60-year-old math problem—by asking AI (7 minute read)

An amateur just solved a 60-year-old math problem—by asking AI (7 minute read)

AI
A 23-year-old amateur used ChatGPT to solve a 60-year-old mathematical conjecture that had stumped expert mathematicians, with the AI discovering an entirely new proof approach that may have broader applications.
What: Liam Price, who has no advanced math training, prompted GPT-5.4 Pro with an unsolved Erdős problem about primitive sets (collections of whole numbers where none divide each other) and received a proof that leading mathematicians Terence Tao and Jared Duker Lichtman validated and refined.
Why it matters: Unlike previous AI math solutions that replicated known approaches, ChatGPT applied a formula from related math areas that no human had thought to use for this problem type, suggesting AI can break through human cognitive blind spots and potentially "discovered a new way to think about large numbers and their anatomy" according to Tao.
Deep dive
  • The problem asked whether the maximum "Erdős sum" score for primitive sets approaches exactly one as the numbers in the set approach infinity, a conjecture left unsolved since the 1960s
  • Price submitted the problem to ChatGPT on "an idle Monday afternoon" without knowing its history or that prominent mathematicians had failed to solve it
  • Terence Tao says human mathematicians "collectively made a slight wrong turn at move one," following a standard sequence of approaches that led nowhere
  • The AI's raw proof output was "quite poor" and required expert mathematicians to extract and understand the core insight
  • ChatGPT used a well-known formula from adjacent mathematical domains that no one had thought to apply to primitive set problems
  • Tao and Lichtman have since distilled the proof and already identified other potential applications of the method
  • Lichtman, who proved a related Erdős conjecture in his 2022 doctoral thesis but got stuck on this one, says the new method confirms his graduate school intuition that these problems "were kind of clustered together"
  • Price and collaborator Kevin Barreto sparked the "AI-for-Erdős craze" in late 2025 by randomly prompting free ChatGPT with open Erdős problems
  • An AI researcher gifted them ChatGPT Pro subscriptions to encourage their "vibe mathing" experiments
  • Experts caution the long-term significance is still uncertain, but this appears to be a genuine novel contribution rather than rediscovery of existing work
  • The breakthrough suggests AI language models may excel at bypassing human mental blocks and connecting disparate mathematical domains
Decoder
  • Erdős problems: Unsolved mathematical conjectures left by prolific mathematician Paul Erdős, ranging widely in difficulty and significance
  • Primitive sets: Collections of whole numbers where no number can be evenly divided by any other number in the set (generalizes the concept of prime numbers)
  • Erdős sum: A calculated "score" for primitive sets that Erdős proved has a maximum value
  • LLM (Large Language Model): AI systems like ChatGPT trained on vast text corpora to generate human-like responses
Original article

An amateur just solved a 60-year-old math problem—by asking AI

A ChatGPT AI has proved a conjecture with a method no human had thought of. Experts believe it may have further uses

By Joseph Howlett edited by Lee Billings

An orange cube resembling a puzzle, suspended in space against a lavender background.

Liam Price just cracked a 60-year-old problem that world-class mathematicians have tried and failed to solve. He's 23 years old and has no advanced mathematics training. What he does have is a ChatGPT Pro subscription, which gives him access to the latest large language models from OpenAI.

Artificial intelligence has recently made headlines for solving a number of "Erdős problems," conjectures left behind by the prolific mathematician Paul Erdős. But experts have warned that these problems are an imperfect benchmark of artificial intelligence's mathematical prowess. They range dramatically in both significance and difficulty, and many AI solutions have turned out to be less original than they appeared.

The new solution—which Price got in response to a single prompt to GPT-5.4 Pro and posted on www.erdosproblems.com, a website devoted to the Erdős problems, just over a week ago—is different. The problem it solves has eluded some prominent minds, bestowing it some esteem. And more importantly, the AI seems to have used a totally new method for problems of this kind. It's too soon to say with certainty, but this LLM-conceived connection may be useful for broader applications—something hard to find among recently touted AI triumphs in math.

"This one is a bit different because people did look at it, and the humans that looked at it just collectively made a slight wrong turn at move one," says Terence Tao, a mathematician at the University of California, Los Angeles, who has become a prominent scorekeeper for AI's push into his field. "What's beginning to emerge is that the problem was maybe easier than expected, and it was like there was some kind of mental block."

The question Price solved—or prompted ChatGPT to solve—concerns special sets of whole numbers, where no number in the set can be evenly divided by any other. Erdős called these "primitive sets" because of their connection to similarly indivisible prime numbers.

"A number is prime if it has no other divisors, and this is kind of generalizing that definition from an individual number to a collection of numbers," says Jared Duker Lichtman, a mathematician at Stanford University. Any set of prime numbers is automatically primitive, because primes have no factors (except themselves and the number one).

Erdős also came up with the Erdős sum, a "score" you can calculate for any primitive set. He showed that the sum had a maximum possible value—and conjectured that this value must hold only for the set of all prime numbers. Lichtman proved Erdős right as part of his doctoral thesis in 2022.

Erdős also noticed that the score drops if all of a set's numbers are large—the larger the numbers, the less large the score could become. He guessed that as the set's numbers approached infinity, the maximum score would drop to exactly one. Lichtman tried to prove this, too, but got stuck like everyone else before him.

Price wasn't aware of this history when he entered the problem into ChatGPT on an idle Monday afternoon. "I didn't know what the problem was—I was just doing Erdős problems as I do sometimes, giving them to the AI and seeing what it can come up with," he says. "And it came up with what looked like a right solution."

He sent it to his occasional collaborator Kevin Barreto, a second-year undergraduate in mathematics at the University of Cambridge. The duo had jump-started the AI-for-Erdős craze late last year by prompting a free version of ChatGPT with open problems chosen at random from the Erdős problems website. (An AI researcher subsequently gifted them each a ChatGPT Pro subscription to encourage their "vibe mathing.")

Reviewing Price's message, Barreto realized what they had was special, and experts whom he notified quickly took notice.

"There was kind of a standard sequence of moves that everyone who worked on the problem previously started by doing," Tao says. The LLM took an entirely different route, using a formula that was well known in related parts of math, but which no one had thought to apply to this type of question.

"The raw output of ChatGPT's proof was actually quite poor. So it required an expert to kind of sift through and actually understand what it was trying to say," Lichtman says. But now he and Tao have shortened the proof so that it better distills the LLM's key insight.

More importantly, they already see other potential applications of the AI's cognitive leap. "We have discovered a new way to think about large numbers and their anatomy," Tao says. "It's a nice achievement. I think the jury is still out on the long-term significance."

Lichtman is hopeful because ChatGPT's discovery validates a sense he's had since graduate school. "I had the intuition that these problems were kind of clustered together and they had some kind of unifying feel to them," he says. "And this new method is really confirming that intuition."

Editor's Note (4/28/26): This article was edited after posting to correct the description of the Erdős sum and to clarify Jared Duker Lichtman's full name.

Meta signs agreement with AWS to power agentic AI on Amazon's Graviton chips (1 minute read)

Meta signs agreement with AWS to power agentic AI on Amazon's Graviton chips (1 minute read)

AI
Meta is deploying tens of millions of AWS Graviton5 processors for agentic AI workloads, signaling a major shift from GPU-focused infrastructure toward CPU-intensive agent reasoning and orchestration.
What: Meta has partnered with AWS to deploy Graviton5 processors at massive scale—starting with tens of millions of cores—to power CPU-intensive AI agent workloads like real-time reasoning, code generation, and multi-step task orchestration rather than relying solely on GPUs for training.
Why it matters: This partnership highlights an emerging infrastructure divide in AI: while GPUs remain critical for training large models, the explosion of agentic AI systems that need to reason, plan, and execute complex workflows in real-time creates massive demand for purpose-built, energy-efficient CPUs that can handle billions of coordinated interactions.
Takeaway: If building AI agent systems, evaluate whether CPU-optimized infrastructure like Graviton instances could be more cost-effective than GPU-based deployments for inference and orchestration workloads.
Deep dive
  • Meta becomes one of the largest AWS Graviton customers globally, deploying tens of millions of cores with room to expand as AI capabilities grow
  • Graviton5 features 192 cores per chip with 5x larger cache than previous generation, reducing inter-core communication delays by up to 33%
  • The deal reflects architectural shift in AI infrastructure: GPUs for model training, CPUs for inference and agentic orchestration at scale
  • Agentic AI workloads differ from traditional inference—they require sustained CPU power for reasoning chains, code generation, search coordination, and multi-step task execution
  • Built on AWS Nitro System providing bare-metal instance access while maintaining standard networking (ENA) and storage (EBS) interfaces Meta relies on
  • Supports Elastic Fabric Adapter (EFA) for low-latency, high-bandwidth communication between instances—critical for distributed agent workflows
  • Manufactured using 3nm process technology, delivering 25% better performance than Graviton4 while maintaining energy efficiency
  • AWS controls full stack from chip design through server architecture, enabling optimization impossible with off-the-shelf processors
  • Partnership helps Meta handle billions of user interactions while meeting sustainability targets through improved energy efficiency
  • Deployment supports Meta's existing AWS relationship and Amazon Bedrock usage for next-generation AI development
Decoder
  • Agentic AI: Autonomous AI systems that can reason, plan, and complete complex multi-step tasks rather than just generating single responses
  • Graviton: AWS's custom ARM-based processors designed for energy-efficient cloud computing, now in fifth generation
  • AWS Nitro System: Dedicated hardware and software layer that offloads virtualization functions to provide bare-metal performance with cloud flexibility
  • Elastic Fabric Adapter (EFA): AWS networking interface enabling low-latency, high-bandwidth communication between cloud instances for distributed workloads
  • 3nm chip technology: Manufacturing process that creates smaller, more power-efficient transistors (3 nanometers wide)
Original article

Key takeaways

  • The deployment starts with tens of millions of Graviton cores, with the potential to expand.
  • Meta is now one of the largest Graviton customers in the world.
  • The deal builds on Meta's long-standing AWS relationship and use of Amazon Bedrock at scale to support its next generation of AI.

Meta has signed an agreement to deploy AWS Graviton processors at scale. The deal marks a significant expansion of a long-standing partnership between the two companies as Meta builds its next generation of AI. The deployment starts with tens of millions of Graviton cores, with the flexibility to expand as Meta's AI capabilities grow. The deal reflects a shift in how AI infrastructure gets built: while GPUs remain essential for training large models, the rise of agentic AI is creating massive demand for CPU-intensive workloads—real-time reasoning, code generation, search, and orchestrating multi-step tasks. Graviton5 is purpose-built for these workloads, giving Meta the processing power to run them efficiently at scale. The chips will power various workloads at Meta, including supporting the company's AI efforts. That work requires infrastructure that can handle billions of interactions while coordinating complex, multi-step agent workflows—exactly the kind of CPU-intensive work Graviton is designed for.

AWS Graviton chips powering AI workloads

As organizations increasingly adopt agentic AI—autonomous systems that can reason, plan, and complete complex tasks—the demand for high-performance, energy-efficient compute infrastructure has never been greater. Meta is building at the forefront of agentic AI, and its broad Graviton deployment reflects a simple reality: agentic workloads like code generation, real-time reasoning, and frontier model training are CPU-intensive, and purpose-built chips are the most efficient way to power them. The Graviton5 chip features 192 cores and a cache that is five times larger than the previous generation, which reduces delays in how quickly those cores communicate with each other by up to 33%. That means faster data processing with greater bandwidth—key requirements for agentic AI systems that need to continuously reason through and execute multi-step tasks. Graviton is built on the AWS Nitro System, which uses dedicated hardware and software to deliver high performance, high availability, and high security. The Nitro System enables bare-metal instances for direct access to the hardware while providing the same familiar Elastic Network Adapter (ENA) and Amazon Elastic Block Store (Amazon EBS) devices that allow Meta to run its own virtual machines without performance compromises. The range of Graviton5 instances also supports the Elastic Fabric Adapter (EFA), enabling low-latency, high-bandwidth communication between instances. This is essential for Meta's agentic AI workloads, where large-scale tasks need to be distributed across many processors working in coordination. As a longtime AWS customer, Meta has relied on AWS's highly scalable and secure cloud infrastructure to power its global businesses. "This isn't just about chips; it's about giving customers the infrastructure foundation, as well as data and inference services, to build AI that understands, anticipates, and scales efficiently to billions of people worldwide," said Nafea Bshara, vice president and distinguished engineer, Amazon. "Meta's expanded partnership, deploying tens of millions of Graviton cores, shows what happens when you combine purpose-built silicon with the full AWS AI stack to power the next generation of agentic AI." "As we scale the infrastructure behind Meta's AI ambitions, diversifying our compute sources is a strategic imperative. AWS has been a trusted cloud partner for years, and expanding to Graviton allows us to run the CPU-intensive workloads behind agentic AI with the performance and efficiency we need at our scale," said Santosh Janardhan, head of infrastructure, Meta.

Energy efficiency benefits of Graviton

AWS Graviton5 is built on 3-nanometer chip technology—a manufacturing process that produces smaller, more efficient processors. Because AWS designs its chips from the ground up and controls the full process from chip design through server architecture, it can optimize performance and efficiency in ways that off-the-shelf processors can't match. The result is infrastructure that delivers stronger performance while maintaining leading energy efficiency, helping Meta pursue ambitious AI goals while staying on track with sustainability targets. Graviton5 delivers up to 25% better performance than the previous generation. As the demand for AI compute grows across the industry, the efficiency of the underlying infrastructure becomes increasingly important—both for managing costs and reducing environmental impact. The deal signals a new chapter in how large-scale AI infrastructure gets built—and how purpose-built chips like Graviton can help companies like Meta deliver smarter, more personalized experiences to billions of people worldwide. Get more details on AWS Graviton5 and how to get started with Graviton.

Sovereign Labs Are Overkill for Enterprise AI (7 minute read)

Sovereign Labs Are Overkill for Enterprise AI (7 minute read)

AI
Sovereign AI labs are being oversold to enterprises who actually just need private deployment and data control, not expensive national-scale foundation model training.
What: A critical analysis arguing that sovereign AI labs (national initiatives to build independent foundation models) make sense for governments but are overkill for enterprises, who conflate the need for data sovereignty with the need to pre-train their own frontier models.
Why it matters: This matters because enterprises are being pitched expensive sovereign lab infrastructure when their actual requirements—data residency, auditability, and regulatory compliance—can be met more cheaply with self-hosted open models and proper data isolation controls.
Takeaway: Evaluate your AI sovereignty needs by asking where regulated data flows, not which model to use—consider self-hosting open models like Llama or DeepSeek with proper data isolation rather than building or buying sovereign foundation models.
Deep dive
  • The sovereign lab pitch makes seven claims (sovereign data, weights, compute, cultural fit, jurisdictional control, supply chain independence, strategic autonomy) but only 1.5 actually hold up in practice
  • Most sovereign labs use the same training data (Common Crawl), architectures (Llama/DeepSeek derivatives), and supply chains (NVIDIA chips from Taiwan) as everyone else, just with different branding
  • The only genuine advantage sovereign labs have is cultural and linguistic fit for local languages, but GPT and Claude are closing this gap with each release
  • What enterprises actually mean by "sovereign AI" is five things: regulated data stays in jurisdiction, no data leakage to third parties, auditability, vendor independence, and local language/workflow support
  • The practical solution is two levers used together: local deployment (self-host open models on controlled infrastructure) and local isolation of sensitive data (keep regulated data from reaching model providers)
  • This approach lets you run self-hosted Llama for sensitive workloads while still calling frontier APIs like GPT-5 for non-sensitive tasks, maintaining sovereignty boundaries at the data level
  • Sovereignty is a property of data flows, not model nationality—the right question is "where does regulated data go?" not "whose model is this?"
  • National labs make sense for defense, intelligence, and government use cases where data genuinely cannot cross borders, but not for most enterprise scenarios
  • The sovereign lab industry is driven by GPU sellers' revenue incentives and VC growth-stage investment theses, not genuine enterprise needs
  • Recent example: Aleph Alpha (Germany) and Cohere (Canada) merged at $20B valuation, positioning as sovereign alternatives despite using similar underlying technology stacks
Decoder
  • Sovereign Lab: A national initiative to build and control AI foundation models domestically, independent of foreign providers like OpenAI or Anthropic
  • Sovereign AI: Umbrella term covering both sovereign pre-training (building national models) and sovereign deployment (private inference with data residency)
  • CLOUD Act: U.S. law allowing American authorities to access data stored by U.S. companies regardless of physical location, relevant for AWS/Azure sovereign cloud claims
  • GDPR/DPDPA: Data protection regulations (European and Indian respectively) requiring data to stay within specific jurisdictions
  • vLLM/Ollama: Open-source tools for serving and running large language models on your own infrastructure
  • Common Crawl: Large public web crawl dataset used to train most foundation models, undermining claims of truly sovereign training data
Original article

The national lab thesis is legitimate for nations, but for everyone else, it's a solution to a problem they don't have.

Anthropic tests new Bugcrawl tool for Claude Code bug detection (2 minute read)

Anthropic tests new Bugcrawl tool for Claude Code bug detection (2 minute read)

AI
Anthropic is testing Bugcrawl, a new Claude Code feature that scans entire repositories to detect bugs and suggest fixes.
What: Bugcrawl is an unreleased feature appearing in Claude Code's UI that lets developers select a repository and have Claude scan the entire codebase for general bugs and propose fixes, with warnings about high token consumption for larger repositories.
Why it matters: This positions Claude Code as a full-spectrum code quality tool alongside its existing Security and Code Review features, and escalates competition with OpenAI, xAI, and Google in the shift from single-file AI assistants to repository-wide autonomous agents.
Takeaway: Enterprise teams should prepare for high token costs if they plan to use this when it launches, and consider starting with smaller repositories to test the feature.
Deep dive
  • Bugcrawl appears in Claude Code's sidebar with a repository picker and a warning about high token consumption
  • The feature targets general bug detection and fixes, complementing Security (vulnerabilities) and Code Review (PR-level analysis)
  • Anthropic could potentially extend it to end-to-end testing where Claude runs the app locally and walks through user flows
  • Fits into Claude Code's rapid expansion: Code Security in February 2026, Code Review in March 2026
  • Part of industry-wide shift toward repository-wide AI agents, competing with OpenAI Codex, xAI Grok Build, and Google Jules
  • High token costs suggest it's aimed at Team and Enterprise tiers
  • Not in production yet, no release date, but rapid feature cadence suggests research preview likely soon
Original article

Anthropic appears to be building a tool within Claude Code called Bugcrawl, which surfaces as a dedicated entry in the side navigation. Once opened, the screen presents a repository selection UI alongside a warning that the feature consumes tokens at a high rate, so it's suggested to start with a small repository before pointing it at anything substantial. That caveat alone hints at the scale of work the agent would be carrying out in the background.

The most plausible read is that Bugcrawl will set Claude loose across an entire codebase to hunt for general bugs and propose fixes, while the Security tab already shipping in Claude Code for Enterprises targets vulnerabilities specifically. If Anthropic pushes the concept further, the same loop could extend into end-to-end product testing, where Claude spins up a local instance of the app, walks through user flows, and reports regressions. How feature specifications or test criteria would be passed into a run is still an open question, since the only screen visible so far is the repository picker.

Claude

For Anthropic, the move slots cleanly into the Claude Code expansion of recent months, which has already produced Claude Code Security in February and Claude Code Review in March, both built around multi-agent investigation of code. Bugcrawl would round out that lineup by tackling general correctness and quality, the broader, fuzzier category that sits between security scanning and PR-level review. It also fits the wider competitive picture, with OpenAI's Codex, xAI's Grok Build, and Google's Jules each pushing toward agents that reason across full repositories rather than single files.

The likely audience is engineering teams on Team and Enterprise tiers, where the token burn warning is easier to absorb. No release window has surfaced, and the feature does not appear in production builds. Given the cadence of Code Security and Code Review landing within weeks of each other, a research preview on the same web surface looks like the most likely path.

HashiCorp Vault 2.0 Marks Shift to IBM Lifecycle with New Identity Federation (3 minute read)

HashiCorp Vault 2.0 Marks Shift to IBM Lifecycle with New Identity Federation (3 minute read)

DevOps
HashiCorp Vault 2.0 is the first major release under IBM ownership, adding workload identity federation to eliminate static cloud credentials while introducing breaking changes and a two-year support lifecycle.
What: Vault 2.0 jumps from version 1.21 to adopt IBM's versioning model after the acquisition, introducing workload identity federation that uses OIDC tokens to authenticate with AWS, Azure, and GCP without long-lived credentials, plus SCIM provisioning, SPIFFE support, and PKI automation enhancements.
Why it matters: This release signals Vault's direction after HashiCorp's 2023 license change to Business Source License sparked the OpenBao fork, and addresses a critical security gap by removing static credential requirements during secret synchronization across cloud providers.
Takeaway: Review the migration documentation if running Vault 1.x, particularly Azure authentication configurations that now require explicit settings instead of environment variable fallbacks.
Deep dive
  • The version jump from 1.21 to 2.0 reflects IBM's acquisition and support model shift, guaranteeing at least two years of standard support for major releases under the IBM Support Cycle-2 policy
  • Workload Identity Federation eliminates the need for static credentials when syncing secrets to cloud providers by using short-lived OIDC tokens, reducing the attack surface for credential leakage during synchronization
  • Internal storage engine modifications target performance improvements for high-volume operations like real-time encryption and authentication at enterprise scale
  • Breaking changes remove legacy components to simplify codebase maintenance, including Azure authentication now requiring explicit configuration rather than environment variable defaults (enforcement of changes that began in 1.20)
  • Beta SCIM 2.0 support enables automated provisioning of Vault entities and groups from external identity platforms, reducing manual identity management overhead
  • SPIFFE JWT-SVID support allows workloads to participate in SPIFFE-based identity meshes, bridging proprietary HashiCorp features with open standards
  • Enhanced PKI secret engine automation for certificate issuance and renewal aligns with zero-trust networking principles by reducing manual credential management risks
  • The release comes as teams evaluate Vault against cloud-native alternatives like AWS Secrets Manager and Azure Key Vault (tighter platform integration but less portability) and managed services like Akeyless and Doppler (no operational overhead)
  • The 2023 license change from Mozilla Public License to Business Source License prompted the community-driven OpenBao fork, making IBM's stewardship particularly important to the community
  • This is the first major version increment since version 1.0 launched in 2018, representing eight years of feature development under the 1.x line
Decoder
  • OIDC tokens: OpenID Connect tokens are short-lived authentication credentials that prove identity without storing long-term secrets
  • SCIM: System for Cross-domain Identity Management, a standard protocol for automating user and group provisioning across systems
  • SPIFFE: Secure Production Identity Framework For Everyone, an open standard for workload identity in distributed systems
  • JWT-SVID: JSON Web Token SPIFFE Verifiable Identity Document, a cryptographically-signed token format used in SPIFFE identity attestation
  • PKI: Public Key Infrastructure, the framework for managing digital certificates and encryption keys
  • Business Source License: A source-available license that restricts commercial use until code ages, then converts to open source (unlike fully open Mozilla Public License)
  • Static credentials: Long-lived access keys or passwords that don't expire automatically, creating security risks if leaked
Original article

HashiCorp released Vault 2.0 under IBM's versioning model with two-year support, introducing identity-based security, workload identity federation without static credentials, performance improvements, and breaking changes while adding SCIM, SPIFFE support, and enhanced PKI automation.

DigitalOcean Dedicated Inference: A Technical Deep Dive (6 minute read)

DigitalOcean Dedicated Inference: A Technical Deep Dive (6 minute read)

DevOps
DigitalOcean launched Dedicated Inference, a managed service for hosting large language models on dedicated GPUs with KV cache-aware routing and predictable economics for high-volume inference workloads.
What: DigitalOcean Dedicated Inference is a managed LLM hosting service that deploys models on dedicated GPUs using vLLM and Kubernetes, designed for teams that need consistent, high-volume inference beyond pay-per-token serverless offerings. The service handles cluster lifecycle, routing, and scaling while giving users control over model choice, capacity, and performance tuning through OpenAI-compatible APIs.
Why it matters: Fills the gap between serverless inference (unpredictable costs at scale) and DIY GPU management (heavy operational burden). The KV cache-aware routing is particularly notable—it directs requests to replicas that already cached prompt prefixes, avoiding redundant computation and improving cost efficiency for workloads with reusable context.
Takeaway: Evaluate whether your inference workload has sustained, high-volume demand where dedicated GPU pricing would beat pay-per-token models, especially if you need bring-your-own-model capabilities or VPC-isolated deployments.
Deep dive
  • Separates control plane (endpoint lifecycle management via regional API services) from data plane (direct inference traffic through VPC-native load balancers to GPU nodes)
  • Uses vLLM as the core inference engine paired with LLM-d for Kubernetes-native distributed inference patterns
  • Implements intelligent routing via Kubernetes Gateway API with an Inference Extension that understands queue depth, replica health, and KV cache affinity
  • Endpoint Picker component on CPU nodes selects optimal GPU replica based on whether it already cached the prompt prefix, not just round-robin distribution
  • One DOKS cluster per VPC can host multiple isolated Dedicated Inference deployments using Kubernetes namespaces and Custom Resource Definitions
  • Exposes both public and private endpoints through regional network load balancers, supporting internet and VPC-only access patterns
  • Platform manages Kubernetes capacity provisioning, GPU pool coordination, gateway configuration, and reconciliation loops; users control model selection, replica counts, and scaling policies
  • Designed for workloads where predictable GPU-hour economics matter more than burst elasticity—think coding assistants serving 2,000 concurrent engineers, not occasional API calls
  • OpenAI-compatible API surface means existing client libraries work without modification
  • Lifecycle management follows Kubernetes operator reconciliation pattern: observe desired state, act with retries and backoff, surface clear status instead of partial failures
Decoder
  • vLLM: Open-source inference engine optimized for serving large language models on GPUs with efficient memory management
  • KV cache: Saved attention key-value tensors from previously processed tokens; reusing cached prefixes avoids recomputing the same prompt context
  • KV cache-aware routing: Load balancing strategy that prefers sending requests to replicas that already cached relevant prompt prefixes, reducing redundant computation
  • Kubernetes Gateway API: Standard API for configuring HTTP/TLS routing into Kubernetes clusters, successor to Ingress
  • Control plane vs data plane: Architectural split where control plane handles management operations (create/update/delete endpoints) and data plane handles high-throughput inference requests
  • LLM-d: Kubernetes-oriented stack for distributed inference with prefix-cache-aware routing and LLM-specific scaling patterns
  • DOKS: DigitalOcean Kubernetes Service, their managed Kubernetes offering
  • Replica: Horizontally scaled copy of the same model server running in a separate pod/process
  • Day-two operations: Ongoing operational tasks after initial deployment—monitoring, scaling, upgrades, incident response
Original article

DigitalOcean Dedicated Inference: A Technical Deep Dive

Getting a model to answer 10 inference requests concurrently is tricky but simple enough; getting it to handle 2,000 engineers hitting a coding assistant with long contexts, all day, without runaway costs, is where teams stall. A working endpoint is only the beginning. Teams need to identify the supporting hardware and wire up the right components—serving, scaling, observability, and cost guardrails—so the deployment can support expected SLAs and SLOs under real, sustained load.

DigitalOcean already offers Serverless Inference on the DigitalOcean AI Platform: a fast path to models from OpenAI, Anthropic, Meta, or other providers, with minimal setup and token-based pricing. This offering works well for many use cases. However, when you need your own weights, predictable performance on dedicated GPUs, and economics that favor sustained, high-volume token generation over pay-per-token bursts, a different approach makes sense

Dedicated Inference, our managed LLM hosting service on the DigitalOcean AI Platform, fills that gap.

Dedicated Inference deploys and operates an opinionated inference stack on dedicated GPUs, with Kubernetes-native orchestration under the hood. You interact through the control plane and APIs you already use in the DigitalOcean ecosystem; the data plane exposes public and private endpoints so applications inside, or outside, your VPC can call your models securely.

The service is designed to collapse a vast combinatorial space—GPU SKUs, runtimes, routers, autoscaling policies—into guided defaults so teams hit production milestones faster than DIY stacks, while retaining knobs that matter for model serving: replicas, scaling behavior, and advanced optimizations as you roll out your product roadmap.

What we manage vs. what you control

Every managed product draws a line between operator-owned and customer-owned concerns. Dedicated Inference aims to put day-two operations—cluster lifecycle integration, ingress, core serving and routing components, and the glue between them—on the platform side, while leaving model choice, capacity, and workload-specific tuning with you.

Typically platform-managed:

  • Provisioning and lifecycle of the underlying orchestration footprint in line with product design (for example, managed Kubernetes integration and GPU pool coordination).
  • Core inference engine and orchestration integration, including patterns that matter at scale: intelligent routing, autoscaling hooks, and production-oriented serving paths.
  • Endpoint creation, health and scaling workflows, and the operational automation required to keep endpoints aligned with declared configuration.

In your hands:

  • Selecting models (including bring-your-own-model paths where supported), GPU profiles, and replica counts appropriate to your SLOs and budget.
  • Configuring scaling behavior and, over time, advanced serving options that map to your latency, throughput, and cost goals.
  • Connecting applications via stable HTTP APIs consistent with common LLM client stacks.

Dedicated Inference overview

Dedicated Inference builds on industry-standard building blocks so customers benefit from community momentum and continuous improvement:

  • vLLM as a capable, widely adopted inference engine for large language models on modern GPUs.
  • LLM-d as a Kubernetes-oriented stack for distributed inference patterns—precise prefix-cache aware routing, scaling concerns that differ from traditional HTTP services, and room to grow into more advanced topologies as workloads demand.

This combination reflects a deliberate choice: meet customers where they are today (OpenAI-compatible APIs, familiar GPU offerings on DigitalOcean) while staying aligned with where the ecosystem is moving on routing, replication, and scale-out inference.

For readers who want more depth on why LLM routing differs from classic load balancing—and how prefix cache awareness changes the game—see our article on Load Balancing and Scaling LLM Serving.

High-level architecture

The system design separates a control plane (how endpoints are created, updated, listed, and deleted) from a data plane (how chat/completions request traffic reaches your models). That mirrors the management requests, which take a path built for regional placement and durable lifecycle work. Inference requests take a direct, low-latency path in front of your GPUs.

Control plane: central entry, regional execution

What does "control plane" mean here? In this split, the control plane is everything that handles management traffic: management rpc for Dedicated Inference endpoints, plus the durable bookkeeping that turns your declared intent into running DI infrastructure. It is separate from the data plane, which is the hot path for inference (chat/completions-style) requests once an endpoint is healthy.

image alt text

Central API layer: Requests originating from the DigitalOcean Cloud UI, automation workflows, or the public API are routed through a centralized API layer first. This layer maintains the mapping of endpoint ownership by region and transparently forwards each request to the appropriate regional backend. This design follows a multi-region fan-out model, where regional endpoints are addressed using stable identifiers.

Regional Dedicated Inference service: Each region operates a control-plane service responsible for the full lifecycle management of instances within its scope. This includes persisting the desired state, reconciling it with the observed state, advancing lifecycle status (e.g., creating → active), and enqueuing the workflows that provision or mutate the underlying infrastructure. In this context, lifecycle refers to the state machine governing transitions from requested to running and reachable. An instance represents the managed inference deployment as a whole—encompassing both its control-plane record and the associated backing resources.

Separate worker-style components perform integrations that need retries, backoff, and idempotency—calling the Kubernetes API, watching object status, and publishing lifecycle updates back to the core service. This is similar to the reconciler pattern familiar from Kubernetes operators: observe desired state, act, repeat until reality matches intent. Use case: transient API errors or slow node startups do not wedge the user-facing API; the system keeps trying and surfaces a clear status instead of a half-applied state

DigitalOcean Kubernetes and capacity: The platform first helps ensure that sufficient DigitalOcean Kubernetes capacity is available within the target VPC, attaches the required GPU and supporting CPU node capacity, rolls out the managed inference stack (gateway and model workloads), and creates regional network load balancers for public and private endpoints.

Data plane: VPC-native traffic, gateway-routed requests

image alt txt

Client Contract & Endpoint connectivity: Once an endpoint is active, clients send OpenAI-compatible API requests (for example, HTTPS to /v1/chat/completions-style routes). Your public endpoint FQDN resolves to an external regional network load balancer (L4); your private endpoint resolves to an internal regional network load balancer, so the same inference stack can be reached from the public internet or stays on your private VPC network. In both cases, traffic is OpenAI-shaped JSON carrying model ID, messages, and generation parameters.

Cluster placement and VPC isolation: Inside the VPC, workloads run on DOKS. One cluster per VPC can host multiple Dedicated Inference deployments, each isolated by a Kubernetes namespace. Desired gateway and model wiring is expressed as Custom Resources (CRDs—Custom Resource Definitions): declarative objects you kubectl apply (or the platform applies for you) instead of imperative scripts.

Inference Gateway and Kubernetes Gateway API: After the NLB, traffic hits the Inference Gateway, implemented with the upstream Kubernetes Gateway API—the community standard for describing HTTP/TLS routing into a cluster.

Gateway API Inference Extension (inference-aware routing): Below the gateway, the Gateway API Inference Extension teaches routing about inference signals: queue depth, replica health, and KV cache affinity (preferring a replica that already holds key/value tensors for a reusable prompt prefix so work is not recomputed from scratch). KV cache is the saved attention state for prior tokens; inference-aware routing is deliberately not simple round robin, because the cheapest replica is often the one that already cached your prefix.

Endpoint Picker: On CPU nodes, the Endpoint Picker is the component that talks to the Inference Extension and selects which GPU replica should receive each request.

Model Service and inference pools: On GPU nodes, the Model Service—backed by inference pools in configuration—runs your model replicas (distinct pods/processes that can serve the same model ID). Each replica reports load, KV, and cache metadata upstream so the Endpoint Picker's choices stay accurate through rollouts, crashes, and autoscaling events. Replica is a horizontally scaled copy of the same model server; pool is the grouping of those replicas for routing and capacity.

Who is Dedicated Inference for?

Dedicated Inference is aimed at builders who already know they need GPUs, but who would rather not become a full-time inference platform team:

  • Teams that self-host on raw GPU Droplets or Kubernetes and want to offload orchestration, baseline optimizations, and repetitive infra work while keeping API-level ownership of their applications.
  • Teams that have graduated from Serverless Inference and need hardware-level control or BYOM without abandoning managed operations.
  • Organizations with consistent inference demand where predictable GPU-hour economics and performance isolation matter more than pure burst elasticity.

Inference is no longer a novelty layer; it is part of the core application stack. That shift raises the bar for reliability, performance, and cost predictability. Dedicated Inference is for teams that need production-grade, dedicated GPU inference with a managed path from model selection to a stable endpoint—so you spend engineering cycles on products and prompts, not on reinventing the serving platform.

Building a PCI-DSS Compliant GKE Framework for Financial Institutions: Data Protection, Governance… (6 minute read)

Building a PCI-DSS Compliant GKE Framework for Financial Institutions: Data Protection, Governance… (6 minute read)

DevOps
Summary
Digest devoured!