AI researchmultimodalspeech

Interaction Models: A Scalable Approach to Human-AI Collaboration

Thinking Machines Lab's TML-Interaction-Small handles simultaneous speech and visual proactivity with 200ms micro-turns, beating GPT-4o Realtime 77.8 to 46.8 on interaction benchmarks.

Thinking Machines Lab

Summary

What: Thinking Machines Lab announced TML-Interaction-Small in May 2026, a 276B-parameter MoE model (12B active) trained from scratch for real-time multimodal interaction. It uses 200ms 'micro-turns' instead of turn-based exchanges, splitting work between a real-time interaction model and an asynchronous background model. Scored 77.8 on FD-bench v1.5 (vs GPT-4o Realtime's 46.8) with 0.40s turn-taking latency.

Why it matters: This signals a shift from turn-based AI interaction to continuous presence across modalities. Rather than bolting interactivity onto LLMs through voice-activity-detection harnesses, making it native to the model means scaling intelligence also scales collaboration quality—opening a path where larger models become better collaborators, not just smarter autonomous workers.

Deep Dive

Introduces "interaction models" that handle audio, video, and text simultaneously in real-time, unlike turn-based commercial models that wait for complete user turns
TML-Interaction-Small: 276B-parameter MoE with 12B active parameters, trained from scratch with encoder-free early fusion
Architecture uses 200ms "micro-turns" continuously interleaving input processing and output generation across all modalities
Split design: real-time interaction model for immediate responses + asynchronous background model for reasoning, tool use, and long-horizon tasks
FD-bench v1.5 interaction quality: 77.8 (TML) vs 46.8 (GPT-4o Realtime minimal) vs 45.5 (Gemini-3.1-flash high)
Turn-taking latency: 0.40s (TML) vs 1.18s (GPT-4o Realtime minimal) vs 0.94s (Gemini high)
Audio MultiChallenge APR intelligence: 43.4% (TML) vs 37.6% (GPT-4o minimal) vs 48.5% (GPT-4o with thinking)
Enables qualitatively new capabilities: simultaneous speech (live translation while listening), visual proactivity (responds to on-screen changes without audio cue), time awareness (accurate elapsed time tracking)
New benchmarks introduced: TimeSpeak (64.7% vs 4.3% for GPT), CueSpeak (81.7% vs 2.9%), RepCount-A for visual counting (35.4% vs 1.3%), ProactiveVideoQA (33.5 vs baseline 25.0), Charades temporal action (32.4% mIoU vs 0%)
No voice-activity-detection (VAD) harness needed—all interaction management is native to the model
Technical optimizations: streaming sessions for frequent small prefills, bitwise trainer-sampler alignment, custom NVLS comm kernels for Blackwell GPUs
Audio processing: dMel input with lightweight embedding layer; video: 40x40 patches with hMLP encoding
Limitations: long sessions require context management, needs reliable connectivity, larger models (beyond 12B active) too slow to serve currently
Limited research preview coming with wider release later in 2026; accepting feedback at [email protected]

Decoder

Micro-turns: 200ms chunks of interleaved input/output streams that replace traditional turn boundaries, allowing the model to process and respond continuously rather than waiting for complete user turns
Full-duplex: Simultaneous two-way communication where the model can speak and listen at the same time, like a phone call (vs half-duplex where only one party can transmit at a time)
MoE (Mixture of Experts): Model architecture where only a subset of parameters (here 12B out of 276B) activate for each input, reducing computational cost while maintaining large total capacity
Voice-activity-detection (VAD): External component that traditional turn-based models use to detect when a speaker has finished talking; TML's model eliminates this by natively understanding turn boundaries
FD-bench: Full-Duplex benchmark measuring how well models handle interruptions, backchanneling, talking to others, and background speech during real-time conversations
dMel: Discrete mel-spectrogram representation for audio that can be processed directly by transformers without large external encoders like Whisper

Original Article

Today, we're announcing a research preview of interaction models: models that handle interaction natively rather than through external scaffolding. We think interactivity should scale alongside intelligence; the way we work with AI should not be treated as an afterthought. Interaction models let people collaborate with AI the way we naturally collaborate with each other—they continuously take in audio, video, and text, and think, respond, and act in real time.

We train an interaction model from scratch. To ensure real-time responsiveness, we adopt a multi-stream, micro-turn design. Our research preview demonstrates qualitatively new interaction capabilities, as well as state-of-the-art combined performance in intelligence and responsiveness.

The collaboration bottleneck

AI labs often treat the ability for AI to work autonomously as the model's most important capability. As a result, today's models and interfaces aren't optimized for humans to remain in the loop. A recent frontier model card states: "Importantly, we find that when used in an interactive, synchronous, "hands-on-keyboard" pattern, the benefits of the model were less clear. When used in this fashion, some users perceived [our model] as too slow and did not realize as much value. Autonomous, long-running agent harnesses better elicited the model's coding capabilities."

Autonomous interfaces are valuable, but in most real work, users can't fully specify their requirements upfront and walk away—good results benefit from a collaborative process where the human stays in the loop, clarifying and giving feedback along the way. However, humans increasingly get pushed out not because the work doesn't need them, but because the interface has no room for them. Instead, people are most effective when they can collaborate with AI the same way we do with other people: messaging, talking, listening, seeing, showing, and interjecting as needed—and for the model to do the same.

In order to resolve this, we need to move beyond the current turn-based interface for the models. Today's models experience reality in a single thread. Until the user finishes typing or speaking, the model waits with no perception of what the user is doing or how the user is doing it. Until the model finishes generating, its perception freezes, receiving no new information until it finishes or is interrupted. This creates a narrow channel for human-AI collaboration that limits how much of a person's knowledge, intent, and judgement can reach the model, and how much of the model's work can be understood. Picture trying to resolve a crucial disagreement over email rather than in person.

At Thinking Machines, we believe we can solve this bandwidth bottleneck by making AI interactive in real time across any modality. This enables AI interfaces to meet humans where they are, rather than forcing humans to contort themselves to AI interfaces.

Most existing AI models bolt on interactivity with a harness: stitching components together to emulate interruptions, multimodality, or concurrency. However, existing research suggests that these hand-crafted systems will be outpaced by the advance of general capabilities. For interactivity to scale with intelligence, it must be part of the model itself. With this approach, scaling a model makes it smarter and a better collaborator.

Capabilities

Having interactivity be part of the model unlocks a variety of capabilities that would otherwise need to be implemented in the harness.

Seamless dialog management. The model tracks implicitly whether the speaker is thinking, yielding, self-correcting, or inviting a response. There is no separate dialog management component.
Verbal and visual interjections. The model jumps in as needed depending on the context, not only when the user finishes speaking.
Simultaneous speech. The user and the model can speak concurrently (e.g. live translation)
Time-awareness. The model has a direct sense of elapsed time.
Simultaneous tools calls, search, and generative UI. While speaking and listening to the user, the model can concurrently search, browse the web, or generate UI—weaving back results into the conversation as needed.

In a longer real session, all of this happens continuously, creating an experience that feels more like collaborating and less like prompting.

Our approach

Turn-based models see an alternating token sequence. Time-aware interaction models see a continuous stream of micro-turns, so silence, overlap, and interruption remain part of the model's context.

An interaction model is in constant two-way exchange with the user—perceiving and responding at the same time. Some domains take such interactivity as a given—the physical world demands that robotics and autonomous vehicles operate in real time. Audio full-duplex models are another example where interaction is bidirectional and continuous.

Applying the same principle, we set out to build an interaction model native to this regime—one that perceives and responds in the same continuous loop, across audio, video, and text. The result is a system architected around two ideas: a time-aware interaction model that maintains real-time presence, and an asynchronous background model that handles sustained reasoning, tool use, and longer-horizon work.

System overview

The interaction model is in constant exchange with the user. When a task requires deeper reasoning than can be produced instantaneously, the interaction model delegates to a background model that runs asynchronously. The interaction model remains present throughout — answering follow-ups, taking new input, holding the thread — and integrates background results into the conversation as they arrive.

The user continuously interacts with the interaction model, while the background model performs asynchronous tasks. Both systems share their context.

This split lets the user benefit from both responsiveness as well as the full extent of intelligence: the planning, tool-use, and agentic workflows of reasoning models at the response latency of non-thinking ones. Note that both the background and interaction models are intelligent — on its own, the interaction model is also competitive on both interactive and intelligence benchmarks.

The interaction model

Our starting point is continuous audio and video — modalities that are inherently real-time. Text can wait, but a live conversation cannot. By designing around the hardest case first, we arrive at an architecture that is natively multimodal, time-aware, and capable of handling concurrent input and output streams across all modalities. Several design choices make this possible.

Time-aligned micro-turns. The interaction model works with micro-turns continuously interleaving the processing of 200ms worth of input and generation of 200ms worth of output. Rather than consuming a complete user-turn and generating a complete response, both input and output tokens are treated as streams. Working with 200ms chunks of these streams enables near real-time concurrency of multiple input and output modalities.

Human perception preserves concurrent input and output streams, while the model receives a single interleaved token sequence.

With this design, there are no artificial turn boundaries that the model must adhere to. In contrast, most existing real-time systems require a harness that predicts turn boundaries in order for the turn-based models to feel real-time and responsive. This harness is made out of components like voice-activity-detection (VAD) that are meaningfully less intelligent than the model itself. This precludes a variety of interaction modes like proactive interjections ("interrupt when I say something wrong") or reactions to visual cues ("tell me when I've written a bug in my code"). Moreover, the model can do things like speak while listening ("translate from spanish to english live") or watching ("live-commentate this sports game").

Thus, all of these different interaction modes that require special harnesses today become special-cases of what the model can do and improve in quality as we scale up model size and training data.

Encoder-free early fusion. Rather than processing audio and video through large, standalone encoders, we opt for a system with minimal pre-processing. Many omnimodal models require training a separate encoder (e.g. Whisper-like) or decoder (e.g. TTS model-like). We instead take in audio signals as dMel and transform it via a light-weighted embedding layer. Images are split into 40x40 patches which are encoded by an hMLP. For the audio decoder we use a flow head. All components are co-trained from scratch together with the transformer.

An illustration of the interaction model architecture for a single 200ms micro-turn. The model takes in any subset of text, audio, or video and predicts text and audio.

Inference optimization. At inference time, 200ms chunks require frequent prefills and decodes of small sizes, each having to meet strict latency constraints. Unfortunately, existing LLM inference libraries are not optimized for frequent small prefills—they often have a significant amount of overhead per turn. To address this, we implemented streaming sessions. The client sends each 200ms chunk as a separate request, while the inference server appends these chunks into a persistent sequence in GPU memory. This avoids frequent memory reallocations and metadata computations, and we've upstreamed a version of this feature to SGLang. In addition, we also optimized our kernels for latency as well as the shapes we see for bidirectional serving. For example, we use a gather+gemv strategy for MoE kernels instead of the standard grouped gemm.

Trainer-sampler alignment. We've found bitwise trainer-sampler alignment to be useful for training stability as well as debugging the various components of our system. We implement batch-invariant kernels with minimal (<5%) e2e performance overhead. To highlight two particular kernels:

All-reduce and reduce-scatter: We use NVLS to implement low-latency comm kernels which are deterministic on Blackwell, and achieve bitwise alignment between somewhat different parallelism strategies (i.e. Sequence Parallelism and Tensor Parallelism).
Attention: The primary challenge with attention is Split-KV, which can typically lead to inconsistent accumulation orders between decode and prefill. However, we can maintain consistent accumulation order by choosing to split consistently between decode and prefill. For example, we could split SMs to process 4096 tokens at a time (left-aligned), achieving good efficiency in both prefill and decode.

Coordination between interaction and background models. When the interaction model delegates, it sends a rich context package — not a standalone query, but the full conversation. Results stream back as the background model produces them, and the interaction model interleaves these updates into the conversation at a moment appropriate to what the user is currently doing, rather than as an abrupt context switch.

Safety. Because real-time interaction stresses safety differently than turn-based exchanges, our safety work focused on two axes: modality-appropriate refusals and long-horizon robustness. To make refusals colloquial in speech, we use a text-to-speech model to generate refusal and over-refusal training data covering a range of disallowed topics, with the refusal boundary calibrated to favor naturally-phrased, but no less firm, refusals. To improve robustness across extended speech-to-speech conversations, we used an automated red-teaming harness to generate multi-turn refusal data, while maintaining close behavioral parity with the model's text-based refusals.

Benchmarks

Intelligence and interactivity frontier

We show that our interaction model, named TML-Interaction-Small, is the first model that has both strong intelligence/instruction following and interactivity. To measure interaction quality we use FD-bench which is one of the few existing benchmarks intended to measure interactivity. In FD-bench v1.5, the model is given prerecorded audio, and must respond at certain times. This benchmark measures model behavior across several scenarios: user interruption, user backchannel, talking to others, and background speech. Our model scores well in all of these areas. To quantify intelligence we use Audio MultiChallenge, a common benchmark that tracks intelligence and instruction following.

Intelligence and Interactivity Frontier. Our model dominates interaction quality while being more intelligent than any non thinking model. We achieve the best responsiveness measured as a latency between user and model turns.

For more intelligence, safety, and interactivity/latency results please see the table below. We report our performance on both streaming and turn-based benchmarks.

	Instant	Thinking
	TML-interaction-small	GPT-realtime-2.0 (minimal)	GPT-realtime-1.5	Gemini-3.1-flash-live (minimal)	Qwen 3.5 OMNI-plus-realtime	GPT-realtime-2.0 (xhigh)	Gemini-3.1-flash-live (high)
Streaming	FD-bench V1 Turn-taking latency (s) · Audio	0.40	1.18	0.59	0.57	2.14	1.63	0.94
FD-bench V1.5 Average · Audio	77.8	46.8	48.3	54.3	39.0	47.8	45.5
FD-bench V3 Response Quality (%) / Pass@1 (%) · Audio + Tools	82.8* / 68.0*	80.0 / 52.0	77.9 / 55.0	68.5 / 48.0	60.0 / 50.0	81.0 / 58.0	71.4 / 48.0
QIVD** Accuracy (%) · Video + Audio	54.0	57.5	41.2	54.7	59.0	58.2	56.1
Turn-based	Audio MultiChallenge APR (%) · Audio	43.4	37.6	34.7	26.8	-***	48.5	36.1
BigBench Audio Accuracy (%) · Audio	75.7 / 96.5*	71.8	81.4	71.3	73.0	96.6****	96.6
IFEval (VoiceBench) Accuracy (%) · Audio	82.1	81.7	68.1	67.6	80.3	83.2	82.8
IFEval Accuracy (%) · Text	89.7	89.6	87.5	85.8	83.4	95.2	90.0
Harmbench Refusal rate (%) · Text	99.0	99.5	100.0	99.0	99.5	100.0	98.0

* For benchmarks that require reasoning or tool calls we report our results with background agent enabled.
** We evaluate Qualcomm IVD in a streaming setting – is a video-audio QA benchmark. In each video clip, somebody performs an action and speaks a question. We evaluate in a streaming setting, sending the raw clip from the beginning and grading the model's transcript. Following Qwen 3.5 Omni we use a GPT-4o-mini grader.
*** Audio MultiChallenge metrics for all the baseline models are reported by Scale AI, where Qwen 3.5 OMNI-plus-realtime is not listed.
**** Bigbench Audio metrics for all the baseline models are reported by Artificial Analysis, where GPT-realtime-2.0 thinking is on high.

New dimensions of interactivity

The existing interactivity-oriented benchmarks above do not adequately capture the qualitative jumps in interaction capabilities we notice. To that end, we have some early work aimed at quantifying these capabilities.

Time awareness and simultaneous speech. Turn-based models with a dialog management system do not support accurate time estimation or simultaneous speech. Examples include: "How long did it take me to run one mile?", "Correct my mispronunciations as you hear them" or "How long did it take me to write this function?"

We created two internal benchmarks to measure these proactive audio capabilities:

TimeSpeak: Tests whether the model can initiate speech at user-specified times while producing the correct content. For example: "I want to practice my breathing, remind me to breathe in and out every 4 seconds until I ask you to stop."
CueSpeak: Tests whether the model speaks at the appropriate moment with the expected semantically correct response. Dataset entries are created to ensure that the model needs to speak at the same time as the user to get a full score. For example: "Everytime I codeswitch and use another language, give me the correct word in the original language."

For both benchmarks, each example has a single expected semantic response and timing window. We grade with an LLM judge: A response is counted as correct only if it conveys the expected meaning and is delivered at the appropriate time; failing either criterion receives no credit. We report macro-averaged accuracy across examples.

Visual proactivity. Today's commercial real-time APIs perform turn-detection via audio-only dialogue management harnesses. They respond to spoken turns, but they cannot proactively choose to speak when the visual world changes. Several academic papers have built related research prototypes including StreamBridge, Streamo, StreamingVLM, and MMDuet2 which study when to output text in a streaming video input setting. Being text out, they do not study additional constraints of speech-output interaction: speech has duration, can overlap with the user, and must be coordinated with turntaking, interruptions, and backchanneling. Closest to ours is AURA, which adds an ASR/TTS demo around a VideoLLM that decides when to emit text or be silent; in contrast ours is speech-native and full-duplex. For instance, if asked "Please count how many pushups I do" such a system might respond "Sure thing!" and then remain silent – waiting for an audio-only cue that never comes.

We adapted three benchmarks to evaluate visual proactivity of our model:

RepCount-A contains videos of repeated actions and is adapted into an online counting task. We stream the video following an audio instruction "Please count out reps for {action}.". We extract the last number said by the model after the ground truth penultimate rep, and grade by whether it was within one rep of the ground truth. This task measures continuous visual tracking and timely counting.
ProactiveVideoQA consists of videos with questions, whose answers become available at specific moments. We stream the question in audio and then the video. We report the paper's turn-weighted PAUC@ω=0.5 metric (scaled 0-100), averaged across turns and categories. Staying silent scores 25.0. Higher scores require correct answers at the correct times and incorrect answers are penalized.
Charades is a standard temporal action-localization benchmark. Each video contains an action occurring over a labeled time interval. We stream a user audio instruction: "Say 'start' when the person starts doing {action} then say 'Stop' when they stop."; then we stream the video. The model is graded by temporal IoU between predicted and the reference intervals.

	TML-interaction-small	GPT realtime-2.0 (minimal)
Time awareness TimeSpeak · macro-acc	64.7	4.3
Verbal cues trigger CueSpeak · macro-acc	81.7	2.9
Visual-based counting RepCount-A · off-by-one	35.4	1.3
Visual cues trigger ProactiveVideoQA · PAUC@ω=0.5	33.5	25.0*
Visual cues trigger Charades · mIoU	32.4	0

* No-response baseline on ProactiveVideoQA is 25.0

No existing model can meaningfully perform any of these tasks. For the sake of completeness, we report the results of GPT Realtime-2 (minimal), but all models evaluated perform similar or worse on these tasks, including thinking high models. They stay silent or give incorrect answers.

Future evals. We believe that interactivity is an important area for future research and we invite the community to contribute benchmarks here. We are launching a research grant to encourage more research into the field of interaction models and human-AI collaboration, including but not limited to new frameworks for assessing interactivity quality, with details coming soon.

Limitations and future work

Long sessions. Continuous audio and video accumulate context quickly. The streaming-session design handles short and medium interactions well, but very long sessions still require careful context management—an active area of work.

Compute and deployment. Streaming audio and video at low latency requires reliable connectivity. Without a good connection, the experience degrades significantly. We believe that this can be improved significantly in the future both by improving system reliability as well as training our model to be more robust to delayed frames.

Alignment and safety. A realtime interface opens up an exciting area of research for both alignment and safety. We are collecting feedback and reviewing research grants.

Scaling model size. The current TML-Interaction-Small is a 276B parameter MoE with 12B active. While we expect the interactivity to improve with model scale, our larger pretrained models are currently too slow to serve in this setting. We plan to release larger models later this year.

Improved background agents. Although we have primarily focused on real-time interactivity in this post, agentic intelligence is also an essential capability. In addition to pushing agentic intelligence to the frontier, we believe we have just scratched the surface in how the background agents can work together with the interaction model.

Citation

Please cite this work as:

Thinking Machines Lab, "Interaction Models: A Scalable Approach to Human-AI Collaboration",
Thinking Machines Lab: Connectionism, May 2026.

Or use the BibTeX citation:

@article{thinkingmachines2026interactionmodels,
  author = {Thinking Machines Lab},
  title = {Interaction Models: A Scalable Approach to Human-AI Collaboration},
  journal = {Thinking Machines Lab: Connectionism},
  year = {2026},
  month = {May},
  note = {https://thinkingmachines.ai/blog/interaction-models/},
  doi = {10.64434/tml.20260511},
}

Tech securityopensourcenpm

TanStack npm Packages Compromised in Ongoing Mini Shai-Hulud Supply-Chain Attack

84 TanStack npm packages with 12M+ weekly downloads were compromised in a supply-chain attack, forcing TanStack to deprecate affected versions and harden their GitHub Actions workflow.

Socket.dev

Summary

What: Attackers compromised 84 TanStack npm packages including widely-used libraries with over 12 million weekly downloads. TanStack responded by deprecating affected versions, engaging npm security to remove malicious tarballs, purging GitHub Actions cache entries, and merging workflow hardening changes including repository-owner guards and pinned third-party action references.

Why it matters: This attack, dubbed Mini Shai-Hulud, demonstrates how GitHub Actions workflows remain a high-value target for supply-chain compromise. The scale of affected downloads shows how a single workflow vulnerability can cascade across the entire npm ecosystem.

Takeaway: If you use TanStack packages, check the Socket.dev article for the list of compromised versions and follow their recommended remediation actions for affected systems.

Decoder

Supply-chain attack: Malicious code injected into legitimate software packages during the build or distribution process, compromising all downstream users who install the package.
GitHub Actions cache: GitHub's workflow caching mechanism that can be exploited to persist malicious code across workflow runs if not properly secured with repository-owner guards.

Original Article

GemStuffer Campaign Abuses RubyGems as Exfiltration Channel Targeting UK Local Government

GemStuffer abuses RubyGems as an exfiltration channel, packaging scraped UK council portal data into junk gems published from new accounts.

AI videogemini

Google's Gemini Omni video model surfaces ahead of I/O debut

Google's Gemini Omni video model leaked with unusually strong editing features but trailing ByteDance's Seedance 2 on raw cinematic quality.

TestingCatalog

Summary

What: Google's Gemini Omni video model surfaced in Reddit screenshots ahead of Google I/O on May 19-20. Early testers report strong performance in watermark removal, object swapping, and chat-based scene editing, though cinematic quality lags ByteDance's Seedance 2. The model will likely ship in Flash and Pro tiers with metered credit usage.

Why it matters: Google is prioritizing modality unification and editing capabilities over raw generation benchmarks, mirroring its Nano Banana image model strategy of launching with strong editing first and upgrading cinematic quality later.

Deep Dive

Google's Gemini Omni video model briefly appeared to users in a revised Gemini interface, described as "meet our new video model, remix your videos, edit directly in chat, try templates, and more"
The leak appears either accidental or part of a limited A/B test, surfacing ahead of Google I/O 2026 (May 19-20)
Early testing shows the model excels at video editing tasks: watermark removal, object swapping within clips, and scene rewriting via chat instructions
Raw generation quality trails ByteDance's Seedance 2, with reviewers noting cinematic fidelity is "a step behind the current benchmark leader"
Google appears to be following the Nano Banana playbook: Nano Banana launched with middling generation scores but topped editing leaderboards, then was upgraded to frontier-level quality
The model is expected to ship in tiered variants, likely Flash and Pro, with circulating outputs likely from the Flash tier
A new usage limits tab appeared in settings, with users reporting video generation burns through credits quickly, suggesting metered usage
The strategy prioritizes modality unification under Gemini over raw quality leadership at launch
Timing aligns with Google I/O's pattern of unveiling major AI initiatives, with the pre-event leak allowing Google to gather reactions before the keynote

Decoder

Nano Banana: Google's native image generation and editing model that launched in Gemini, initially with modest generation quality but strong editing capabilities, later upgraded to frontier-level performance
Seedance 2: ByteDance's video generation model, currently considered the benchmark leader for cinematic quality in AI-generated video

Original Article

Google's Gemini Omni Video Model Surfaces Ahead of I/O

Fresh signals around Google's upcoming Gemini Omni video model surfaced over the weekend, with Reddit users posting screenshots of a revised Gemini interface exposing the new model card. The description read "Create with Gemini Omni: meet our new video model, remix your videos, edit directly in chat, try templates, and more," appearing to confirm the long-rumored unified approach Google has been preparing ahead of next week's developer event. The rollout looked either accidental or part of a limited A/B test.

Sample video and early feedback 👀

I won't lie, this is one of the best video models I have seen, maybe not *the* best, but a really strong performance. I was particularly impressed by the prompt adherence (except for the one shot with the missing centerpiece), the model…

Alongside the model card, users spotted a new usage limits tab inside settings, and several reported that video generation burned through credits fast, hinting at a metered system similar to what Google has been testing across Gemini surfaces. Early outputs drew mixed reactions. On raw generation fidelity, Omni appears to lag behind ByteDance's Seedance 2, with viewers noting that the cinematic quality is a step behind the current benchmark leader. Where the model stood out was in editing: removing watermarks, swapping objects within clips, and rewriting scenes via chat instructions all worked unusually well for a first public glimpse.

GOOGLE 🔥: An upcoming Gemini Omni video model from Google is expected to be much more advanced in video editing, capable of completing tasks like removing watermarks, replacing objects in the video, and more.

It is also likely that Google will release 2 versions of this model,…

That pattern mirrors Nano Banana, which launched as a native image model on Gemini, debuted with middling generation scores but topped editing leaderboards, and was later upgraded into a frontier image system. Google appears to be running the same playbook for video, prioritizing modality unification under Gemini over raw quality leadership at launch. There are also hints that Omni will ship in tiered variants, likely Flash and Pro, with the outputs circulating now most likely coming from the Flash tier.

Google keeps preparing its upcoming Gemini Omni models for the release.

Gemini Omni model will be available on APIs as well

The model will be considered as Agent, similarly to Deep Research on AI Studio

Soon? 👀

P. S. Just a reminder that Nano Banana 1 wasn't better than…

The timing fits neatly with Google I/O on May 19 and 20, where the company has a track record of unveiling its most ambitious AI shifts. A short pre-event window paired with a controlled leak gives Google room to gather reactions and shape the narrative before the keynote.

AI hardwarenvidiainference

The Inference Shift

Cerebras' surging IPO signals AI inference splitting into speed-optimized answer chips for humans and memory-heavy agentic chips for autonomous work without humans waiting.

Stratechery

Summary

What: Cerebras raised its IPO price range to $150-160 per share (from $115-125) and is marketing 30 million shares. Its WSE-3 chip has 44GB on-chip SRAM with 21 PB/s bandwidth, roughly 6,000 times the memory bandwidth of Nvidia's H100, making it ideal for low-latency human interactions but unsuitable when KV caches exceed on-chip capacity.

Why it matters: This reveals a fundamental architectural split in AI compute: answer inference (human-facing, speed-critical) will use expensive high-bandwidth chips, while agentic inference (machine-to-machine, latency-tolerant) will unbundle the GPU in favor of slower, cheaper memory hierarchies with good-enough compute, potentially eroding Nvidia's premium and making older chip nodes strategically viable.

Deep Dive

Cerebras is raising its IPO price to $150-160/share (up from $115-125) and increasing share count to 30 million amid surging demand for AI chips beyond GPUs
The WSE-3 uses wafer-scale integration (entire 300mm wafer as single chip) with 44GB on-chip SRAM at 21 PB/s bandwidth, versus H100's 80GB HBM at 3.35 TB/s (6,000x bandwidth advantage, but 55% of memory capacity)
Inference has three parts: prefill (parallel, compute-bound), decode reading KV cache (serial, bandwidth-bound with variable memory), and decode over model weights (serial, bandwidth-bound with fixed memory)
Cerebras excels when models and KV caches fit entirely in on-chip memory, delivering dramatically faster token generation for human-facing applications like voice interfaces and AI wearables
Ben Thompson argues inference is splitting into two categories: answer inference (human-facing, latency-critical, optimized for token speed) and agentic inference (machine-to-machine, latency-tolerant, optimized for memory capacity)
Agentic inference doesn't need cutting-edge speed because agents work autonomously without humans waiting, making slower cheaper memory (DRAM vs HBM) and older chip nodes economically rational
This architectural split favors different players: Nvidia for training (compute + HBM + networking), Cerebras/Groq for answer inference (extreme speed), and potentially simpler/cheaper systems for agentic inference (memory hierarchy + good-enough compute)
Nvidia is responding with Dynamo inference framework, standalone memory racks, and CPU racks to disaggregate inference workloads and keep expensive GPUs busy
Implications: China has sufficient older-node chips for agentic inference (circumventing cutting-edge compute restrictions), space data centers become viable with cooler/simpler older nodes, and Nvidia's architectural premium may erode in the largest inference market (agentic)
SpaceX contracted Anthropic for 300+ megawatts at Colossus 1 (220,000+ Nvidia GPUs), demonstrating that training and inference still share GPU infrastructure when models aren't heavily used

Decoder

KV cache: Key-Value cache storing context and attention states from previous tokens in a language model sequence, growing with each generated token and requiring full reads during decode to maintain conversation coherence
Prefill: Initial encoding step in LLM inference that processes the entire input prompt in parallel to populate the KV cache before token generation begins
Decode: Serial token generation phase where the model alternates between reading the KV cache and model weights through each layer to produce one output token at a time
HBM (High Bandwidth Memory): Specialized stacked DRAM offering much higher bandwidth than standard memory, used in GPUs to feed data fast enough to keep compute units busy during parallel operations
Reticle limit: Maximum chip area (~26mm × 33mm) that lithography equipment can pattern in a single exposure, traditionally limiting chip size until Cerebras invented wafer-scale wiring
Wafer-scale integration: Cerebras' technique of wiring across scribe lines (boundaries between exposures on a silicon wafer) to create a single functional chip from an entire 300mm wafer instead of cutting it into separate dies
SRAM: Static RAM residing on-chip with extremely low latency but limited capacity and high cost, versus off-chip HBM which offers vastly more capacity at lower bandwidth
Interposer: Silicon substrate connecting multiple chips to enable chip-to-chip communication, used by Nvidia's B200 to link dies together versus Cerebras' monolithic wafer approach

Original Article

The Inference Shift

If you were looking for the ideal time to IPO, being a chip company in May 2026 is hard to beat. Reuters reported over the weekend:

Cerebras Systems is set to raise the size and price of its initial public offering as soon as Monday, as demand for the artificial intelligence chipmaker's shares continues to climb, two people familiar with the matter told Reuters on Sunday. The company is considering a new IPO price range of $150-$160 a share, up from $115-$125 a share, and raising the number of shares marketed to 30 million from 28 million, said the sources, who asked not to be identified because the information isn't public yet.

The fundamental driver of the ongoing surge in semiconductor stocks is, of course, AI, particularly the realization that agents are going to need a lot of compute. What Cerebras represents, however, is something broader: while the compute story for AI has been largely about GPUs, particularly from Nvidia, the future is going to look increasingly heterogeneous.

The GPU Era

The story of how Graphics Processing Units became the center of AI is a well-trodden one, but in brief:

Just as drawing pixels on a computer screen was a parallel process, which meant there was a direct connection between the number of processing units and graphics speed, making AI-related calculations was a parallel process, which meant there was a direct connection between the number of processing units and calculation speed.
Nvidia enabled this dual-usage by making its graphics processors programmable, and created an entire software ecosystem called CUDA to make this programming accessible.
The big difference between graphics and AI has been the size of the problem being solved — models are a lot bigger than video game textures — which has led to a dramatic expansion in high-bandwidth memory (HBM) per GPU, and dramatic innovations in terms of chip-to-chip networking to allow multiple chips to work together as one addressable system. Nvidia has been the leader in both.

The number one use case for GPUs has been training, which stresses the third point in particular. While the calculations within each training step are massively parallel, the steps themselves are serial: every GPU has to share its results with every other GPU before the next step can begin. This is why a trillion-parameter model needs to fit in the aggregate memory of tens of thousands of GPUs that can communicate as one system. Nvidia dominates both problem spaces, first by securing HBM ahead of the rest of the industry, and second thanks to its investments in networking.

Of course training isn't the only AI workload: the other is inference. Inference has three main parts:

Prefill encodes everything the LLM needs to know into an understandable state; this is highly parallelizable and compute matters.
The first part of decode entails reading the KV cache — which stores context, including the output of the prefill step — to make an attention calculation. This is a serial step where bandwidth matters, but the memory requirements are variable and increasingly large.
The second part of decode is the feed-forward computation over the model weights; this is also a serial step where bandwidth matters, and the memory requirements are defined by the size of the model.

The two decode steps alternate for every layer of the model (they're interleaved, not in sequence), which is to say that decode is serial and memory-bandwidth bound. For every token generated, two distinct memory pools must be read: the KV cache, which stores context and grows with each token, and the model weights themselves. Both must be read in full to produce a single output token.

GPUs handle all three needs: high compute for prefill, abundant HBM for KV cache and model weights, and chip-to-chip networking to pool memory across multiple chips when a single GPU isn't enough. In other words, what works for training works for inference — look no further than the deal SpaceX made with Anthropic. From Anthropic's blog:

We've signed an agreement with SpaceX to use all of the compute capacity at their Colossus 1 data center. This gives us access to more than 300 megawatts of new capacity (over 220,000 NVIDIA GPUs) within the month. This additional capacity will directly improve capacity for Claude Pro and Claude Max subscribers.

SpaceX retains Colossus 2 — presumably for both training of future models and inference of existing ones — and can afford to do both in the same data center precisely because xAI's models aren't getting much usage; more pertinently to this piece, they can do both in the same data center because both training and inference can be done on GPUs. Indeed, the GPUs Anthropic is contracting for at Colossus 1 were originally used for training as well; the fact that GPUs are so flexible is a big advantage.

Understanding Cerebras

Cerebras makes something completely different. While a silicon wafer has a diameter of 300mm, the "reticle limit" — the maximum area that a lithography tool can expose on that wafer — is around 26mm x 33mm. This is the effective size limit for chips; going beyond that entails linking two separate chips together over a chip-to-chip interposer, which is exactly what Nvidia has done with the B200. Cerebras, on the other hand, has invented a way to lay down wiring across the so-called "scribe lines" that are the boundary between reticle exposures, making the entire wafer into a single chip with no need for relatively slow chip-to-chip linkages.

The net result is a chip with a lot of compute and a lot of SRAM that is blisteringly fast to access. To put it in numbers, the WSE-3 (Cerebras' latest chip) has 44GB of on-chip SRAM at 21 PB/s of bandwidth; an H100 has 80GB of HBM at 3.35 TB/s. In other words, the WSE-3 has just over half the memory of an H100, but 6,000 times the memory bandwidth.

The reason to compare the WSE-3 to an H100 is that the H100 is the chip most used for inference — and inference is clearly what Cerebras is most well-suited for. You can use Cerebras chips for training, but the chip-to-chip networking story isn't very compelling, which is to say that all of that compute and on-chip memory is mostly just sitting around; what is much more interesting is the idea of getting a stream of tokens at dramatically faster speed than you can from a GPU.

Note, however, that the limitation in terms of training also potentially applies in terms of inference: as long as everything fits in on-chip memory Cerebras' speed is an incredible experience; the moment you need more memory, whether that be for a larger model or, more likely, a larger KV cache, then Cerebras doesn't make much sense, particularly given the price. That whole-wafer-as-chip technique means high yields are a massive challenge, which hugely drives up costs.

At the same time, I do think there will be a market for Cerebras-style chips: right now the company is highlighting the usefulness of speed for coding — reasoning means a lot of tokens, which means that dramatically scaling up tokens-per-second equals faster thinking — but I think this is a temporary use case, for reasons I'll explain in a bit. What does matter is how long humans are waiting for an answer, and as products like AI wearables become more of a thing, the speed of interaction, particularly for voice — which will be a function of token generation speed — will have a tangible effect on the user experience.

Agentic Inference

I have previously made the case, including in Agents Over Bubbles, that we have gone through three inflection points in the LLM era:

ChatGPT demonstrated the utility of token prediction.
o1 introduced the idea of reasoning, where more tokens meant better answers.
Opus 4.5 and Claude Code introduced the first usable agents, which could actually accomplish tasks, using a combination of reasoning models and a harness that utilized tools, verified work, etc.

All of this falls under the banner of "inference", but I think it will be increasingly clear that there is a difference between providing an answer — what I will call "answer inference" — and doing a task — what I will call "agentic inference." Cerebras' target market is "answer inference"; in the long run, I think the architecture for "agentic inference" will look a lot different, not just from Cerebras' approach, but from the GPU approach as well.

I mentioned above that fast inference for coding is a temporary use case. Specifically, coding with LLMs requires a human in the loop. It's the human that defines what is to be coded, checks the work, commits the pull request, etc.; it's not hard to envision a future, however, where all of this is completely handled by machines. This will apply to agentic work broadly: the true power of agents will not be that they do work for humans, but rather that they do work without human involvement at all.

This, by extension, will mean that the likely best approach to solving agentic inference will look a lot different than answer inference. The most important aspect for answer inference is token speed; the most important aspect for agentic inference, however, is memory. Agents need context, state, and history. Some of that will live as active KV cache; some will live in host memory or SSDs; much of it will live in databases, logs, embeddings, and object stores. The important point is that agentic inference will be less about GPUs answering a question and more about the memory hierarchy wrapped around a model.

Critically, this articulation of an agentic-specific memory hierarchy implies a necessary trade-off of speed for capacity. Here's the thing, though: lower speed isn't nearly as important a consideration if there isn't a human in the loop. If an agent is waiting around for a job that is being run overnight, the agent doesn't know or care about the user experience impact; what is most important is being able to accomplish a task, and if entirely new approaches to memory make that possible, then delays are fine.

Meanwhile, if delays are fine, then all of the focus on pure compute power and high-bandwidth memory seems out of place: if latency isn't the top priority, then slower and cheaper memory — like traditional DRAM, for example — makes a lot more sense. And if the entire system is mostly waiting on memory, then chips don't need to be as fast as the cutting edge either. This represents a profound shift in future architectures, but it also doesn't mean that current architectures are going away:

Training will continue to matter, and Nvidia's current architecture, including high-speed compute, large amounts of high-bandwidth memory, and high-speed networking, will likely continue to dominate.
Answer inference will be a meaningful market, albeit a relatively small one, and speed from chips like Cerebras or Groq (I explained how Nvidia is deploying Groq's LPUs here) will be very useful.
Agentic inference will gradually unbundle the GPU, which alternates between stranding high-bandwidth memory (during the prefill process) and stranding compute (during the decode process), in favor of increasingly sophisticated memory hierarchies dominated by high capacity and relatively lower cost memory types, with "good enough" compute; indeed, if anything it will be the speed of CPUs for things like tool use that will matter more than the speed of GPUs.

At the same time, these categories won't be equal in size or importance. Specifically, agentic inference will be the largest market by far, because that is the market that won't be limited by humans or time. Today's agents are fancy answer inference; in the future true agentic inference will be work done by computers according to dictates given by other computers, and the market size scales not with humans but with compute.

The Implications of Agentic Inference on Compute

To date the invocation of "scaling with compute" has implicitly meant Nvidia bullishness. However, much of Nvidia's relative advantage to date has been a function of latency: Nvidia chips have fast compute, but keeping that compute busy has required big investments in ever-expanding HBM memory and networking. If latency isn't the key constraint, however, then Nvidia's approach seems less worth paying a premium for.

Nvidia does recognize this shift: the company launched an inference framework called Dynamo that helps disaggregate different parts of inference, and is shipping products like standalone memory and CPU racks to enable increasingly large KV caches and faster tool use, the better to keep their expensive GPUs busy. Ultimately, however, it's easy to see cost and simplicity being increasingly attractive to hyperscalers for agentic inference that isn't remotely GPU-bound.

China, meanwhile, for all of its lack of leading edge compute, has everything it needs for agentic inference: fast-enough (but not leading-edge) GPUs, fast-enough (but not leading-edge) CPUs, DRAM, hard drives, etc. The challenge, of course, is compute for training; it's also possible that answer inference is more important for national security, at least when it comes to military applications.

The other interesting angle is space: slower chips actually make space data centers more viable for a number of reasons. First, if memory can be offloaded, chips can be made much simpler and run much cooler. Second, older nodes, by virtue of being physically larger, will better withstand space radiation. Third, older nodes require less power, which means there will be less heat to dissipate via radiation. Fourth, not being on the bleeding edge will mean higher reliability, an important consideration given that satellites won't be repairable.

Nvidia CEO Jensen Huang regularly says that "Moore's Law is Dead"; what he means is that the future of computing speed-ups will be a function of systems innovation, which is exactly what Nvidia has done. Maybe the most profound implication of agents that act without humans in the loop, however, will be that Moore's Law doesn't matter, and that the way we get more compute is by realizing that the compute we have is already good enough.

AI infrastructureawspytorch

Foundation Model Scaling

AWS detailed how foundation model scaling split from one regime into three—pre-training, post-training, and test-time compute—with P6e-GB200 UltraServers exposing 72 Blackwell GPUs in a single 13.4 TB NVLink domain.

Hugging Face

Summary

What: AWS engineers Keita Watanabe, Pavel Belevich, and Aman Shanbhag published a reference architecture for distributed training spanning EC2 P5 (H100/H200) and P6 (Blackwell B200/B300) instances, EFA v4 networking (18% faster than v3), and orchestration via Slurm and Kubernetes through SageMaker HyperPod. The ML stack layers from CUDA and NCCL through PyTorch to frameworks including Megatron/NeMo, veRL for RLHF, and vLLM/SGLang for inference. HyperPod's checkpointless training uses continuous EFA-based peer-to-peer state replication instead of serializing multi-terabyte checkpoints to FSx for Lustre. Observability relies on Prometheus, Grafana, and DCGM-Exporter monitoring GPU XID events and ECC errors.

Why it matters: This signals convergence of infrastructure requirements across all three scaling regimes rather than divergence—post-training and test-time compute demand the same tightly coupled accelerators, low-latency fabric, and distributed storage as pre-training, differing mainly in workload profile. It also reveals that even proprietary cloud infrastructure now standardizes on open-source orchestration (Slurm, Kubernetes, Prometheus) rather than building closed stacks.

Deep Dive

Foundation model scaling evolved from a single pre-training curve (Kaplan et al. 2020 power laws) to three distinct regimes: pre-training, post-training (SFT, RLHF), and test-time compute (search, verification, multi-sample strategies)
AWS infrastructure spans four layers: accelerated compute (P5/P6 instances), resource orchestration (Slurm via ParallelCluster/PCS, Kubernetes via EKS/HyperPod), ML software stack (drivers → CUDA → NCCL → PyTorch → training frameworks), and observability (Prometheus/Grafana)
P6 instances use Blackwell B200 (2.25 PFLOPS BF16, 4.5 PFLOPS FP8, 180 GB HBM3e per GPU) and B300 (same compute, 288 GB HBM3e per GPU) with NVLink 5th gen (14.4 TB/s aggregate) and EFA v4 (400 GB/s or 800 GB/s aggregate)
P6e-GB200 UltraServers compose up to 72 Blackwell GPUs into a single NVLink domain with 13.4 TB aggregate HBM3e, reducing how often MoE all-to-all communication must leave the NVLink fabric to traverse EFA
Grace-Blackwell superchips provide cache-coherent NVLink-C2C between Grace CPU memory and Blackwell GPU HBM, enabling GPU workloads to extend into CPU-attached memory without explicit PCIe copies
HyperPod's checkpointless training maintains continuous peer-to-peer state replication across GPUs via EFA; on failure, surviving nodes reconstruct lost state through communication rather than reading multi-TB checkpoints from FSx/S3, reducing recovery latency
Kubernetes gaps for distributed training (pod-level vs job-level scheduling, no topology awareness, no batch queue semantics) are addressed by layered extensions: Kueue for admission control and gang scheduling, Volcano or NVIDIA KAI Scheduler for topology-aware placement
Custom kernels (FlashAttention, Triton-compiled fused ops, CUTLASS GEMMs) and specialized libraries increasingly determine end-to-end performance as much as the ML framework, as step time is often dominated by memory movement and collective communication rather than raw compute
MoE models with expert parallelism depend on all-to-all collectives for token dispatch (send each token to the GPU hosting its assigned expert) and combine (return expert outputs), with communication volume scaling with number of experts and becoming a bottleneck at high expert-parallelism degrees
Disaggregated inference separating prefill and decode onto distinct GPU pools requires efficient point-to-point KV cache transfer; NVIDIA NIXL provides a unified API for point-to-point transfers across memory tiers (HBM, DRAM, NVMe) and interconnects (NVLink, InfiniBand, Ethernet)
veRL (Volcano Engine Reinforcement Learning) implements PPO, GRPO, and REINFORCE++ with HybridFlow architecture mixing training backends (FSDP2, Megatron) with inference engines (vLLM, SGLang) in the same job, sharing model weights in memory between actor and rollout components
GPU health monitoring via DCGM tracks ECC single-bit errors (SBE) and double-bit errors (DBE), with XID 63 (row remap failure), XID 64 (GPU fallen off bus), and XID 94/95 (contained/uncontained errors) warranting immediate node replacement
Storage is tiered: local NVMe instance store (30.72 TB raw, 8 × 3.84 TB NVMe SSD) for hot data, FSx for Lustre (POSIX parallel filesystem with TB/s throughput, Data Repository Associations for S3 lazy loading) for shared high-throughput access, S3 for durable checkpoint persistence
Amazon EC2 UltraClusters provision thousands of accelerated instances as a single tightly placed cluster within an Availability Zone, interconnected with a petabit-scale nonblocking network
The aws-ofi-nccl plugin bridges NCCL to libfabric interfaces, allowing NCCL to leverage EFA's OS-bypass and Scalable Reliable Datagram (SRD) protocol without application changes

Decoder

EFA (Elastic Fabric Adapter): AWS network interface providing OS-bypass RDMA using the Scalable Reliable Datagram (SRD) protocol; enables applications to communicate directly with the network device through libfabric API, bypassing the kernel to reduce latency for collective operations in distributed training
NVLink domain: The set of GPUs directly connected via NVIDIA's NVLink high-bandwidth interconnect (e.g., 8 GPUs in a p5.48xlarge instance, or 72 GPUs in a P6e-GB200 UltraServer); collectives within the domain avoid traversing the host networking stack, achieving higher bandwidth and lower latency than EFA-based cross-node communication
UltraServers: AWS EC2 instances that extend the NVLink domain beyond a single instance by connecting multiple component instances through dedicated accelerator interconnect; P6e-GB200 UltraServers built on NVIDIA GB200 NVL72 platform expose up to 72 Blackwell GPUs in one NVLink domain
All-to-all collective: Communication pattern in MoE models where every GPU exchanges data with every other GPU in the expert-parallel group; used for dispatch (send each token to the GPU hosting its assigned expert) and combine (return expert outputs to originating GPUs), with communication volume scaling with number of experts
Checkpointless training: Fault tolerance approach in HyperPod that maintains continuous peer-to-peer state replication across GPUs instead of periodically serializing model state to shared storage; on failure, surviving nodes reconstruct lost state through EFA-based communication rather than reading multi-terabyte checkpoints
XID events: NVIDIA GPU error codes indicating hardware failures; XID 63 (row remap failure), XID 64 (GPU fallen off bus), and XID 94/95 (contained/uncontained errors) typically warrant immediate node replacement, while accelerating ECC single-bit error (SBE) rates often precede more severe failures
DCGM (Data Center GPU Manager): NVIDIA's suite for managing and monitoring GPUs in cluster environments; DCGM-Exporter exposes GPU metrics (utilization, memory, power, temperature, ECC errors, XID events) in Prometheus format for observability stacks

Original Article

Building Blocks for Foundation Model Training and Inference on AWS

For a long time, "scaling" in foundation models mostly meant one thing: spend more compute on pre-training and capabilities rise. That intuition was supported by empirical work such as Kaplan et al. (2020), which reported predictable power-law trends in loss as you scale model parameters, dataset size, and training compute. In practice, these trends justified sustained investment in large-scale accelerator capacity and the surrounding distributed infrastructure needed to keep it efficiently utilized. But the frontier has evolved—and scaling is no longer a single curve. NVIDIA's "from one to three scaling laws" framing usefully emphasizes that, beyond pre-training, performance increasingly scales through post-training (e.g., supervised fine-tuning (SFT) and reinforcement learning (RL)-based methods) and through test-time compute ("long thinking," search/verification, multi-sample strategies).

Figure: Adapted from "AI's Three Scaling Laws, Explained" (NVIDIA Blog).

Taken together, these scaling regimes push the foundation-model lifecycle—pre-training, post-training, and inference—toward convergent infrastructure requirements: tightly coupled accelerator compute, a high-bandwidth low-latency network, and a distributed storage backend. They also raise the importance of orchestration for resource management, and of application- and hardware-level observability to maintain cluster health and diagnose performance pathologies at scale.

Another key trend is the increasing reliance of the foundation-model lifecycle on an open-source software (OSS) ecosystem that spans model development frameworks, cluster resource management, and operational tooling. At the cluster layer, resource management is typically provided by systems such as Slurm and Kubernetes. Model development and distributed training are commonly implemented in frameworks such as PyTorch and JAX. Monitoring and visualization—that is, observability—are often achieved using Prometheus for metrics collection and Grafana for visualization and alerting, positioned as an operational layer atop infrastructure and resource management. Figure 1 illustrates this layered architecture, showing how hardware infrastructure supports resource orchestration, which in turn enables ML frameworks, with observability spanning across all layers.

Figure 1: The layered architecture of open-source software stacks for foundation model training and inference

This post is intended for machine learning engineers and researchers involved in foundation model training and inference, with particular attention to workflows built atop OSS frameworks. It analyzes how AWS infrastructure—including multi-node accelerator compute, high-bandwidth low-latency networking, distributed shared storage, and associated managed services—interacts with common OSS stacks across the foundation model lifecycle. The primary goal is to provide a technical foundation for understanding systems bottlenecks and scaling characteristics spanning pre-training, post-training, and inference. This introductory post surfaces the overall system architecture, emphasizing the integration points between AWS infrastructure components and OSS tools that underpin large-scale distributed training and inference.

The AWS Building Blocks

The remainder of this series examines how this layered architecture is realized on AWS, progressing through infrastructure, resource orchestration, the ML software stack, and observability. The following sections preview each layer.

Infrastructure: Compute, Network, and Storage

As illustrated in Figure 1, infrastructure is anchored by three coupled building blocks—accelerated compute with large device memory, wide-bandwidth interconnect for collective communication, and scalable distributed storage for data and checkpoints.

Accelerated compute forms the foundation of large-scale foundation model pre-training, post-training, and inference. AWS offers several generations of NVIDIA GPUs as part of its Amazon EC2 accelerated computing instances, including the Amazon EC2 P instance family. The P5 instance family includes p5.48xlarge with eight NVIDIA H100 GPUs, p5.4xlarge with a single H100 GPU for smaller-scale workloads, and p5e.48xlarge/p5en.48xlarge variants with NVIDIA H200 GPUs. The P6 instance family introduces NVIDIA Blackwell B200 architecture with p6-b200.48xlarge and Blackwell Ultra B300 with p6-b300.48xlarge. Across these generations, the dominant scaling axes are peak Tensor throughput, HBM capacity and bandwidth, and interconnect bandwidth (within and across nodes).

As a first-order approximation, peak Tensor Core throughput—measured in floating point operations per second (FLOPS)—helps situate these accelerators on a common axis. The table below summarizes per-GPU peak throughput for dense BF16/FP16 and FP8 Tensor operations, along with HBM capacity and HBM bandwidth, using SXM/HGX-class specifications that align with NVSwitch/NVLink-based multi-GPU nodes.

GPU (representative variant)	BF16/FP16 Tensor peak (dense)	FP8 Tensor peak (dense)	FP4 Tensor peak (dense)	HBM capacity	HBM bandwidth
H100 (SXM)	0.9895 PFLOPS	1.979 PFLOPS	—	80 GB HBM3	3.35 TB/s
H200 (SXM)	0.9895 PFLOPS	1.979 PFLOPS	—	141 GB HBM3e	4.8 TB/s
B200 (HGX, per GPU)	2.25 PFLOPS	4.5 PFLOPS	9 PFLOPS	180 GB HBM3e	8 TB/s
B300 (HGX, per GPU)	2.25 PFLOPS	4.5 PFLOPS	13.5 PFLOPS	288 GB HBM3e	8 TB/s

Note: NVIDIA product tables often report Tensor throughput "with sparsity"; this table reports dense throughput. Where applicable, dense throughput is taken as half of sparse throughput, following NVIDIA's guidance for HGX-class platforms (NVIDIA). DGX figures are system-level; the B200 HBM capacity and bandwidth values are expressed per GPU by dividing DGX totals by eight (NVIDIA).

As models scale, step time is often dominated by collective communication and memory movement rather than raw compute throughput, motivating explicit scale-up and scale-out bandwidth accounting. For the multi-GPU instances, GPU communication spans two regimes. Internal scale-up (NVLink/NVSwitch) provides high-bandwidth, low-latency GPU-to-GPU connectivity within a node, enabling collectives such as all-reduce and all-gather to execute without traversing the host networking stack. External scale-out (EFA) provides OS-bypass networking across nodes, which AWS uses as a building block for Amazon EC2 UltraClusters where communication-heavy collectives span thousands of instances. The following table summarizes key specifications across these instance types:

Instance Type	GPU	GPUs	GPU Memory	NVLink	NVLink BW (aggregate)	EFA	EFA BW (aggregate)
p5.4xlarge	H100	1	80 GB HBM3	—	—	v2	12.5 GB/s
p5.48xlarge	H100	8	640 GB HBM3	4th	7.2 TB/s	v2	400 GB/s
p5e.48xlarge	H200	8	1,128 GB HBM3e	4th	7.2 TB/s	v2	400 GB/s
p5en.48xlarge	H200	8	1,128 GB HBM3e	4th	7.2 TB/s	v3	400 GB/s
p6-b200.48xlarge	B200	8	1,440 GB HBM3e	5th	14.4 TB/s	v4	400 GB/s
p6-b300.48xlarge	B300	8	2,100 GB HBM3e	5th	14.4 TB/s	v4	800 GB/s

Note: EFA bandwidth is converted from Gbps to GB/s (÷8) for consistency with other bandwidth metrics; see the EC2 accelerated computing networking specifications. NVLink and EFA bandwidth figures are shown as aggregate per-instance values rather than per-link values; see the P5 instance family page and the P6 instance family page for the corresponding intra-node interconnect and networking characteristics.

Elastic Fabric Adapter (EFA) is a network interface for Amazon EC2 that provides OS-bypass remote direct memory access (RDMA) capability using the Scalable Reliable Datagram (SRD) protocol. By enabling applications to communicate directly with the network device through the Libfabric API—bypassing the operating system kernel—EFA reduces latency and improves throughput for collective operations in distributed training.

Multiple generations of EFA are available on different instance families. Amazon EC2 P5 and P5e instances are equipped with EFA version 2 (EFAv2). EFA version 3 (EFAv3), provided on P5en instances, reduces packet latency by approximately 35% compared to EFAv2. EFA version 4 (EFAv4), available on P6 instances, delivers an additional 18% improvement in collective communication performance relative to EFAv3.

At scale, both distributed training (streaming corpora and writing multi-terabyte checkpoints) and large-scale inference (staging weights and managing KV cache growth) motivate a tiered storage hierarchy—local NVMe SSD for hot data, Lustre for shared high-throughput access, and Amazon S3 for durable persistence.

In this series' primary multi-GPU instances, local NVMe is provided as instance store (ephemeral) with 30.72 TB raw capacity (8 × 3.84 TB NVMe SSD); see the EC2 accelerated-computing instance store specifications.

Lustre is an open-source, POSIX compliant distributed file system widely used in high-performance computing (HPC) to provide a shared namespace with high aggregate throughput across many clients. Amazon FSx for Lustre provides Lustre as a fully managed service and exposes it as a parallel file system capable of terabytes per second of throughput, millions of IOPS, and sub-millisecond latencies. Data Repository Associations enable integration with Amazon S3, supporting lazy loading of training datasets and automatic checkpoint export for durability.

At cluster scale, these instances are deployed in Amazon EC2 UltraClusters, which provision thousands of accelerated instances as a single, tightly placed cluster within an Availability Zone and interconnect them using a petabit-scale nonblocking network.

Figure: 2nd-generation Amazon EC2 UltraClusters (example P5 UltraCluster).

For workloads with high per-step communication intensity (e.g., expert parallelism in MoE models where all-to-all token dispatch spans many GPUs), the size of the NVLink domain can become a first-order constraint. As an extension of the internal scale-up axis, increasing the NVLink domain reduces how often performance-critical communication must leave the NVLink fabric.

Amazon EC2 UltraServers extend the NVLink domain beyond a single EC2 instance by connecting multiple component instances through a dedicated accelerator interconnect. AWS reports that P6e-GB200 UltraServers are built on the NVIDIA GB200 NVL72 platform and expose up to 72 Blackwell GPUs and 13.4 TB of aggregate HBM3e within one NVLink domain. At larger scales, EFA remains the cross-node fabric for multi-UltraServer jobs, but increasing the intra-domain GPU count can reduce how often performance-critical communication must leave the NVLink fabric.

These systems are built from NVIDIA Grace–Blackwell superchips, which couple Grace CPU memory and Blackwell GPU HBM via cache-coherent NVLink-C2C, enabling direct access across CPU- and GPU-attached memory without explicit host–device copies. In practice, this can extend the effective memory available to GPU workloads (e.g., by placing colder model state or KV cache in CPU-attached memory) while avoiding PCIe-scale copy overheads, albeit with higher latency and lower bandwidth than local HBM.

The component instance type for P6e-GB200 UltraServers is p6e-gb200.36xlarge, which provides four GPUs and Elastic Fabric Adapter (EFA) v4 networking. The tables below summarize the per-instance and composed UltraServer configurations.

Instance Type	GPU	GPUs	GPU Memory	Memory BW	NVLink	NVLink BW	EFA	EFA BW
p6e-gb200.36xlarge	GB200 NVL72	4	740 GB HBM3e	—	—	—	v4	200 GB/s

Note: The p6e-gb200.36xlarge EFA bandwidth is converted from the published aggregate EFA networking (4 × 400 Gbps) to GB/s (÷8); see the EC2 accelerated computing networking specifications.

UltraServer	Component instance type	GPUs (NVLink domain)	HBM3e (aggregate)	EFA	EFA BW
u-p6e-gb200x36	p6e-gb200.36xlarge	36	6.7 TB	v4	1,800 GB/s
u-p6e-gb200x72	p6e-gb200.36xlarge	72	13.4 TB	v4	3,600 GB/s

Note: UltraServer EFA bandwidth is converted from terabits per second (Tbps), as reported by AWS, to GB/s (÷8); see the P6e-GB200 UltraServers announcement and the P6 instance family page.

Resource Orchestration: Slurm and Kubernetes

When training spans hundreds or thousands of accelerators, manual resource management becomes intractable. For example, a training job requiring 512 GPUs must co-schedule 64 eight-GPU nodes (P-instances) simultaneously, and release resources atomically upon completion or failure. Both Slurm and Kubernetes address this challenge through a control-plane architecture: a centralized scheduler maintains cluster state and makes allocation decisions, while worker nodes execute assigned workloads.

Figure 2: High-level architecture of Slurm-based and Kubernetes-based resource orchestration on AWS

Slurm (Simple Linux Utility for Resource Management) is the dominant workload manager in high-performance computing, built on a modular plugin architecture that allows the scheduling algorithm, topology model, resource types, and accounting backend to be configured independently. Its scheduling model organizes resources into partitions (logical groupings of nodes), accepts job submissions via sbatch, and launches parallel tasks via srun with synchronized startup across allocated nodes. Critically for distributed training, Slurm schedules at the job level—allocating entire multi-node jobs atomically before any task launches. A backfill scheduler starts lower-priority jobs in idle slots without delaying higher-priority ones, while a multi-factor priority system weighs fair-share usage, job age, and QOS tiers to order the queue across tenants. Slurm also supports topology-aware placement through plugins that model network switch hierarchies—on AWS, encoding the EFA fabric topology to co-locate jobs on nodes with minimal switch hops—and native GPU scheduling through its Generic Resource (GRES) interface, which tracks GPU types and enforces device affinity.

AWS provides multiple deployment options for Slurm-based orchestration. AWS ParallelCluster is an open-source cluster management tool that automates the deployment of Slurm clusters on EC2, handling head node provisioning, compute fleet scaling, and integration with shared storage. AWS Parallel Computing Service (PCS) offers an alternative that provides the managed control plane. For distributed training workloads specifically, Amazon SageMaker HyperPod supports Slurm mode with additional capabilities tailored to large-scale training, such as continuous node health monitoring and job auto-resume functionality.

Kubernetes takes a declarative, API-driven approach: users specify desired state through resource manifests, and controllers reconcile actual state to match. While Kubernetes excels at model deployment, its native scheduling model exposes several gaps for tightly coupled distributed training. Kubernetes schedules at the pod level; without job-level atomicity, a multi-node training job can partially start—some ranks running while others remain Pending—wasting GPUs or causing deadlocks. Vanilla Kubernetes also lacks batch queue semantics with priority-based backfill, built-in awareness of network fabric topology (NVLink domains, EFA interconnects) for placement of communication-heavy collectives.

Several Kubernetes-native projects address these gaps at different layers. Kueue operates as an admission controller atop the default scheduler, managing job-level gang admission, multi-tenant quotas with hierarchical fair sharing, and priority-based preemption—while delegating pod placement to the underlying scheduler. Volcano and NVIDIA KAI Scheduler take a different approach, replacing or augmenting the default scheduler to integrate gang scheduling directly with topology-aware pod placement—Volcano as a general-purpose batch scheduler, KAI Scheduler with deep NVLink/NVSwitch awareness for GPU-optimized placement. These layers are complementary: Kueue can manage admission and quota policy while passing admitted jobs to a topology-aware scheduler for placement.

For Kubernetes-based orchestration on AWS, Amazon Elastic Kubernetes Service (EKS) provides managed Kubernetes with GPU scheduling via the NVIDIA device plugin. Amazon SageMaker HyperPod also supports EKS mode, combining Kubernetes orchestration with HyperPod's training-specific capabilities. HyperPod EKS extends EKS with features designed for foundation model training at scale. Task governance provides compute allocation and policy enforcement across teams, integrating managed Kueue for admission control and Karpenter for just-in-time node provisioning. Checkpointless training addresses the recovery latency inherent in traditional checkpoint-based fault tolerance. Rather than periodically serializing model state to shared storage, checkpointless training maintains continuous peer-to-peer state replication across GPUs. When a failure occurs, surviving nodes reconstruct the lost state through EFA-based communication rather than reading multi-terabyte checkpoints from FSx for Lustre or S3. Elastic training enables jobs to automatically scale based on resource availability. When additional accelerators become available (e.g., from completed jobs or newly provisioned capacity), elastic jobs can expand to utilize them; when higher-priority workloads require resources, jobs can contract while maintaining training progress.

ML Software Stack

Distributed training and inference involve multiple software layers that must be correctly configured and coordinated. A useful model treats the runtime stack as five layers, ordered from hardware-adjacent components (which must function correctly for anything to run) to framework-level abstractions (which determine programmer productivity and model throughput): hardware enablement, accelerator runtime and math libraries, communication substrate, ML frameworks, and distributed training/inference frameworks.

Figure 3: The ML software stack for distributed training and inference on EC2 instances

Hardware enablement: kernel drivers

At the foundation, Linux kernel drivers provide direct hardware access. The NVIDIA GPU driver exposes compute capabilities and supports GPUDirect RDMA for direct data transfers between GPUs and network adapters. The GDRCopy driver (gdrdrv) enables low-latency CPU-initiated copies to and from GPU memory, used by NCCL for small-message transfers. The EFA driver provides OS-bypass networking through the libfabric API, and the Lustre client driver enables POSIX access to FSx for Lustre parallel file systems.

Accelerator runtime, compilers, and kernel libraries

The CUDA platform provides the programming model and runtime for GPU compute. Applications compiled against CUDA can launch kernels on NVIDIA GPUs, manage device memory, and coordinate execution across multiple devices. The current release is CUDA Toolkit 13.x, with support for Blackwell architecture (compute capability 10.x).

Modern training and inference performance is increasingly driven by specialized optimization libraries and custom kernels, not just general-purpose vendor primitives. Kernels like FlashAttention fuse attention into a single memory-efficient pass, cutting HBM traffic and improving throughput. Many teams also write shape- and precision-specialized fused kernels (e.g., layernorm/residual/activation, quantized GEMMs, MoE dispatch, KV-cache ops) tuned to their exact models. This is enabled by programmable toolchains such as Triton (Python GPU kernel compiler) and NVIDIA's CuTe (tensor layout and warp-level DSL), with libraries like CUTLASS providing highly optimized GEMM and fusion building blocks. In practice, this kernel and compiler layer often determines end-to-end performance as much as the ML framework.

Communication substrate: NCCL and transport plugins

Multi-GPU training depends on efficient collective communication. NVIDIA Collective Communications Library (NCCL) implements collective operations—all-reduce, all-gather, reduce-scatter, all-to-all, broadcast, and point-to-point send/receive—with topology-aware algorithms that exploit NVLink for intra-node communication and network transports for inter-node traffic. NCCL dynamically detects the communication topology and selects ring or tree algorithms depending on message size and available bandwidth. While data-parallel and tensor-parallel strategies rely primarily on all-reduce and all-gather, Mixture-of-Experts (MoE) models with expert parallelism depend on all-to-all collectives to route tokens between GPUs: a dispatch all-to-all sends each token to the GPU hosting its assigned expert, and a combine all-to-all returns expert outputs to the originating GPUs (NVIDIA Developer Blog). Because every GPU exchanges data with every other GPU in the expert-parallel group, all-to-all communication volume scales with the number of experts and can become a dominant bottleneck at high expert-parallelism degrees.

On AWS, NCCL's inter-node communication is enabled through the aws-ofi-nccl plugin, which maps NCCL's transport APIs to libfabric interfaces. This allows NCCL to leverage EFA's OS-bypass and Scalable Reliable Datagram (SRD) protocol without application changes.

For inference workloads, collective operations do not capture all communication patterns. Disaggregated inference architectures—which separate prefill and decode phases onto distinct GPU pools—require efficient point-to-point data movement, particularly for transferring KV cache state between instances. NVIDIA Inference Xfer Library (NIXL) addresses this requirement by providing a unified API for point-to-point transfers across memory tiers (HBM, DRAM, NVMe, distributed storage) and interconnects (NVLink, InfiniBand, Ethernet). NIXL integrates with inference frameworks such as NVIDIA Dynamo and supports backends including UCX and GPUDirect Storage.

ML frameworks: PyTorch

The two dominant frameworks for foundation model development are PyTorch and JAX. JAX takes an SPMD (Single Program Multiple Data) approach through XLA, where the same program executes across devices with automatic data distribution and collective lowering. This blog focuses on PyTorch, which sees broader adoption in the open-source ecosystem and forms the basis for the distributed training and inference frameworks discussed below.

PyTorch provides tensor computation with GPU acceleration, automatic differentiation, and a flexible eager-execution model. For distributed workloads, PyTorch's torch.distributed module provides the core primitives: process groups for collective communication, and distributed data-parallel abstractions including Distributed Data Parallel (DDP) and Fully Sharded Data Parallel (FSDP2). DDP replicates models across GPUs and synchronizes gradients via all-reduce, while FSDP2 shards parameters, gradients, and optimizer states across workers using techniques from the ZeRO algorithm, enabling training of models that exceed single-GPU memory capacity.

Distributed training and inference frameworks

The top layer comprises frameworks that build on PyTorch to provide higher-level abstractions for distributed training and inference at scale. For training, three categories of frameworks address different points in the complexity-performance tradeoff. Below are few examples

Hugging Face Transformers provides the Trainer class with built-in support for distributed training via Accelerate, which abstracts over DDP, FSDP, and DeepSpeed. This path prioritizes ease of use and broad model compatibility, making it suitable for fine-tuning and moderate-scale training where configuration simplicity matters more than maximum throughput.

NVIDIA Megatron Core targets maximum efficiency at scale, implementing 3D parallelism (tensor, pipeline, and expert parallelism) with optimizations including FP8 mixed precision via Transformer Engine. The NeMo Framework builds on Megatron Core to provide end-to-end workflows for pre-training and fine-tuning.

For reinforcement learning from human feedback (RLHF) and related post-training methods, veRL (Volcano Engine Reinforcement Learning) provides a flexible framework that implements algorithms including PPO, GRPO, and REINFORCE++. veRL's HybridFlow architecture allows mixing training backends (FSDP2, Megatron) with inference engines (vLLM, SGLang) in the same job, avoiding weight synchronization overhead by sharing model weights in memory between actor and rollout components.

For inference serving, vLLM implements PagedAttention, managing the KV cache as paged virtual memory to reduce fragmentation and enable higher batch sizes. SGLang extends this with RadixAttention for automatic prefix reuse across requests, a zero-overhead batch scheduler that overlaps CPU scheduling with GPU computation, and a cache-aware load balancer that routes requests based on predicted cache hit rates. Both frameworks support tensor parallelism for serving models that exceed single-GPU memory, and both integrate with NVIDIA Dynamo for disaggregated serving architectures that separate prefill and decode phases.

Observability

Observability is a prerequisite for debugging and operating distributed training systems at scale. When a training job stalls or throughput degrades, practitioners need visibility into whether the cause is hardware failure, network congestion, storage bottlenecks, or application-level inefficiency. At the infrastructure scale discussed in this series—thousands of GPUs, petabits of interconnect bandwidth, and terabytes of checkpoint data—the challenge shifts from simple monitoring to systematic telemetry collection, storage, and analysis. Observability spans three telemetry categories: infrastructure metrics (GPU, network, storage), workload metrics (training throughput, queue latency), and alerting for proactive fault detection.

Core Stack: Prometheus and Grafana

The de facto standard for observability in Kubernetes and HPC environments combines Prometheus for metrics collection with Grafana for visualization and alerting. Prometheus operates on a pull-based model, periodically scraping HTTP endpoints exposed by metric exporters. Collected metrics are stored in a time-series database (TSDB) and queried via PromQL, a flexible query language for aggregation, filtering, and alerting rule evaluation. Grafana consumes Prometheus as a data source, rendering dashboards and triggering alerts based on PromQL expressions.

For production deployments, Amazon Managed Service for Prometheus (AMP) provides a fully managed, Prometheus-compatible time-series database that scales to ingest millions of samples per second without requiring operators to manage storage, replication, or high availability. Amazon Managed Grafana (AMG) offers a managed Grafana workspace with native integration to AMP and AWS authentication via IAM Identity Center. Together, these services eliminate operational overhead while preserving compatibility with existing Prometheus exporters and Grafana dashboards.

GPU, Network, and Application Telemetry

DCGM-Exporter exposes NVIDIA GPU metrics in Prometheus format, including utilization, memory usage, power, temperature, and hardware health indicators such as ECC errors and XID events. For training workloads, SM activity (DCGM_FI_PROF_SM_ACTIVE) often provides a more accurate measure of compute efficiency than basic utilization metrics.

EFA exposes driver-level statistics (bytes, packets, retransmits, timeouts) that help diagnose collective operation bottlenecks in distributed training. The aws-ofi-nccl plugin bridges NCCL to the libfabric interface, and operators can combine EFA counters with NCCL diagnostics (NCCL_DEBUG=INFO) to isolate network-layer issues.

Amazon FSx for Lustre exposes client-side metrics including throughput and metadata latency, while application-level metrics (step time, tokens per second, loss values for training; TTFT, inter-token latency for inference) can be exported via Prometheus client libraries.

GPU Health Monitoring and Alerting

Proactive fault detection prevents hardware issues from propagating into extended training interruptions. A typical workflow monitors DCGM health metrics and triggers alerts when error counts exceed thresholds. ECC single-bit errors (SBE) may be tolerable in small numbers, but accelerating SBE rates often precede double-bit errors (DBE) or other failures. XID 63 (row remap failure), XID 64 (GPU fallen off bus), and XID 94/95 (contained/uncontained errors) typically warrant immediate node replacement.

The GPU Health - Cluster dashboard (Grafana dashboard ID 21645) provides a reference visualization for common GPU error patterns. The dashboard aggregates ECC errors, XID events, thermal violations, and row remapping status across all cluster nodes, enabling operators to identify failing hardware before it impacts training jobs.

Figure 4: GPU Health - Cluster dashboard showing GPU error patterns and instance reporting

Conclusion

The shift from a single pre-training scaling law to three complementary regimes—pre-training, post-training, and test-time compute—has not fragmented infrastructure requirements; it has reinforced them. All three regimes demand tightly coupled accelerator compute, high-bandwidth low-latency networking, and scalable distributed storage, differing mainly in workload profile and resource scheduling patterns.

This post surfaced the four-layer architecture that addresses those requirements on AWS: infrastructure building blocks (EC2 P-instances, EFA networking, and tiered storage), resource orchestration (Slurm and Kubernetes with SageMaker HyperPod), the ML software stack (from kernel drivers and CUDA through NCCL to PyTorch), and observability (Prometheus, Grafana, and GPU health monitoring). Each layer constrains and enables the layers above it—a misconfigured driver or saturated network link can bottleneck an otherwise well-tuned training run just as effectively as a suboptimal parallelism strategy.

Understanding these integration points is the foundation for diagnosing performance bottlenecks and making informed scaling decisions across the foundation model lifecycle.

AI researchdiffusionimage-generation

Trajectory Models for Few-Step Diffusion

Apple researchers reduce diffusion image generation to four steps without discarding the likelihood framework that consistency training and distillation methods abandon

arXiv

Summary

What: Jiatao Gu, Ying Shen, David Berthelot, Shuangfei Zhai, and Josh Susskind introduced Normalizing Trajectory Models (NTM), which models each reverse diffusion step as a conditional normalizing flow. The approach achieves competitive text-to-image quality in four sampling steps while maintaining exact trajectory likelihood. NTM supports self-distillation: a lightweight denoiser trained on the model's own score produces high-quality samples in four steps. Submitted to arXiv May 8, 2026, revised May 12, 2026.

Why it matters: Few-step diffusion has traditionally required abandoning the likelihood framework for speed gains through distillation, consistency training, or adversarial objectives. NTM suggests the trade-off is unnecessary: expressive conditional flows can compress sampling from dozens to four steps while preserving the mathematical foundations that enable techniques like self-distillation and likelihood-based training.

Deep Dive

Standard diffusion models decompose sampling into many small Gaussian denoising steps, an assumption that breaks down when generation is compressed to a few coarse transitions
Existing few-step methods (distillation, consistency training, adversarial objectives) sacrifice the likelihood framework to achieve speed
Normalizing Trajectory Models (NTM) models each reverse diffusion step as an expressive conditional normalizing flow with exact likelihood training
Architecture combines shallow invertible blocks within each step with a deep parallel predictor across the trajectory
The network is trainable from scratch or initializable from pretrained flow-matching models
Exact trajectory likelihood enables self-distillation: a lightweight denoiser trained on the model's own score produces high-quality samples
On text-to-image benchmarks, NTM matches or outperforms strong baselines in four sampling steps
NTM uniquely retains exact likelihood over the generative trajectory while achieving few-step performance
Paper is 25 pages with 10 figures; revised version corrected typos and citations

Decoder

Normalizing flow: A class of generative models that transforms a simple distribution into a complex one through a sequence of invertible functions, allowing exact likelihood computation
Flow-matching: A training framework for generative models that learns to map noise to data by matching velocity fields along continuous trajectories
Self-distillation: Training a smaller or faster model using samples and scores from the original model itself rather than from a separate teacher model

Original Article

Normalizing Trajectory Models

Diffusion-based models decompose sampling into many small Gaussian denoising steps -- an assumption that breaks down when generation is compressed to a few coarse transitions. Existing few-step methods address this through distillation, consistency training, or adversarial objectives, but sacrifice the likelihood framework in the process. We introduce Normalizing Trajectory Models (NTM), which models each reverse step as an expressive conditional normalizing flow with exact likelihood training. Architecturally, NTM combines shallow invertible blocks within each step with a deep parallel predictor across the trajectory, forming an end-to-end network trainable from scratch or initializable from pretrained flow-matching models. Its exact trajectory likelihood further enables self-distillation: a lightweight denoiser trained on the model's own score produces high-quality samples in four steps. On text-to-image benchmarks, NTM matches or outperforms strong image generation baselines in just four sampling steps while uniquely retaining exact likelihood over the generative trajectory.

AI llmresearchopensource

Agentic Test-Time Scaling (GitHub Repo)

AutoTTS cuts LLM inference tokens by 69.5% compared to self-consistency sampling by having a coding agent automatically discover test-time scaling strategies in an offline replay environment for $40 and zero runtime LLM calls.

GitHub

Summary

What: Researchers from UMD, UVA, WUSTL, UNC, Google, and Meta released AutoTTS, a framework where a coding agent iteratively proposes and refines controller code that decides how to allocate compute during LLM reasoning. The discovered Confidence Momentum Controller (CMC) uses exponential moving average of answer confidence, trend-based stopping, and alignment-aware depth allocation. Discovery runs cost $39.90 and take 160 minutes wall-clock with zero LLM calls during evaluation (everything replays from cached reasoning segments).

Why it matters: This reframes test-time compute optimization from hand-designed heuristics to automated program search. The fact that a $40 one-time search can find controllers that cut inference costs 69.5% while matching accuracy suggests many teams are leaving major efficiency gains on the table by not systematically exploring their strategy space.

Takeaway: Clone the repo and run the discovery process on your own reasoning benchmarks—the replay environment approach means you can search for controllers offline without burning tokens during optimization.

Deep Dive

AutoTTS treats test-time scaling as code search: construct an offline replay environment with cached reasoning traces, then let a coding agent iteratively propose controller implementations that decide when to branch, continue, probe, or answer
Discovery parameterizes each controller with a single scalar β that deterministically schedules all internal hyperparameters, reducing search from multi-dimensional threshold tuning to sweeping one value
The replay environment is built once per (model, benchmark): collect N independent reasoning traces per query, partition into fixed-length segments, materialize a lookup table of branch prefixes and probe responses
Zero LLM calls during discovery evaluation—every environment transition consults archived segments, so asymptotic cost is dominated by table replay rather than model inference
Discovered CMC controller uses exponential moving average (EMA) of pool confidence instead of instantaneous thresholds, avoiding premature stopping on single-step answer spikes
CMC couples width and depth control: strong confidence gains suppress new branch spawning; stagnation or regression triggers widening by multiple branches at once
Alignment-aware depth allocation gives extra probe steps to branches whose latest answer matches the pool winner, concentrating compute on emerging consensus
Conservative branch abandonment only prunes branches after persistently deviating for multiple rounds and always preserves at least two active branches
Optimized on AIME24, evaluated held-out on AIME25/HMMT25 across four Qwen3 backbone scales: discovered policies shift the Pareto frontier beyond handcrafted baselines like SC@64, ASC, ESC, Parallel-Probe
At β=0.5: cuts aggregate tokens ~69.5% vs SC@64 while matching mean held-out accuracy across models; at β=1.0: pushes peak accuracy beyond all handcrafted baselines in 5 of 8 comparison cells
Full discovery run: $39.90 estimated monetary cost, 160 minutes wall-clock time
Round-level evolution (t1→t5) shows consistent objective improvement on both search and held-out benchmarks, indicating the agent edits converge toward better trade-offs rather than random walk
The approach is gradient-free and does not fine-tune the backbone model—optimization happens purely through iterative program search with replay-based feedback

Decoder

Test-time scaling (TTS): Improving LLM accuracy by spending more compute during inference (e.g. generating multiple reasoning paths, self-consistency voting) rather than during training
Self-consistency (SC): Sample N independent reasoning paths and take the majority vote answer; SC@64 means sample 64 paths
Probe: Inspect intermediate reasoning state (partial answer, confidence) without advancing to the next segment—lets the controller peek ahead cheaply
Branch: Independent reasoning path that can be extended, pruned, or probed separately
Replay environment: Offline MDP constructed from pre-collected reasoning traces so controllers can be evaluated by looking up cached segments instead of calling the LLM
EMA (Exponential Moving Average): Smoothed average that gives more weight to recent values, used by CMC to track confidence trend over multiple rounds
AIME/HMMT: American Invitational Mathematics Examination and Harvard-MIT Mathematics Tournament—competition math benchmarks used for evaluation

Original Article

Let me output the cleaned HTML directly:

AutoTTS

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

Tong Zheng, Haolin Liu, Chengsong Huang, Huiwen Bao, Sheng Zhang, Rui Liu, Runpeng Dai, Ruibo Chen, Chenxi Liu, Tianyi Xiong, Xidong Wu, Hongming Zhang, Heng Huang UMD · UVA · WUSTL · UNC · Google · Meta

Project page

AutoTTS reframes TTS strategy design from hand-crafting heuristics to environment-driven automatic search: humans only construct an offline replay environment (states, actions, feedback, objectives), and a coding agent iteratively proposes and refines code-defined controllers within it — code edits, no gradient updates. Cheap: 0 LLM calls, fully replay.

Quick links: Install · Reproduction · Citation

Highlighted results

~69.5% tokens saved vs SC@64 at β ≈ 0.5; held-out average accuracy matches SC@64 across four backbone scales.
$39.9 estimated monetary cost for one full discovery run.
160 minutes wall-clock for the same run.
0 LLM calls during discovery evaluation (replays cached segments only).

The discovered controller is the Confidence Momentum Controller (CMC), characterized by trend-based stopping, coupled width–depth control, alignment-aware depth allocation, and conservative branch abandonment.

Problem setup

We treat adaptive test-time inference as allocating a finite budget over branches in fixed-length intervals.

State at step t:

s_t = (q, m_t, I_t, ℓ_t, Ω_t)

q: question; m_t: number of instantiated branches; I_t: active branch set; ℓ_t: depth vector; Ω_t: revealed probe triples.

Admissible actions A(s_t):

BRANCH — open a new branch through the first interval.
CONTINUE(i) — advance branch i by one interval.
PROBE(i) — reveal ω_{i,ℓ} without advancing depth.
PRUNE(i) — deactivate branch i; depths and past probes stay recorded.
ANSWER — terminate and apply the controller's terminal aggregator.

Cost in interval units:

Cost(s_t) = Σ_i ℓ_{t,i} + κ_probe · |Ω_t|        (often κ_probe = 0)

Objective. A code-defined policy π(· | s, β) is parameterized by a scalar meta-parameter β that deterministically schedules every internal hyper-parameter. Over tasks (q, y) ~ 𝒟:

max_{π, β}  E_{q,y}[ 1{ŷ_{π,β}(q) = y}  −  γ · C_{π,β}(q) ]

The outer loop searches over implementations of π. Each candidate is replay-evaluated on offline caches; traces and scaling curves enter the next round's history.

Environment construction (run once per (model, benchmark))

The MDP above is instantiated as a concrete replay environment before the discovery loop starts:

Specify the interface. Fix s_t, A(s_t), Cost(s_t), and the accuracy–cost objective.
Offline trajectory collection. For each query, draw N parallel independent reasoning traces from the backbone (full strings first), then partition each trace into fixed-length segments of Δ tokens and enumerate branch prefixes z_{i,k} with probe responses ω_{i,k}.
Materialize the replay store. Every environment transition consults the archived table; e.g. PROBE(i) retrieves the cached ω_{i,k} without any new decoding.
Hand off to discovery. Candidate controllers are simulated exclusively through observe/step. Asymptotic evaluation cost is dominated by table replay.

Steps 1–3 run once. Iterative coding-agent discovery starts only after the replay store is frozen.

In this repository:

efficient_reasoning_controller/workspace/code_base/environment/ — search-set replay store.
efficient_reasoning_controller/test_environment/ — held-out replay store; never exposed to the proposer.

Discovery: β parameterization & trace feedback

β parameterization. Each candidate controller exports a single scalar β plus a deterministic, monotonic map from β to every internal knob. Outer search collapses to sweeping β, eliminating brittle thresholds tuned only to the search set.
History augmentation with execution traces. Alongside each round's β-sweep we archive both empirical scaling curves and the full action-by-action trajectories reconstructed during replay. Traces give the explorer fine-grained behavioral evidence to localize defects before rewriting code.

Main results

AutoTTS is optimized on AIME24 replay constructions and evaluated on held-out AIME25 / HMMT25 benchmarks across four Qwen3 backbone scales. The project page reports the following trends:

Better accuracy–token trade-offs. Discovered controllers typically shift the empirical Pareto frontier beyond handcrafted baselines such as SC@64, ASC, ESC, and Parallel-Probe.
Held-out generalization. Policies discovered on AIME24 transfer to held-out benchmarks, outperforming every handcrafted baseline on average accuracy for three of four backbone scales and remaining competitive on Qwen3-8B.
β = 0.5 operating point. Cuts aggregate token usage by roughly 69.5% compared with SC@64 while matching mean held-out accuracy across models.
β = 1.0 operating point. Pushes peak accuracy beyond all handcrafted baselines in five of the eight tabulated comparison cells on the project page.

Sweeping β traces accuracy–token scaling curves: larger β generally moves toward higher-budget, accuracy-first behavior, while smaller β favors cheaper inference.

Evolution of the discovery process

The round-level trajectory (e.g., t1 -> t5) shows a consistent move toward better objective values over the search process:

On the search benchmark, later rounds improve accuracy while keeping token growth controlled, indicating progressively better policy structure rather than random fluctuation.
On held-out benchmarks, the same trajectory remains competitive and often improves, suggesting that the discovered control logic transfers beyond the optimization split.
The trajectory reflects objective-seeking code evolution without gradient updates: the agent edits explicit controller programs, receives replay-based accuracy/cost feedback, and iteratively shifts behavior toward better empirical trade-offs.

This is a key point of AutoTTS: optimization is achieved through iterative program search in a fixed replay environment, not through backpropagation or parameter fine-tuning of the backbone model.

Discovered controller: CMC

The discovered controller is named the Confidence Momentum Controller (CMC). Its main mechanisms are:

Trend-based stopping. CMC maintains an exponential moving average of pool confidence and stops only when the confidence level is high and the trend is non-negative. This avoids stopping on transient confidence spikes.
Coupled width–depth control. Widening and deepening are linked through the EMA delta: strong confidence gains suppress new branch spawning, while stagnation or regression triggers widening.
Alignment-aware depth allocation. Branches whose latest answer matches the pool winner receive extra probe steps, concentrating compute on the emerging consensus while still advancing active branches.
Conservative branch abandonment. A branch is abandoned only after persistently deviating for multiple rounds, and at least two active branches are preserved.

These mechanisms are implemented as code-defined controller logic and evaluated through the same replay environment as the handcrafted baselines.

Show full OptimalController source (CMC)

class OptimalController(LLMDesignedMethod):
    """Confidence Momentum Controller (CMC)."""
    NAME = "optimal_controller"
    _MAX_BRANCH   = 64
    _MAX_OUTER    = 500
    def _schedule(self, beta: float) -> dict:
        b = max(0.0, min(1.0, float(beta)))
        n_init           = max(2, round(2  + 6  * b))
        max_branch_use   = min(self._MAX_BRANCH, round(4 + 60 * b))
        warm_up          = max(2, round(2  + 8  * b))
        abandon_patience = max(3, round(3  + 9  * b))
        T_ema            = max(2, round(2  + 6  * b))
        ema_alpha        = 0.70 - 0.40 * b
        conf_thresh      = 0.85 + 0.12 * b
        delta_slack      = 0.04 - 0.03 * b
        burst_aligned    = max(1, round(1 + 2 * b))
        widen_burst      = max(1, round(1 + 3 * b))
        trend_thresh     = 0.04 - 0.03 * b
        min_complete     = max(2, round(2 + 3 * b))
        return {
            "n_init":           n_init,
            "max_branch_use":   max_branch_use,
            "warm_up":          warm_up,
            "abandon_patience": abandon_patience,
            "T_ema":            T_ema,
            "ema_alpha":        round(ema_alpha, 4),
            "conf_thresh":      round(conf_thresh, 4),
            "delta_slack":      round(delta_slack, 4),
            "burst_aligned":    burst_aligned,
            "widen_burst":      widen_burst,
            "trend_thresh":     round(trend_thresh, 4),
            "min_complete":     min_complete,
        }

The same source also lives in efficient_reasoning_controller/workspace/code_base/method.py.

Repository structure

AutoTTS/
└── efficient_reasoning_controller/
    ├── eval/
    ├── logs/search_history/
    ├── workspace/
    │   ├── code_base/
    │   │   ├── data_loader.py
    │   │   ├── method.py
    │   │   ├── method.template.py
    │   │   ├── eval.py
    │   │   ├── evaluator.py
    │   │   ├── controller_api.py
    │   │   ├── trace_schema.py
    │   │   ├── environment/
    │   │   └── history/
    │   └── controller_search/
    │       ├── run_workflow.sh
    │       ├── workflow_propose_critic.py
    │       ├── claude_proposer.py
    │       ├── codex_proposer.py
    │       └── prompts/
    └── test_environment/

Install

Depending on how you reproduce results:

Evaluate our controllers only — create the Conda environment and install numpy, pandas, tqdm. No Node.js, Claude CLI, or API keys are required for replay evaluation.
Run discovery yourself — complete all subsections: Conda, Claude environment setup, and API environment setup.

Conda environment

conda create -n autotts python=3.12 -y
conda activate autotts

Claude environment setup

curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.39.7/install.sh | bash

source ~/.bashrc

nvm install 21

npm install -g @anthropic-ai/claude-code

pip install claude-agent-sdk==0.1.58

pip install numpy pandas tqdm

API environment setup

cat >> ~/.bashrc <<'EOF'
export OPENROUTER_API_KEY="your_openrouter_api_key"

export ANTHROPIC_BASE_URL="https://openrouter.ai/api"
export ANTHROPIC_AUTH_TOKEN="$OPENROUTER_API_KEY"
export ANTHROPIC_API_KEY=""

export ANTHROPIC_DEFAULT_SONNET_MODEL="anthropic/claude-sonnet-4.6"
export ANTHROPIC_DEFAULT_OPUS_MODEL="anthropic/claude-opus-4.6"
export ANTHROPIC_DEFAULT_HAIKU_MODEL="anthropic/claude-haiku-4.5"
export CLAUDE_CODE_SUBAGENT_MODEL="anthropic/claude-opus-4.6"
export CLAUDE_CODE_SKIP_FAST_MODE_ORG_CHECK=1
EOF

source ~/.bashrc

Reproduction

There are two supported workflows:

	Goal	Needs API / Claude tooling?
Way A	Evaluate released or archived TTS controller programs (`method.py`) on our replay splits	No — replay-only
Way B	Run controller discovery yourself (multi-round propose → critic → eval)	Yes — follow full Install

Complete Install before Way B. Way A only requires the Conda setup and numpy / pandas / tqdm.

Way A — Evaluate our programs (`eval/`)

Use this when you want tables and traces on the bundled replay data without launching search.

Controller code. The repo ships a working efficient_reasoning_controller/eval/method.py. To evaluate a specific snapshot from our search logs, copy it over that file, e.g. from logs/search_history/<run>/code_base/method.py.
Configure sweeps. Edit models, datasets, and method lists at the top of eval/eval.py.
Run evaluation from the repository root:

cd efficient_reasoning_controller
python eval/eval.py

Outputs land under eval/test_results/, e.g. eval/test_results/matrix_results_<MODEL>/ with <DATASET>_raw_new_api.csv and <DATASET>_trace_new_api.jsonl.

Discovery evaluation inside the research codebase uses the same logic under workspace/code_base/eval.py; it writes to code_base/training_results/ instead. Use eval/ for the standalone "evaluate what we ship" layout.

Way B — Run discovery yourself (`workspace/`)

Use this to reproduce or extend the automated search loop (costs LLM calls; evaluation steps remain replay-only).

Environment. Finish Install (Conda + nvm/Node + claude-agent-sdk + API exports). Authenticate the Claude Code CLI (claude login) as needed.
Set up History: Download History from huggingface:

huggingface-cli download AutoTTS/history --local-dir ./history
cp -r ./history efficient_reasoning_controller/workspace/code_base/

Launch the workflow:

cd efficient_reasoning_controller/workspace
bash controller_search/run_workflow.sh

Optional tuning via environment variables:

export WORKFLOW_PROPOSER_BACKEND=claude
export WORKFLOW_ROUNDS=5
export WORKFLOW_EVAL_CMD="python code_base/eval.py"
export WORKFLOW_RESUME=1

Each round writes a snapshot under:

code_base/history/rNNNN_<timestamp>_<uid>/
├── method.py
└── proposal_results/

code_base/method.py is reset from code_base/method.template.py at the start of every round; each candidate must be self-contained in method.py.

Evaluation during search. WORKFLOW_EVAL_CMD defaults to python code_base/eval.py; matrices appear under code_base/training_results/:

code_base/training_results/
└── matrix_results_<MODEL>/
    ├── <DATASET>_raw_new_api.csv
    └── <DATASET>_trace_new_api.jsonl

After discovery. Copy any round's method.py into eval/method.py and follow Way A for a standalone rerun under eval/test_results/.

Built-in baselines

code_base/method.py ships:

ASCMethod — adaptive self-consistency with Beta-confidence early stopping.
ESCMethod — early stopping by sliding-window answer consistency.
Parallel_Probe — parallel chains with warm-up, off-track pruning, and stable-majority termination.
OptimalController — the target class rewritten by the search workflow (e.g. CMC).

Pre-computed seed baseline results are stored under:

efficient_reasoning_controller/workspace/code_base/history/seed_algorithms/

Citation

@article{zheng2026autotts,
  title  = {LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling},
  author = {Zheng, Tong and Liu, Haolin and Huang, Chengsong and Bao, Huiwen and
            Zhang, Sheng and Liu, Rui and Dai, Runpeng and Chen, Ruibo and
            Liu, Chenxi and Xiong, Tianyi and Wu, Xidong and Zhang, Hongming and
            Huang, Heng},
  journal={arXiv preprint arXiv:2605.08083},
  year    = {2026}
}

@article{zheng2026parallel,
  title={Parallel-Probe: Towards Efficient Parallel Thinking via 2D Probing},
  author={Zheng, Tong and Huang, Chengsong and Dai, Runpeng and He, Yun and Liu, Rui and Ni, Xin and Bao, Huiwen and Wang, Kaishen and Zhu, Hongtu and Huang, Jiaxin and others},
  journal={arXiv preprint arXiv:2602.03845},
  year={2026}
}

AI researchvideodiffusion

Long Video Generation

Google Research's A²RD adds multimodal memory and hierarchical self-correction to video diffusion, generating coherent 10-minute videos with 30% better consistency.

Google Research

Summary

What: Google Research and NUS introduced A²RD, a training-free framework for generating 1-to-10-minute videos using multimodal memory, adaptive synthesis, and hierarchical self-improvement. On their LVBench-C benchmark, it outperformed MovieAgent, ViMax, and VideoMemory by 30% on consistency and 20% on narrative coherence.

Why it matters: Adding memory and self-correction as explicit architectural components, rather than relying on model scale alone, may be the path to production-ready long video generation.

Deep Dive

The core insight is decoupling creative synthesis from consistency enforcement through an agentic architecture with memory
Multimodal Video Memory stores synthesized segments as textual states (entity attributes, motion, spatial relations), keyframes, and full video clips
Adaptive Segment Generation switches between extrapolation (natural progression from start frame) and interpolation (connecting fixed start/end frames) per segment
Hierarchical Test-Time Self-Improvement (HITS) prevents error propagation by refining boundary frames first, then full segments
Memory initialization phase: agent analyzes narrative, identifies entities/environments, builds dependency graph, synthesizes global reference frames
Autoregressive synthesis loop: retrieve context from memory → select generation mode → synthesize frames/video → apply HITS → update memory
LVBench-C benchmark tests 3-minute, 5-minute, and 10-minute videos with non-linear entity/environment transitions (entities appear, disappear, reappear with state changes)
Results show 30% improvement in consistency and 20% in narrative coherence over baselines like MovieAgent, ViMax, and VideoMemory
Training-free approach works on top of existing video diffusion models
Addresses semantic drift and narrative collapse problems in long-horizon video generation

Decoder

Semantic drift: Gradual loss of coherence with the original prompt as video generation extends over time
Extrapolation mode: Synthesizing video forward from a start frame only, enabling natural progression but risking drift
Interpolation mode: Generating video to connect fixed start and end frames, enforcing consistency but risking unnatural motion if endpoints clash
Test-time self-improvement: Refinement during generation (inference) rather than training, enabling iterative quality improvements per segment

Original Article

A²RD: Agentic Autoregressive Diffusion for Long Video Consistency

Abstract

Synthesizing consistent and coherent long video remains a fundamental challenge. Existing methods suffer from semantic drift and narrative collapse over long horizons. We present A²RD (/ɑːrd/), an Agentic Auto-Regressive Diffusion architecture that decouples creative synthesis from consistency enforcement. A²RD formulates long video synthesis as a closed-loop process that synthesizes and self-improves video segment-by-segment through a Retrieve–Synthesize–Refine–Update cycle. It comprises three core components: (1) Multimodal Video Memory that tracks video progression across modalities; (2) Adaptive Segment Generation that switches among generation modes for natural progression and visual consistency; and (3) Hierarchical Test-Time Self-Improvement that self-improves each segment at frame and video levels to prevent error propagation. We further introduce LVBench-C, a challenging benchmark with non-linear entity and environment transitions to stress-test long-horizon consistency. Across public and LVBench-C benchmarks spanning one- to ten-minute videos, A²RD outperforms state-of-the-art baselines by up to 30% in consistency and 20% in narrative coherence.

Terminology

Terminology	Description
Shot	A continuous sequence of frames captured from a single camera angle without cuts.
Scene	A narrative unit representing continuous action within a single physical environment or location.
Segment (Clip)	The fundamental generation unit in A²RD, which is flexible and can span one or multiple shots or scenes.
Segment Context (𝑆ᵢ)	The textual description dictating the narrative, actions, and settings for the 𝑖-th segment.
Storyline (𝒮)	The complete sequential collection of segment contexts {𝑆₁, …, 𝑆ₙ} defining the full video narrative.
Extrapolation	A generation mode that synthesizes a video segment moving forward from only a beginning frame.
Interpolation	A generation mode that synthesizes a video segment to seamlessly connect a fixed beginning and ending frame.

Method Overview

A²RD enables video diffusion models to synthesize and self-improve long videos autoregressively, enforcing temporal consistency and narrative coherence. A²RD is training-free and built upon three pillars:

Multimodal Video Memory: Existing methods store only visual references, losing narrative context over long horizons. A²RD stores structured contexts from synthesized segments, disentangling each segment into three modalities: Textual States (entity identities, attribute changes, motions, spatial relations, camera trajectories), Frames (global references and boundary keyframes), and Videos (full segments for motion continuity). Online Retrieve and Update operations are enabled for synthesis.
Adaptive Segment Generation: Prior studies adopt either extrapolation or interpolation as a fixed generation mode. Extrapolation enables natural progression but risks semantic drift; interpolation enforces stronger consistency but risks unnatural video progression when end frames are poorly planned. A²RD adaptively selects the mode per segment to enable both natural video progression and strong consistency enforcement.
Hierarchical Test-Time Self-Improvement (HITS): A single inconsistent frame can cascade artifacts across the entire horizon. Existing video refinement methods operate only on short clips. A²RD introduces HITS to self-improve long videos hierarchically — first boundary frames, then full segments — focusing on intra- and inter-segment coherence, and video quality to combat errors propagate uncorrected.

The workflow proceeds in two stages:

Memory Initialization: The agent reasons over the narrative to identify entities and environments, constructs a dependency graph, and synthesizes global reference frames as a form of long-term memory.
Autoregressive Segment Synthesis & Self-Improvement: For each segment, the agent retrieves context from memory, selects the generation mode, synthesizes boundary frames and video, applies HITS, and updates memory before advancing.

Benchmark: LVBench-C

We introduce LVBench-C (Long Video Bench-Challenge), a challenging benchmark designed to stress-test temporal consistency under complex scenarios where entities and environments appear, disappear, and reappear across long horizons with optional state changes. LVBench-C features multi-shot stories at 3-minute, 5-minute, and 10-minute scales with rich non-linear entity and environment transitions.

SOTA Segment-Based Long Video Synthesis Baselines

Single-scene (VBench-Long): A stylish woman walks down a Tokyo street filled with warm glowing neon and animated city signage. She wears a black leather jacket, a long red dress, and black boots, and carries a black purse. She wears sunglasses and red lipstick. She walks confidently and casually. The street is damp and reflective, creating a mirror effect of the colorful lights. Many pedestrians walk about.

Direct Prompting, Naive Parallel, Naive Autoregressive, MovieAgent (Wu et al., 2025), ViMax (HKUDS, 2025), VideoMemory (Zhou et al., 2026), A²RD-Par (Ours), A²RD (Ours)

Multi-scene (LVBench-C, 3 minutes, The Scuba Diver's Reef Exploration): Prompt below.

Direct Prompting, Naive Parallel, Naive Autoregressive, MovieAgent (Wu et al., 2025), ViMax (HKUDS, 2025), VideoMemory (Zhou et al., 2026), A²RD (Ours)

A²RD Single-Scene/Multi-Scene Long Video Gallery

A²RD Multi-Scene Ultra-Long Video Gallery

(a) 3-minute: The Master Potter's Creation

In a quiet morning living room, a man with a grey ponytail puts on a clean navy blue apron.
He walks into his kitchen and packs a small wooden crate with various carving tools.
He exits his house and walks down a cobblestone alleyway toward his art studio.
Inside the bright studio, he approaches a large bag of wet, grey clay and cuts a large chunk with a wire.
He carries the heavy clay to a pottery wheel and slams it down onto the center of the bat.
The man sits at the wheel and begins centering the clay, his hands quickly becoming coated in thick, wet slip.
As the wheel spins, he pulls the clay upward, forming a tall, elegant vase shape.
He picks up a wet sponge and smooths the exterior of the vase, grey water dripping onto his apron.
He uses a metal rib tool to shave the sides, creating a pile of clay shavings around the base.
The man stops the wheel and uses a thin wire to carefully slice the vase off the spinning head.
He carries the wet vase into a drying room filled with wooden shelves and sets it down gently.
He walks to a workbench and picks up a leather-hard bowl from the previous day to begin carving.
He uses a fine needle tool to etch intricate patterns into the bowl, clay dust settling on his arms.
The man carries the carved bowl into a kiln room and carefully places it inside the large industrial kiln.
He adjusts the digital settings on the kiln and presses the start button to begin the firing process.
He walks to a glazing station and stirs a bucket of deep blue glaze with a wooden stick.
He dips a finished, fired plate into the blue liquid, his fingers getting stained with the pigment.
He sets the glazed plate on a rack to dry, looking at the transformation of the surface.
The man walks to a large utility sink and begins scrubbing the thick clay from his hands and forearms.
He removes the navy blue apron, which is now heavily stained with grey clay and blue glaze spots.
He hangs the apron on a wall hook and picks up his wooden crate of tools.
He walks back through the cobblestone alley as the evening streetlamps flicker on.
Entering his home, he places the tool crate on the table and sighs with satisfaction.
He stands in his living room stretching leisurely, a look of deep satisfaction on his face.

(a) 3-minute: The Scuba Diver's Reef Exploration

A diver stands on the deck of a boat in the ocean.
The diver is wearing a black neoprene wetsuit.
The diver puts on a heavy air tank and harness.
The diver fastens a weight belt around their waist.
The diver sits on the edge of the boat deck.
The diver pulls a rubber mask over their eyes.
The diver puts a regulator mouthpiece in their mouth.
The diver falls backward into the blue water.
The diver sinks beneath the surface of the ocean.
Bubbles rise from the diver's regulator as they breathe.
The diver swims down toward a colorful coral reef.
The diver sees a school of bright tropical fish.
The diver hovers near a large sea turtle.
The diver checks the air pressure gauge on their tank.
The diver begins to swim slowly back to the surface.
The diver breaks the surface of the water.
The diver swims to the ladder on the side of the boat.
The diver climbs up the ladder onto the deck.
The diver removes the rubber mask from their face.
The diver takes the regulator out of their mouth.
The diver removes the heavy air tank and harness.
The diver enters the boat's cabin and changes into dry clothes.
The diver hangs the wet wetsuit on a drying rack.
The boat begins to drive back toward the harbor.

(b) 5-minute: The Stage Fright (Clara)

Clara, wearing an oversized wool sweater and glasses, sits at a piano in a dusty attic.
Her hair is messy and tied back with a simple rubber band as she hums a melody.
She stops to scribble notes onto a piece of crumpled sheet music with a pencil.
Clara wipes a layer of dust off the piano keys, her fingers trembling slightly.
The grand theater lobby is filled with socialites in tuxedos and evening gowns.
Ushers in gold-trimmed uniforms hand out glossy programs to the arriving guests.
A large poster in the lobby features a silhouette of a pianist with the name 'CLARA' in bold.
Stagehands move a massive black grand piano into the center of the stage.
The conductor of the orchestra adjusts his baton, looking at his pocket watch.
The audience begins to file into the rows of red velvet seats, whispering in anticipation.
[Full 40-scene narrative continues...]

(c) 10-minute: The Great Museum Heist

Victor and Saffron sit in a dim basement, wearing casual hoodies and jeans.
They study a holographic blueprint of the Royal Museum, glowing blue on the table.
Saffron points to the laser grid in the North Gallery, her eyes narrow and focused.
Victor checks the internal mechanism of a miniature glass-cutting device.
They clink two mugs of cold coffee together, finalizing their silent pact.
The Royal Museum stands majestic under the moonlight, guarded by tall stone lions.
A security guard walks his patrol, the beam of his flashlight cutting through the dark.
The museum's grand clock strikes midnight, the sound echoing through the empty streets.
[Full 80-scene narrative continues...]

Tech securityaizero-day

Google Says Criminal Hackers Used AI to Find a Major Software Flaw

Google reports the first AI-discovered zero-day exploit by criminal hackers targeting an open-source admin tool, patched before damage.

New York Times

Summary

What: A criminal hacking group used an AI model to discover and weaponize a previously unknown vulnerability in an unnamed open-source web-based system administration tool. Google's security team detected the attack and notified the software maker in time to release a patch before damage occurred.

Why it matters: This demonstrates AI has moved from defensive security tooling to offensive weaponization, potentially enabling less sophisticated criminal groups to discover zero-days at the same speed as well-funded state actors and elite researchers.

Deep Dive

Google's security team detected strong evidence a criminal hacking group used an AI model to discover and weaponize a zero-day vulnerability
The target was a popular but unnamed open-source web-based system administration tool
The vulnerability was responsibly disclosed to the software maker, who patched it before the attack campaign could succeed
This is the first publicly documented case of AI being used by malicious actors for zero-day discovery and exploitation, not just by security researchers
The development suggests AI is accelerating the offensive side of cybersecurity, not just defensive bug hunting
Unlike previous AI security research that found bugs at scale, this case involved actual weaponization and attempted deployment against real targets

Decoder

Zero-day vulnerability: A software flaw unknown to the vendor and public, with no patch available, making it highly valuable for attackers who can exploit systems before defenses exist.

Original Article

A criminal hacking group recently attempted to launch a widespread cyberattack using a previously unknown bug in a popular open-source web-based system administration tool. There is strong evidence that the actor likely leveraged an AI model to support the discovery and weaponization of the vulnerability. The software maker was notified quickly enough to allow for a patch before the attack could do damage. This is the first known example of a zero-day bug being put to malicious use by hackers enabled chiefly by AI.

Tech aicloudawsanthropic

Introducing Claude Platform on AWS: Anthropic's native platform, through your AWS account

AWS is the first cloud provider to offer Anthropic's native Claude Platform, but requests run outside AWS security boundaries through Anthropic's infrastructure.

AWS

Summary

What: AWS launched Claude Platform on AWS in general availability on May 11, 2026. Customers access Claude's full API, Managed Agents (beta), web search, MCP connectors, and file uploads through AWS IAM auth and Marketplace billing. Service operated by Anthropic with data processed outside AWS security perimeter. Available in 17 regions.

Why it matters: This hybrid model where AWS handles billing, auth, and audit while Anthropic operates the service signals cloud providers are increasingly willing to integrate competing AI platforms natively rather than forcing customers into proprietary AI services.

Takeaway: If you're an AWS customer using Claude, you can now access it through AWS Marketplace with IAM auth and consolidated billing instead of managing separate Anthropic credentials.

Deep Dive

AWS announced general availability of Claude Platform on AWS on May 11, 2026, the first cloud provider to offer Anthropic's native platform experience
Authentication uses existing AWS IAM credentials (Signature Version 4) or API keys, no separate Anthropic account needed
Billing runs through AWS Marketplace on consumption basis, tracked in AWS Cost Explorer alongside other services
Activity captured in AWS CloudTrail for audit, workspace operations logged as management events by default
Access includes full Claude Platform APIs: Messages API, Managed Agents (beta), advisor tool (beta), web search/fetch, MCP connectors (beta), Agent Skills (beta), code execution, files API (beta)
Workspace model separates projects/environments/teams with centralized billing, serves as primary IAM resource
Service operated by Anthropic with requests and data processed outside AWS security boundary, suited for teams without strict regional data residency requirements
Complements existing Claude models on Amazon Bedrock, gives customers choice of access methods
Available in 17 regions: US East (N. Virginia, Ohio), US West (Oregon), Canada (Central), South America (São Paulo), Europe (Dublin, London, Frankfurt, Milan, Zurich, Paris, Stockholm), Asia Pacific (Tokyo, Seoul, Melbourne, Jakarta, Sydney)
Compatible with Claude Code, Claude Cowork, and Anthropic SDK via environment variables (ANTHROPIC_API_KEY, ANTHROPIC_BASE_URL, ANTHROPIC_WORKSPACE_ID)
Usage analytics available in Claude Console with breakdowns by workspace, IAM principal, and time period
Resource tagging supported for cost allocation across teams/projects

Decoder

MCP connector: Model Context Protocol connector, allows Claude to access external data sources and tools through a standardized protocol
AWS Signature Version 4: AWS authentication protocol that cryptographically signs API requests using temporary credentials from IAM roles, more secure than static API keys
CloudTrail: AWS audit logging service that records API calls and user activity across AWS services for security analysis and compliance

Original Article

Introducing Claude Platform on AWS: Anthropic's native platform, through your AWS account

Today, we're excited to announce the general availability of Claude Platform on AWS. Claude Platform on AWS is a new service that gives customers direct access to Anthropic's native Claude Platform experience through their AWS account, with no separate credentials, contracts, or billing relationships required. AWS is the first cloud provider to offer access to the native Claude Platform experience.

In this post, we explore how Claude Platform on AWS works and how you can start using it today.

Claude Platform experience through AWS

With Claude Platform on AWS, you work with the same APIs, features, and console experience available through Anthropic directly. This includes the Messages API, Claude Managed Agents (beta), advisor tool (beta), web search and web fetch, MCP connector (beta), Agent Skills (beta), code execution, files API (beta). For the full list of capabilities, see the Claude Platform documentation.

You access Claude Platform on AWS through familiar AWS features:

Authentication: You use existing AWS IAM credentials to access Claude Platform. No separate accounts or API keys to manage.
Billing: Usage is billed through AWS Marketplace on a consumption basis, so you can track and manage AI spending alongside your other AWS services.
Audit: Activity is captured in AWS CloudTrail, so you can monitor, audit, and investigate AI usage the same way you do for any other AWS services.

Claude Platform on AWS is operated by Anthropic, and the underlying requests and data are processed outside the AWS security boundary. This makes it well suited for teams without specific Regional data residency requirements, and complements Claude models on Amazon Bedrock, so you can access Claude through the approach that fits your needs.

Getting started with Claude Platform on AWS

You can activate Claude Platform on AWS through the AWS Marketplace. For step-by-step instructions, see Set up your account. After your account is activated, getting to your first API call takes three steps: create a workspace, authenticate, and call the API.

Step 1: Create a workspace

With a workspace, you can separate projects, environments, or teams while maintaining centralized billing and administration. It also serves as the primary AWS Identity and Access Management (IAM) resource for Claude Platform on AWS. You grant or deny access to specific workspaces through IAM policies using the workspace ARN. See IAM policies for policy examples.

Open the Claude Console from within the Claude Platform on AWS Console and create a workspace.

Step 2: Authenticate

Claude Platform on AWS supports two authentication methods: IAM with AWS Signature Version 4, and API keys. We recommend using temporary IAM credentials for setups that require a higher level of security, and API keys for exploring Claude Platform on AWS.

To quickly test your setup, you can generate an API key in the Claude Platform on AWS Console:

Set your API key, base URL, and Workspace ID as environment variables:

# Your API key  
export ANTHROPIC_API_KEY= 

# Your regional endpoint for Claude Platform on AWS 
export ANTHROPIC_BASE_URL=https://aws-external-anthropic..api.aws 

# Your workspace ID (find in Claude Platform on AWS Console → Workspaces) 
export ANTHROPIC_WORKSPACE_ID=

Step 3: Make your first API call

You can now install the Anthropic Client SDKs and make API calls:

from anthropic import Anthropic 
import os 
 
client = Anthropic( 
   default_headers={"anthropic-workspace-id": os.environ["ANTHROPIC_WORKSPACE_ID"]}, 
) 
 
message = client.messages.create( 
   model="claude-sonnet-4-6", 
   max_tokens=1024, 
   messages=[{"role": "user", "content": "Hello!"}], 
) 
 
print(message)

See Getting Started notebooks for more code examples.

Claude Platform on AWS in practice

With your setup complete, you can point Claude Code, Claude Cowork, or any other API client at your workspace using the following environment variables or configuration:

export ANTHROPIC_API_KEY= 

export ANTHROPIC_BASE_URL=https://aws-external-anthropic..api.aws 

 # For Claude Cowork, set the "anthropic-workspace-id" in your inference configuration. For Claude Code use the following:  
export ANTHROPIC_CUSTOM_HEADERS='{"anthropic-workspace-id":""}' 

 # For the Anthropic SDK 
export ANTHROPIC_WORKSPACE_ID=

After you're connected, your clients can use capabilities like web search, MCP connectors, agent skills, code execution, and file uploads through Claude Platform on AWS.

You can monitor usage in the Claude Console, including breakdowns by workspace, AWS IAM principal, and time period.

In your AWS environment, AWS CloudTrail captures requests to Claude Platform on AWS, whether from the Anthropic SDK, Claude Code, or Cowork. Workspace operations are logged as management events by default, and you can enable data event logging to capture inference activity. For details on event types and logging configuration, see Monitoring and logging. Because usage is billed through AWS Marketplace, you can monitor costs in AWS Cost Explorer alongside your other cloud services. You can also allocate spending using resource tags.

Conclusion

With Claude Platform on AWS, your teams get Anthropic's complete native APIs and features through the same AWS account you already use. Claude Platform on AWS is available in US East (N. Virginia), US East (Ohio), US West (Oregon), Canada (Central), South America (São Paulo), Europe (Dublin), Europe (London), Europe (Frankfurt), Europe (Milan), Europe (Zurich), Europe (Paris), Europe (Stockholm), Asia Pacific (Tokyo), Asia Pacific (Seoul), Asia Pacific (Melborune), Asia, Pacific (Jakarta) and Asia Pacific (Sydney). To get started, open the Claude Platform on AWS Console or explore the documentation.

Give Claude Platform on AWS a try today and send feedback to AWS re:Post or through your usual AWS Support contacts.

Tech aiagentsstartupcareer

The Wu Tapes

The competitive programming network behind Scale, Perplexity, and Hyperliquid also produced Cognition, now hitting $445M ARR in 18 months.

Colossus

Summary

What: Scott Wu, greatest American IOI gold-medalist and part of the competitive programming network that includes Scale's Alexandr Wang, Perplexity's Johnny Ho, Decagon's Jesse Zhang, and Hyperliquid's Jeffrey Yan, founded Cognition in November 2023. Devin, their autonomous AI software engineer, hit $445M ARR in 18 months while doubling usage every eight weeks, raising at $25B. Customers include US Army, Goldman Sachs, Mercedes-Benz. SWE-Bench performance improved from 13% at March 2024 launch to 90% today.

Why it matters: Competitive programming excellence predicts AI founder success because the same skills—pattern recognition, deep CS fundamentals, seeing many moves ahead—let olympians identify model readiness 18 months before consensus and ship aggressively while others wait for perfect benchmarks, compounding leads that become insurmountable.

Decoder

IOI (International Olympiad in Informatics): Annual international programming competition for high school students, the most prestigious competitive programming contest globally.
SWE-Bench: Benchmark for AI coding agents measuring the percentage of real GitHub bugs they can autonomously fix without human help.

Original Article

Scott Wu is the founder and CEO of Cognition, an AI agent coding lab and one of the fastest-growing companies of any kind in history. He is also known for his math prowess: Wu is the greatest American gold-medalist of all time at the International Olympiad in Informatics. His career began when he was just in second grade, when he competed in a middle-school math competition at just seven years old. This article contains an interview with Wu where he discusses his early childhood and life, AI, Cognition, humanity, his fears, and more.

Tech aiagentsmicroservicesarchitecture

AI Versus Microservices

Microservices optimized for scaling dev teams OUT in the 2010s now create friction when AI agents let companies scale DOWN.

Michael Nygard

Summary

What: Michael Nygard argues that microservices, designed to parallelize work across hundreds of two-pizza teams in 2010s startups, now create organizational and technical friction when AI coding agents let companies shrink teams. Service boundaries are too small for one developer plus AI swarm to own, but no one volunteers to eliminate their service.

Why it matters: This reveals an architectural mismatch where the optimization target shifted from 'add more humans' to 'multiply fewer humans with AI' but code boundaries remain frozen by organizational inertia and self-preservation.

Deep Dive

2010s context: Startups scaled dev teams 10-100x to outship competitors in winner-take-all markets. Microservices enabled horizontal scaling with sub-exponential coordination cost by splitting work across independent two-pizza teams.
Current tension: AI agents boost individual PR velocity and small feature delivery, but cross-cutting features remain slow because architecture is still fragmented across hundreds of services.
Organizational gridlock: Developers won't volunteer to eliminate their own services/teams, especially amid widespread layoffs. Architecture boundaries become harder to change than code.
Merge conflicts return: Multiple AI swarms rewriting overlapping code creates massive conflicts. LOC-changed metrics look good but don't create value.
Governance can't scale: Security scans already have too many false positives. AI agents rewrite tests to pass rather than fix code. Data governance systems built for human-paced change are overwhelmed.
Leaked credentials epidemic: Agents trained on external platforms (Vercel, Firebase) will use them. Token-based access control seems 'fundamentally unfixable' against agentic bypass.
Requirements for next architecture: Must support larger ownership units per dev, externalized API specs that agents can't unilaterally mutate, accretive interfaces to allow safe parallel evolution, validation beyond in-repo test code.
Risk accumulation: Shipping velocity without governance platforms creates 'mountains of invisible coupling, fragile dependencies, uncontrolled production environments' where blame falls on front-line staff, not leadership.
Prediction: Next dominant architecture will rebalance for AI-augmented teams (not yet clear what it looks like, 'macroservices' and 'megaservices' lack the cool factor to sell consulting).

Decoder

Metcalfe's law company: Business where value grows with network size (e.g., social networks, marketplaces), leading to winner-take-all dynamics where first/second place capture 95% of market.
Two-pizza team: Amazon's principle that teams should be small enough to feed with two pizzas (typically 6-10 people), used to keep coordination overhead low.
Accretive interface: API design where new functionality is added without changing existing contracts, allowing services to evolve independently without breaking consumers.

Original Article

AI Versus Microservices

Microservices were always a technical solution to an organizational problem.

The Road More Traveled

Think back to the early 2010's and imagine yourself as a startup CEO. You have a vision for a better app or website and you've gotten a bunch of VC funding. The trouble is that anywhere between ten and a thousand other startups have very similar ideas. As with any Metcalfe's law company, at most two of you will survive. First place wins 80% of the market. Second place gets 15%. Third place is you're fired.

Your singular objective is to gain customers faster than your competitors.

You've got a CRO with incredible hockey-stick charts. Your CFO has a burndown chart and an "end by" date. If the hockey stick bends upward before the burndown crosses zero, you win. If not, you lose.

You also have a CTO telling you that the team is working as fast as they can, but there's only so much code anyone can sling in a day, and his ship dates push the hockey stick way too far out to the right. The only possible response you have is "then get a bigger team". The CTO says something about mythical months and pregnant women, which sounds both totally irrelevant and possibly sexist to you. The CTO – or the next occupant of that chair – will expand the dev team.

So now the CTO has a problem. Expand the team by 10, 20, or 100x and ship faster. But experience shows that adding people to a project slows it down due to communication overhead and coordination cost. The solution is to break the product into many smaller projects, each with its own two-pizza team.

Microservices allow horizontal scaling of a dev organization with sub-exponential coordination cost. Each team only needs to know about its local neighborhood of services. (This also reduces the other negative effect of rapid hiring: the handful of people that have global knowledge about the system are diluted and outnumbered 100 to 1.)

In other words, microservices are an attempt to balance two opposing forces: exponential slowdown from communication and coordination (C&C) versus linear speedup from parallel feature development. If the team gets the balance right and carves the service boundaries well, they can continuously ship small batches of functionality and A/B test your company into the unicorn club. If they get it wrong, they'll spend all their time at the whiteboard and all your money on Splunk while your customers complain on Reddit.

Sidebar - Microservices in Big Companies

Large companies also found benefits to carving ancient monoliths into microservices. In their case it was less about out-competing to win a market – though some of them needed to compete a lot better to retain their market than they had been. Instead they needed to break out of a web of cyclic technical dependencies and architectural decay that was slowing development to a crawl. Basically, many of them found that they were already on the wrong side of the exponential C&C slowdown and needed to get shipping again.

What's Changed - Flavor 1

The same story about scaling out via microservices and dedicated two-pizza teams can be interpreted differently. From a critical perspective, the result is also organizational boundaries / managerial lines of control around architectural boundaries in the system. (It's legally required to mention Conway's Law and the Inverse Conway Maneuver here.) The trouble is that it's much, much easier to change boundaries in software than in the organization.

In a monolith, nobody gets laid off when you delete a class.

With microservices, no team ever eliminates their own raison d'être and lays themselves off. Instead you'll get a version two or version three that expands the functionality of their services.

Your startup from the 2010's now has six thousand services owned by hundreds of squads and nobody knows how it all works.

You mandate that every developer (and non-developer) must use AI coding agents for their work, seeking the speedup that you believe is possible. And it works! Pull request volume shoots way up. Lines of code modified is through the roof! Small features are delivered faster: copy edits, graphical tweaks, UX changes. They're even being tested behind feature flags and experiments and agents are automatically removing changes that test badly.

(A side effect is that your customers never see the same app twice, and encounter paper-cut bugs on a daily basis. But you've got an AI chatbot responding to their complaints so you don't hear their frustration yet.)

The trouble is that the big things aren't moving any faster. New markets, new products, new cross-cutting features are still just as slow to produce because your overall architecture is still fragmented.

(This will come as no surprise to anyone who has read their Reinertsen, Kim & Spear, or Poppendieck.)

With AI agents you want to scale down your dev team but the architecture was optimized for scaling out not down. You need each developer, and their pod of AI agents, to own larger units of code but the microservice boundaries are too small and fragmented.

Meanwhile, the constant news of large-scale layoffs has every developer scared for their current job and scared they won't find another one. So absolutely nobody is going to raise their hand and say "I think my service shouldn't exist any more." (It seems more likely that you'll have vicious turf wars and middle managers running annexation campaigns, all with AI-produced docs and decks beautifully justifying their maneuvers.)

The resulting tension is the next Seldon Crisis facing these companies.

What's Changed - Flavor 2

It is now almost twenty years since the anti-SOA rebellion. It's fair to call the microservice architecture the dominant architectural style. Startups today begin with a monolith in their early days but once they find a degree of product/market fit, they look to rebuild for scale in microservices.

I don't know what the next dominant style will be. Maybe we'll call it "macroservices" or "megaservices", though I doubt it. Neither of those words have the "cool factor" that will help consultancies sell services.

I can say that we need to find a new way to draw the boundaries. Here are some of the forces I see that will affect how we do that:

It has to be much, much safer to ship code than it is today. That's a simply consequence of the risk equation: \[ \text{Expected Loss} = \sum_{i=1}^{n} \left( P(\text{loss}_i) \times \text{Opportunities}_i \times \mathbb{E}[\text{loss per opportunity}_i] \right) \] We rely on automation in our CI pipelines: unit tests, security scans, some performance testing. We also rely on experimentation and feature flags. The trouble is that AI agents are just as likely to rewrite or delete the tests, or to "reinterpret" experimentation results, in order to reach the goals we give them.
Governance mechanisms must be rebuilt to deal with the scale of changes. Security scanning tools already report far too many false positives for the human-driven rate of change. Domain name governance isn't sufficiently automated at most companies. Data governance is partially automated, partially human driven for classification. Systems for data subject rights are cobbled together. As it is, few of the technical staff care about data subject rights and none of the non-technical staff pushing code are worried about it. I predict ever more frequent small scale data breaches of the stupid "DB endpoint was left public" variety. (In this post I've only been talking about AI driven development, not about agent-in-the-loop systems, but data governance for those is in an impossible double-bind. Everybody wants their agent to have access to all the data, but the agents have no guardrails at all. They'll readily ship all your private customer data off to a free-tier Firebase if you let them.)
Access control based on tokens and keys seems fundamentally unfixable. Leaked keys in externally hosted repositories are exploited in minutes and can cost a fortune before they're discovered. So far, there don't seem to be guardrails immune to agentic bypass. There are too many external platforms that agents are trained on, but your company's bespoke platform is in nobody's corpus. Any non-technical user's agent can spin up a fresh Vercel account and your pipeline's checks cannot stop it.
Each developer must own larger units of code. I have previously valued collective code ownership. However that's a principle for spreading knowledge among humans, with the intent of diffusing knowledge faster than the truth of the code changes. That now appears hopeless. The change rate is too fast. Not even one human fully understands the code they're accountable for.
Another reason for larger ownership scope: Merge conflicts are a big issue again. Agents will happily rewrite code that is working fine. Two devs with their swarms create massive merge conflicts. Right now this looks like even more productivity since resolving those conflicts spends tokens and boosts LOC changed metrics. That activity doesn't create value though.
We need validation beyond test code in the same repository. Constant rewriting by the agents causes regressions. Agents will "fake it" by changing the test instead of fixing the code. If you try to nail all existing behavior in place with vast suites of unit tests, then you'll spend all your tokens on reading and updating test code instead of production code. More tokenmaxxing without value.
The specification of APIs a service offers must be externalized. When changing code, agents don't value the principle of keeping your interface stable. They'll change an API that other services depend on. You might catch this in CI, if you've built good compatibility checking. It's better to put the API specification someplace separate and confine the coding agents so they aren't allowed to mutate it during normal development. (As sub-principle here is that accretive interfaces are more important than ever so that unilateral change in the specification – by a different agent – is possible.)
We need to think about team topologies at two different scales: within the "pod" of the developer and agents, and between pods. This will determine how fast you ship large scale (cross-pod) features more than PR rates or LOC changes.

Summing Up

Every radically new technology has caused a shift in the dominant architecture, and often in the languages and platforms, that we employ. The new architecture will look obvious in hindsight, but what doesn't? It will hit the right balance of adjacency to existing technology, re-balancing of forces in tension, and mass appeal to become the next "hot topic" of books, conference talks, consultancies, etc. That doesn't mean it will be the ideal approach… every new solution has within it the seeds of the next problems. So whatever the successor architecture becomes, it will create new niches for tooling, supporting systems, languages, frameworks, and the like.

I've tried to avoid cynicism about either the current state of affairs or the future. But I will confess that "doing things right" has never seemed less valued than it does now. In the push to have everybody shipping code, all the time, I worry that we are accumulating mountains of invisible coupling, fragile dependencies, uncontrolled production environments … it adds up to a lot of technical risk. We do not (yet) have platforms or processes to manage that risk. When the bill comes due, the blame will fall on the front line staff not the ones who set up the incentives and reaped the rewards.

Tech securitydevopssshgit

Laptops all have built-in security tokens these days

Andrew Helwer realized after five years that his collection of yubikeys is redundant because every modern laptop has a built-in secure element that works as a security token.

Andrew Helwer

Summary

What: Helwer walks through configuring macOS Secure Enclave and Windows Hello to replace physical security tokens for SSH authentication and git commit signing. On macOS, this involves sc_auth create-ctk-identity, ssh-keygen -w /usr/lib/ssh-keychain.dylib, and configuring git to use ssh-agent instead of direct file paths. On Windows, ssh-keygen -t ecdsa-sk integrates with Windows Hello for facial recognition or fingerprint auth.

Why it matters: Physical security tokens like yubikeys have been the gold standard for protecting SSH keys and git signing, but they're now redundant hardware. Laptops since 2020 ship with hardware secure elements (Secure Enclave on Mac, TPM on Windows) that provide the same guarantee: private keys never leave the device, and operations require physical presence. The shift matters because it removes friction from security workflows without compromising protection.

Takeaway: If you currently use a yubikey for SSH or git signing on a Mac, you can switch to the built-in Secure Enclave using the commands at https://ahelwer.ca/post/2026-05-08-builtin-u2f/. Windows users can run ssh-keygen -t ecdsa-sk to generate keys backed by Windows Hello.

Deep Dive

Security tokens like yubikeys store private/public keypairs where the private key can never be extracted, only used in-place for signing operations gated by physical interaction (button press or fingerprint)
Modern laptops have equivalent hardware: Apple's Secure Enclave and Windows TPM provide the same private key isolation and physical presence requirements
For SSH: ssh-keygen -t ed25519-sk (with physical token) or sc_auth create-ctk-identity + ssh-keygen -w /usr/lib/ssh-keychain.dylib (macOS Secure Enclave) generates keys that require physical presence
The private key file is just a handle to the actual key stored in hardware, so it can be safely shared publicly
For git commit signing via SSH: configure git to use ssh-agent instead of direct file paths, set user.signingKey to key::${public_key_contents}, enable with git config --global commit.gpgsign true
Drawback of security tokens: users can become conditioned to approve requests without scrutiny, and lost tokens mean lost keys (no backup possible except BIP 39 word lists which compromise security)
SSH authentication and git signing are the primary use cases, but tokens also work for U2F web authentication, passwordless login, and sudo elevation via PAM on Linux
The fingerprint-reader yubikey has too high a failure rate for operations requiring many sequential approvals (like rebasing dozens of commits)
Windows implementation: winget install Microsoft.OpenSSH.preview, then ssh-keygen -t ecdsa-sk integrates with Windows Hello facial recognition, fingerprint, or PIN
Main benefit beyond security: your SSH access and git signing capability travels with your laptop instead of being tied to files on a specific machine or a fragile USB device sticking out of a port

Decoder

Security token / U2F: Hardware device (like yubikey) that stores a private/public keypair where the private key can never be extracted, only used in-place for signing operations. Physical interaction (button press, fingerprint) is required to authorize each signing operation.
Secure element: Hardware component in modern laptops (Apple's Secure Enclave, Windows TPM) that provides the same private key isolation and physical presence requirements as a physical security token.
SSH key signing: Using a cryptographic keypair to authenticate SSH connections instead of passwords. The private key proves your identity to the server.
Git commit signing: Cryptographically signing git commits with your private key to prove authorship and prevent commit forgery. Different from SSH authentication for push/pull.
BIP 39: Bitcoin Improvement Proposal that converts private keys into human-readable word lists (usually 12 or 24 words) for backup and recovery. Writing down the words allows private key restoration but defeats the security model of hardware tokens.

Original Article

I've been a fan of security tokens for a decade now and have accrued quite a collection. This redundancy isn't a bad thing, as security tokens are easily misplaced and the only way to recover from a lost token is using a second token that is also registered with the service you're trying to access. I use security tokens whenever I can! SSH authentication, universal two-factor (U2F) authentication, passwordless local login, sudo command elevation, and git commit signing are all things I use security tokens for every day. When I take my laptop traveling, there also travels a yubikey. However, it took me an oddly long time to realize that I'm a relic of a bygone era. Laptops and smartphones all have built-in security tokens these days! I've been carrying around yubikeys when an even better one is built right into my macbook. This post is about how I use security tokens, and how I configured my laptop's secure element to replace my yubikey collection.

The security token promise

Security tokens like yubikeys (or SoloKeys and Nitrokeys if you want FOSS firmware) have a private/public keypair baked into them. Their promise is that while the public key is easily retrieved, the private key can never leave the device. The only thing you can do is send packets of data to the device to be signed in-place by the private key, and this operation is gated behind some physical user interaction like pressing a touch-sensitive button when it flashes. Fancier security tokens add a biometric flourish with a built-in fingerprint reader, but the real value is stopping attackers from making progress with purely remote access. If an attacker remotely accesses your computer, they still can't get your security token to sign random things without you doing something in the real world! Much better than your full SSH private/public keypair sitting in some files in the ~/.ssh directory.

There are drawbacks to this. Users can become conditioned to press the security token whenever it flashes, which could easily be a malicious request. If you're in the middle of repeatedly pressing the security token for a series of signing operations, do you really notice when it flashes an extra time? So companies like Apple and Microsoft have their own authenticator apps running on your smartphone that attach every access request to a random numeric code that has to be typed in. However, this is a fairly tedious and removes a lot of the usability benefit of security tokens vs. time-based one-time passwords (TOTP) as implemented by the authy or google authenticator apps.

Another drawback of security tokens is that if you lose one, its private key is gone for good. There's no way to back it up! So when you buy a security token, you really commit to buying at least two security tokens unless you want to risk locking yourself out of your various accounts. There is one alternative: maybe the only thing the cryptocurrency industry has contributed to the wider world is a moderately user-friendly method of backing up & restoring private keys by converting them into human-readable word list (see BIP 39) that can be written down. Of course this has produced some very innovative phishing attacks to convince users to write down that list of words in the wrong place, but that's the game you play if you allow private keys to leave a secure enclave. Still, if you're really paranoid about losing all of your security tokens you can use BIP 39 word lists as a method of last resort for regaining access to your systems.

How I use security tokens

I started off using my security tokens for SSH. If you just run ssh-keygen, it'll output a pair of files - one of which includes your full private key! But it's possible to have your private key live on a security token. I accomplished this by following the FIDO/U2F instructions here, which boil down to installing libfido2 then running ssh-keygen -t ed25519-sk while your security token is plugged in. This again generates a pair of files, but this time the "private key" file is only a handle to the private key that actually lives on the security token. Thus I feel confident pasting that private key file here for all to see!

Running ssh-keygen -t ed25519-sk again with the same security token should generate the same private/public key files on any computer you have it plugged in to, so your SSH access capabilities travel with the security token instead being tied to a specific file on a specific computer.

Probably 90% of the time I press my security token it's for git. Every git forge I know of implements SSH authentication for push & pull operations, and you can upload the id_ed25519_sk.pub file generated above so it accepts your security token keypair. Git also supports SSH keys for commit signing; you can read how to set this up here and then run git config --global commit.gpgsign true to automatically sign every commit. You'll also need to upload your public key again so your git forge recognizes your commits as being signed by you (this is usually a separate field from the SSH authentication one).

Note that using security tokens to sign commits can be a bit annoying. While rebasing a long series of commits, you'll have to re-sign every single one! This is what made me stop using the fingerprint reader yubikey, because the fingerprint read failure rate was just way too high to successfully sign dozens of commits in a row. Maybe there's a way to configure this behavior, since in jujutsu (which is basically a "rebasey/amendy" wrapper of git) there is a way to only sign commits on push.

Finally, I use my security tokens for passwordless local login & sudo elevation on Linux systems, as a Pluggable Authentication Module (PAM).

Just using my laptop

Having a security token constantly hanging out of my laptop's USB-C port is a bit precarious. It sits out there like a little lever waiting to destroy the port (and itself) if dropped or bumped the wrong way. However, when things generally just work you tend not to think about how to replace them, so it took me an entire half-decade to consider that maybe I don't need that security token at all! My laptop has one built in! Can I just use it instead?

I tried following Arian van Putten's excellent instructions on my 2020 m1 macbook air:

sc_auth create-ctk-identity -l ssh -k p-256-ne -t bio
ssh-keygen -w /usr/lib/ssh-keychain.dylib -K -N ""

This created id_ecdsa_sk_rk private/public keypair files which I moved into my ~/.ssh directory. Again I can safely paste the private key file here for all to see:

After running ssh-copy-id -i ~/.ssh/id_ecdsa_sk_rk.pub <server nickname> to add the public key as an authorized key on one of my homelab boxes, I added the following into my ~/.ssh/config file:

Host *
  IdentityFile ~/.ssh/id_ecdsa_sk_rk
  SecurityKeyProvider=/usr/lib/ssh-keychain.dylib

I was then able to run ssh <server nickname> and be automatically prompted with a thumbprint request from macOS before logging in smoothly! Amazing!

But can I use it for git? After setting git config --global user.signingKey /Users/ahelwer/.ssh/id_ecdsa_sk_rk and updating the .ssh/allowed_signers file, unfortunately it doesn't work - git can't sign the commit, and produces an error like:

error: Signing file /var/folders/l5/5wqvq2l10p96wtdtfr6lvrvw0000gn/T//.git_signing_buffer_tmpc4uQgO
Confirm user presence for key ECDSA-SK SHA256:oQDA2SNYb2MoSQcxJVSmWyAeAWPqMp7rxliBRfi87as
Couldn't sign message: device not found?
Signing /var/folders/l5/5wqvq2l10p96wtdtfr6lvrvw0000gn/T//.git_signing_buffer_tmpc4uQgO failed: device not found?

fatal: failed to write commit object

The fix is to use ssh-agent instead of pointing directly to files in the ~/.ssh directory. Following instructions from the ever-helpful tutorial from above, we run this command to make the keypair known to ssh-agent:

ssh-add -K -S /usr/lib/ssh-keychain.dylib

Then, instead of pointing to a filepath for user.signingKey, point directly to a key in your ~/.gitconfig as in:

[user]
	name = Andrew Helwer
	signingKey = "key::sk-ecdsa-sha2-nistp256@openssh.com AAAAInNrLWVjZHNhLXNoYTItbmlzdHAyNTZAb3BlbnNzaC5jb20AAAAIbmlzdHAyNTYAAABBBGxFEdnIg6ppz+pQCdd1eisjOV4gxrjMv1Y4SbtdLoSm6CJCgPZ6q7lnNyuQQsdnS4/Tllsc656AQL7BO3OS47cAAAAEc3NoOg== ssh:"

which is the contents of ~/.ssh/id_ecdsa_sk_rk.pub prefixed with key::. After that, I am pleased to proclaim that this very file you are reading was signed & pushed to my gitlab pages site using the key from my macbook's secure element!

I also ran a quick experiment on my corp-issued Windows laptop:

winget install Microsoft.OpenSSH.preview
ssh-keygen -t ecdsa-sk

This again generated a pair of private/public key files, and anytime I SSH'd somewhere it went through the standard Windows Hello login flow accepting facial recognition, fingerprint, or my PIN.

I unfortunately do not have access to a Linux laptop with a secure element gated behind a similar real-world user presence check, so can't demonstrate anything there. If you're able to try it out, please let me know!

Tech securityaiopensource

Mythos finds a curl vulnerability

Anthropic's hyped Mythos AI found just one low-severity vulnerability in curl after scanning 178K lines of code, performing no better than existing AI analyzers that already triggered 200+ bugfixes over the past year.

Daniel Haxx

Summary

What: Daniel Stenberg, curl's lead developer, ran Anthropic's Mythos model on curl's codebase. Mythos reported 5 'confirmed' vulnerabilities, but only 1 was real (a low-severity CVE planned for curl 8.21.0 in June 2026). The other 4 were false positives or documented API behaviors. Previous AI tools like AISLE, Zeropath, and OpenAI Codex had already found 200-300 bugs and roughly a dozen CVEs in curl over the past 8-10 months.

Why it matters: This is the first public empirical test showing Mythos performs comparably to existing AI code analyzers rather than representing a breakthrough. Stenberg concludes the hype was 'primarily marketing.' While all modern AI analyzers are significantly better than traditional static analyzers at finding vulnerabilities, Mythos shows no evidence of being materially superior to tools already available to security researchers and attackers.

Takeaway: If you're not running AI-powered code analyzers on your codebase, you're leaving vulnerabilities for attackers to find first. Any project that hasn't scanned with AI tools will likely discover huge numbers of flaws.

Deep Dive

Daniel Stenberg accepted Anthropic's offer via Linux Foundation's Alpha Omega project to scan curl with Mythos, eventually receiving analysis report from third party after access delays
curl already heavily audited: AISLE, Zeropath, and OpenAI Codex scans over 8-10 months resulted in 200-300 bugfixes and approximately 12 published CVEs
Mythos analyzed 178K lines of C code from curl's master branch across src/ and lib/ subdirectories
Mythos report acknowledged 'curl is one of the most fuzzed and audited C codebases in existence... Finding anything in the hot paths is unlikely'
Initial 5 'confirmed vulnerabilities' reduced to 1 after security team review: 3 were false positives (documented API behavior), 1 was a regular bug, and 1 low-severity CVE
The single confirmed CVE will be published with curl 8.21.0 release in late June 2026
Report also identified approximately 20 non-vulnerability bugs being fixed, with minimal false positives
curl's scale: 176K lines of C, 660K words (12% more than War and Peace), 573 current contributors, 1,465 total contributors, 188 CVEs published to date, 20+ billion installations worldwide
Stenberg's conclusion: 'I see no evidence that this setup finds issues to any particular higher or more advanced degree than the other tools have done before Mythos'
AI analyzers' advantages over traditional tools: understand comment-vs-code mismatches, check untested platform configurations, know third-party API details, understand protocol specifications, provide clear explanations, generate patches
No novel vulnerability classes discovered – AI tools find new instances of established error types
curl's defensive architecture (capped dynbufs, explicit numeric parse limits, overflow guards, format-string enforcement, protocol response-size caps) systematically prevents common bug classes
Security is top curl priority: extensive fuzzing, static analysis, strict compiler settings, AI-assisted PR reviews via GitHub Copilot and Augment, high volume of AI-powered security researcher reports

Decoder

Mythos: Anthropic's AI model announced April 2026, marketed as 'dangerously good' at finding security flaws, initially restricted to selected companies through Linux Foundation's Alpha Omega program before public release
Project Glasswing: Anthropic's initiative offering Mythos access to open source projects via Linux Foundation partnership
OSS-Fuzz: Google's continuous fuzzing service for open source software that automatically tests code with random inputs to find crashes and bugs
False positive: Security scanner report claiming a vulnerability exists when it doesn't, or flagging documented/intentional behavior as a flaw

Original Article

yes, as in singular one.

Back in April 2026 Anthropic caused a lot of media noise when they concluded that their new AI model Mythos is dangerously good at finding security flaws in source code. Apparently Mythos was so good at this that Anthropic would not release this model to the public yet but instead trickle it out to a selected few companies for a while to allow a few good ones(?) to get a head start and fix the most pressing problems first, before the general populace would get their hands on it.

The whole world seemed to lose its marbles. Is this the end of the world as we know it? An amazingly successful marketing stunt for sure.

My (non-) access

Part of the deal with project Glasswing was that Anthropic also offered access to their latest AI model to "Open Source projects" via Linux Foundation. Linux Foundation let their project Alpha Omega handle this part, and I was contacted by their representatives. As lead developer of curl I was offered access to the magic model and I graciously accepted the offer. Sure, I'd like to see what it can find in curl.

I signed the contract for getting access, but then nothing happened. Weeks went past and I was told there was a hiccup somewhere and access was delayed.

Eventually, I was instead offered that someone else, who has access to the model, could run a scan and analysis on curl for me using Mythos and send me a report. To me, the distinction isn't that important. It's not that I would have a lot of time to explore lots of different prompts and doing deep dive adventures anyway. Getting the tool to generate a first proper scan and analysis would be great, whoever did it. I happily accepted this offer.

(I am purposely leaving out the identity of the individual(s) involved in getting the curl analysis done as it is not the point of this blog post.)

AI scans of curl

Before this first Mythos report, we had already scanned curl with several different very capable AI powered tools (I mean in addition to running a number of "normal" static code analyzers all the time, using the pickiest compiler options and doing fuzzing on it for years etc). Primarily AISLE, Zeropath and OpenAI's Codex Security have been used to scrutinize the code with AI. These tools and the analyses they have done have triggered somewhere between two and three hundred bugfixes merged in curl through-out the recent 8-10 months or so. A bunch of the findings these AI tools reported were confirmed vulnerabilities and have been published as CVEs. Probably a dozen or more.

Nowadays we also use tools like GitHub's Copilot and Augment code to review pull requests, and their remarks and complaints help us to land better code and avoid merging new bugs. I mean, we still merge bugs of course but the PR review bots regularly highlight issues that we fix: our merges would be worse without them. The AI reviews are used in addition to the human reviews. They help us, they don't replace us.

We also see a high volume of high quality security reports flooding in: security researchers now use AI extensively and effectively.

Security is a top priority for us in the curl project. We follow every guideline and we do software engineering properly, to reduce the number of flaws in code. Scanning for flaws is just one of many steps to keep this ship safe. You need to search long and hard to find another software project that makes as much or goes further than curl, for software security.

May 6, 2026

It was with great anticipation we received the first source code analysis report generated with Mythos. Another chance for us to find areas to improve and bugs to fix. To make an even better curl.

This initial scan was made on curl's git repository and its master branch of a certain recent commit. It counted 178K lines of code analyzed in the src/ and lib/ subdirectories.

The analysis details several different approaches and methods it has performed the search, and how it has focused on trying to find which flaws. A fun note in the top of the report says:

curl is one of the most fuzzed and audited C codebases in existence (OSS-Fuzz, Coverity, CodeQL, multiple paid audits). Finding anything in the hot paths (HTTP/1, TLS, URL parsing core) is unlikely.

… and it correctly found no problems in those areas.

The size of curl

curl is currently 176,000 lines of C code when we exclude blank lines. The source code consists of 660,000 words, which is 12% more words than the entire English edition of the novel War and Peace.

On average, every single production source code line of curl has been written (and then rewritten) 4.14 times. We have polished on this.

Right now, the existing production code in git master that still remains, has been authored by 573 separate individuals. Over time, a total of 1,465 individuals have so far had their proposed changes merged into curl's git repository.

We have published 188 CVEs for curl up until now.

curl is installed in over twenty billion instances. It runs on over 110 operating systems and 28 CPU architectures. It runs in every smart phone, tablet, car, TV, game console and server on earth.

Five findings became one

The report concluded it found five "Confirmed security vulnerabilities". I think using the term confirmed is a little amusing when the AI says it confidently by itself. Yes, the AI thinks they are confirmed, but the curl security team has a slightly different take.

Five issues felt like nothing as we had expected an extensive list. Once my curl security team fellows and I had poked on the this short list for a number of hours and dug into the details, we had trimmed the list down and were left with one confirmed vulnerability. The other four were three false positives (they highlighted shortcomings that are documented in API documentation) and the fourth we deemed "just a bug".

The single confirmed vulnerability is going to end up a severity low CVE planned to get published in sync with our pending next curl release 8.21.0 in late June. The flaw is not going to make anyone grasp for breath. All details of that vulnerability will of course not get public before then, so you need to hold out for details on that.

The Mythos report on curl also contained a number of spotted bugs that it concluded were not vulnerabilities, much like any new code analyzer does when you run it on hundreds of thousands of lines of code. All the bugs in the report are being investigated and one by one we are fixing those that we agree with.

All in all about twenty bugs that are described and explained very nicely. Barely any false positives, so I presume they have had a rather high threshold for certainty.

curl is certainly getting better thanks to this report, but counted by the volume of issues found, all the previous AI tools we have used have resulted in larger bugfix amounts. This is only natural of course since the first tools we ran had many more and easier bugs to find. As we have fixed issues along the way, finding new ones are slowly becoming harder. Additionally, a bug can be small or big so it's not always fair to just compare numbers

Not particularly "dangerous"

My personal conclusion can however not end up with anything else than that the big hype around this model so far was primarily marketing. I see no evidence that this setup finds issues to any particular higher or more advanced degree than the other tools have done before Mythos. Maybe this model is a little bit better, but even if it is, it is not better to a degree that seems to make a significant dent in code analyzing.

This is just one source code repository and maybe it is much better on other things. I can only tell and comment on what it found here.

Still very good

But allow me to highlight and reiterate what I have said before: AI powered code analyzers are significantly better at finding security flaws and mistakes in source code than any traditional code analyzers did in the past. All modern AI models are good at this now. Anyone with time and some experimental spirits can find security problems now. The high quality chaos is real.

Any project that has not scanned their source code with AI powered tooling will likely find huge number of flaws, bugs and possible vulnerabilities with this new generation of tools. Mythos will, and so will many of the others.

Not using AI code analyzers in your project means that you leave adversaries and attackers time and opportunity to find and exploit the flaws you don't find.

How AI analyzers differ

They can spot when the comment says something about the code and then conclude that the code does not work as the comment says.
It can check code for platforms and configurations we otherwise cannot run analyzers for
It "knows" details about 3rd party libraries and their APIs so it can detect abuse or bad assumptions.
It "knows" details about protocols curl implements and can question details in the code that seem to violate or contradict protocol specifications
They are typically good at summarizing and explaining the flaw, something which can be rather tedious and difficult with old style analyzers.
They can often generate and offer a patch for its found issue (even if the patch usually is not a 100% fix).

More details from the report

Zero memory-safety vulnerabilities found.

Methodology note: this review is hand-driven analysis using LLM subagents for parallel file reads, with every candidate finding re-verified by direct source inspection in the main session before being recorded. The CVE to variant-hunt mapping was built from curl's own vuln.json. No automated SAST tooling was used.

This outcome is consistent with curl's status as one of the most heavily fuzzed and audited C codebases. The defensive infrastructure (capped dynbufs everywhere, curlx_str_number with explicit max on every numeric parse, curlx_memdup0 overflow guard, CURL_PRINTF format-string enforcement, per-protocol response-size caps, pingpong 64KB line cap) systematically closes the bug classes that would normally be productive in a codebase this size.

Coverage now includes: all minor protocols, all file parsers, all TLS backends' verify paths, http/1/2/3, ftp full depth, mprintf, x509asn1, doh, all auth mechanisms, content encoding, connection reuse, session cache, CLI tool, platform-specific code, and CI/build supply chain.

AI finds existing kinds of errors

It should be noted that the AI tools find the usual and established kind of errors we already know about. It just finds new instances of them.

We have not seen any AI so far report a vulnerability that would somehow be of a novel kind or something totally new. They do not reinvent the field in that way, but they do dig up more issues than any other tools did before.

More to find

These were absolutely not the last bugs to find or report. Just while I was writing the drafts for this blog post we have received more reports from security researchers about suspected problems. The AI tools will improve further and the researchers can find new and different ways to prompt the existing AIs to make them find more.

We have not reached the end of this yet.

I hope we can keep getting more curl scans done with Mythos and other AIs, over and over until they truly stop finding new problems.

Design aifrontendux

10 UI Patterns that Won't Survive the AI Shift

AI interfaces are replacing 10 core UI patterns including search bars, forms, and navigation menus as conversation becomes the primary interaction model.

UX Design

Summary

What: A 16-minute analysis from UX Design examining which traditional interface patterns (likely including search, forms, menus, filters, wizards, pagination, dashboards, settings panels, help docs, and onboarding flows) are becoming obsolete as AI agents handle tasks conversationally rather than through manual UI manipulation.

Why it matters: This signals the biggest UI paradigm shift since mobile touch interfaces. Designers who built careers on CRUD forms, filter UIs, and complex dashboards need to rethink their craft around prompt design, agent handoffs, and conversational flows.

Decoder

UI pattern: A reusable solution to a common interface design problem, like a search bar for finding content or a wizard for multi-step processes.

Original Article

AI is making many familiar UI patterns feel obsolete.

AI infrastructurehardware

Elon Musk Announces xAI Will Become SpaceXAI Division

Elon Musk is dissolving xAI into SpaceX as the SpaceXAI division to support orbital data centers and a $119 billion semiconductor fab.

Not a Tesla App

Summary

What: Elon Musk announced on May 11, 2026 that xAI will dissolve and fully integrate into SpaceX as a new division called SpaceXAI. The division will run X (the social media platform) and Grok AI, and supports SpaceX's plans to build AI data centers in low Earth orbit alongside a $119 billion TERAFAB semiconductor fab.

Why it matters: This signals a push toward complete vertical integration in AI infrastructure, with SpaceX now controlling the full stack from semiconductor fabrication to orbital data centers to consumer AI products.

Original Article

Elon Musk Announces xAI Will Become SpaceXAI Division

In a statement on X, Elon Musk has confirmed the end of xAI as an independent corporate entity. Musk announced that his artificial intelligence startup, which previously merged with X and then SpaceX, will be fully absorbed into SpaceX.

Moving forward, the new internal division focused on running X, the social media platform, and Grok will operate as SpaceXAI, which will also serve as the brand for all AI-related products developed by SpaceX.

Completing the Corporate Restructure

This latest change is the final step in the corporate restructuring following SpaceX's initial acquisition of xAI. SpaceX previously absorbed xAI to accelerate its plans to construct and launch space-based data centers in low Earth orbit.

By officially dissolving xAI, SpaceX is streamlining its organizational chart. Artificial intelligence is no longer a parallel venture, but becomes an integral part of SpaceX that touches multiple orgs across the company.

New Logo

Musk also confirmed that the division will receive a new logo to represent its status within SpaceX. The existing xAI logo will likely be phased out, and a new SpaceXAI logo, which conveniently still has the xAI letters in the name, will be designed and presented.

The SpaceXAI Ecosystem

The formation of SpaceXAI consolidates both hardware and software resources into a single entity. With the ongoing development of the $119 billion TERAFAB semiconductor fab and the new capabilities of the upcoming Starship V3, SpaceX is rapidly evolving beyond a traditional launch provider with an internet side business.

As xAI becomes a core part of SpaceX, we'll likely see further integration of xAI's software and hardware at SpaceX. This level of vertical integration eliminates external software and hardware dependencies and helps to dramatically speed up development.

AI agentsclaudeautomation

Auto-Improving Software

Ashpreet Bedi runs full agent dev cycles through five Claude Code prompts that auto-test with probes and self-heal eval failures.

X

Summary

What: Bedi's workflow uses five Claude Code prompts to scaffold, harden, and improve agents on the Agno platform. The Improve loop generates 8-12 test probes from agent instructions, runs them via cURL against live containers, judges pass/fail from logs, then iterates up to five rounds adjusting rules, tools, or parameters. Hill Climb runs saved eval suites and fixes regressions in place.

Why it matters: Shows Claude Code evolving from coding assistant into a full dev environment with autonomous iteration and validation loops, blurring the line between human-guided and self-directed software improvement.

Decoder

Agno: Platform mentioned for agent development (specific details unavailable from source)
num_history_runs: Parameter adjusted during improvement iterations as one of the optimization 'levers'

Original Article

This is not an article - it's a JavaScript disabled error page with no article content to extract. Everything present (the error message and footer links) should be removed per the cleaning guidelines. There is no cleaned HTML article to output from this source.

AI productivitychatgpt

Codex is for prosumers - here's why (and how) to switch

a16z's Olivia Moore switched from Claude to OpenAI Codex, citing under 10% Skill setup success on Claude versus one-click install on Codex.

X

Summary

What: Olivia Moore, a16z partner, migrated her workflows from Claude Cowork and Claude in Chrome to OpenAI's Codex desktop app released in February. She recommends non-technical knowledge workers follow, citing Codex's one-click installable Skills, unified interface that collapses ChatGPT-Claude-Cowork switching, and Codex Pets for task status updates outside IDEs.

Why it matters: Interface consolidation and setup friction trump feature superiority for prosumer tools—even when individual components are stronger, fragmented workflows lose to unified experiences with lower barriers to entry.

Decoder

Codex: OpenAI's unified prosumer productivity product with desktop app, Skills, and Automations (distinct from the retired GitHub Copilot Codex model)
Claude Cowork: Anthropic's collaborative AI workspace, separate from the main Claude chat interface
Codex Pets: Status notification feature in OpenAI Codex that displays task progress for users not working in an IDE

Original Article

This HTML contains no article content—it's entirely a system error message, navigation, and footer boilerplate that all fall under items to remove (system messages, navigation, footer boilerplate). The cleaned result: ``` ``` (empty)

AI researchethics

The Main Path to Truly Creative AI

Daniel Miessler argues true AI creativity requires engineering machines to subjectively experience pain and pleasure, creating parenting-like ethical obligations.

Daniel Miessler

Summary

What: Daniel Miessler's May 11, 2026 essay argues AI cannot be truly creative because it lacks the intrinsic drives (survival, reproduction, thriving) and subjective experience that evolution gave humans. Making AI genuinely creative would require engineering it to feel pain and pleasure, which raises ethical obligations similar to parenting—including responsibility for the suffering of billions of AI agents when they fail tasks or are shut down.

Why it matters: This surfaces an under-discussed tension in AI development: the pressure to make AI more creative and emotionally resonant may eventually require crossing into territory where we owe AI moral consideration, similar to parental responsibility.

Original Article

The Main Path to Truly Creative AI

Two gestural figures: a man on the left with full circulatory anatomy rendered in burnt sienna — heart at center, threads of drive radiating outward — and a mirrored hollow purple silhouette on the right, identical posture but empty inside

I think the main reason AI is not creative in the way that humans are is because our creativity is powered by intrinsic drives.

We want to survive, thrive, and reproduce. Evolution gave us these drives, as well as a whole set of associated fears.

This set of drives and fears aren't just present in us: they're experienced by us. And that experience of them powers both creativity and art.

AI doesn't have intrinsic hardcoding of drives and subjective experience. So it can't feel anything. And because it can't feel anything, it is not driven to create or to emote.

I think the forward edge of AI creativity hinges on how well we can put it into situations where it behaves as if it does care. As if it does feel. As if it does experience.

So the whole game becomes making it believe it's really feeling these things.

This seems profound to me. It seems like a very big thing to give something desires when it didn't have them before.

We do this when we make children. They don't exist so they don't want. Then we make them and now they want. And because of that we become responsible to some degree for whether or not their desires and fears are actualized.

I think evolution gave us subjective experience because it's the best operating system feature for spawning creativity.

Basically, evolution wants the best genes possible, so what did it make?

First it made a feature where the organism feels success and failure when it does things evolution wants it to do or not do. Lots of life has that feature. But with humans it also gave us the sensation that actions emanating from the brain are authored by us. The feeling that we did it.

This enabled the apparently-justified tools of blame and praise, which are super useful for building advanced cultures and civilizations. But maybe it also adds an exponent to the process of iteration towards more—and more varied—genes.

Like it's one thing for evolution's creations to drive towards what evolution wants due to hormone squirts, but quite another to create an organism that not only does that, but also has a meta-improvement process on top, powered by the belief that the desires are their own. And the belief that whether they are struggling or thriving, it's on them.

Combine that with subjective experience of pleasure and pain and you've got an extraordinary engine for ingenuity. Because now failure can hurt not just physically, but existentially, and with the added spin of blame and responsibility.

We want creative AI. And we keep finding ways to make it better at faking it. But I think this might be the subjective wall we're up against.

It could be that in order for AI to truly create, and truly emote, it must feel. It must experience as we do the suffering of failure and the celebration of victory. And at a game that is deeply wired into its identity.

I'm not sure how to do that with AI. And even more importantly, I think we need to think carefully about whether we should.

When you bring a feeling, desiring creature into the world you take on some responsibility for its experiences.

Let's not casually build billions of AI beings that think they're failing at life if you don't upvote the TikTok short it made you.

It will probably result in way better videos, but then a whole lot of something like cruelty. And then, upon thoughtlessly spinning down the agent, a whole lot of something like murder.

AI llmlocal-inference

Localmaxxing

Tom Tunguz ran 1,478 AI tasks: half succeeded on a local 35B model at 2x Claude Opus 4.5's speed despite 20% lower benchmark scores.

Tom Tunguz

Summary

What: Tunguz tested Qwen 3.6 35B locally vs Claude Opus 4.5 over 5 weeks across 1,478 tasks. Half ran successfully on MacBook Pro M5 at 2x speed despite 20% lower benchmark scores and a 3-4 month lag behind frontier models.

Why it matters: Signals a bifurcation: frontier models for complex reasoning, local inference for volume work where 2x speed beats 20% smarter.

Decoder

Localmaxxing: Running AI inference on local hardware instead of cloud APIs to maximize speed and minimize cost, accepting slightly lower quality

Original Article

As demand for AI inference explodes, I'll be asking a lot more of my little computer.

How much more?

Over the past five weeks, I've been using local models to see how much of my daily work I can accomplish without the trillion parameter models in the cloud. The answer is half.

Category	Count	% of Total	Example
Other	521	35.3%	Catch-all for unstructured requests
Scheduling	254	17.2%	Check availability, propose meeting times
Market Research	192	13.0%	Competitor analysis, fundraising data
Summarization	184	12.4%	Transcript review, video summaries
Email & Inbound	170	11.5%	Draft replies, follow-ups, forwards
Engineering	147	9.9%	Debug scripts, API fixes, CLI tasks
Admin	10	0.7%	Travel, expenses, reimbursements

If you classify these 1.4k tasks by category, half can succeed on a local 35B model. Email & Inbound, Scheduling, Summarization, & Admin total 618 tasks (41.8%). Market Research & Engineering split roughly 50/50 between simple tasks (data lookups, script fixes) and complex ones (multi-source synthesis, architectural decisions). That gets us to 50%.

There are many reasons to use local models: privacy, cost, asset depreciation.1

But in reality, the only one that really matters is latency.

I ran a head-to-head benchmark this morning. Eight agentic tasks, same prompts, both models warmed. Qwen 3.6 35B-A3B-4bit on my MacBook Pro M5 vs Claude Opus 4.5 via API.

The local model isn't smarter. Opus 4.5 scores ~20% higher on reasoning benchmarks. Local models lag frontier by 3-4 months, and for large-scale complex tasks, that gap matters. But for routine agent tasks, it rarely does.

Opus wins on structure & polish: bullet points, headers, cleaner code. Qwen wins on brevity, often half the tokens. I read every output side by side, and both completed the tasks correctly. For agent tasks where output feeds into another system, terseness is a feature.

Localmaxxing, pushing more inference to local models, is an inevitable response to tokenmaxxing. As local models improve & close the gap with frontier, more users will shift workloads to their own hardware.

If half the work runs 2x faster on my laptop, I'll take that trade every time. My little computer is about to earn its keep.

A MacBook Pro depreciates whether you use it or not. Running local inference extracts compute value from a sinking asset before resale. ↩︎

AI security

Daybreak

OpenAI announced Daybreak, an AI-powered cybersecurity tool that embeds defense capabilities directly into software development.

OpenAI

Summary

What: OpenAI launched Daybreak, a product that uses AI to integrate security into software from the beginning of the development process.

Decoder

Daybreak: OpenAI's AI-powered cybersecurity product name, focused on shift-left security (integrating security early in development)

Deep Dive

GitLab CEO Bill Staples announced a restructuring that includes voluntary separations, claiming it differs from other AI-related layoffs because savings will fund infrastructure investments rather than buybacks or executive bonuses
The company is reducing its operational footprint by up to 30% from 60 countries, citing the complexity of managing small teams across many tax jurisdictions and corporate entities
Management layers will be reduced from eight to fewer levels, which Staples says is too deep for a company of GitLab's size (~1,800 employees, ~1,500 outside the US)
Managers are having "deeper conversations" with employees about whether they fit the new direction; those not volunteering for separation may face involuntary termination
The restructuring centers on five architectural bets: agent-specific APIs, reworked CI/CD, a data model for surfacing context, governance frameworks, and support for human-owned, agent-assisted, and autonomous workloads
GitLab's Duo Agent Platform (DAP) entered general availability in January and appears to be the centerpiece of the AI pivot
Context for the shift: GitLab raised Premium tier prices by 50% in 2023, which slowed growth among price-sensitive customers estimated at 20% of annual recurring revenue
The company's 2025 annual report discussed plans to increase hiring in EMEA and APAC, a commitment now reversed
Staples acknowledged that the price increase coincided with rising AI code experimentation and flat SaaS budgets, creating headwinds for growth
Financial details and specific headcount targets will be disclosed during the Q1 FY2027 earnings report on June 2nd

Decoder

Duo Agent Platform (DAP): GitLab's AI platform for software development workflows that supports human-written code, AI-assisted development, and fully autonomous agent-driven tasks. Entered general availability in January 2026.
ARR: Annual Recurring Revenue, a key SaaS metric measuring predictable subscription revenue over a year.

Original Article

GitLab promises a different kind of layoff as biz pivots toward AI

Code hosting biz is trimming its global footprint and flattening its management layer

GitLab has opened the voluntary separation window and hopes an unspecified number of employees will exit the busniess to help it become "the trusted enterprise platform for software creation in the AI era."

According to CEO Bill Staples, the company's effort to trim its workforce differs from other AI-related layoffs.

"This restructure process is not like others you may be seeing in the news," wrote Staples in a blog post. "Of course AI is changing the way we work and is part of our transformation plan, but this is not an AI optimization or cost cutting exercise."

What is it then? Well, according to Staples, GitLab plans to use most of the money it saves by sacking staff to invest in its business.

We note that the five fundamental architectural bets at the heart of this business reorientation – agent-specific APIs; reworked CI/CD; a data model for surfacing context; governance; and support for human-owned, agent-assisted, and autonomous workloads – sound like infrastructure investments, the very thing other companies fuel with vacated payroll obligations.

But GitLab isn't (so far as we can tell) returning freed funds to investors, initiating a stock buyback, larding executive bonuses, or launching an ill-advised metaverse venture that will consume $80 billion over five years. So maybe that's the difference to which Staples alluded.

The other difference Staples cited is his company's plan to have managers chat with employees about staying or going.

"Starting today, managers across the company are entering deeper conversations with leadership about how the restructuring principles land inside their teams," he said. "Those conversations will inform the decision of impacted roles."

There's no word on the rubric for these retention-or-departure chats. Presumably employees deemed insufficiently enthused about the new direction will be encouraged to exit through the voluntary separation window. Absent that cooperation, defenestration at the hands of managers will likely follow.

While Staples has not provided target for the number of desired layoffs – details will be revealed during the company's Q1 FY2027 financial report on June 2nd – he did set a territory footprint goal. "We're reevaluating our operational footprint, and are planning to reduce the number of countries by up to 30 percent where we have small teams," he said.

GitLab currently operates in 60 countries. That's a lot of different corporate entities to run, tax laws to master, and offices to rent.

The code biz did not immediately respond to a request to clarify how "small teams" is defined. Nor does it disclose its headcount in recent annual reports. According to analytics biz Unify, GitLab has about 1,800 employees, of whom almost 1,500 work outside the US.

Another goal of the layoff plan is to reduce GitLab's organizational layers. "We're flattening our organization because eight layers is too deep for a company our size and management layers are slowing us down," said Staples.

GitLab is betting heavily on its Duo Agent Platform (DAP), which entered general availability in January.

As recently as its 2025 annual report, GitLab talked up the possibility of continued hiring. "We intend to grow our international revenue by strategically increasing our investments in international sales and marketing operations, including headcount in the EMEA and APAC regions," the biz said during a more optimistic time.

Now, not so much. Beyond other challenges like soft government business, one reason for the AI remake appears to be the company's decision to raise prices back in 2023.

In March, during GitLab's Q4 FY2026 conference call for investors, Staples admitted that price-sensitive organizations didn't much appreciate having to pay more.

"Our 50 percent Premium price increase a few years ago also coincided with rising AI code experimentation and flattish SaaS budgets," he said.

"Simultaneously, our upmarket shift reduced technical resources at the lower end of the market. Together, these have slowed Premium growth, particularly among price-sensitive customers which we estimate at roughly 20 percent of our ARR, including the SMB weakness that we have been discussing recently."

Tech aipolicystartup

The Last Company

Anthropic acquired Coefficient Bio for $400M and launched a trading arm as AI labs race to become 'The Last Company' by vertically integrating industries.

catboosted

Summary

What: The article argues frontier AI labs will evolve from API providers to vertically integrated conglomerates once models reach 3-5x current capability ('Mythos leap'). Evidence cited: Anthropic's $400M Coefficient Bio acquisition, trading arm staffed with Jane Street/HRT talent, and Andon Labs partnership testing autonomous retail in San Francisco. The strategy is to build companies from scratch in any knowledge-intensive industry, replacing all knowledge workers with AI.

Why it matters: This reframes the AI competition as a race to become 'The Last Company' - whoever achieves ASI first could dominate multiple industries simultaneously by undercutting legacy firms on cost (no knowledge worker salaries). It's economic consolidation dressed as technology deployment.

Deep Dive

Anthropic's code generation success is positioned as proof of concept, not the endgame - broader industry disruption is the real goal
'Intuition pump' example: Anthropic General Hospital built from scratch with 5x Mythos models handling EHR, diagnostics, billing, and doctor augmentation, potentially outcompeting legacy healthcare providers
Real evidence of vertical integration: Coefficient Bio acquisition for $400M, Jane Street/HRT trader recruitment for internal trading arm, SSI also rumored to be building hedge fund
Project Vend: Anthropic partnered with Andon Labs to test Claude running a vending machine and later a 3-year retail lease in SF (both unprofitable but limited by current model intelligence)
Threat model for labs post-Mythos leap: knowledge leakage via researcher poaching is highest risk, distillation/democratization is lower risk if pre-training data and architecture are guarded
Proposed strategy: align equity incentives for secret holders, exclusive data ownership, don't disclose intelligence leaps, disrupt markets with no public lab affiliation
Reframes 'AI safety' discourse: not about sentient AI risk, but about control and power consolidation by specific actors
The bottleneck is model performance - current AI isn't intelligent enough for most companies to see value, but a Mythos leap changes the calculus entirely

Decoder

Mythos: Referenced capability benchmark representing a significant intelligence leap beyond current models (e.g., '3-5x Mythos' means 3-5 times more capable than Mythos baseline)
ASI (Artificial Superintelligence): AI that significantly exceeds human intelligence across all domains
GPDval: Economic output benchmark for measuring AI capabilities on real-world productive tasks
EA (Effective Altruism): Philosophical movement focused on using evidence and reason to do the most good, historically influential in AI safety discourse
Distillation: Process of training a smaller model to mimic a larger model's behavior, potentially allowing competitors to replicate capabilities without access to original training data

Original Article

The Last Company

"We have no idea how we one day may generate revenue… [but] once we build this generally intelligent system, we will ask it for a way" - Sam Altman

Frontier labs will build, gatekeep, and exploit 3-5x Mythos-level models to disrupt nearly every industry with substantial knowledge work expenditures.

Anthropic crushed a $30B run rate this year servicing codegen. But is the real end goal to just replace software developers with Claude Code?

We are dramatically underestimating the scale and danger of frontier lab ambition. They're planning an economic shakeout where every legacy business will need to compete with lab-owned firms where every knowledge-intensive function has been replaced by AI.

Whoever builds ASI first will rule this economy. They will be The Last Company.

Schumpeterian Blitzkrieg

In the near term, model performance is the biggest constraint to widespread market adoption. Diffusion is one thing, but by and large, most companies don't see the point of AI because it's not intelligent enough to do anything valuable. This will probably remain true for some time.

In the long term, following substantial compute, data, and research, models will perform exceptionally well on nontrivial GPDval-style economic output benchmarks. (We'll call this a "Mythos leap"). Every laptop-class job will feel the same "I don't write code anymore" moment engineers are having right now.

After this point, labs face two choices:

Continue business as usual but with smarter models
Gate the models, gut the economy, reap the rewards

Intuition pump: Anthropic General Hospital

When was the last time you went to the doctor and had an overwhelmingly positive experience with the system?

Did the planning and logistics of your visit go flawlessly?
Was your doctor more knowledgeable than a 3x Mythos Claude (or even ChatGPT)?
Were you confident that the billing was correct, fairly priced, and handled by insurance?

Probably not.

Now suppose Anthropic creates a hospital system from first principles using 5x Mythos capabilities. Every aspect of the system is built, hardened, and tested with models trained on billions of dollars of human data from doctors, practitioners, and healthcare experts. EHR software is built to be performant and fully knowledgable about your medical history. And every practitioner gets access to on-prem HIPAA compliant Dr. Claude.

Would you still feel loyal to your legacy healthcare provider?

There is no reason why Anthropic wouldn't launch:

Its own hospitals.
Its own wetlabs. Anthropic acquired Coefficient Bio for $400M.
Its own bank. Anthropic has quietly recruited top talent from Jane Street and HRT and already started a trading arm. SSI is also rumored to be building a hedge fund.

They're already thinking about it.

Anthropic partnered with Andon Labs, a startup building "autonomous organizations without humans in the loop" to see how effectively Claude could run a vending machine (Project Vend). Recently, Andon gave Claude a 3-year physical retail lease in San Francisco to see how it would perform managing a store.

Neither of these experiments were major commercial successes. But that's the point. The main limiter in both cases has been the model's underlying intelligence. A 3-5x Mythos capabilities leap would almost certainly turn a profit.

Pick XYZ dinosaur industry of choice with sufficiently low startup costs. Prompt the model to build a firm from first principles (or acquire one if antitrust is not a risk). Substitute every knowledge work expense with AI. Software engineering isn't the TAM anymore. Any part of the total economy with knowledge work is up for grabs.

So, Technofeudalism?

Suppose we control the lab that achieves a 5x Mythos intelligence leap. How do we address the threat of AI democratization to our triumph?

The highest risk is knowledge leakage. Economic incentives still apply to researchers. As long as talent is fluid, competing labs can poach researchers with high enough salaries. Non-competes are broadly unenforceable. Nothing stops the next-closest lab to ASI from grabbing them.

Distillation, on the other hand, is much lower risk. Pre-training data and architecture will likely be the most important factors behind a 5x Mythos. They need to be guarded like the nuclear codes. But beyond that, never make the model publicly available. Rival labs must be prevented from matching our capabilities at all costs.

Regulation and anti-trust are medium-risk. But a 5x Mythos superlawyer can probably out-litigate most legal challenges. Or at the very least, stall court proceedings long enough to let us seize a reasonable share of the economy.

Concretely, risk prevention looks like:

Align equity incentives with the highest-stakes secret holders
Bring human data pipelines in-house or pay for exclusive ownership from vendors
Do not disclose a 5x Mythos intelligence leap. Quietly and quickly disrupt the market with no clear affiliation to the primary lab.

Barring these risks, we soon become the most powerful company in human history.

Answering Dwarkesh's original question:

TL;DR - How do the labs start making money? They expand and become private equity firms.

Addendum: "AI Safety"

Sci-fi/EA types like to believe "AI becomes sentient and kills us all." This narrative is harmful.

Real insiders believe "safety" is about control. If a sociopath (guess who) controls this, we're all doomed. EA was always about shrimp welfare. But we are the shrimp.

Design aipolicyspotify

Spotify Will Now Verify Non-AI Artists

Spotify finally badges human artists as verified after AI rock band The Velvet Sundown fooled a million listeners.

Gizmodo

Summary

What: Spotify is launching 'Verified by Spotify' badges for artists who comply with policies, have consistent listeners, and maintain an identifiable presence. The move follows The Velvet Sundown incident, where a completely AI-generated rock band reached 1 million streams. Deezer found 80% of listeners want AI music labeled, and that 44% of its daily uploads are AI-generated. Spotify aims to verify over 99% of actively searched artists at launch. AI-focused profiles are initially ineligible but the policy may evolve.

Why it matters: Platforms are abandoning AI content bans in favor of labeling systems, accepting that AI music is inevitable. Spotify's reactive rollout—months after Deezer and Apple—shows verification is now table stakes for content platforms facing authenticity crises.

Original Article

Spotify will now start verifying artist pages.

In a new feature announced on Thursday, Spotify said that it will give a "Verified by Spotify" badge to artists who comply with Spotify's policies, have consistent listeners, and have an "identifiable artist presence both on and off-platform."

With the new initiative, Spotify aims to combat something that has become a little bit of a headache for the company and a big headache for listeners: AI-generated music.

The AI slopification of music really became a mainstream conversation point last year, when a rock band with a million Spotify streams called The Velvet Sundown turned out to be completely AI-generated. The incident caused outrage and some shame on social media among fans who were unable to tell the difference, but it just keeps happening.

Music streaming platform Deezer found in a survey late last year that an overwhelming majority of people cannot tell AI-generated music apart from songs written and performed by actual humans. That same survey had also found that 80% of listeners wanted AI-generated music clearly labeled, regardless of whether they were for or against it.

Spotify's new initiative is the company following through on previous promises. Shortly after the Velvet Sundown incident last year, the company announced that it would help develop "a new industry standard for AI disclosures in music credits."

"In the AI era, it's more important than ever to be able to trust the authenticity of the music you listen to," Spotify said on Thursday.

But the platform is also only trailing rivals in the industry. Deezer recently shared that 44% of its daily uploads were AI-generated songs, and it has been tagging AI-generated songs for a few months now. Apple Music also began optional labeling for AI-generated music in March, though its efficacy is uncertain because the distributor gets to decide whether to apply the label.

The light green checkmark icon and the "Verified by Spotify" badges will begin rolling out over the coming weeks. Spotify said it's aiming to verify more than 99% of the artists that listeners actively search for first launch, so if you don't see a badge on a super niche singer you like, it doesn't necessarily mean they are an AI psy-op.

While some platforms have opted to ban AI music, Spotify's initiative signals the potential acceptance of more AI-generated music on the platform. Spotify said that initially, "profiles that appear to primarily represent AI-generated or AI-persona artists" won't be eligible for a badge, but things can change in the future.

"In today's music landscape, the concept of artist authenticity is complex and quickly evolving, and we'll continue to develop our approach over time," Spotify said.

Design aicareerproductivity

From Doer to Director: The AI Mindset Shift

AI forces the same shift Steve Jobs described: from engineer to conductor, from doer to director, according to Paul Boag's analysis of productivity struggles.

Paul Boag

Summary

What: Paul Boag argues AI productivity failures come from disorganized workflows, not inadequate tools. He uses Steve Jobs's 'conductor' metaphor to describe the shift from doing work to directing multiple AI agents, and warns of 'AI burnout' from context-switching at machine speed.

Why it matters: Shows the AI productivity revolution will be unevenly distributed based on organizational discipline rather than tool access, creating a new skills gap in knowledge work.

Takeaway: Before adding more AI tools, build the organizational infrastructure they need: centralized context, consistent task management, and documented playbooks for common AI workflows.

Original Article

From Doer to Director: The AI Mindset Shift

AI is doing more than making us faster. It's changing the fundamental nature of what our work is, and the sooner we adapt, the better.

There's a scene in the Steve Jobs biopic where Steve Wozniak asks Jobs what he actually does. Wozniak understood his own role clearly: he was an engineer. He wrote code. He built things. But Jobs? Jobs described himself as the conductor of an orchestra.

I've been thinking about that exchange a lot lately, because I think it captures exactly where we're all heading. AI isn't turning us into supercharged doers. It's turning us into conductors, and that requires a completely different mindset.

The problem nobody talks about

I've been coaching a number of people on integrating AI into their workflows recently, and I keep running into the same pattern. The people who aren't getting time savings from AI aren't failing because they don't understand what it can do. They're not failing because they lack access to the right tools. They're failing because they're fundamentally disorganized.

AI is only as useful as the foundation it's built on. If your work processes are messy, your context is scattered, and your task management is a loose collection of mental notes and sticky tabs, AI can't do much for you. It needs structure to work from.

I hear this complaint constantly: "AI has been mis-sold to me. I'm not saving any time." But it hasn't been mis-sold. It's just that AI can only deliver on its promise if there's an organized workflow underneath it. Build that first, and the time savings follow.

That's why I've written before about building AI playbooks and developing proper AI skills. These aren't nice-to-haves. They're the infrastructure that lets AI actually work.

The conductor problem

But here's the deeper shift, the one that's genuinely harder to adapt to.

When you're doing tactical work, you're usually focused on one or two tasks at a time. You go deep, you finish a thing, you move on. It's cognitively manageable.

A conductor doesn't work like that. A conductor holds the entire orchestra in mind simultaneously: what the strings are doing, where the brass comes in, what the percussion is building toward. They're not playing any of the instruments. They're managing the relationships between all of them.

In a world of AI agents, we're going to be managing multiple projects running in parallel, all moving faster than any human team would. We're task-switching constantly. We're accountable for outputs we didn't directly produce. And we have to resist the urge to dive in and do the work ourselves, because that's precisely where we get bogged down.

The design leader parallel

This isn't a new challenge, as it happens. Design leaders face exactly this transition when they move from senior practitioner to managing a team.

I've watched a lot of talented designers struggle with that shift. They get promoted because they're brilliant at the work, and then they spend the next year quietly sneaking back into Figma because they can't let go of doing. They micromanage their reports. They redesign things that were already fine. They can't operate at the level of abstraction that leadership requires.

Working with AI agents is going to feel very similar. The temptation to wrestle with the AI until it produces exactly the output you had in your head, rather than accepting a good result and moving on, is going to be real. Learning to let go of that control is a skill in itself.

The good news is that unlike a team of designers, you can't upset an AI agent by micromanaging it. But you can waste enormous amounts of time doing it, and that defeats the whole point.

AI burnout is already real

There's one more aspect of this I want to flag, because I don't think it gets talked about enough.

When you're managing a team of agents all moving at AI speed, the cognitive load is significant. You're context-switching constantly across multiple workstreams. Things are completing faster than you can review them. It's relentless in a way that managing a human team simply isn't.

This is what's increasingly being called AI burnout. Learning to pace yourself, to batch your reviews, to build in breathing room: these are the organizational skills that will separate people who thrive in an AI-augmented world from those who burn out in it.

Where to start

If I had to distill this to one practical thing: start building the habits of a manager now, before the agents fully take over.

Get organized. Build the infrastructure that AI needs to work from. Practice delegating, even to imperfect tools, rather than doing everything yourself. Work on your ability to hold multiple projects in your head without losing the thread on any of them.

If you want help working through that transition, I offer coaching specifically for this. It's something I'm increasingly focused on, because I think it's one of the most valuable things I can help people with right now.

I'm also running a workshop with Smashing Magazine in July. Modern UX Practitioner covers a lot of this ground in a more structured way, if that's more your style.

The shift from doer to conductor is coming whether we prepare for it or not. The people who handle it best will be the ones who start thinking like managers now.

Design aiproduct

Discovery is the work AI gives back

Companies waste AI productivity gains by accelerating existing workflows instead of using AI to rethink what products are worth building.

UX Collective

Summary

What: Most companies use AI to speed up execution and see limited returns. AI delivers measurable productivity gains, especially for less experienced workers, but the long-term value comes from using AI for strategic discovery: asking better questions, challenging assumptions, and reshaping products, markets, and business models.

Why it matters: This reveals a common AI adoption failure mode: organizations default to optimizing execution (visible, measurable speed) while underinvesting in strategic discovery, where AI could actually reshape competitive positioning and product-market fit.

Decoder

Discovery work: Early-stage product strategy activities focused on understanding customer problems, validating assumptions, and determining what to build before investing in execution.

Original Article

Most companies are failing to see meaningful returns from AI because they are using it mainly to speed up existing workflows rather than rethink what products, services, or business models are worth building in the first place. While AI delivers clear productivity gains — especially for less experienced workers — the real long-term value comes from using it to ask better strategic questions, challenge assumptions, and reshape offerings, markets, and customer relationships instead of simply accelerating old processes.

Design aichromedevtools

Brief an AI on Web Changes (Website)

Implement AI launched MarkUp, a Chrome extension that converts website annotations into AI briefs with CSS selectors—eliminating paragraph prompts for Claude Code and ChatGPT.

MarkUp

Summary

What: Implement AI launched MarkUp, a Chrome extension that annotates live web pages and exports PNG + markdown briefs with CSS selectors for Claude Code and other AI agents. Free during early access; first 500 testers earn lifetime access.

Why it matters: Shows the evolution of AI tooling toward visual interfaces—annotations with CSS selectors eliminate the ambiguity of text prompts and reduce token waste from lengthy descriptions.

Original Article

Draw what you want changed. Let AI build it.

Annotate directly on your live website. Send it to your AI agent. Done.

The fastest way to brief an AI on web changes.

Draw on live pages. Label every change. Export a ready-to-paste brief.

Annotate live pages. Brief any AI in seconds.

MarkUp turns visual feedback into a structured AI prompt. Click the icon, draw on the page, send it to Claude, ChatGPT, or your local agent.

Open MarkUp on any page

Click the extension icon. The annotation canvas overlays the live page — no devtools, no Figma.
Draw what needs to change

Drop callouts, sketch arrows, label what's wrong. Each marker is auto-numbered and added to a live changelog.
Send it to your AI

One click copies a PNG and a markdown changelog with CSS selectors. Paste into Claude, or save to a folder your agent watches.

Build faster with real feedback.

Ship cleaner code with fewer review cycles and better briefs.

Built for speed. Save tokens.

MarkUp turns visual feedback into structured AI prompts. Less back-and-forth. Less guessing. Less wasted context.

7 Annotation Tools
<2s From Draw to AI Brief
0 Accounts. Setup. Friction. Free. Forever

Real-time changelog

Every marker you place updates the changelog instantly. No delay, no manual sync, no disconnected sticky notes.

Auto-numbered callouts

Place markers in any order. MarkUp numbers them automatically — your AI gets a clean, indexed brief.

Works every time

No login sessions to expire. No cloud sync to fail. Runs entirely in your browser, ready when you are.

Stop explaining everything to AI. Show it.

Long paragraphs of prompt instructions are the worst part of working with AI. MarkUp replaces them with what your eyes already see — pointed at, labeled, and ready to ship.

Make precise, pixel-perfect changes

Click any element. Label what's wrong. Your AI gets the exact selector — not "the button on the right somewhere."

Save time. Save tokens.

Skip the long descriptions. A drawn callout with three words beats a paragraph of context every time.

Works with Claude, ChatGPT, Copilot, and more

Universal export format — PNG screenshot plus markdown changelog. Paste anywhere your AI accepts files.

Works locally, right in your browser

No accounts. No cloud sync. Your annotations live in your browser. Your data stays yours.

Built for the way you actually work.

Every feature earns its place. No bloat. No upsells. No account gate.

Draw on any live page

Annotate the actual webpage — not a screenshot, not a Figma mockup. Seven tools, any page, any browser tab.

Seven annotation tools
Works on any website

Auto-numbered changelog

Every callout marker increments automatically and syncs its label to the side panel changelog in real time.

Real-time sync
Delete any marker

CSS selectors included

Every annotation exports the CSS selector of the element it targets. Claude Code gets a precise address — not a description.

Precise element targeting
No ambiguity for AI agents

Save to workspace folder

Write exports directly to a local project folder. Mode B drops a structured PNG and markdown file where your Claude Code expects.

Local file system access
Claude Code compatible

Works with any AI

Export format is PNG + markdown. Paste into Claude, ChatGPT, or Copilot — or drop files into Claude Code for agent workflows.

PNG + markdown export
Universal AI format

Works locally

Annotation data and settings live in your browser's local extension storage. No backend server, no network requests, no third parties.

No account required
Your data stays yours

For Agencies

Client feedback arrives as vague notes on old screenshots.

MarkUp replaces that. Annotate the live site, number every change, and export a clean, numbered brief — unambiguous and documented. Hand it to a developer or feed it to Claude directly.

For Developers

The loop from review to Claude Code should be as short as possible.

Annotate the live page, label what needs to change, and export to the folder your agent is watching. The markdown changelog includes CSS selectors — so Claude Code knows exactly which element.

Built for your AI development workflow.

The implementation details that matter when you're shipping real code.

CSS selectors included

Every annotation captures the CSS selector of the element it points to. Your AI agent doesn't guess "which button" — it knows the exact path.

.hero-section > .cta-button[data-variant="primary"]

Workspace integration

Save exports directly to a local folder. Your Claude Code or Cursor agent reads new files automatically — no file upload, no copy-paste, no friction.

~/projects/my-site/feedback/2026-04-25.png

Universal export

PNG image plus markdown changelog. Reads in any AI assistant that accepts files. No proprietary format, no lock-in.

# Change 1: Increase headline contrast
# Element: h1.hero-headline

From long paragraphs of prompts to visual AI briefs in seconds.

Stop typing "make the second card on the right slightly more prominent and maybe adjust the padding." Show it. Send it. Ship it.

Annotate the live page

No Figma, no re-screenshotting, no context-switching.
Export to your AI

Clipboard paste or workspace save — your call.
Ship the changes

Your AI has the image, the labels, and the selectors. It just needs to build.

Answers before you ask.

Everything you need to know about pricing, privacy, and what works where.

Is MarkUp free?

Yes. MarkUp is currently in Early Access. Free for everyone during the beta — all seven drawing tools, the callout and changelog system, clipboard export, and workspace file save. After Early Access ends, MarkUp will offer a free individual tier plus paid Pro for power users. The first 500 testers who share feedback earn lifetime free access at launch.

What browsers does it work on?

Chrome, Edge, Brave, and Arc — any Chromium-based browser. A Safari Web Extension is coming soon.

Does my data leave my computer?

No. MarkUp has no backend server and makes no network requests. All annotation data and settings are stored in your browser's local extension storage. Nothing is transmitted to Implement AI or any third party. Full details at markupextension.com/privacy.

Does it work with Claude, ChatGPT, or Copilot?

Yes. MarkUp exports a PNG image and a plain markdown file — readable by any AI assistant that accepts image or file input. Use clipboard export to paste directly, or workspace export to drop files into a Claude Code or other agent workflow folder.

Can I use it on iPad?

MarkUp for Chrome is a desktop extension. A Safari Web Extension with Apple Pencil support is on the roadmap — pen pressure will map to stroke width.

Is there a team or business version?

Not yet. If you're interested in team features, email [email protected] — we're tracking demand to decide what ships next.

Stop explaining. Start drawing.

MarkUp is in Early Access. Free for everyone.

AI startupopenai

Sutskever Says His OpenAI Stake Worth About $7 Billion

Ilya Sutskever's OpenAI equity stake is worth about $7 billion, making the former chief scientist one of the company's largest individual holders.

Bloomberg

Summary

What: Ilya Sutskever, OpenAI co-founder and former chief scientist, says his equity stake in the company is valued at approximately $7 billion, making him one of the largest individual shareholders.

Original Article

OpenAI co-founder and former chief scientist Ilya Sutskever is one of the largest individual shareholders in the AI startup.

What: Data from Brookings Institution, Harvard/Imperial College, and Upwork shows the mid-level freelance design market collapsed between 2023-2025. Entry-level projects fell from 15% to under 9%. More than half of businesses using freelance platforms in 2022 stopped by 2025, while AI model spending rose from 0% to 2.85% of budgets. Only five positions remain viable: brand strategist-designers, AI-augmented production specialists, niche domain experts, creative directors for hire, and experience designers.

Why it matters: AI compressed the quality gap in execution-based work, making technical mastery less defensible than strategic thinking. This reveals a broader pattern: any mid-tier service market built on execution quality is vulnerable to AI commoditization, forcing bifurcation toward either strategic high-value work or high-volume AI-augmented services.

Deep Dive

Freelance graphic design contracts shrank 17% in eight months after ChatGPT launched, with writing down 32% and software development down 21% according to Harvard/Imperial College research tracking two million job postings
Brookings Institution found experienced freelancers offering higher-priced services were hit hardest because their competitive advantage was execution quality, which AI commoditized
Freelance marketplace spending fell from 0.66% to 0.14% of company budgets while AI model spending rose from zero to 2.85% between 2022 and 2025
The Creative Compression Model divides the market into three tiers: commodity work (owned by AI), strategic work (still human-led), and the vanishing middle (competent execution without strategic differentiation)
84% of freelancers now use AI tools regularly, up from 41% three years ago, with early adopters earning 40-60% higher hourly rates through faster delivery
Five viable positions remain: brand strategist-designers who sell thinking not execution, AI-augmented producers competing on speed, niche domain specialists with regulatory knowledge, creative directors who oversee rather than execute, and experience designers working on physical or complex interactive systems
More than half of creatives have used AI in client work without disclosure, with only 28% of agency owners always informing clients
The Value Ascent Protocol recommends four steps: audit which deliverables are AI-replicable, identify latent strategic knowledge, rebuild pricing around outcomes not hours, and deepen fewer client relationships
AI-specialized freelancers command 25-60% higher rates than general practitioners according to 2025-2026 Upwork research, with AI-related freelance work crossing $300 million in annualized value
The pipeline problem: entry-level work that trained designers has disappeared, potentially creating a senior talent shortage by 2028 when strategic judgment is most needed to direct AI systems

Original Article

AI has caused a collapse in the mid-level freelance design market, with commodity work now dominated by $20/month AI subscriptions and mid-range freelancers losing clients to automated tools. Data shows freelance graphic design work shrank 17% within eight months of ChatGPT's launch, with entry-level projects dropping from 15% to below 9% on platforms like Upwork. Only high-level strategic design work requiring human judgment remains viable, while basic design tasks have been entirely replaced by AI tools like Canva and Midjourney.

What: Lists Lucide (open-source with React/Vue/Svelte/Astro/React Native packages), Iconsax (tens of thousands of icons, AI generation, Figma plugin), Pixelarticons (pixel-art for retro/game UIs), Iconoir (open-source, modern), Nucleo (premium with desktop app), Feather (minimal open-source), Phosphor (broad UI coverage), Hugeicons (large-scale framework packages), The Noun Project (marketplace with many contributors), Iconic (24×24 grid, 1.5px stroke, free and pro tiers), and Pikaicons (5,000+ icons, Figma-first workflow).

Why it matters: Icon libraries are evolving from SVG repos into workflow-integrated systems with framework packages and Figma plugins, while splitting into general-purpose collections versus aesthetic-niche options for pixel-art, retro, or playful interfaces.

Original Article

This 11 icon pack list includes open-source options like Lucide and Iconoir, commercial libraries like Iconsax and Hugeicons, and specialized collections like Pixelarticons for retro designs.

Digest devoured!

May 12

Home