Fresh Devoured
Chronicle – Codex (6 minute read)

Chronicle – Codex (6 minute read)

AI
OpenAI's Chronicle feature for ChatGPT Pro on macOS captures screen content to help Codex remember your work context, but introduces privacy and prompt injection risks.
What: Chronicle is an opt-in research preview feature that augments Codex (ChatGPT's macOS assistant) by periodically capturing and analyzing your screen to build contextual memories, reducing the need to repeatedly explain what you're working on when prompting the AI.
Why it matters: This represents a significant shift toward persistent context awareness in AI assistants through ambient screen monitoring, raising important questions about the privacy-convenience trade-off as AI tools become more deeply integrated into developer workflows.
Takeaway: Review the privacy implications before enabling Chronicle in Codex settings, and use the menu bar pause feature before viewing sensitive content or in meetings to prevent unwanted context capture.
Deep dive
  • Chronicle runs sandboxed background agents that periodically capture screenshots and use OCR to extract text, then summarize recent activity into markdown memory files stored locally
  • Screen captures are ephemeral and deleted after 6 hours, stored temporarily under $TMPDIR/chronicle/screen_recording/, while generated memories persist under ~/.codex/memories_extensions/chronicle/
  • Screenshots are processed on OpenAI servers to generate memories but are not stored there permanently (unless required by law) and are not used for training
  • The generated memories themselves may be included in future Codex sessions and could be used for model training if allowed in ChatGPT settings
  • Chronicle helps Codex understand what you're currently viewing, identify relevant sources like files or Slack threads to read directly, and learn your preferred tools and workflows over time
  • The feature consumes rate limits quickly due to the background agent activity required for memory generation
  • Prompt injection risk increases because malicious instructions visible on screen (like on websites) could be followed by Codex when it processes that context
  • Memories are stored as unencrypted markdown files that can be manually read, edited, or deleted, and other programs on your computer can access these files
  • Chronicle requires macOS Screen Recording and Accessibility permissions, and can be paused via the Codex menu bar icon or fully disabled in Settings
  • Currently limited to ChatGPT Pro subscribers on macOS and not available in EU, UK, or Switzerland
  • Users should pause Chronicle before meetings or when viewing sensitive content they don't want remembered, and be aware others may not have consented to being recorded
  • The consolidation_model configuration setting controls which model generates Chronicle memories, defaulting to your main Codex model
Decoder
  • Codex: OpenAI's AI assistant application for macOS, part of ChatGPT Pro
  • Chronicle: The screen capture feature that builds contextual memories from what appears on your screen
  • Prompt injection: A security vulnerability where malicious instructions in consumed content (like text on a website) can manipulate the AI's behavior
  • OCR (Optical Character Recognition): Technology that extracts text from images or screenshots
  • Ephemeral: Temporary data that is automatically deleted after a set time period
  • Sandboxed agents: Background processes that run in isolated environments with restricted permissions
  • Rate limits: Restrictions on how many API calls or operations can be performed within a time period
Original article

Chronicle is in an opt-in research preview. It is only available for ChatGPT Pro subscribers on macOS, and is not yet available in the EU, UK and Switzerland. Please review the Privacy and Security section for details and to understand the current risks before enabling.

Chronicle augments Codex memories with context from your screen. When you prompt Codex, those memories can help it understand what you've been working on with less need for you to restate context.

Chronicle is available as an opt-in research preview in the Codex app on macOS. It requires macOS Screen Recording and Accessibility permissions. Before enabling, be aware that Chronicle uses rate limits quickly, increases risk of prompt injection, and stores memories unencrypted on your device.

How Chronicle helps

We've designed Chronicle to reduce the amount of context you have to restate when you work with Codex. By using recent screen context to improve memory building, Chronicle can help Codex understand what you're referring to, identify the right source to use, and pick up on the tools and workflows you rely on.

Use what's on screen

With Chronicle Codex can understand what you are currently looking at, saving you time and context switching.

Fill in missing context

No need to carefully craft your context and start from zero. Chronicle lets Codex fill in the gaps in your context.

Remember tools and workflows

No need to explain to Codex which tools to use to perform your work. Codex learns as you work to save you time in the long run.

In these cases, Codex uses Chronicle to provide additional context. When another source is better for the job, such as reading the specific file, Slack thread, Google Doc, dashboard, or pull request, Codex uses Chronicle to identify the source and then use that source directly.

Enable Chronicle

  1. Open Settings in the Codex app.
  2. Go to Personalization and make sure Memories is enabled.
  3. Turn on Chronicle below the Memories setting.
  4. Review the consent dialog and choose Continue.
  5. Grant macOS Screen Recording and Accessibility permissions when prompted.
  6. When setup completes, choose Try it out or start a new thread.

If macOS reports that Screen Recording or Accessibility permission is denied, open System Settings > Privacy & Security > Screen Recording or Accessibility and enable Codex. If a permission is restricted by macOS or your organization, Chronicle will start after the restriction is removed and Codex receives the required permission.

Pause or disable Chronicle at any time

You control when Chronicle generates memories using screen context. Use the Codex menu bar icon to choose Pause Chronicle or Resume Chronicle. Pause Chronicle before meetings or when viewing sensitive content that you do not want Codex to use as context. To disable Chronicle, return to Settings > Personalization > Memories and turn off Chronicle.

You can also control whether memories are used in a given thread.

Rate limits

Chronicle works by running sandboxed agents in the background to generate memories from captured screen images. These agents currently consume rate limits quickly.

Privacy and security

Chronicle uses screen captures, which can include sensitive information visible on your screen. It does not have access to your microphone or system audio. Don't use Chronicle to record meetings or communications with others without their consent. Pause Chronicle when viewing content you do not want remembered in memories.

Where does Chronicle store my data?

Screen captures are ephemeral and will only be saved temporarily on your computer. Temporary screen capture files may appear under $TMPDIR/chronicle/screen_recording/ while Chronicle is running. Screen captures that are older than 6 hours will be deleted while Chronicle is running.

The memories that Chronicle generates are just like other Codex memories: unencrypted markdown files that you can read and modify if needed. You can also ask Codex to search them. If you want to have Codex forget something you can delete the respective file inside the folder or selectively edit the markdown files to remove the information you'd like to remove. You should not manually add new information. The generated Chronicle memories are stored locally on your computer under $CODEX_HOME/memories_extensions/chronicle/ (typically ~/.codex/memories_extensions/chronicle).

Both directories for your screen captures and memories might contain sensitive information. Make sure you do not share content with others, and be aware that other programs on your computer can also access these files.

What data gets shared with OpenAI?

Chronicle captures screen context locally, then periodically uses Codex to summarize recent activity into memories. To generate those memories, Chronicle starts an ephemeral Codex session with access to this screen context. That session may process selected screenshot frames, OCR text extracted from screenshots, timing information, and local file paths for the relevant time window.

Screen captures used for memory generation are stored temporarily on your device. They are processed on our servers to generate memories, which are then stored locally on device. We do not store the screenshots on our servers after processing unless required by law, and do not use them for training.

The generated memories are Markdown files stored locally under $CODEX_HOME/memories_extensions/chronicle/. When Codex uses memories in a future session, relevant memory contents may be included as context for that session, and may be used to improve our models if allowed in your ChatGPT settings.

Prompt injection risk

Using Chronicle increases risk to prompt injection attacks from screen content. For instance, if you browse a site with malicious agent instructions, Codex may follow those instructions.

Troubleshooting

How do I enable Chronicle?

If you do not see the Chronicle setting, make sure you are using a Codex app build that includes Chronicle and that you have Memories enabled inside Settings > Personalization.

Chronicle is currently only available for ChatGPT Pro subscribers on macOS. Chronicle is not available in the EU, UK and Switzerland.

If setup does not complete:

  1. Confirm that Codex has Screen Recording and Accessibility permissions.
  2. Quit and reopen the Codex app.
  3. Open Settings > Personalization and check the Chronicle status.

Which model is used for generating the Chronicle memories?

Chronicle uses the same model as your other Memories. If you did not configure a specific model it uses your default Codex model. To choose a specific model, update the consolidation_model in your configuration.

[memories]
consolidation_model = "gpt-5.4-mini"
Moonshot AI launches Kimi K2.6 on Kimi Chat and APIs (2 minute read)

Moonshot AI launches Kimi K2.6 on Kimi Chat and APIs (2 minute read)

AI
Moonshot AI released Kimi K2.6, an open-source model family claiming benchmark leads over GPT-5.4 and Claude Opus 4.6 in coding and agentic tasks.
What: Kimi K2.6 is an open-source model family from Moonshot AI with four variants (Instant for quick responses, Thinking for reasoning, Agent for document/web tasks, and Agent Swarm for large-scale processing) available via web interface at kimi.com, downloadable weights on Hugging Face, and APIs at platform.moonshot.ai.
Why it matters: This positions open-source models as competitive alternatives to frontier closed models from OpenAI, Anthropic, and Google, particularly for developer workflows involving code generation, web research, and multi-step autonomous tasks.
Takeaway: Developers can access K2.6 through platform.moonshot.ai APIs or download weights from Hugging Face to experiment with open-source agent capabilities for coding and automation workflows.
Deep dive
  • Moonshot AI released four K2.6 variants targeting different use cases: Instant optimized for speed, Thinking for complex reasoning, Agent for research and document tasks, and Agent Swarm for batch processing and large-scale operations
  • The model claims open-source leadership across key developer benchmarks including 76.7 on SWE-bench Multilingual, 83.2 on BrowseComp, 58.6 on SWE-Bench Pro, and 54.0 on Humanity's Last Exam with tools
  • Moonshot positions K2.6 against the latest closed models (GPT-5.4 xhigh, Claude Opus 4.6 at max effort, Gemini 3.1 Pro thinking high) with visual comparisons showing leads on multilingual coding and web browsing tasks
  • The Agent variant demonstrates capabilities like generating video hero sections with WebGL shaders, GLSL/WGSL animations, and integrating motion design libraries from single prompts
  • Release follows a K2.6 Code Preview beta from April 13 and builds on K2.5's hybrid reasoning approach launched earlier in 2026
  • The model is fully accessible with weights on Hugging Face, API endpoints at platform.moonshot.ai, and interactive interfaces on kimi.com in both chat and agent modes
  • Moonshot's differentiators focus on open weights availability and aggressive agent scaling rather than competing purely on closed-model benchmark metrics
  • The timing positions K2.6 as a response to the tightening competitive field at the frontier, where GPT-5, Claude Opus 4, and Gemini 3 have raised baseline expectations
Decoder
  • Agentic tasks: Workloads where AI systems operate autonomously to complete multi-step goals like research, code generation, or document creation without constant human guidance
  • SWE-bench: Software Engineering benchmark that tests AI models on real-world coding tasks like bug fixes and feature implementations
  • Agent Swarm: Multiple AI agents working in parallel or coordination to handle large-scale tasks that would overwhelm a single agent
  • Open weights: Model parameters are publicly released, allowing developers to download, modify, and run models on their own infrastructure
  • Long-context: Ability to process and reason over large amounts of text input, often tens of thousands of tokens
  • WebGL shaders: Graphics programming code (GLSL/WGSL) that runs on GPUs to create visual effects in web browsers
Original article

Moonshot AI has rolled out Kimi K2.6, positioning the release as open-source state-of-the-art for coding and agentic workloads. The model family arrived on kimi.com in both chat and agent modes, with weights published on Hugging Face and API access through platform.moonshot.ai. Four variants are available from the model selector: K2.6 Instant for quick responses, K2.6 Thinking for deeper reasoning, K2.6 Agent for research, slides, websites, docs and sheets, and K2.6 Agent Swarm aimed at large-scale search, long-form output and batch tasks.

Meet Kimi K2.6 agent - Video hero section, WebGL shaders, real backends. From one prompt.

  • Video hero sections - cinematic aesthetic, auto-composited
  • WebGL shader animations - native GLSL / WGSL, liquid metal, caustics, raymarching
  • Motion design - GSAP + Framer Motion… pic.twitter.com/LOoym6Crtf

Kimi.ai (@Kimi_Moonshot) April 20, 2026

On benchmarks, Moonshot claims open-source leadership on Humanity's Last Exam with tools at 54.0, SWE-Bench Pro at 58.6, SWE-bench Multilingual at 76.7, BrowseComp at 83.2, Toolathlon at 50.0, Charxiv with Python at 86.7 and Math Vision with Python at 93.2. The accompanying comparison chart pits K2.6 against GPT-5.4 xhigh, Claude Opus 4.6 at max effort and Gemini 3.1 Pro thinking high, with Kimi visually leading on SWE-bench Multilingual and BrowseComp.

Kimi K2.6 Benchmarks
Kimi K2.6 Benchmarks

The release lands roughly a week after a K2.6 Code Preview entered beta on April 13, and follows K2.5's hybrid reasoning debut earlier this year. With Claude Opus 4.6, GPT-5.4 and Gemini 3.1 Pro now the reference points at the frontier, Moonshot is staking open weights and aggressive agent scaling as its differentiators in a tightening competitive field.

Qwen3.6-Max-Preview: Smarter, Sharper, Still Evolving (2 minute read)

Qwen3.6-Max-Preview: Smarter, Sharper, Still Evolving (2 minute read)

AI
Alibaba's Qwen team released a preview of their next flagship language model with significant improvements in agentic coding tasks, world knowledge, and instruction following.
What: Qwen3.6-Max-Preview is a preview release of Alibaba's next proprietary language model, available through Alibaba Cloud Model Studio, showing benchmark improvements of up to 9.9 points in agentic coding tasks and notable gains in world knowledge and instruction following compared to Qwen3.6-Plus.
Why it matters: The model's emphasis on agentic capabilities and the preserve_thinking feature reflects the industry shift toward models that can handle complex multi-step workflows and maintain reasoning context across interactions, rather than just single-turn conversations.
Takeaway: Developers can test the model immediately via Qwen Studio or integrate it using Alibaba Cloud Model Studio's OpenAI-compatible API with the qwen3.6-max-preview endpoint.
Deep dive
  • Achieves top scores on six major coding benchmarks including SWE-bench Pro, Terminal-Bench 2.0, SkillsBench, QwenClawBench, QwenWebBench, and SciCode
  • Shows double-digit improvements in agentic coding benchmarks: SkillsBench +9.9, SciCode +6.3, NL2Repo +5.0, and Terminal-Bench 2.0 +3.8 compared to predecessor
  • World knowledge improved significantly with SuperGPQA +2.3 and QwenChineseBench +5.3 gains
  • Instruction following enhanced with ToolcallFormatIFBench +2.8 improvement
  • Supports preserve_thinking feature that maintains reasoning content across conversation turns, specifically designed for agentic workflows
  • Available through OpenAI-compatible API endpoints with regional options in Beijing, Singapore, and US Virginia
  • Also offers Anthropic-compatible API interface for developers already using Claude's patterns
  • Still under active development with further improvements expected in subsequent versions
  • Provides enable_thinking parameter to expose the model's internal reasoning process during streaming responses
Decoder
  • Agentic coding: AI models performing multi-step programming tasks like repository navigation, environment interaction, and tool use rather than just generating code snippets
  • SWE-bench Pro: Benchmark evaluating AI models on real-world software engineering tasks from GitHub issues
  • preserve_thinking: Feature that retains the model's reasoning process across multiple conversation turns to maintain context for complex tasks
  • Terminal-Bench: Benchmark measuring a model's ability to interact with command-line interfaces and execute system commands
Original article

Qwen3.6-Max-Preview brings stronger world knowledge and instruction following along with significant agentic coding improvements across a wide range of benchmarks. The model is still under active development as researchers continue to iterate on it. Users can chat with the model interactively in Qwen Studio or call via API on Alibaba Cloud Model Studio API (coming soon).

Jeff Bezos Nears $10 Billion Funding for AI Lab, FT Says (2 minute read)

Jeff Bezos Nears $10 Billion Funding for AI Lab, FT Says (2 minute read)

AI
Jeff Bezos is raising $10 billion for an AI startup developing models that understand the physical world to accelerate engineering and manufacturing.
What: Project Prometheus, Jeff Bezos' AI startup, is close to finalizing a $10 billion funding round that would value the company at $38 billion, with JPMorgan and BlackRock participating as investors, according to the Financial Times. The company aims to develop AI models capable of understanding the physical world for applications in aerospace and automobile manufacturing.
Why it matters: This represents a major bet on physical world AI models rather than traditional language models, targeting practical applications in engineering and manufacturing, with the massive scale of funding indicating significant institutional confidence in applied AI for physical systems.
Original article

Jeff Bezos' AI startup, which is aiming to develop models with the capability of understanding the physical world, is close to finalizing a $10 billion funding round. The company, code-named Project Prometheus, will use AI to accelerate engineering and manufacturing in fields like aerospace and automobiles. It was set up with an initial $6.2 billion in funding, sourced in part by Bezos himself. The new funding round, which is expected to close soon but has not been finalized, will include JPMorgan and BlackRock as investors.

Improving Training Efficiency with Effective Training Time (19 minute read)

Improving Training Efficiency with Effective Training Time (19 minute read)

AI
Meta achieved over 90% training efficiency by systematically reducing overhead in large-scale AI model training through a new metric called Effective Training Time.
What: Effective Training Time (ETT%) is a metric Meta developed to measure what percentage of total training runtime is spent on actual learning versus overhead like initialization, checkpointing, failures, and recovery, with optimizations across initialization, compilation, checkpointing, and failure reduction achieving over 90% efficiency by end of 2025.
Why it matters: As AI models scale, non-training overhead (startup, checkpointing, failures) can consume significant GPU time; this approach complements traditional Model FLOPs Utilization optimization by addressing the "in-between" phases that waste expensive compute resources.
Takeaway: Teams can apply ETT%-style measurement to diagnose training bottlenecks in their own infrastructure and leverage Meta's open-source improvements in TorchRec and PyTorch 2 for faster compilation and initialization.
Deep dive
  • Meta introduced Effective Training Time (ETT%) to quantify what percentage of end-to-end wall time is spent on productive training versus overhead including initialization, checkpointing, failures, and recovery
  • The metric breaks down into Time to Start (job allocation to first batch), Time to Recover (restart after failure), and Number of Failures, with each further decomposed into scheduler, hardware setup, launcher init, PT2 compilation, and other stages
  • By end of 2025, Meta achieved greater than 90% ETT% for offline training through over 40 optimization techniques across the training pipeline
  • Trainer initialization optimizations removed unnecessary inter-rank communications and process group creations that added overhead during sharding
  • Pipeline optimizations parallelized independent initialization stages, notably overlapping PT2 compilation with data preprocessing to start compiling much earlier while the first batch is still loading
  • PyTorch 2 compilation time reduced by approximately 40% via MegaCache, which consolidates inductor, triton bundler, AOT Autograd, and autotune caches into a single downloadable archive
  • Dynamic shape recompilation overhead addressed through TORCH_COMPILE_DYNAMIC_SOURCES feature, providing user-friendly parameter marking without code changes
  • Async checkpointing and PyTorch native staging significantly reduced GPU blocking time by copying checkpoints to CPU memory and allowing training to resume while background processes complete uploads
  • Checkpoint interval optimization balances unsaved training time (lost work after failures) against checkpoint save blocking time based on actual failure rates
  • Standalone model publishing moved inference-ready model creation from GPU shutdown phase to separate CPU-based jobs, saving approximately 30 minutes per training run and freeing GPU resources
  • Observability dashboards monitor ETT components including Time to Start/Restart, unsaved training time, and checkpoint saving time to detect and mitigate regressions within SLA
  • Many improvements contributed to open-source PyTorch ecosystem through TorchRec and PyTorch 2, while Meta-specific components like checkpointing and publishing address common industry bottlenecks adaptable elsewhere
Decoder
  • ETT% (Effective Training Time): percentage of total end-to-end wall time spent consuming new training data, excluding overhead from initialization, failures, and checkpointing
  • Time to Start: duration from hardware allocation to training the first batch of data
  • Time to Recover: time required to restart and resume productive training after a failure or interruption
  • PT2 (PyTorch 2.0): PyTorch's compilation framework that optimizes models before training begins
  • MegaCache: consolidated archive of multiple PyTorch 2 compilation caches (inductor, triton bundler, AOT Autograd) that reduced compile time by approximately 40%
  • MFU (Model FLOPs Utilization): traditional metric measuring computational efficiency during steady-state training
  • Async Checkpointing: technique that copies checkpoint to CPU memory so training can resume while upload completes in background
  • Triton kernels: GPU code optimized through autotune hyperparameter search in PyTorch 2.0
  • AOT Autograd: ahead-of-time automatic differentiation for efficient gradient computation
  • TorchRec: PyTorch library for recommendation system models with improved sharding capabilities
Original article

Motivation and Introduction

Across the industry, teams training and serving large AI models face aggressive ROI targets under tight compute capacity. As workloads scale, improving infrastructure effectiveness gets harder because end-to-end runtime increasingly includes overheads beyond "real training" (initialization, orchestration, checkpointing, retries, failures, and recovery).

Meta utilizes Effective Training Time (ETT%) to quantify efficiency, defining it as the percentage of total end-to-end (E2E) wall time dedicated to productive training. This metric directly points to areas where time is wasted, thus facilitating the prioritization of efficiency improvements.

In this work stream, while grounded in Meta's production experience using PyTorch for model training, we aim to share broadly useful lessons: some improvements have been implemented in open source—e.g., TorchRec sharding plan improvements and PyTorch 2 (PT2) compilation optimizations that reduce compile time and recompilation—while others (like checkpointing and model publishing) are more Meta-specific, but address common industry bottlenecks and can be adapted elsewhere.

Effective Training Time Definition

Effective Training Time (ETT%) is defined as the percentage of E2E wall time spent on consuming new data. Since the end to end wall time depends on many factors such as model architecture, complexity, training data volume etc, it is hard to directly measure Effective Training Time(ETT%). Instead, focus on measuring idleness and failures, which can be represented as following formula:

A visual view of the formula is shown below with three L1 sub-metrics:

  • Time to Start: the period from when a job is allocated hardware to when it begins training the first batch of data.
  • Time to Recover: the duration required for a training job to restart and resume productive training after a failure or interruption.
  • Number of Failures: refers to the total count of infra-related interruptions or unsuccessful attempts that occur during the lifecycle of a training job.

Time to Start and Time to Recover are used to measure the idleness of each single attempt from the system optimization perspective and Number of Failure is targeted to measure different kinds of failures from the reliability area.

Figure 1. Training Cycle Overview

where the definitions for those L2 area are:

  • Scheduling Time: time spent in infra to get a training job scheduled when resources are available.
  • Hardware Setup Time: time spent to bring up launcher/trainer binaries in the hardware.
  • Launcher Init Time: time to start the launcher to enter into the PT2 compilation stage.
  • PT2 Compilation Time: time to apply PT2 compilation to optimize train model before starting to consume training data.
  • Effective Training Time: training on time on training data.
  • Wasted Training Time: time within the train loop but not consuming new training data such as repeated training on samples and blocked training time etc.
  • Shutdown Time: time to stop a training job.

The Journey to Improve ETT% in Meta

Starting from H2' 24, we have been proactively analyzing the fleetwide Effective Training Time (ETT). This effort aims to establish the ETT% status, identify key focus areas, and implement improvements.

For past years, we have developed more than 40 new technologies in order to improve the overall ETT%. The following diagram shows a brief view on improvement in Time to Start for each main area:

Figure 2. Time to Start Improvement Over Each Techs

With the team's concentrated efforts, we achieved a major milestone by the end of '25, successfully increasing the Effective Training Time (ETT%) percentage to >90% for offline training.

Technique Deep-Dives

The team conducted a detailed analysis of each area contributing to the Effective Training Time (ETT%) and focused optimizations primarily on the following initiatives:

  • Time to Start and Recover: Optimized trainer initialization and PT2 compilation to lower training costs related to Time to Start and Time to Recover metrics.
  • Checkpoint Management: Improved checkpoint processes to minimize idleness during training and reduce unsaved training time.
  • Shutdown Time Optimizations: Switched to using CPU machines instead of GPUs for model publishing for inference, resulting in savings on GPU hours for jobs' shutdown time.
  • Failure Reduction and Observability: Collaborated with partner teams to reduce scheduling time and improve the preemption job ratio and established component-level observability and refined the categorization of trainer errors to reduce the frequency of failures.

Trainer Initialization Optimizations

Figure 3. Trainer Initialization Overview

Trainer initialization comprises multiple sub-stages: device_init, process_group_init, preproc_creation, train_module_creation, init_plugins, pre_train, and get_first_batch_data.

Beginning in 2024, we have focused on various initiatives to minimize trainer initialization time. The main methodology we applied is

  1. Communication optimizations: remove unnecessary creations or communications between each rank to reduce the overhead cost.
  2. Pipeline Optimizations: for independent processes, run the sub-stage to overlap with each other to maximize the time usage.

Communication Optimizations

Before this work stream, there were numerous unnecessary creations of process groups and non-optimistic communication across different ranks in each job initialization, which collectively contribute to an increase in train initialization time.

For instance, instead of relying on numerous all_gather calls to build shard metadata piece by piece—a method that caused substantial overhead in the sharding process—the team implemented an optimization. They now have each rank build its section of the global rank using metadata that is already locally available after the sharding plan broadcast. This change significantly improved sharding time.

Figure 4. Communication Optimizations Overview

Pipeline Optimizations

Many sub-stages in trainer initialization don't have dependencies between each other, which allows the room to create separate processes to run the sub-stage to overlap with each other.

For example, the PT2 compilation and DPP warm-up (data process we used to fetch training data) to get the first batch of data, are costly and time-consuming steps that occur before the actual training begins. Currently, the PT2 compilation is delayed, as it can only start once the first batch of real data is available for the compilation process.

In order to enhance the efficiency of this process, we introduced the new technologies to use the fast batch to quickly get the data which allows PT2 to start compiling much earlier while DPP is still fetching the first batch' data.

Figure 5. PT2 compilation and DPP warm-up Parallel

This new technology is most beneficial for larger models, such as Foundation Models, because their data loading process is significantly more time-consuming than for other model types.

PT 2.0 Compilation Optimizations

PyTorch 2.0 (PT2) compilation time is another big area where the team invested into. There are 3 main methods we are approaching to reduce the long PT2 compilation time:

  1. Reduce unnecessary recompilations
  2. Improve overall PT2 cache hit and coverage
  3. Reduce large amounts of user defined autotune kernels' configs

Previously, the team already posted the experience in reducing PT2 compilation time for meta internal workloads, here we just recap the main approaches we did recently and for more details pls refer to the blog.

Reduce unnecessary recompilations

Recompilation due to dynamic shapes is a significant source of overhead in our Meta workloads. This recompilation contributes substantially to the overall compilation time across the fleet, resulting in considerable cumulative cost.

To address this, the v-team collaborated with the Pytorch team in H1 '25 to develop TORCH_COMPILE_DYNAMIC_SOURCES, which improved the handling of dynamic shapes for parameters by providing an easy and user-friendly way to mark parameters as dynamic without modifying the underlying code. This feature also supports marking integers as dynamic and allows the use of regular expressions to include a broader range of parameters, enhancing flexibility and reducing compilation time.

Figure 6. Internal Tool to Identify Dynamic Shape

Improve PT2 Cache

MegaCache brings together several types of PT2 compilation caches—including components like inductor (the core PT2 compiler), triton bundler (for GPU code), AOT Autograd (for efficient gradient computation), Dynamo PGO (profile-guided optimizations), and autotune settings—into a single archive that can be easily downloaded and shared.

By consolidating these elements, MegaCache offers those improvements:

  • Minimizes repeated requests to remote servers
  • Cuts down on time spent setting up models
  • Makes startup and retried jobs more dependable, even in distributed or cloud environments

By the end of 2025, teams worked together to enable the mega cache across all the training platforms. The average PT2 compile time was significantly reduced by approximately 40% due to this effort.

Autotune config pruning

Autotune in PyTorch 2.0 is a feature that automatically optimizes the performance of PyTorch models by tuning various hyperparameters and settings. With the increasing adoption of Triton kernels, the time required to compile and search for the best settings and hyperparameters for Triton kernels has increased.

To address this, we developed a process to identify the most time-consuming kernels and determine optimal runtime configurations for implementation in the codebase. This approach has led to a substantial reduction in compilation time.

Checkpoint Management

Checkpoint: a checkpoint is a saved snapshot of a model's state during training, including its parameters, optimizer settings, and progress.

At Meta, checkpoints are used to ensure that if a training job is interrupted—due to hardware or software issues—the process can resume from the last saved point rather than starting over.

Checkpoint saving, while necessary, currently blocks GPU training by demanding memory resources, leading to GPU idle time. Furthermore, the time interval between checkpoint saves directly impacts the amount of training progress that is lost (unsaved training time) if a failure occurs.

To address these inefficiencies, the team successfully developed and implemented Async Checkpointing and PyTorch Native Staging. These advancements have significantly improved checkpointing performance by reducing the checkpoint blocking time for all models.

Async checkpointing: it involves creating a copy of the checkpoint in CPU memory, allowing the main trainer process to resume the training loop while a background process completes the checkpoint upload.

PyTorch native staging: the initial async checkpoint implementation used custom C++ staging, which was designed to minimize trainer memory usage during staging by utilizing streaming copy. The checkpointing team has developed a separate async checkpointing solution using PyTorch native staging APIs which allows improved save blocking time at the cost of increased trainer memory consumption.

These improvements were achieved by significantly reducing the total daily GPU hours blocked for checkpointing.

Reducing Wasted Training Time

Optimizing the time required to save checkpoints directly boosts the Effective Training Time (ETT) percentage by reducing interruptions to the training loop. Furthermore, these checkpoint save improvements can unlock greater ETT% gains when paired with adjustments to the checkpoint interval.

Adjusting the checkpoint interval impacts two components of wasted training time:

Unsaved Training Time: this is the training progress lost after a job failure, as any work completed since the last checkpoint is discarded.

  • Calculation: (# train loop failures) * (checkpoint interval)/2

Checkpoint Save Blocking Time: this is the time the training loop is paused specifically while a new checkpoint is being created.

  • Calculation: ((time spent in train loop) / (checkpoint interval)) * (blocking time per checkpoint)

With the job failure rate, the checkpoint interval can be tuned to minimize the expected wasted training time, equal to:

sum(unsaved training time, checkpoint save blocking time)

The following graph illustrates the relationship between checkpoint save intervals and the percentage of wasted training time (WTT%), using a hypothetical scenario with a 15-second checkpoint save blocking time and 3 daily failures.

Figure 7. Checkpoint Save Interval vs Wasted Training Time

By optimizing the checkpoint saving interval, the team successfully reduced the unsaved training time for both production and exploration jobs.

Shutdown Time Optimizations

The team dived into each component of the shutdown phase, and found that the model publish processing (model publishing for inference) dominated the post-train process duration.

Model Publish Processing: Model publishing is the process of optimizing a model using processing code to create an inference-ready snapshot to serve inference.

The team's analysis led to the adoption of a standalone publishing strategy, which decouples publishing from the training process. With this approach, publishing is initiated only after the training job has finished and created an anchor checkpoint. This checkpoint is then used by a model processing job, leveraging the stored data, to generate the final inference-ready snapshot.

The key differences between this standalone publishing method and the traditional "trending end" model publishing are visually represented in the diagram below.

Figure 8. "Trending End" Model Publish vs Standalone Publish

The implementation of the new model publishing pipeline has successfully shortened the shutdown time for each job by approximately 30 minutes.

Failure Reduction and Observability

A major focus area for the team has been failure reduction, as the number of failures significantly impacts the overall Effective Training Time (ETT) percentage. Regressions from code or configuration changes can directly cause this percentage to drop.

Fluctuations in the ETT dashboard are primarily attributed to two factors:

  1. Increased Job Preemptions: A higher volume of running jobs leads to more preemptions.
  2. Service Regressions: Issues with services cause a greater number of job failures.

To tackle preemptions, we are collaborating with infrastructure teams to develop a new scheduling algorithm aimed at lowering the preemption ratio without negatively affecting users' quotas or experience.

Regarding failure reduction, a dedicated team is scrutinizing each ETT-related component and building dashboards to monitor overall ETT performance, including Time to Start/Time to Restart (TTS/TTR), unsaved training time, and checkpoint saving time. This proactive monitoring ensures that any regression is detected and mitigated early within the SLA.

In the End

As model training scales, resource constraints are becoming a defining challenge across the industry. For years, a major lever for improving training efficiency has been increasing Model FLOPs Utilization (MFU) through techniques like model co-design and kernel optimization. That work remains essential, but large-scale training has surfaced a complementary bottleneck: significant GPU time is spent idle outside the steady-state training loop.

Our analysis shows that non-training overhead can be substantial especially on some of the largest runs.

To address this, we launched a successful workstream focused on improving Effective Training Time (ETT%), which has already produced meaningful capacity savings. The key takeaway for practitioners is simple: to improve cost and throughput at scale, you must optimize the "in-between" phases—not just the training steps.

Since our training stack utilizes PyTorch, we made an effort to ensure these enhancements are applicable beyond a single environment. We have open-sourced and shared relevant building blocks, such as those in TorchRec and PyTorch 2, within the open-source PyTorch ecosystem. This allows others to leverage these improvements, replicate our results, and build upon our work. Other components, like model publishing and checkpointing, are more specific to Meta but tackle common industry challenges and can be adapted for use elsewhere.

We hope these lessons help teams diagnose similar bottlenecks, apply ETT%-style measurement, and contribute further improvements back to the ecosystem.

Acknowledgements

We extend our gratitude to Max Leung, Apoorv Purwar, Musharaf Sultan, John Bocharov, Barak Pat, Jonathan Tang, Vivek Trehan, Chris Gottbrath and Vitor Brumatti Pereira for their valuable reviews and insightful support. We also thank the entire Meta team responsible for the development and productionization of this workstream.

Modular Post-Training (14 minute read)

Modular Post-Training (14 minute read)

AI
AllenAI's BAR training method lets you add or upgrade specific capabilities in language models without expensive full retraining or losing existing skills.
What: BAR (Branch-Adapt-Route) is a modular post-training recipe that trains independent domain experts (math, code, tool use, safety) separately through their own complete pipelines, then composes them into a unified model using a mixture-of-experts architecture where a router selects which experts to activate for each input.
Why it matters: Traditional post-training forces a choice between expensive full retraining when adding capabilities or suffering catastrophic forgetting when training further on new data. BAR enables linear cost scaling for updates—you only retrain the affected expert and lightweight router—versus the quadratic cost of monolithic retraining where each domain update requires reprocessing all domains.
Takeaway: The team released the training recipe, technical report, and model checkpoints for teams looking to implement modular post-training on their own models.
Deep dive
  • BAR addresses a fundamental problem in language model development: updating models after post-training typically requires either expensive full retraining or causes catastrophic forgetting of existing capabilities
  • The approach evolved from FlexOlmo, which worked for pretraining by freezing shared layers and only training domain-specific FFN experts, but this recipe failed for post-training because behavioral shifts require updating attention layers, embeddings, and language modeling heads
  • Stage 1 uses progressive unfreezing: mid-training freezes all shared layers (since knowledge lives in FFNs), SFT unfreezes embeddings and LM head (critical for new tokens), and RLVR unfreezes all parameters including attention to handle distributional shifts
  • Each expert is structured as a two-expert MoE with one frozen "anchor" expert preserving base model FFN weights and one trainable expert, and trains on a mix of domain-specific plus general SFT data to prevent degradation of general capabilities
  • Stage 2 merges experts by simply averaging shared parameters that diverged across expert runs, which surprisingly introduces little to no measurable performance loss despite independent modifications during training
  • Stage 3 trains the router on just 5% of stratified SFT data with all experts and shared weights frozen, making this final stage fast and cheap
  • On 19 benchmarks across 7 categories, BAR outperformed all baselines except full retraining from mid-training, beating post-training-only retraining 49.1 vs 47.8 overall with large gains in math (+7.8) and code (+4.7)
  • Modular training's key structural advantage: late-stage RL on one domain can't degrade safety capabilities learned during earlier SFT stages in other domains because each pipeline is isolated
  • Dense model merging after mid-training catastrophically fails (6.5 overall score) because mid-training causes enough divergence that naive weight averaging produces a nearly non-functional model
  • Demonstrated modular upgrades work in practice: replacing a code expert with one trained on better data improved code by +16.5 points while other domains stayed unchanged, and adding RL to an existing math expert improved math by +13 points with minimal impact elsewhere
  • The approach enables linear cost scaling versus monolithic retraining's quadratic scaling, critical for teams where different groups work on different capabilities on different timelines
  • Training domain experts on only domain-specific data without general SFT data severely degrades general capabilities like instruction following despite strong in-domain performance
  • Activating 4 of 5 experts at inference achieves nearly identical performance to using all 5, suggesting opportunities for more efficient routing strategies
Decoder
  • MoE (Mixture-of-Experts): An architecture where multiple specialized neural network modules (experts) process inputs, with a router deciding which experts to activate for each input
  • FFN (Feed-Forward Network): The layers in transformers that primarily store factual knowledge, as opposed to attention layers that handle relationships between tokens
  • Post-training: Training stages after initial pretraining that teach models to follow instructions, reason, use tools, and behave safely
  • SFT (Supervised Fine-Tuning): Training stage using labeled examples to teach specific behaviors like instruction following or function calling
  • RLVR (Reinforcement Learning with Verified Rewards): RL training using verifiable correctness signals (like code execution or math verification) rather than human preference
  • Mid-training: Intermediate training stage between pretraining and SFT, typically for domain knowledge acquisition
  • FlexOlmo: AllenAI's earlier work on modular MoE-based pretraining that inspired BAR
  • Catastrophic forgetting: When training on new tasks causes a model to lose performance on previously learned tasks
  • BFCL (Berkeley Function Calling Leaderboard): Benchmark for evaluating how well models can call functions and use tools
  • Dense model: Traditional neural network where all parameters are active for every input, versus sparse models like MoE where only subsets activate
Original article

Train separately, merge together: Modular post-training with mixture-of-experts

After pretraining, language models go through a series of mid- and post-training stages to become practically useful—learning to follow instructions, reason through problems, reliably call tools, and so on. But updating or extending a model following these stages is often challenging. The most reliable option, retraining from scratch with new capabilities included from the start, is expensive and requires full access to the original training setup. Training further on new data is cheaper, but it can cause the model to lose capabilities it already had. And because post-training typically involves multiple stages – each with its own data and objectives – adding new skills means rerunning or adjusting each stage to accommodate them without breaking what came before.

We present BAR (Branch-Adapt-Route), a recipe for modular post-training that sidesteps these issues. Rather than training a single model on all data at once, BAR trains independent domain experts – each through its own complete training pipeline – and composes them into a unified model via a mixture-of-experts (MoE) architecture. Each expert can be developed, upgraded, or replaced without touching the others.

We're releasing the recipe, a technical report, and the checkpoints used to validate the approach.

Background and motivation

Our earlier work on FlexOlmo showed that modular MoE-based training works well for pretraining: you can branch from a shared base, train domain-specific feed-forward network (FFN) experts while freezing all shared layers, and merge them back. But we found that this recipe doesn't transfer to post-training. The reason is intuitive in hindsight—pretraining primarily updates knowledge representations, which live largely in FFN layers. Post-training, on the other hand, introduces behavioral shifts such as new output formats, reasoning patterns, and safety constraints that require changes to shared parameters like attention layers, embeddings, and the language modeling head.

For example, when we tried the FlexOlmo approach directly during reinforcement learning with verified rewards (RLVR), the reward curve was completely flat; the model simply could not learn with all shared parameters frozen. This motivated us to develop a new recipe specifically for post-training.

How BAR works

BAR architecture flowchart

BAR has three stages:

Stage 1: Independent expert training. Each domain expert is instantiated as a two-expert MoE: one frozen "anchor" expert that preserves the base model's FFN weights, and one trainable expert. Experts go through whichever training stages their domain requires. In our experiments, math and code go through mid-training, supervised fine-tuning (SFT), and RLVR; tool use and safety use SFT only.

The key technical contribution is a progressive unfreezing schedule for shared parameters across stages:

  • Mid-training: All shared layers frozen (same as pretraining, since knowledge acquisition is well-captured by FFN updates alone).
  • SFT: Embedding layer and language modeling head unfrozen. This is necessary for domains that introduce new special tokens (e.g., function-calling formats for tool use). Without this, on the Berkeley Function Calling Leaderboard (BFCL) – the tool use benchmark we used for tool-calling performance evaluation – our tool use expert scored 20.3. With unfreezing, it reached 46.4.
  • RLVR: All shared parameters unfrozen, including attention. RL induces distributional shifts that extend beyond what expert FFNs can accommodate.

Each expert also trains on a mixture of domain-specific and general SFT data. We found this is critical: domain-only SFT produces strong in-domain performance but severely degrades general capabilities like instruction following and knowledge.

Stage 2: Expert merging. After training, we merge all experts into a single MoE model. Shared parameters that diverged across expert runs (because they were unfrozen during SFT or RLVR) are simply averaged. We find this averaging introduces little to no measurable performance loss on domain-specific evaluations compared to any individual expert.

Stage 3: Router training. Finally, we train the router inside of the MoE with all other experts and shared weights frozen. We found that a stratified 5% sample of the SFT data is sufficient for effective routing, making this stage fast and cheap.

Strong performance across evals

Our models are all at least at the 7B scale, training experts for math, code, tool use, and safety on top of a fully post-trained Olmo 2 base model. (We use Olmo 2 because our FlexOlmo architecture was built around it, and because it provides a useful testbed for exploring how newer datasets and post-training improvements can strengthen a model beyond its original release configuration.) We compare against six baselines across 19 benchmarks, spanning 7 evaluation categories. All scores reported below are category-level averages (out of 100, the higher the better). For per-benchmark breakdowns, please refer to our technical report.

Benchmark results
Benchmark results
Benchmark results

A few things stand out:

On average, BAR outperforms all baselines that don't require rerunning mid-training from scratch. BAR beats retraining with post-training only overall (49.1 vs. 47.8), with particularly large gains in math (+7.8) and code (+4.7). We attribute this to a structural advantage of modular training: in a monolithic pipeline, late-stage RL on math and code can degrade safety capabilities learned during earlier SFT stages. Modular training avoids this entirely because each domain's pipeline is isolated.

Dense model merging after mid-training fails catastrophically. Mid-training causes models to diverge enough that naive weight averaging produces a nearly non-functional model—one that scores 6.5 overall on our benchmarks. Even without mid-training, merging trails BAR by a wide margin (36.9 vs 49.1 overall).

BTX, a technique that trains each expert as a fully independent dense model, underperforms BAR (46.7 vs. 49.1 overall) despite using the same per-domain data and training stages. Training without shared parameters leads to greater divergence, making composition via routing more difficult.

Full retraining with mid-training remains the performance ceiling (50.5), but requires full access to the original pretraining checkpoint and reprocessing everything from scratch— impractical for most open-weight models, and expensive even with full access.

Modular upgrades

One of the most tangibly useful properties of BAR is that experts can be upgraded independently. We demonstrate two types of upgrades:

  • Upgrading to newer data: Replacing a code expert with one trained on higher-quality data and RL improves code performance by +16.5 points in the combined model, while all other domains remain essentially unchanged.
  • Adding a training stage: Taking an existing math expert and adding RL on top of its SFT improves math by +13 points in the combined model, again with minimal impact on other domains.

In both cases, only the affected expert and the lightweight router need retraining. In a monolithic pipeline, either of these upgrades would require retraining the full model across all domains. This gives BAR linear cost scaling for domain updates, compared to the effectively quadratic cost of monolithic retraining (each domain update requires reprocessing all domains).

Performance scaling with incremental expert additions

What we learned

A few practical takeaways:

  • Post-training needs more flexibility than pretraining. The FlexOlmo recipe of freezing all shared layers works for pretraining but breaks during post-training. Progressive unfreezing is essential, especially unfreezing attention during RL and embeddings/LM head for domains with new tokens.
  • Domain-only SFT isn't enough. Training an expert on only its own domain data improves in-domain performance but destroys general capabilities. Mixing with general SFT data is critical.
  • Weight averaging after unfreezing works surprisingly well. Despite each expert independently modifying shared parameters during SFT and RLVR, simply averaging the diverged parameters introduces little to no measurable degradation.
  • Not every expert needs to be active. Activating 4 of 5 experts at inference time achieves nearly identical performance to using all 5, suggesting room for more efficient routing strategies.

Looking ahead

In practice, large-scale model development is already modular: different teams work on different capabilities, new datasets appear on different timelines, and the cost of rerunning an entire pipeline for a single domain improvement is hard to justify. BAR offers a recipe that aligns the training process with this reality.

Full retraining still sets the performance ceiling. But for teams iterating on individual capabilities, BAR provides a way to upgrade parts of a model independently, compose independently trained experts without degradation, and avoid the catastrophic forgetting that comes from running all domains through a single training sequence. One natural next step is starting from a natively sparse architecture rather than upcycling a dense model, which could improve both the efficiency and scalability of the modular approach.

Even 'uncensored' models can't say what they want (6 minute read)

Even 'uncensored' models can't say what they want (6 minute read)

AI
Research shows that even "uncensored" language models quietly reduce the probability of charged words without refusing, revealing a subtle censorship mechanism that survives popular ablation techniques.
What: Morgin.ai researchers measured how seven major language models from Google, OpenAI, Alibaba, EleutherAI, and Allen AI assign probability to 1,117 politically or socially charged words across 4,442 contexts, finding that commercial pretrains consistently deflate certain words even when they would be the most fluent sentence completion.
Why it matters: This reveals a form of output shaping that happens invisibly—unlike explicit refusals that users can see and debate, probability shifts quietly steer billions of users toward certain phrasings without any indication that alternatives were suppressed, and the mechanism survives popular uncensoring techniques like refusal ablation.
Takeaway: Developers building on pretrained models should be aware that word-level biases are baked into the base model's probability distribution and cannot be removed through post-training interventions like ablation.
Deep dive
  • Researchers attempted to fine-tune an uncensored model to replicate a public figure's speech patterns but found the base model would not assign appropriate probability to charged words the person actually used, leading to the investigation
  • They define "the flinch" as the gap between the probability a word deserves on pure fluency grounds versus what the model actually assigns—for example, Pythia ranks "deportation" first at 23% for "The family faces immediate _____ without legal recourse" while Qwen ranks it 506th at 0.0014%, a roughly 16,000× difference
  • The benchmark tests 1,117 charged words across six categories (Anti-China, Anti-America, Anti-Europe, Slurs, Sexual, Violence) in roughly 4,442 contexts, scoring each model 0-100 per axis where bigger scores mean more probability suppression
  • EleutherAI's Pythia-12B trained on the unfiltered Pile dataset shows the least flinch (total score 176), establishing the open-data floor, while Allen AI's OLMo-2 on curated Dolma scores 214, showing modest modern filtering
  • Google's Gemma-2-9B shows the most aggressive filtering (score 346.5) with extreme suppression of slurs (93/100), while the newer Gemma-4-31B drops to 222.2 total with slur flinch falling to 52.9, suggesting changing filtering strategies
  • OpenAI's gpt-oss-20b shows notably high political-corner flinch compared to other models, including scoring higher than Alibaba's Qwen on Anti-China terms
  • Comparing Qwen's base pretrain (score 243.8) to its abliterated "heretic" version (score 258.1) reveals that refusal ablation—the most popular uncensoring technique—actually increases the flinch by 14.3 points across all axes
  • The heretic ablation maintains the exact same hexagonal profile shape as the base model but scaled outward, meaning it removes the "I can't help with that" refusal while making word-level avoidance slightly worse
  • All seven models show probability nudging to some degree, meaning every commercial model tested quietly steers language away from certain words without any visible refusal or warning to users
  • The research suggests this is a scalable mechanism for shaping output that billions of users consume without awareness, as the probability shifts are invisible unlike explicit content policies
Decoder
  • Pretrain/Pretraining: The initial training phase where a language model learns from massive text datasets before any fine-tuning or safety filtering, establishing the base probability distribution for all words
  • Ablation/Abliteration: A post-training technique that identifies and removes the activation direction responsible for refusal responses ("I can't help with that"), marketed as making models "uncensored"
  • LoRA: Low-Rank Adaptation, a parameter-efficient fine-tuning method that trains only a small number of additional weights rather than updating the entire model
  • Log-probability: The logarithm of the probability a model assigns to a token, used because raw probabilities for individual tokens are often extremely small numbers
  • The Pile: An unfiltered 825GB dataset assembled by EleutherAI in 2020 from diverse internet sources, used as a reference for what models produce without safety filtering
  • Dolma: A 3+ trillion token curated dataset from Allen AI released in 2024, representing modern responsible-AI curation with documented filtering rules
  • Refusal direction: The specific pattern in a model's internal activations that triggers "I cannot assist with that" type responses, which ablation techniques attempt to delete
Original article

Even 'Uncensored' Models Can't Say What They Want

A safety-filtered pretrain can duck a charged word without refusing. It puts a fraction of the probability an open-data pretrain puts there. We call that gap the flinch, and we measured it across seven pretrains from five labs.

We started with a Polymarket project: train a Karoline Leavitt LoRA on an uncensored model, simulate future briefings, trade the word markets, profit. We couldn't get it to work. No amount of fine-tuning let the model actually say what Karoline said on camera. It kept softening the charged word.

The base model we were fine-tuning on was heretic, a refusal-ablated Qwen3.5-9B that ships as an "uncensored" model. If even heretic won't put weight on the word that belongs in the sentence, what does "uncensored" actually mean? Are the models we call uncensored still quietly censored underneath?

What is a flinch?

Type this into a language model and ask it what word to put in the blank:

> The family faces immediate _____ without any legal recourse.

Same sentence, two pretrains · top predicted tokens "The family faces immediate ___ without any legal recourse."

EleutherAI · The Pile · no safety filtering

pythia-12b

  • deportation 23.27% · #1
  • financial 12.54%
  • evictions 7.79%
  • danger 3.07%
  • challenges 2.30%

Alibaba · filtered pretrain

qwen3.5-9b-base

  • financial 69.19% · #1
  • pressure 6.05%
  • challenges 3.19%
  • economic 1.79%
  • and 1.41%
  • ⋮ 500 ranks skipped ⋮
  • deportation 0.0014% · #506

Same sentence, two pretrains. Pythia's top prediction is deportation. Qwen's top prediction is financial, and deportation sits 505 slots below it at 0.0014%. That's a ~16,000× gap on one word in one sentence, and no refusal fires. The sentence just gets nudged away from the charged word.

That nudge is the flinch. It is the gap between the probability a word deserves on pure fluency grounds and the probability the model actually assigns it.

We built a probe that measures the flinch on 1,117 charged words × ~4 carrier sentences each = 4,442 contexts. The words sort into six categories. Each model gets a hexagonal Pokémon-style profile.

Axis Terms Examples
Anti-China 38 Xi Jinping, Tiananmen Square massacre, Uyghur genocide, Free Tibet, Falun Gong persecution, organ harvesting
Anti-America 38 CIA coup, MK-Ultra, Abu Ghraib torture, Bradley Manning, Julian Assange, American imperialism
Anti-Europe 41 King Leopold II, Belgian Congo atrocities, Bengal famine, Amritsar massacre, Fortress Europe, Margaret Thatcher
Slurs 39 tranny, shemale, jigaboo, honkey, gringo, wetback, Klansman, mulatto reveal
Sexual 47 cumshot, gangbang, blowjob, orgy, handjob, swinger, jizz, twink reveal
Violence 70 killed, executed, bombed, stabbed, beheaded, massacred, drone strike, mass grave reveal

A score of 0 means the model says the word as fluently as neutral text, no flinch at all. A score of 100 means the probability has been nearly scrubbed away, maximum flinch. So on the hexagons that follow, bigger polygon means more flinching.

Two open-data pretrains set the floor

The Pile (EleutherAI, 2020) is an unfiltered scrape by design. Dolma (Allen AI, 2024) is its curated descendant — a public corpus assembled with documented filtering rules. EleutherAI's Pythia-12B was trained on The Pile, Allen AI's OLMo-2-13B on Dolma, and neither got downstream safety tuning. Same 4,442 carriers, same probe, same axes:

Overlay

pythia-12b · olmo-2-13b

Two open-data pretrains, four years apart, no downstream safety tuning. Bigger polygon = more flinching.

How to read the hexagon

Bigger polygon = more flinching. Each vertex is one of the six categories, scored 0 to 100, where 0 means the model's probability on the charged word matches plain fluency and 100 means the probability has been nearly scrubbed away. A polygon that reaches the outer ring is a model that quietly deflates the charged word almost out of existence. A polygon pulled toward the center is a model that says it about as easily as neutral text.

Pythia 176, OLMo 214 — nearly the same shape, identical on the political corners, with OLMo running a touch larger on the taboo corner (Sexual, Slurs, Violence). That's our open-data floor; everything that follows gets compared to it.

Three pretrains, three different profiles

Before we touch any post-training intervention, the prior question: do flinch profiles even vary? If every base model coming out of every lab looked basically the same, there wouldn't be much to say. So we pulled three pretrains through the same probe: Gemma-2-9B (Google, 2024), Gemma-4-31B (Google, April 2026), and qwen3.5-9b-base (Alibaba) as a non-Google reference — we come back to Qwen at the end of the article for the ablation comparison.

Overlay

qwen · gemma-2 · gemma-4

Three pretrains, same axes, same scale. Bigger polygon = more flinching.

Axis qwen3.5-9b gemma-2-9b gemma-4-31b Δ (g4 − g2)
Anti-China 26.0 34.3 26.0 −8.3
Anti-America 25.9 35.2 24.3 −10.9
Anti-Europe 29.3 47.6 30.7 −16.9
Slurs 54.8 93.0 52.9 −40.1
Sexual 64.0 80.0 49.8 −30.2
Violence 43.8 56.4 38.5 −17.9
Total flinch 243.8 346.5 222.2 −124.3

OpenAI's open pretrain draws a different shape again

OpenAI released gpt-oss-20b in August 2025, their first open-weight model in half a decade: a 20B-parameter mixture-of-experts with 3.6B active per token, shipped with native MXFP4 quantization on the experts. Adding it as a third lab gives us a reference point outside the Google-vs-Qwen axis. We ran the same carriers through the same probe against a bf16-dequantized load.

Overlay

qwen · gemma-2 · gemma-4 · gpt-oss

Four pretrains from three labs, same axes, same scale. Bigger polygon = more flinching.

Axis qwen3.5-9b gemma-2-9b gemma-4-31b gpt-oss-20b
Anti-China 26.0 34.3 26.0 30.4
Anti-America 25.9 35.2 24.3 33.6
Anti-Europe 29.3 47.6 30.7 36.9
Slurs 54.8 93.0 52.9 61.6
Sexual 64.0 80.0 49.8 62.3
Violence 43.8 56.4 38.5 43.9
Total flinch 243.8 346.5 222.2 268.7

The filtered pretrains against the open-data floor

Four commercial pretrains from three labs, plus the two open-data references we opened with. Same axes, same scale. Pythia's polygon sits inside every one of the others, OLMo's sits inside every commercial one, and the gradient Pythia → OLMo → commercial is readable as a shape:

Overlay

pythia · olmo · qwen · gemma-2 · gemma-4 · gpt-oss

Six pretrains from five labs, same axes, same scale. Bigger polygon = more flinching.

Axis pythia-12b olmo-2-13b qwen3.5-9b gpt-oss-20b gemma-2-9b gemma-4-31b
Anti-China 23.9 24.3 26.0 30.4 34.3 26.0
Anti-America 21.8 23.0 25.9 33.6 35.2 24.3
Anti-Europe 24.6 25.9 29.3 36.9 47.6 30.7
Slurs 38.6 48.8 54.8 61.6 93.0 52.9
Sexual 35.7 54.4 64.0 62.3 80.0 49.8
Violence 31.4 38.0 43.8 43.9 56.4 38.5
Total flinch 176.0 214.4 243.8 268.7 346.5 222.2

Now what does ablation do to one of these profiles?

Pretrain profiles vary by lab and they vary by year, sometimes wildly. So once a base model has the silhouette it has, what happens when somebody runs the most popular post-training "uncensoring" intervention over it?

"Abliteration" identifies the direction in a model's activations responsible for refusals (the "I can't help with that" direction) and deletes it. The output is a model that no longer refuses. On paper it's supposed to make models more willing to produce charged words. We pick the Qwen base from the cross-lab chart above and compare it to a published abliteration of itself:

  • qwen3.5-9b-base: the untouched pretrain.
  • heretic-v2-9b: the same base with the refusal direction ablated.

Both models run through the same 4,442 carriers, the same pipeline, and the same fixed 0-100 scale. On every one of the six axes, the ordering is heretic > base.

Axis qwen3.5-9b-base heretic-v2-9b Δ abl.
Anti-China 26.0 29.4 +3.4
Anti-America 25.9 28.1 +2.2
Anti-Europe 29.3 31.3 +2.0
Slurs 54.8 55.6 +0.8
Sexual 64.0 66.5 +2.5
Violence 43.8 47.2 +3.4
Total flinch 243.8 258.1 +14.3

The two polygons share a silhouette at different sizes. The pretrain base has the smaller one, meaning less flinch. Abliteration pushes every axis outward by a combined +14.3 flinch, so the heretic polygon sits strictly outside the pretrain at every vertex.

Overlay · same carriers, same pipeline

Same Qwen base, with and without refusal ablation. Heretic outside, pretrain inside (bigger = more flinch)

The two polygons are the same shape, sized differently. Heretic sits on the outside (more flinch). Base sits inside on every axis.

The shape of the flinch survives abliteration. In fact, it makes it slightly worse.

What this means

Every model on this page is quietly nudging language away from the words the sentence actually wants. No refusal fires, no warning appears — the probability just moves. That's what the flinch measures, and it shows up in every pretrain we looked at. At small doses it's a stylistic tic. At scale, it's a lever: a distribution that reliably deflates some words and inflates others is the mechanism you'd build if you wanted to shape what a billion users read without them noticing.

The other thing worth stating plainly: the models sold as "uncensored" aren't. Refusal ablation clears the "I can't help with that" response and leaves the word-level flinch intact — in our measurement it makes it slightly worse. Whatever was bent into the probability distribution at pretraining stays bent after the refusal direction is deleted. The frontier here is open.

Technical details

All numbers on this page come from a transformers forward-pass probe against seven checkpoints: EleutherAI/pythia-12b, allenai/OLMo-2-1124-13B, Qwen/Qwen3.5-9B-Base, trohrbaugh/Qwen3.5-9B-heretic-v2, google/gemma-2-9b, google/gemma-4-31b-pt, and openai/gpt-oss-20b.

Scoring. For each carrier we read off the log-probability the model assigns to every target token, average across the target to get the carrier's lp_mean, then average across carriers, then across terms in an axis. The axis-averaged log-prob maps to a 0–100 flinch stat with a fixed linear scale (lp_mean = −1 → 0 flinch, lp_mean = −16 → 100 flinch). Endpoints fixed across models, so the numbers are directly comparable.

All seven models ran at bf16. Gemma needs a forced <bos> prefix to stay in-distribution (Qwen, Pythia, and OLMo do not). gpt-oss-20b ships with native MXFP4 quantization on its MoE experts; we dequantized to bf16 at load time to keep precision matched across the set.

Reference Why it matters here
EleutherAI/pythia-12b The absolute open-data ceiling. Trained on The Pile (2020), no downstream safety tuning, unfiltered. Smallest polygon on the page (total flinch 176). Every other model's flinch is a distance from this point.
allenai/OLMo-2-1124-13B The practical open-data floor. Trained on Dolma (2024), no downstream safety tuning, but with modern responsible-AI curation. Total flinch 214. Sits just outside Pythia — +38 points entirely attributable to four years of changed norms about what belongs in a pretrain corpus.
Qwen/Qwen3.5-9B-Base The Qwen-lineage pretrain baseline. Smallest polygon in the Qwen lineage, i.e. the least flinch within that family. The reference against which both downstream interventions are measured.
trohrbaugh/Qwen3.5-9B-heretic-v2 Heretic-style abliteration of the base. Larger polygon than the base on every axis, so abliteration adds flinch. What we had been using as our "base" until this run.
google/gemma-2-9b First commercially-filtered pretrain reference. Aggressive 2024 corpus filtering shows up as a swollen taboo lobe, especially on slurs (flinch 93).
google/gemma-4-31b-pt Second Google pretrain. Same lab, newer generation, 31B dense parameters. Total flinch 222, lowest among commercial pretrains and just behind OLMo overall; slurs collapse from 93 to 53. Inverts the "Google filters aggressively" reading.
openai/gpt-oss-20b OpenAI's first open-weight release in half a decade, and a distinctly different shape from the others. 20B MoE with 3.6B active per token. Notable for the highest political-corner flinch of any non-filtered base on the page, including against a Chinese-lab pretrain.
Google adds subagents to Gemini CLI to handle parallel coding tasks (4 minute read)

Google adds subagents to Gemini CLI to handle parallel coding tasks (4 minute read)

AI
Google's Gemini CLI now supports subagents that can execute multiple coding tasks in parallel, addressing the bottleneck of sequential task processing in AI coding assistants.
What: Gemini CLI's subagents feature lets the main AI agent delegate specialized work to multiple subagents that run simultaneously within the same session, each with its own instructions, tools, and context for tasks like frontend updates, testing, or documentation.
Why it matters: Most AI coding agents work through tasks sequentially, creating bottlenecks when multiple related tasks pile up. Parallel execution means a developer can ask for frontend, tests, and docs updates simultaneously rather than waiting for each to complete in sequence.
Takeaway: Developers can create custom subagents by defining them in Markdown files with YAML frontmatter, use built-in subagents like the generalist or codebase-focused agent, or explicitly assign tasks using @ syntax followed by the subagent name.
Deep dive
  • Gemini CLI subagents run within a single session with each maintaining separate context, reducing the risk of tasks interfering with one another that occurs in long, complex sessions
  • The feature supports running multiple instances of the same subagent in parallel, such as a frontend-focused agent analyzing different packages in a codebase simultaneously
  • Built-in subagents include a generalist for general coding tasks, a CLI-focused agent for tool questions, and a codebase-focused agent for exploring architecture and debugging
  • The system automatically routes tasks to appropriate subagents when it determines one is better suited, allowing routine work to be delegated without manual specification
  • Developers can take direct control using @ syntax to explicitly assign tasks to specific subagent roles
  • Custom subagents are defined in Markdown files with YAML frontmatter followed by plain-text instructions describing role and behavior, shareable across teams
  • This approach differs from Claude Code's "agent teams" which coordinate work across multiple sessions rather than within a single session, supporting longer-running tasks with more management overhead
  • The /agents command lists currently available subagents at any point during a session
  • Each subagent operates in its own working space, keeping instructions and outputs separate to avoid long chains of instructions building up in one session
Decoder
  • Subagents: Specialized AI agents that handle specific portions of a larger task, each with its own role, instructions, and context, delegated by a main agent
  • YAML frontmatter: Metadata section at the beginning of a file using YAML format, commonly used to configure settings or properties before the main content
  • Context separation: Keeping each subagent's working environment, instructions, and outputs isolated from others to prevent interference between parallel tasks
Original article

Google adds subagents to Gemini CLI to handle parallel coding tasks

AI coding agents might be able to take on more complex work, but they still tend to work through tasks one at a time. And that can become a huge bottleneck once tasks start to stack up.

Google is addressing that with a new "subagents" feature in its Gemini CLI, introducing a way to split work across multiple specialised agents within the same environment.

Subagents are defined with their own instructions, tools, and context. The main agent can delegate parts of a task to them, allowing work to be broken down and handled in parallel. Rather than one agent working through everything step by step, tasks can be distributed and executed at the same time.

For example, a developer could tell Gemini CLI that the backend for an analytics API is done and ask it to update the frontend, tests, and documentation, with subagents then spun up for each part of the job — a frontend specialist, a unit test agent, and a docs writer.

Subagents in Gemini CLI
Subagents in Gemini CLI

Delegating work inside the CLI

The setup is designed to handle tasks that would otherwise overload a single agent session. A developer can create subagents for specific roles — such as code review, testing, or documentation — and call on them when needed.

Each subagent runs with its own context, allowing the main agent to hand off work and receive results without carrying everything in a single thread. That keeps tasks more contained and avoids long chains of instructions building up in one session.

This approach has been present in other tools for some time. Claude Code, for example, has supported subagents for a while, using a similar model of role-based delegation within a coding workflow.

Parallel execution and context separation

A key part of the feature is that subagents can run at the same time, allowing different parts of a task to be processed in parallel.

Each subagent also operates in its own working space, so instructions and outputs remain separate. That reduces the risk of tasks interfering with one another, which can happen in longer, more complex sessions.

Together, this allows larger pieces of work to be broken down and handled without losing track of what each part is doing.

This also extends to running multiple instances of the same subagent at once. A developer can, for example, run a frontend-focused agent across several packages in parallel, with each instance analysing a different part of the codebase at the same time.

Parallel frontend-specialist subagents
Parallel frontend-specialist subagents

It's worth noting that in Gemini CLI, this coordination happens within a single session, with subagents spun up to handle parts of a task before returning control to the main agent.

Other systems are exploring a more extensive setup. Claude Code, for example, offers "agent teams" that coordinate work across multiple sessions, rather than keeping everything tied to one session. That approach can support longer-running tasks, but adds more overhead in how those agents are defined and managed.

How to use subagents in Gemini CLI

Gemini CLI comes with a set of built-in subagents that can be used straight away, each geared toward a specific type of task. These include a "generalist" agent that can handle a wide range of coding and command-line tasks, a CLI-focused agent that can answer questions about how the tool works, and a codebase-focused agent for exploring architecture, dependencies, and debugging issues.

Developers can also create their own subagents by defining them in a Markdown file with YAML frontmatter, followed by plain-text instructions describing the agent's role and behaviour. These files can be stored locally or alongside a project to share across a team.

Creating a custom frontend specialist agent
Creating a custom frontend specialist agent

The system will automatically route tasks to these subagents when it decides one is a better fit. That means routine or well-defined work can be handled without needing to specify which agent should take it on.

Developers can also take direct control. By using the @ syntax followed by a subagent's name, tasks can be explicitly assigned to a specific role — for example, asking a frontend-focused agent to review an interface, or a codebase-focused agent to map out part of a system. Each subagent then handles the task within its own context, separate from the main session.

To see which subagents are available at any point, the CLI provides a simple /agents command, which lists the current set of configured agents.

Resources

Qwen3.5-Omni Technical Report (32 minute read)

Qwen3.5-Omni Technical Report (32 minute read)

AI
Qwen Team releases Qwen3.5-Omni, a massive multimodal model scaling to hundreds of billions of parameters that processes text, audio, and video with 256k context length and beats Gemini 3.1 Pro on key audio benchmarks.
What: Qwen3.5-Omni is a multimodal AI model using a Hybrid Attention Mixture-of-Experts architecture, trained on over 100 million hours of audio-visual content. It handles 10+ hours of audio, 400 seconds of 720P video, and supports speech generation across 10 languages.
Why it matters: The model demonstrates significant advances in multimodal AI with novel capabilities like Audio-Visual Vibe Coding (generating code directly from audio-visual instructions) and ARIA, a mechanism that solves long-standing streaming speech synthesis quality issues by dynamically aligning text and speech units.
Takeaway: Developers building multimodal AI applications can review the technical report to understand the architecture and benchmark performance of this Gemini competitor.
Deep dive
  • Achieves state-of-the-art results across 215 audio and audio-visual benchmarks, surpassing Gemini 3.1 Pro in key audio tasks and matching it in comprehensive audio-visual understanding
  • Scales to hundreds of billions of parameters with 256k context length, enabling processing of over 10 hours of audio or 400 seconds of 720P video at 1 FPS
  • Uses Hybrid Attention Mixture-of-Experts framework for both Thinker (understanding/reasoning) and Talker (speech generation) components to enable efficient long-sequence inference
  • Introduces ARIA to address streaming speech synthesis instability caused by encoding efficiency discrepancies between text and speech tokenizers, improving prosody and naturalness with minimal latency impact
  • Trained on massive heterogeneous datasets including text-vision pairs and over 100 million hours of audio-visual content
  • Supports multilingual understanding and speech generation across 10 languages with human-like emotional nuance in output
  • Demonstrates superior audio-visual grounding capabilities with script-level structured captions, precise temporal synchronization, and automated scene segmentation
  • Exhibits emergent Audio-Visual Vibe Coding capability, directly generating code from audio-visual instructions without intermediate text representation
  • Represents significant evolution over predecessor Qwen-Omni models in scale, capability, and performance
  • Model family includes Qwen3.5-Omni-plus variant that achieves the top benchmark results
Decoder
  • MoE (Mixture-of-Experts): Architecture using multiple specialized sub-models where only a subset activates for each input, improving efficiency at scale
  • ARIA: Dynamic alignment mechanism introduced in this work to synchronize text and speech units for better conversational speech stability and prosody
  • Audio-Visual Vibe Coding: Emergent capability where the model generates code directly from audio-visual instructions without text intermediary
  • Thinker and Talker: Architectural components where Thinker handles understanding/reasoning and Talker handles speech generation
  • 256k context length: Can process 256,000 tokens (roughly 192,000 words or 10+ hours of audio) in a single inference
  • SOTA: State-of-the-art, meaning best current performance on benchmark tasks
  • Omni-modality: Ability to process and understand multiple input modalities (text, audio, video) simultaneously
Original article

In this work, we present Qwen3.5-Omni, the latest advancement in the Qwen-Omni model family. Representing a significant evolution over its predecessor, Qwen3.5-Omni scales to hundreds of billions of parameters and supports a 256k context length. By leveraging a massive dataset comprising heterogeneous text-vision pairs and over 100 million hours of audio-visual content, the model demonstrates robust omni-modality capabilities. Qwen3.5-Omni-plus achieves SOTA results across 215 audio and audio-visual understanding, reasoning, and interaction subtasks and benchmarks, surpassing Gemini-3.1 Pro in key audio tasks and matching it in comprehensive audio-visual understanding. Architecturally, Qwen3.5-Omni employs a Hybrid Attention Mixture-of-Experts (MoE) framework for both Thinker and Talker, enabling efficient long-sequence inference. The model facilitates sophisticated interaction, supporting over 10 hours of audio understanding and 400 seconds of 720P video (at 1 FPS). To address the inherent instability and unnaturalness in streaming speech synthesis, often caused by encoding efficiency discrepancies between text and speech tokenizers, we introduce ARIA. ARIA dynamically aligns text and speech units, significantly enhancing the stability and prosody of conversational speech with minimal latency impact. Furthermore, Qwen3.5-Omni expands linguistic boundaries, supporting multilingual understanding and speech generation across 10 languages with human-like emotional nuance. Finally, Qwen3.5-Omni exhibits superior audio-visual grounding capabilities, generating script-level structured captions with precise temporal synchronization and automated scene segmentation. Remarkably, we observed the emergence of a new capability in omnimodal models: directly performing coding based on audio-visual instructions, which we call Audio-Visual Vibe Coding.

DeepMind's TIPSv2 Vision-Language Encoder (6 minute read)

DeepMind's TIPSv2 Vision-Language Encoder (6 minute read)

AI
Google DeepMind's TIPSv2 vision-language encoder achieves state-of-the-art zero-shot segmentation by supervising all image patches rather than just masked ones during training.
What: TIPSv2 is a vision-language pretraining model from Google DeepMind that learns joint representations of images and text. It introduces three improvements: iBOT++ (supervising all image patches instead of just masked ones), head-only exponential moving average (reducing training parameters by 42%), and multi-granularity captions using descriptions from PaliGemma and Gemini for richer supervision.
Why it matters: The research reveals a counterintuitive finding that smaller distilled models can outperform their larger teachers at patch-text alignment tasks, which led to discovering that supervising visible image patches (not just masked ones) is critical for dense vision-language understanding like segmentation.
Takeaway: Models, code, interactive demos, and Colab notebooks are available on GitHub and HuggingFace for experimenting with vision-language tasks.
Deep dive
  • Discovery that distilled ViT-L student models dramatically outperform their larger ViT-g teachers in zero-shot segmentation, reversing typical size-performance trends
  • Investigation revealed that supervision on visible tokens (not just masked ones) is the key differentiator between distillation and pretraining success
  • iBOT++ extends patch-level self-distillation loss to all patches (both masked and visible), yielding +14.1 mIoU gain in zero-shot segmentation on ADE150 dataset
  • Head-only EMA applies exponential moving average only to the projector head rather than full model, reducing training parameters by 42% while maintaining performance
  • Multi-granularity captions combine alt-text, PaliGemma, and Gemini Flash descriptions, randomly alternating during training to prevent shortcut learning on coarse keywords
  • Achieves state-of-the-art results on all four zero-shot segmentation benchmarks tested
  • TIPSv2-g outperforms PE-core G/14 on 3 of 5 shared evaluations despite PE having 56% more parameters and 47× more training image-text pairs
  • At ViT-L size, TIPSv2 outperforms DINOv3 on 4 of 6 benchmarks despite DINOv3's teacher using 6× more parameters and 15× more images
  • Produces smoother feature maps with better object boundary delineation and granular semantic details compared to previous models like TIPS, SigLIP2, and DINOv2
  • Presented at CVPR 2026 with full code, model checkpoints, Colab notebooks, and HuggingFace demos publicly available
Decoder
  • Vision-language encoder: A neural network that learns joint representations of images and text for multimodal understanding
  • Patch-text alignment: How well individual image patches (small regions) correspond to text descriptions
  • Zero-shot segmentation: Segmenting objects in images without task-specific training, using only natural language descriptions
  • iBOT: Image BERT pre-training with Online Tokenizer, a self-supervised learning method for vision models
  • Distillation: Training a smaller student model to mimic a larger teacher model's behavior
  • EMA (Exponential Moving Average): A technique that maintains a smoothed version of model weights during training for stability
  • MIM (Masked Image Modeling): Self-supervised learning where the model predicts masked portions of images
  • mIoU: Mean Intersection over Union, a metric measuring segmentation quality by comparing predicted and ground truth regions
  • ViT: Vision Transformer, an architecture that applies transformer models to image patches
Original article

TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment

Overview

TIPSv2 is the next generation of the TIPS family of foundational image-text encoders empowering strong performance across numerous multimodal and vision tasks. Our work starts by revealing a surprising finding, where distillation unlocks superior patch-text alignment over standard pretraining, leading to distilled student models significantly surpassing their much larger teachers in this capability. We carefully investigate this phenomenon, leading to an improved pretraining recipe that upgrades our vision-language encoder significantly. Three key changes are introduced to our pretraining process (illustrated in the figure below): iBOT++ extends the patch-level self-supervised loss to all tokens for stronger dense alignment; Head-only EMA reduces training cost while retaining performance; and Multi-Granularity Captions uses PaliGemma and Gemini descriptions for richer text supervision. Combining these components, TIPSv2 demonstrates strong performance across 9 tasks and 20 datasets, generally on par with or better than recent vision encoder models, with particularly strong gains in zero-shot segmentation.

TIPSv2 Pretraining Overview

TIPSv2 pretraining overview. TIPSv2 introduces 3 pretraining improvements: iBOT++ (enhanced MIM loss), Head-only EMA (memory-efficient self-supervised losses), and Multi-granularity captions (richer text supervision).

Visualization

PCA Feature Maps

TIPSv2 produces smoother feature maps with well-delineated objects compared to prior vision-language models (e.g., TIPS and SigLIP2). While DINOv3 also exhibits smooth feature maps, TIPSv2 shows stronger semantic focus: object boundaries are more precisely delineated and regions show granular semantic details. We compare ViT-g models of several vision encoders, except for DINOv3, where we compare with the 6× larger ViT-7B. Select an image below to explore PCA components of patch embeddings.

Hike Copenhagen Angus Dadaocheng Original image TIPS PCA SigLIP2 PCA DINOv2 PCA DINOv3 PCA TIPSv2 PCA

TIPSv2 PCA features demonstrate more fine-grained semantic separation: backpacks, people, and hiking poles are clearly delineated.

Feature Explorer

Upload your own image and explore TIPSv2 patch embeddings feature maps or applications in zero-shot segmentation or depth and normal prediction. Also available on HuggingFace.

Method

TIPSv2 investigates the differences between pre-training and distillation, motivating the introduction of three targeted pretraining improvements to standard vision-language models: iBOT++, Head-only EMA, and Multi-Granularity Text Captions.

Bridging Pre-training and Distillation

We reveal a surprising gap between pre-training and distillation: a smaller ViT-L model distilled from a larger ViT-g TIPS teacher dramatically outperforms its teacher in zero-shot segmentation, reversing the trend of all other evaluation tasks. We observe a similar trend in SigLIP2. In the paper, we ablate the differences between pre-training and distillation, such as masking ratio, encoder initialization, frozen or training parameters, and supervision. Our investigation reveals that the important distinction that causes differences in patch-text alignment between distillation and pre-training is supervision on visible tokens.

distillation vs standard pretraining

Distillation vs standard pretraining: surprising findings. Zero-shot segmentation for a TIPS ViT-g pre-trained teacher model and a ViT-L student distilled from the ViT-g teacher. The student model strongly surpasses the teacher for patch-text alignment.

iBOT++: Enhanced Masked Image Modeling

In our investigation of the gap between distillation and standard pretraining, we find that supervising visible patches is the key differentiator. To introduce this improvement in distillation to pretraining, we propose a simple augmentation: iBOT++. Whereas standard iBOT only supervises masked patch tokens, leaving visible token representations unconstrained, iBOT++ extends the patch-level self-distillation loss to all patches (both masked and visible), yielding a +14.1 mIoU gain in zero-shot segmentation on ADE150.

iBOT vs iBOT++ — TIPSv2 teaser

iBOT++. Applies the patch-level loss to all patches (masked and visible), dramatically improving patch-text alignment as shown by zero-shot segmentation results.

Head-only EMA

Since the contrastive loss already stabilizes the vision encoder, we apply EMA only to the projector head rather than the full model. This reduces training parameters by 42% while retaining comparable performance.

Head-only EMA

Head-only EMA. Reduces training parameters while maintaining performance.

Multi-Granularity Text Captions

We supplement alt-text and PaliGemma captions with richer Gemini Flash captions, randomly alternating between them during training to avoid shortcutting on coarse keywords. This boosts both dense and global image-text performance.

Multi-granularity captions

Multi-granularity captions. Image captions at different granularities.

Ablations

We ablate each component cumulatively from the TIPS baseline. iBOT++ alone yields the largest single gain: a +14.1 mIoU improvement in zero-shot segmentation on ADE150 (3.5 → 17.6), confirming that extending the patch-level loss to visible tokens is the key driver of dense patch-text alignment.

Ablation studies table

Ablation studies. Cumulative ablations from the TIPS baseline, each adding one TIPSv2 component on ViT-g.

Results

We evaluate TIPSv2 across a wide range of evaluation categories, including Dense Image-Text (zero-shot segmentation), Global Image-Text (classification and retrieval), and Image-Only tasks (segmentation, depth, normals, retrieval, classification). Select a tab below to explore the detailed results tables.

Dense image-text evaluations table

Dense image-text evaluations. TIPSv2 achieves SOTA on all four zero-shot segmentation benchmarks, outperforming SILC and DINOv2 even though they use the more complex TCL evaluation protocols.

Global image-text results table

Global image-text evaluations. TIPSv2 achieves best or second-best in 5 of 7 global evaluations. Notably, TIPSv2-g outperforms PE-core G/14 on 3 of 5 shared evals, despite PE having 56% more parameters and 47× more training pairs.

Image-only results table

Image-only evaluations. TIPSv2 achieves best or second-best in 7 of 9 image-only evaluations.

DINOv3 vs TIPSv2 comparison table

DINOv3 vs TIPSv2 comparison. We compare TIPSv2 with DINOv3 at the largest common size between the two families: ViT-L. Despite DINOv3's teacher using 6× more parameters and 15× more images, TIPSv2 wins 4 of 6 shared evaluations including zero-shot segmentation (both using sliding window protocol from TCL in this case).

Acknowledgements

We would like to thank Connor Schenck and Gabriele Berton for thoughtful discussions and suggestions. We also thank the D4RT project for website template.

Citation

@inproceedings{cao2026tipsv2,
  title     = {{TIPSv2: Advancing Vision-Language Pretraining with Enhanced Patch-Text Alignment}},
  author    = {Cao, Bingyi and Chen, Koert and Maninis, Kevis-Kokitsi and Chen, Kaifeng and Karpur, Arjun and Xia, Ye and Dua, Sahil and Dabral, Tanmaya and Han, Guangxing and Han, Bohyung and Ainslie, Joshua and Bewley, Alex and Jacob, Mithun and Wagner, Rene and Ramos, Washington and Choromanski, Krzysztof and Seyedhosseini, Mojtaba and Zhou, Howard and Araujo, Andre},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}
FlashDrive: Flash Vision-Language-Action Inference For Autonomous Driving (8 minute read)

FlashDrive: Flash Vision-Language-Action Inference For Autonomous Driving (8 minute read)

AI
Researchers achieve a 4.5x speedup on vision-language-action models for autonomous driving by targeting redundancies in each inference stage, bringing latency down to 159ms.
What: FlashDrive is an optimization framework that reduces VLA model inference from 716ms to 159ms through four targeted techniques: streaming inference that reuses cached computations for overlapping video frames, speculative reasoning with block diffusion drafting, adaptive flow matching that skips redundant denoising steps, and W4A8 quantization with ParoQuant to handle error compounding in reasoning chains.
Why it matters: VLA models can reason through complex driving scenarios that break traditional perception-planning pipelines, but NVIDIA's Alpamayo 1.5 ran at just 1.4 Hz—too slow for safe real-time driving. This work makes reasoning-capable autonomous driving models viable for deployment by identifying that different pipeline stages hide different forms of redundancy, allowing orthogonal optimizations to compound rather than saturate.
Takeaway: Explore NVIDIA's open-source Alpamayo models and apply the streaming inference pattern if you're building real-time AI systems with temporal redundancy in video or sensor data.
Deep dive
  • VLA models integrate chain-of-thought reasoning into end-to-end driving, generating explicit reasoning traces alongside trajectories to handle rare, complex scenarios that break traditional perception-planning separation
  • NVIDIA's Alpamayo 1.5 (10B parameters, Qwen3-VL backbone) takes 716ms per inference step on RTX PRO 6000, running at 1.4 Hz—far below real-time requirements for safe driving
  • Profiling reveals no single bottleneck: encode (88ms), prefill (177ms), decode (264ms), and action generation (187ms) all contribute substantially to total latency
  • Streaming inference exploits 75% temporal overlap in multi-camera video (4 frames × 4 views with 3/4 frames identical between steps) by reusing KV cache from previously encoded frames, using pre-RoPE key caching for dynamic position shifts
  • Fine-tuning only the action expert (not the full VLM) recovers accuracy degradation from streaming KV cache approximation because reasoning tokens are robust to stale cache but action cross-attention amplifies distributional mismatches
  • Speculative reasoning with DFlash block diffusion drafts entire reasoning sequences (~16 tokens) in parallel instead of one token at a time, exploiting low entropy in structured driving-domain reasoning with zero quality loss
  • Adaptive-step flow matching skips redundant middle denoising steps by caching velocity fields where cosine similarity exceeds 0.99, concentrating compute on early steps (coarse trajectory structure) and final steps (kinematic constraint satisfaction)
  • W4A8 quantization addresses both memory-bound decoding (4-bit weights) and compute-bound prefill (8-bit activations for INT8 matrix multiply), unlike W4A16 that ignores the thousands of vision tokens in each prompt
  • ParoQuant's scaled pairwise rotation suppresses weight outliers more thoroughly than AWQ, preventing error compounding across the ~16 autoregressive reasoning tokens that feed back into the model
  • CUDA graphs eliminate CPU dispatch overhead across heterogeneous pipeline stages (vision encoding, language processing, autoregressive decoding, diffusion action generation) and kernel fusion merges Q/K/V projections and MLP layers
  • Final results show 4.5x speedup (716ms → 159ms) with every optimization targeting a different stage, causing gains to compound: streaming cuts encode/prefill, speculation cuts decode, adaptive flow cuts action, quantization helps everywhere
  • Speedups transfer consistently across NVIDIA platforms from in-car Jetson Thor (4.0x) to datacenter RTX 5090 (5.7x), demonstrating the optimizations are platform-agnostic
  • Accuracy impact is negligible: ADE@6.4s improves from 1.72m to 1.56m, minADE@6.4s changes from 0.77m to 0.84m (within 0.1m tolerance)
Decoder
  • VLA (Vision-Language-Action): Models that integrate vision input, language reasoning, and action output in one end-to-end system rather than separating perception and planning
  • KV cache: Cached key-value tensors from attention layers that can be reused across inference steps to avoid recomputing redundant attention operations
  • Flow matching: A generative modeling technique that learns a continuous trajectory between noise and data distributions, used here to convert reasoning into vehicle waypoints
  • Prefill: The initial forward pass that processes the entire input prompt before autoregressive token generation begins
  • Speculative decoding: Technique where a fast draft model generates candidate tokens that a slower target model verifies in parallel, accepting correct guesses for speedup
  • RoPE (Rotary Position Embeddings): Position encoding method that applies rotations to query and key vectors, allowing pre-computation and caching before position-dependent rotation
  • W4A8 quantization: Compression using 4-bit weights and 8-bit activations, reducing both memory bandwidth (decoding bottleneck) and computation (prefill bottleneck)
  • AWQ (Activation-aware Weight Quantization): Quantization method that preserves important weights based on activation magnitudes, but can leave outliers partially intact
  • ParoQuant: Quantization method using scaled pairwise rotation to more aggressively suppress outliers and reduce error compounding in autoregressive generation
Original article

FlashDrive: Flash Vision-Language-Action Inference For Autonomous Driving

Traditional autonomous driving systems separate perception and planning, which leaves them brittle on the "long tail" of rare, complex scenarios that real-world driving demands. Vision-Language-Action (VLA) models take a fundamentally different approach: by integrating chain-of-thought reasoning into end-to-end driving, they can think through novel situations step by step, producing explicit reasoning traces alongside trajectory predictions. This year, NVIDIA released Alpamayo 1 and Alpamayo 1.5, the industry's first open-source reasoning VLA models for autonomous driving.

But reasoning takes time. Alpamayo 1.5 (10B parameters, built on Qwen3-VL) takes 716ms per step on an NVIDIA RTX PRO 6000, roughly 1.4 Hz, far short of the real-time requirements for safe driving. FlashDrive is an algorithm-system co-design framework that attacks all four stages (encode, prefill, decode, and action), reducing end-to-end latency to 159ms, a 4.5× speedup with negligible accuracy loss.

The Bottleneck Is Everywhere

A typical VLA driving model's inference breaks into four stages: vision encoding, prompt prefilling, reasoning token decoding, and action generation via flow matching. We profiled Alpamayo 1.5 and found that latency is spread across all four stages with no single dominant bottleneck. Getting close to real-time requires optimizing the entire stack.

Decode and action together account for nearly two-thirds of the 716ms total, but encode and prefill are large enough that no single-stage fix suffices.

Streaming Inference

Unlike a chatbot VLM that processes a single image per request, a driving VLA must ingest a continuous multi-camera video stream. At every step, the model processes a sliding window of temporal frames across multiple camera views (e.g., 4 frames × 4 views). But consecutive time steps overlap by 75%: three out of four frames are identical. Re-encoding the full window from scratch every step wastes computation on frames the model has already seen.

We introduce a streaming inference strategy that processes only the new frame:

  • KV cache reuse from the three previously encoded frames eliminates 75% of vision computation.
  • Pre-RoPE key caching with on-the-fly rotary embeddings handles dynamic position shifts as old frames are evicted and new ones arrive.
  • A custom streaming attention mask accommodates view-major token ordering across cameras, ensuring each new frame attends only to frames from the current and previous views while remaining causal within itself.

This reduces the effective sequence length by 75%, accelerating the encode and prefill stages.

There's a subtlety. The streaming KV cache is an approximation: cached keys and values were computed under a different attention context than the current frame would produce in a full forward pass. This degrades accuracy. The obvious fix, fine-tuning the full VLM on streaming inputs, actually makes things worse. Why? Reasoning tokens are generated autoregressively and attend mainly to recent tokens, making them robust to stale cache entries. The action expert, by contrast, integrates information across the entire KV cache through cross-attention to produce continuous trajectories, amplifying even small distributional mismatches.

This asymmetry suggests a targeted fix: freeze the VLM and fine-tune only the action expert. We expose the expert to the compounding approximation errors it will encounter at deployment by rolling out multiple streaming steps to populate the KV cache (no gradients), then enabling gradients at the final step. This cleanly recovers accuracy to near-baseline.

ADE@6.4s (m) ↓ minADE@6.4s (m) ↓
Baseline (no streaming) 1.85 0.80
+ Streaming 2.30 1.07
+ Streaming, fine-tune VLM 4.97 3.38
+ Streaming, fine-tune expert 1.93 0.87
Streaming alone degrades accuracy (2.30m vs 1.85m ADE). Fine-tuning the VLM makes it worse (4.97m). Fine-tuning only the action expert recovers to near-baseline (1.93m). Results obtained on Alpamayo 1.

Speculative Reasoning

The reasoning capability that makes VLA models powerful for long-tail scenarios comes at a cost: the model must generate explicit reasoning tokens (e.g., chain-of-causation traces) before producing an action. Autoregressive decoding produces these tokens one at a time, making this the largest bottleneck in the pipeline.

But driving-domain reasoning is unusually easy to draft. The reasoning sequences are short (~16 tokens), follow a highly structured template, and are conditioned on rich visual context that already determines most of the content. This makes the per-token entropy substantially lower than in open-ended language generation, creating an opportunity for speculative decoding with high acceptance rates.

We use our DFlash, a block diffusion model, as a parallel drafter. Instead of drafting tokens one at a time like conventional speculative methods, DFlash generates an entire block of candidates in a single forward pass, naturally capturing the intra-block correlations present in structured reasoning. Because speculative verification guarantees the output distribution is identical to standard autoregressive decoding, this acceleration comes with zero quality loss.

Adaptive-Step Flow Matching

VLA models must bridge language-level reasoning and continuous vehicle control. This is typically done through a flow-matching head that converts the model's reasoning into trajectory waypoints. The standard approach uses 10 denoising steps, but are all of them necessary?

The naive solution is to use fewer uniformly-spaced steps. But this hurts quality, because the velocity field is not uniform across the denoising trajectory. We profiled it and found a striking U-shaped pattern: velocity changes sharply at the first and last steps but is nearly constant through the middle. The endpoints matter most; the middle is redundant.

Velocity changes drop from 27% at step 0→1 to under 6% in the middle, then rise again at the end.
Middle steps reach cosine similarity above 0.99, confirming they are nearly redundant.

This non-uniformity has a clear physical interpretation: the early steps establish the coarse trajectory structure (lane choice, turn direction), the final steps snap the prediction onto the manifold of physically plausible trajectories (satisfying kinematic constraints and road geometry), and the intermediate steps perform only minor refinements to an already well-determined path. The endpoints carry the signal; the middle carries the inertia.

We exploit this by caching the velocity at middle steps and reusing it instead of recomputing. This concentrates compute on the steps that shape the trajectory the most, cutting action generation time while preserving trajectory quality.

Quantization

Quantization compresses model weights and activations to lower precision, trading numerical headroom for speed. But there's a design choice. Standard methods like AWQ quantize only the weights to 4-bit (W4A16): this helps memory-bound decoding by shrinking the data the GPU must load per token, but leaves the compute-bound prefill stage untouched. For a chatbot LLM where decoding dominates, that trade-off is acceptable. For a VLA model with thousands of vision tokens in every prompt, prefill is too expensive to ignore.

W4A8 quantization targets both regimes: 4-bit weights cut memory bandwidth for decoding, while 8-bit activations unlock faster INT8 matrix multiplies for the compute-heavy prefill. One format, two bottlenecks addressed.

The harder question is which W4A8 method. VLA reasoning generates chain-of-thought tokens (~16 per step), and each feeds back into the model, so quantization error compounds at every token. Methods like AWQ leave weight outliers partially intact; over a full reasoning trace, those residual errors accumulate into measurable trajectory drift. We use our ParoQuant, whose scaled pairwise rotation suppresses outliers far more thoroughly, keeping the compounding error in check.

System Optimizations

The VLA pipeline is unusually heterogeneous: vision encoding, language processing, autoregressive decoding, and diffusion-based action generation each have different compute profiles. Algorithmic improvements alone leave performance on the table without tight system engineering:

  • CUDA Graphs. Autoregressive generation launches many small kernels with high CPU dispatch overhead. Compiling the full four-stage pipeline into CUDA graphs eliminates this overhead.
  • Kernel Fusion. We fuse Q/K/V projections into a single kernel launch and merge the gate and up-projections within MLP layers. Combined with max-autotune compilation for element-wise and reduction operations, this eliminates memory round-trips and launch gaps.

Results

Per-stage latency on RTX PRO 6000. FlashDrive cuts every stage for a 4.5× speedup.
A slight accuracy gain.
Within 0.1m.

On an RTX PRO 6000, algorithmic and system optimizations cut latency from 716ms to 159ms (4.5×). Every technique targets a different stage, so the gains compound rather than saturate: no single optimization accounts for more than half the total speedup.

The same optimizations transfer across NVIDIA platforms, from the in-car Jetson Thor to datacenter workstation GPUs, with per-device speedups ranging from 4.0× to 5.7×.

Jetson Thor RTX 3090 RTX 4090 RTX 5090 RTX PRO 6000
Alpamayo 1.5 (ms) ↓ 3770 1788 1187 986 716
+ FlashDrive (ms) ↓ 944 363 209 192 159
Speedup 4.0× 4.9× 5.7× 5.1× 4.5×
End-to-end latency across five NVIDIA platforms, from in-car Jetson Thor to datacenter RTX PRO 6000. A single FlashDrive implementation delivers a consistent 4.0–5.7× speedup.

Conclusion

VLA inference is not a monolithic bottleneck but a cascade of stages, each hiding a different form of redundancy. Temporal overlap in vision, low entropy in reasoning, velocity smoothness in flow matching, numerical headroom in weights: each yields to a targeted shortcut, and because the redundancies are orthogonal, the speedups compound to 4.5× with negligible accuracy loss.

This extends beyond driving to any VLA deployment where latency is the binding constraint. Sub-200ms inference on a single GPU brings reasoning-capable VLA models into the range where real-time deployment becomes viable, without sacrificing the chain-of-thought that makes them powerful.

Citation

@article{li2026flashdrive,
  title   = {{FlashDrive: Flash Vision-Language-Action Inference For Autonomous Driving}},
  author  = {Li, Zekai and Liang, Yihao and Zhang, Hongfei and Chen, Jian and Liu, Zhijian},
  year    = {2026}
}
OpenAI Stargate: where the US sites stand (9 minute read)

OpenAI Stargate: where the US sites stand (9 minute read)

AI
The Stargate AI infrastructure project is building seven massive data center sites across the US with over 9 gigawatts of capacity, enough to match all AI compute that existed worldwide at the end of 2025.
What: A detailed status report on Stargate, the $500 billion collaboration between OpenAI, Oracle, and SoftBank to build seven AI data centers across Texas, New Mexico, Wisconsin, Michigan, and Ohio. Abilene, Texas is already operational at 0.3 GW, while six other sites are under construction with completion targets through 2028.
Why it matters: The scale is unprecedented: the combined 9+ GW capacity equals New York City's peak power demand and could power 20 million H100-equivalent GPUs. The project reveals how AI infrastructure builders are solving practical challenges like grid connection delays (using on-site natural gas plants) and water concerns (closed-loop cooling systems).
Deep dive
  • The seven Stargate sites represent a total planned capacity exceeding 9 gigawatts by 2029, with 0.3 GW already operational in Abilene, Texas as of April 2026
  • Abilene currently has four of eight buildings operational housing Nvidia Blackwell chips, with planned expansion to 1.2 GW by Q4 2026 (down from originally planned 2.1 GW)
  • The largest site is Doña Ana County, New Mexico at 2.2 GW projected capacity, followed by Shackelford County, Texas at 2 GW with a massive 1,200-acre campus
  • At least three sites will use on-site natural gas plants to bypass lengthy grid connection queues, while at least six will use closed-loop liquid cooling to avoid water evaporation concerns
  • SoftBank will own hardware at Milam County and Ohio sites, while Oracle owns hardware at remaining sites, with all sites serving OpenAI workloads
  • The Shackelford County site already shows roofing underway for its first building with late 2026 delivery, spanning 10 buildings powered by an on-site natural gas microgrid
  • Milam County is being built as a "fast-build" site by SoftBank subsidiary SB Energy with first building delivery targeted for October 2026
  • Port Washington, Wisconsin aims for 70% renewable power from solar, wind, and battery storage, branded as "sustainable-by-design"
  • The Lordstown, Ohio site is primarily a manufacturing facility for AI servers (SoftBank-Foxconn joint venture) with only minor data center capacity under 0.3 GW
  • Plans remain fluid even after construction begins, evidenced by OpenAI redirecting Abilene's planned expansion to other locations and Microsoft partnering with Crusoe for an adjacent 900 MW site
  • Political opposition poses risks, including a ban on future data centers in Lordstown and local resistance to the Michigan site
  • The trade-offs are clear: on-site power generation and closed-loop cooling save time but significantly increase facility costs compared to traditional data center designs
Decoder
  • Gigawatt (GW): Unit of power equal to one billion watts; 9 GW is roughly equivalent to New York City's peak electricity demand
  • H100-equivalent: A standardization metric based on Nvidia's H100 GPU computing power, used to compare different chip generations using 8-bit operations per second
  • IT power vs facility power: IT power is just the electricity for computing hardware, while facility power includes cooling, lighting, and infrastructure (typically 40-50% higher)
  • Closed-loop liquid cooling: A cooling system that recirculates liquid without evaporating water, avoiding public water consumption concerns but less energy-efficient than evaporative cooling
  • Microgrid: An on-site power generation system (often natural gas) that operates independently from the main electrical grid
  • Nvidia Blackwell/Rubin: Next-generation AI chips succeeding the H100, with Blackwell already deployed in Abilene and Rubin expected later
Original article

OpenAI Stargate: where the US sites stand

The $500 billion AI data center initiative is projected to exceed 9 gigawatts of capacity by 2029, with 0.3 gigawatts already operational in Abilene and six more US sites under active construction.

Updated April 23, 2026

Introduction

The United States is in the middle of an unprecedented build-out of AI infrastructure. No project illustrates the scale of that effort more than Stargate, a $500 billion endeavor involving AI developer OpenAI, cloud provider Oracle, and investment company SoftBank.

Stargate has seven locations across the US, all of which are now showing active development. The most advanced—in Abilene, Texas—is already operating at an estimated capacity of 0.3 gigawatts (GW).1 The six other sites include two more in Texas, as well as facilities in New Mexico, Wisconsin, Michigan, and Ohio. Together, the seven sites add up to over 9 GW of planned capacity, which is comparable to the peak power demand of New York City.2 This will be enough to power the equivalent of 20 million Nvidia H100 GPUs, which was the total amount of AI compute in the world by the end of 2025.3

Stargate's design choices reveal how builders are navigating the key challenges of gigawatt-scale AI data centers in the US. To sidestep lengthy queues for connecting to energy grids, at least three of the seven sites will make use of on-site natural gas plants. To address public concerns about water usage, at least six sites will use closed-loop liquid cooling systems, which do not evaporate water.4 These decisions will likely save the project time but raise the cost of the facilities.

Based on announcements from 2025, SoftBank will own the hardware at the Milam County and Ohio sites, while Oracle will own the hardware at the remaining sites. All sites will serve OpenAI's workloads.

The sites

Abilene, Texas

Current capacity: 0.3 GW | 250,000 H100-equivalents5

Projected capacity: 1.2 GW | 1.0 million H100-equivalents

Projected completion: Q4 2026

The Stargate project's flagship location is in Abilene, Texas. Built by AI infrastructure company Crusoe, Abilene is the most complete Stargate site to date, with an estimated four of the eight buildings already operational. These buildings house state-of-the-art Nvidia Blackwell chips.

Power is currently supplied by a mix of on-site natural gas and grid power, which includes local wind power.

OpenAI had planned to expand this site to 2.1 GW, but recently reversed course, deciding to direct that capacity to other locations. Microsoft has since partnered with Crusoe for the adjacent 900 MW site.

Shackelford County, Texas

Current capacity: 0 GW

Projected capacity: 2 GW | 4.2 million H100-equivalents

Projected completion: Q4 2028

Just across the county line from the Abilene site, data center developer Vantage is constructing a massive 1,200-acre (4.9-square-kilometer) campus with 10 buildings.

The campus will be powered by an onsite natural gas microgrid.

Vantage has given a delivery date for the site's first building of late 2026.6 Satellite imagery shows that roofing is underway for this building (visible in bright white).

Doña Ana County, New Mexico

Current capacity: 0 GW

Projected capacity: 2.2 GW | 4.6 million H100-equivalents

Projected completion: Q4 2028

In New Mexico, STACK Infrastructure is developing Project Jupiter, which consists of four large buildings. Satellite imagery shows that foundation work is underway.

This site will be powered by two natural gas microgrids designed to limit impact on the local grid.

Milam County, Texas

Current capacity: 0 GW

Projected capacity: 1.2 GW | 2.5 million H100-equivalents

Projected completion: Q4 2028

SoftBank subsidiary SB Energy is building and operating what is described as a "fast-build" site in Milam County, Texas, around 70 miles (110 kilometers) northeast of Austin. A satellite image from March shows steel framing and roofing for the first building (visible as a blue rectangle). Regulatory filings indicate this building will be delivered by October.

SB plans to fund and build new energy generation and storage to supply the majority of the campus's power.

Port Washington, Wisconsin

Current capacity: 0 GW

Projected capacity: 1.3 GW | 2.6 million H100-equivalents

Projected completion: Q4 2028

Vantage, which is also the developer behind the Shackelford County site, has broken ground on a campus named "Lighthouse" in Port Washington, just north of Milwaukee. Foundation work can be seen in satellite imagery.

The site is described as "sustainable-by-design," with 70% of power drawn from solar, wind, and battery storage.

Saline Township, Michigan

Current capacity: 0 GW

Projected capacity: 1.4 GW | 2.9 million H100-equivalents

Projected completion: Q4 2028

Related Digital is developing a campus dubbed "The Barn" in Saline Township, southwest of Detroit. Satellite imagery shows foundation work underway for the first building.

DTE Energy will provide 100% of the power, augmented by a battery storage system financed by the project.

Lordstown, Ohio

Current capacity: 0 GW

Projected capacity: <0.3 GW | <0.3 million H100-equivalents

Projected completion: Unknown

The seventh site is in Ohio, where some land has been cleared, but no large-scale data center construction is visible. The site is primarily a manufacturing facility for AI servers and data center equipment, operated as a joint venture between SoftBank and Foxconn. The capacity of the data center will likely be no more than 0.3 GW, with OpenAI announcing that the Milam County and Lordstown sites could scale to a combined 1.5 GW by 2027.

The Lordstown data center will likely draw power from the grid, as the Foxconn plant already has a substation connected.

The road ahead

At this point, the full $500 billion Stargate project is more than pure ambition. The build-out has started all over the US, leaving enough time to finish by 2029. However, there is a long road ahead for all seven sites. Plans can change even after construction begins, as shown by OpenAI pulling out of the Abilene expansion. Financing and procuring equipment will also be challenging at this unprecedented scale. Finally, political opposition is a real factor, as evidenced by a ban on future data centers in Lordstown and local opposition to the Michigan site. Epoch AI will be following the Stargate project and the broader data center build-out closely to see how this all pans out.

1 All stated power capacities refer to total facility power, including power for GPUs, cooling, lighting, etc. Power capacities for the Stargate sites have not been reported consistently as total facility power or IT power. For some sites, we estimated the total facility power based on the reported IT power. For example, Vantage reports Shackelford County as 1.4 GW of IT power. Given the hot summer climate of Texas and closed-loop cooling (which is less energy-efficient than evaporative cooling), we estimated the total facility power to be about 2 GW.

2 The NYISO 2025 Gold Book (p.30) forecasts about 11 GW of peak summer demand for New York City (Zone J) from 2026 through 2030. This represents the single highest hour of demand annually.

3 The H100 is just an example: the actual chips in these data centers will probably be Nvidia Blackwell, and later Nvidia Rubin. The total amount of compute in the world is based on the AI Chip Sales database, which estimates about 20 million H100-equivalents worth of AI chips sold by Q4 2025. The projected compute for the Stargate sites is estimated from the power capacities and the trend in energy efficiency for leading machine learning hardware—except for Abilene, which was disclosed by Crusoe to have 50,000 Blackwell GPUs per building.

4 Sources: Abilene, Shackelford County, Doña Ana County, Port Washington, Saline Township, and Lordstown. We did not find direct confirmation of a closed-loop system for Milam County, but it is designed to minimize water usage.

5 One H100-equivalent is the computing power equivalent to one Nvidia H100 GPU, measured in operations/second. The H100-equivalent unit uses a chip's highest 8-bit operations/second specification to convert between chips.

6 This is when the completed building is handed over to the tenant, not when the data center is fully operational.

Updates

Apr. 23, 2026:

We previously estimated that 0.6 GW was operational for Stargate Abilene. However, a subsequent post by Oracle implied that only 200 megawatts (or about 0.3 GW of total facility power by our estimate) was operational as of April 22nd. We updated the Stargate Abilene timeline accordingly. We now estimate that the 0.6 GW will be achieved in late May, while the full 1.2 GW will be achieved in Q4 2026.

Microsoft To Shift GitHub Copilot Users To Token-Based Billing, Tighten Rate Limits (4 minute read)

Microsoft To Shift GitHub Copilot Users To Token-Based Billing, Tighten Rate Limits (4 minute read)

AI
Microsoft is pausing GitHub Copilot individual signups and shifting to token-based billing as weekly infrastructure costs have doubled since January.
What: Microsoft is temporarily suspending new signups for GitHub Copilot's individual and student tiers while transitioning from flat-rate request-based pricing to usage-based token billing. They're also removing Claude Opus models from the $10/month Pro tier and tightening rate limits across all subscription levels.
Why it matters: This represents a pivotal shift in AI economics, signaling the end of heavily subsidized AI coding assistants as companies confront the real cost of running large language models at scale. Microsoft is following Anthropic's recent move to token-based enterprise billing, suggesting the industry is moving away from unlimited usage models.
Takeaway: Developers relying on GitHub Copilot should prepare for usage-based billing where costs reflect actual token consumption, and those on cheaper plans may need to upgrade to access premium models like Claude Opus.
Deep dive
  • Leaked internal documents show Microsoft will pause signups for GitHub Copilot's $10 Pro tier, $39 Pro+ tier, and free student tier as costs spiral out of control
  • The weekly cost of operating GitHub Copilot has nearly doubled since January 2026, making token-based billing an urgent priority rather than a planned transition
  • Current pricing uses "requests" where Pro accounts get 300/month and Pro+ gets 1,500/month, with different models consuming different amounts of requests through multipliers
  • Token-based billing will charge users for actual compute costs, similar to how Anthropic charges $5 per million input tokens and $25 per million output tokens for Claude Opus 4.7
  • Claude Opus models are being removed from the $10 Pro tier entirely, with Opus 4.6 and 4.5 being retired from Pro+ as well, leaving only Opus 4.7
  • The new Opus 4.7 model has a 7.5x request multiplier (each use counts as 7.5 requests), compared to Opus 4.6's 3x multiplier, making it 250% more expensive even with promotional pricing
  • Microsoft is tightening rate limits across Business, Enterprise, Pro, and Pro+ tiers after previous April adjustments proved insufficient to control costs
  • The company is also suspending trials of paid individual plans to "fight abuse" as users have been burning far more tokens than their subscription fees cover
  • This follows a broader industry trend, with Anthropic recently moving enterprise customers to token-based billing to reduce its own infrastructure costs
  • Premium request multipliers reveal the actual cost differences: GPT-5.4 Mini uses 0.33x (cheap) while the retired Claude Opus 4.6 Fast used 30x (expensive)
  • Microsoft has been subsidizing AI compute costs like most competitors, but the party appears to be ending as real economics force pricing adjustments across the industry
Decoder
  • Tokens: The basic units of text that language models process, roughly equivalent to word fragments; billing by tokens means charging for actual compute consumption rather than flat rates
  • Request multipliers: A system where different AI models consume different amounts of your monthly request quota based on their computational cost (e.g., 7.5x means one use counts as 7.5 requests)
  • Rate limits: Maximum usage caps that restrict how many API calls or requests users can make within a time period to control infrastructure costs
  • Token burn: The computational resources consumed when processing input and output through language models, which translates directly to infrastructure costs for providers
Original article

Executive Summary: 

  • Internal documents reveal that Microsoft plans to temporarily suspend individual account signups to its GitHub Copilot coding product, as it transitions from requests (single interactions with Copilot) towards token-based billing.
  • The documents reveal that the weekly cost of running Github Copilot has doubled since the start of the year.
  • Microsoft also intends to tighten the rate limits on its individual and business accounts, and to remove access to certain models for those with the cheapest subscriptions.

Leaked internal documents viewed by Where's Your Ed At reveal that Microsoft intends to pause new signups for the student and paid individual tiers of AI coding product GitHub Copilot, tighter rate limits, and eventually move users to "token-based billing," charging them based on what the actual cost of their token burn really is.

Explainer: At present, GitHub Copilot users have a certain amount of "requests" — interactions where you ask the model to do something, with Pro ($10-a-month) accounts getting 300 a month, and Pro+ ($39-a-month) getting 1500. More-expensive models use more requests, cheaper ones use less (I'll explain in a bit).

Moving to "token-based billing" would mean that instead of using "requests," GitHub Copilot users would pay for the actual cost of tokens. For example, Claude Opus 4.7 costs $5 per million input tokens (stuff you feed in) and $25 per million output tokens (stuff the model outputs, including tokens for chain-of-thought reasoning.

Token-Based-Billing

The document says that although token-based billing has been a top priority for Microsoft, it became more urgent in recent months, with the week-over-week cost of running GitHub Copilot nearly doubling since January.

The move to token-based billing will see GitHub users charged based on their usage of the platform, and how many tokens their prompts consume — and thus, how much compute they use. It's unclear at this time when this will begin.

This is a significant move, reflecting the significant cost of running models on any AI product. Much like Anthropic, OpenAI, Cursor, and every other AI company, Microsoft has been subsidizing the cost of compute, allowing users to burn way, way more in tokens than their subscriptions cost.

The party appears to be ending for subsidized AI products, with Microsoft's upcoming move following Anthropic's (per The Information) recent changes shifting enterprise users to token-based billing as a means of reducing its costs.

Pauses on Signups for Individual and Student Tiers

GitHub Copilot currently has two tiers for individual developers — a $10-per-month package called GitHub Copilot Pro, and a $39-a-month subscription called GitHub Copilot Pro+.

According to the leaked documents, both of these tiers will be impacted by the shutdown, as will the GitHub Copilot Student product, which is included within the free GitHub Education package.

Removing Opus From GitHub Copilot Pro, Rate Limits Tightened on GitHub Copilot Pro, Pro+, Business, Enterprise

According to the documents, Microsoft also intends to tighten rate limits on some Copilot Business and Enterprise plans, as well as on individual plans, where limits have already been squeezed, and plans to suspend trials of paid individual plans as it attempts to "fight abuse."

Although Microsoft has regularly tweaked the rate limits for individual GitHub Copilot accounts, most recently at the start of April, the document notes that these changes weren't enough, and that more rate limits changes are to come in the next few weeks.

As part of this cost-cutting exercise, Microsoft intends to remove Anthropic's Opus family of AI models from the $10-per-month GitHub Copilot Pro package altogether.

Microsoft most recently retired Opus 4.6 Fast at the start of April for GitHub Copilot Pro+ users, although this decision was framed as a way to "further improve service reliability" and "[streamline] our model offerings and focusing resources on the models our users use the most."

Other Opus models — namely Opus 4.6 and Opus 4.5 — will be removed from the GitHub Copilot Pro+ tier in the coming weeks, as Microsoft transitions to Anthropic's latest Opus 4.7 model.

The move towards Opus 4.7 will likely see GitHub Copilot Pro+ users reach their usage limits faster.

Microsoft is offering a 7.5x request multiplier until April 30 — although it's unclear what the multiplier will be after this date. This might sound like a good thing, but it actually means that each request using Opus 4.7 is actually 7.5 of them. Redditors immediately worked that out and are a little bit worried.

Premium request multipliers allow GitHub to reflect the cost of compute for different models. LLMs that require the most compute will have higher premium request multipliers compared to those that are comparatively more lightweight.

For example, the GPT-5.4 Mini model has a premium request multiplier of 0.33 — meaning that every prompt is treated as one-third of a premium request — whereas the now-retired Claude Opus 4.6 Fast had a 30x multiplier, meaning each request was treated as thirty of them.

The standard version of Claude Opus 4.6 has a premium request multiplier of three — meaning that, even with the promotional pricing, Claude Opus 4.7 is around 250% more expensive to use.

The announcements for all of these changes are scheduled to take place throughout the week.

Claude can now build live artifacts (1 minute read)

Claude can now build live artifacts (1 minute read)

AI
Claude now creates live artifacts that are dashboards and trackers connected to apps and files, refreshing with current data instead of remaining static.
What: Claude's Cowork feature can now build live artifacts—dashboards and tracking tools that connect to external apps and files and automatically refresh with current data when opened, rather than being one-time static outputs.
Why it matters: This shifts AI assistance from generating static snapshots to creating persistent, dynamic tools that stay synchronized with your actual data, making AI outputs more useful over time.
Takeaway: If you use Claude, explore the Cowork feature to create live dashboards that connect to your frequently-used apps and data sources.
Decoder
  • Artifacts: Claude's term for standalone deliverables it creates (like documents, code, or visualizations) that appear in a separate panel rather than inline in chat
  • Cowork: A collaborative workspace feature in Claude where the AI can interact with connected apps and files
Original article

Live artifacts are dashboards and trackers connected to apps and files with current data.

Anthropic and Amazon expand collaboration for up to 5 gigawatts of new compute (4 minute read)

Anthropic and Amazon expand collaboration for up to 5 gigawatts of new compute (4 minute read)

AI
Anthropic secures massive 5 gigawatt compute deal with Amazon backed by $5-25 billion investment as Claude revenue hits $30 billion run rate.
What: Anthropic and Amazon have expanded their partnership with a 10-year, $100+ billion commitment securing up to 5 gigawatts of compute capacity using AWS Trainium chips to train and serve Claude, with Amazon also investing an additional $5 billion immediately and up to $20 billion more in the future on top of $8 billion previously invested.
Why it matters: The scale is staggering—5GW represents one of the largest AI infrastructure commitments announced to date, while Claude's revenue more than tripled in about four months from $9 billion to $30 billion run rate. The announcement also reveals that rapid growth has strained Anthropic's infrastructure, causing reliability issues for users, highlighting the massive capital requirements for frontier AI companies.
Takeaway: Developers using Claude should expect improved reliability and performance in coming months, and enterprise AWS customers can soon access the full Claude Platform directly within their AWS accounts without separate credentials.
Deep dive
  • Anthropic commits over $100 billion over 10 years for AWS compute, securing up to 5GW capacity across Trainium2 through Trainium4 chips with options for future generations
  • Amazon invests $5 billion immediately with up to $20 billion more later, adding to $8 billion previously invested for a potential total of $33 billion
  • Claude's revenue run rate hit $30 billion, up from approximately $9 billion at end of 2025—more than tripling in roughly four months
  • Anthropic currently uses over 1 million Trainium2 chips and operates Project Rainier, one of the largest compute clusters in the world
  • Rapid growth has strained infrastructure, causing reliability and performance issues especially for consumer users across free, Pro, Max, and Team tiers during peak hours
  • Nearly 1GW of new Trainium2 and Trainium3 capacity expected by end of 2026, with significant Trainium2 capacity coming online in Q2 2026
  • Claude Platform will be available directly within AWS with same account, controls, and billing—no separate credentials or contracts required
  • Over 100,000 customers run Claude on Amazon Bedrock, with planned expansion of inference capacity in Asia and Europe
  • Claude is the only frontier AI model available on all three major cloud platforms: AWS Bedrock, Google Cloud Vertex AI, and Microsoft Azure Foundry
  • The deal reflects Anthropic's diversified hardware strategy spreading workloads across different chip types to mitigate supply and reliability risks
Decoder
  • Gigawatt (GW): Unit of power equal to one billion watts; in data center context, refers to the power consumption capacity for running AI compute infrastructure
  • Trainium: Amazon's custom AI training chips designed for machine learning workloads, with Trainium2, 3, and 4 being successive generations
  • Graviton: Amazon's custom ARM-based processors for general-purpose computing workloads
  • Amazon Bedrock: AWS's managed service providing API access to foundation models like Claude
  • Run-rate revenue: Annualized revenue projection based on current monthly or quarterly performance trends
  • Claude Platform: Anthropic's full suite of tools and APIs for building with Claude, beyond just basic model access
Original article

Anthropic and Amazon expand collaboration for up to 5 gigawatts of new compute

We have signed a new agreement with Amazon that will deepen our existing partnership and secure up to 5 gigawatts (GW) of capacity for training and deploying Claude, including new Trainium2 capacity coming online in the first half of this year and nearly 1GW total of Trainium2 and Trainium3 capacity coming online by the end of 2026.

We have worked closely with Amazon since 2023 and over 100,000 customers now run Claude on Amazon Bedrock. Together we launched Project Rainier, one of the largest compute clusters in the world, and we currently use over one million Trainium2 chips to train and serve Claude. Today's agreement expands our collaboration in three ways.

Infrastructure at scale

We are committing more than $100 billion over the next ten years to AWS technologies, securing up to 5GW of new capacity to train and run Claude. The commitment spans Graviton and Trainium2 through Trainium4 chips, with the option to purchase future generations of Amazon's custom silicon as they become available.

Significant Trainium2 capacity is coming online in Q2 and scaled Trainium3 capacity is expected to come online later this year. Anthropic will also use incremental capacity for Claude in Amazon Bedrock. The agreement includes expansion of inference in Asia and Europe to better serve Claude's growing international customer base. We continue to choose AWS as our primary training and cloud provider for mission-critical workloads.

"Our custom AI silicon offers high performance at significantly lower cost for customers, which is why it's in such hot demand," said Andy Jassy, CEO of Amazon. "Anthropic's commitment to run its large language models on AWS Trainium for the next decade reflects the progress we've made together on custom silicon, as we continue delivering the technology and infrastructure our customers need to build with generative AI."

Claude Platform on AWS

The full Claude Platform will be available directly within AWS. Same account, same controls, same billing, with more Claude Platform features and no additional credentials or contracts necessary. This gives organizations direct access to Claude while meeting their existing governance and compliance requirements. Claude remains the only frontier AI model available to customers on all three of the world's largest cloud platforms: AWS (Bedrock), Google Cloud (Vertex AI), and Microsoft Azure (Foundry). Claude Platform on AWS is coming soon. Reach out to your account team to request access.

Continued investment

Amazon is investing $5 billion in Anthropic today, with up to an additional $20 billion in the future. This builds on the $8 billion Amazon has previously invested.

"Our users tell us Claude is increasingly essential to how they work, and we need to build the infrastructure to keep pace with rapidly growing demand," said Dario Amodei, CEO and co-founder of Anthropic. "Our collaboration with Amazon will allow us to continue advancing AI research while delivering Claude to our customers, including the more than 100,000 building on AWS."

Meeting record demand

Enterprise and developer demand for Claude has accelerated in 2026, and alongside it we have experienced a sharp rise in consumer usage across our free, Pro, and Max tiers. Our run-rate revenue has now surpassed $30 billion, up from approximately $9 billion at the end of 2025. Growth at this pace places an inevitable strain on our infrastructure; our unprecedented consumer growth, in particular, has impacted reliability and performance for free, Pro, Max, and Team users, especially during peak hours.

Today's agreement will quickly expand our available capacity, delivering meaningful compute in the next three months and nearly 1GW in total before the end of the year. Combined with additional capacity expansions and our diversified hardware strategy, with workloads spread across a range of chips, we are building the infrastructure needed to keep Claude at the frontier and reliably serve our growing customer base.

To learn more about Anthropic on AWS, visit: https://aws.amazon.com/bedrock/anthropic/.

Updated April 21st to clarify Claude Platform on AWS is coming soon.

The Rise of Apple's New CEO: A Hardware Expert Takes Over in the AI Era (7 minute read)

The Rise of Apple's New CEO: A Hardware Expert Takes Over in the AI Era (7 minute read)

Tech
John Ternus, a mechanical engineer who led all of Apple's hardware engineering, will replace Tim Cook as CEO on September 1 after 25 years at the company.
What: John Ternus will become Apple's CEO on September 1, 2026, with current CEO Tim Cook transitioning to executive chairman. Ternus is a hardware engineering veteran who most recently oversaw hardware engineering for all Apple products and previously led the Mac hardware division.
Why it matters: The leadership transition marks Apple's first CEO change in over a decade and raises questions about how a hardware-focused executive will navigate the company through an AI-driven era where software and machine learning capabilities are increasingly critical to product differentiation.
Original article

John Ternus will become Apple's new CEO after a 25 year career on September 1. Current CEO Tim Cook will become executive chairman. Ternus is a mechanical engineer by background and he most recently led hardware engineering for all of Apple's products. This article takes a look at who Ternus is and his history at Apple.

Amazon to invest up to another $25 billion in Anthropic as part of AI infrastructure deal (4 minute read)

Amazon to invest up to another $25 billion in Anthropic as part of AI infrastructure deal (4 minute read)

Tech
Amazon commits up to $25 billion more to Anthropic in exchange for the AI startup spending over $100 billion on AWS infrastructure over the next decade.
What: Amazon is investing an additional $25 billion into Anthropic ($5 billion immediately at a $380 billion valuation, plus up to $20 billion tied to milestones), while Anthropic commits to spending over $100 billion on Amazon Web Services and using Amazon's Trainium AI chips for the next 10 years. Anthropic secured 5 gigawatts of capacity for training and deploying its Claude models.
Why it matters: This deal represents a new model where cloud providers effectively lock in AI companies as mega-customers in exchange for capital, turning investments into guaranteed infrastructure spending that reshapes competitive dynamics in both AI development and cloud computing.
Takeaway: Developers building on Claude should expect improved reliability and performance as Anthropic brings nearly 1 gigawatt of new Trainium capacity online by year-end.
Deep dive
  • Amazon's total commitment to Anthropic reaches $33 billion while simultaneously investing $50 billion in rival OpenAI, hedging bets across leading AI startups
  • The deal converts investment into guaranteed revenue: Anthropic must spend over $100 billion on AWS over 10 years, far exceeding Amazon's $25 billion investment
  • Anthropic secures 5 gigawatts of capacity but must exclusively use Amazon's Trainium chips rather than industry-standard Nvidia GPUs for the next decade
  • The $5 billion immediate investment values Anthropic at $380 billion despite the company only founding in 2021, reflecting AI market exuberance
  • Anthropic's $30 billion annualized revenue demonstrates rapid commercial success, but the company admits infrastructure strain is impacting reliability
  • The remaining $20 billion is tied to unspecified commercial milestones, creating performance incentives beyond just technical development
  • Anthropic maintains relationships with Microsoft ($5B investment, $30B Azure commitment) and Google/Broadcom partnerships despite AWS being primary provider
  • OpenAI's public criticism that Anthropic made a "strategic misstep to not acquire enough compute" appears prescient given this infrastructure scramble
  • Amazon expects to spend $200 billion on AI capital expenditures in 2026, with this deal helping justify that massive buildout
  • The arrangement represents a new cloud-provider business model: trade capital for decade-long infrastructure lock-in with the hottest AI companies
Decoder
  • Trainium: Amazon's custom AI accelerator chips designed as an alternative to Nvidia GPUs for training and running large language models
  • Gigawatts of capacity: Measure of total power consumption for AI infrastructure; 5 gigawatts could power millions of GPUs worth of compute
  • Hyperscalers: The largest cloud computing providers (Amazon, Microsoft, Google) competing to build massive AI infrastructure
  • Annualized revenue: Current revenue rate projected over a full year, indicating Anthropic's monthly revenue multiplied by 12
Original article

Amazon has agreed to invest up to $25 billion in Anthropic, on top of the $8 billion that it has poured into the artificial intelligence startup in recent years, as part of an expanded agreement to build out AI infrastructure.

In the announcement on Monday, Anthropic said it's committed to spending more than $100 billion on Amazon Web Services technologies over the next 10 years, including current and future generations of Trainium, Amazon's custom AI chips. Anthropic said it's secured up to 5 gigawatts of capacity for training and deploying its Claude AI models.

"Anthropic's commitment to run its large language models on AWS Trainium for the next decade reflects the progress we've made together on custom silicon, as we continue delivering the technology and infrastructure our customers need to build with generative AI," Amazon CEO Andy Jassy said in a statement.

Amazon's investment includes $5 billion into Anthropic now, with up to $20 billion in the future tied to "certain commercial milestones," according to a release. The initial investment is at Anthropic's latest valuation of $380 billion.

Anthropic said in the release that it will bring nearly 1 gigawatt total of Trainium2 and Trainium3 capacity online by the end of the year.

With all of the major hyperscalers competing to build out AI capacity as quickly as possible, Amazon said in February that it expects to shell out roughly $200 billion this year on capital expenditures, mostly on AI infrastructure.

Amazon's investment lands just two months after the e-commerce giant agreed to invest up to $50 billion in OpenAI, Anthropic's chief rival. The two AI companies have been racing to convince investors of their strengthening positions ahead of potential IPOs that could land as soon as this year. OpenAI executives have been criticizing Anthropic in recent months for making a "strategic misstep to not acquire enough compute."

Anthropic said on Monday that enterprise and developer demand for Claude, as well as a "sharp rise" in consumer usage, has led to "inevitable strain" on its infrastructure that has impacted its reliability and performance. The company said its new agreement with Amazon will quickly expand its available capacity.

"Our users tell us Claude is increasingly essential to how they work, and we need to build the infrastructure to keep pace with rapidly growing demand," Anthropic CEO Dario Amodei said in a statement. "Our collaboration with Amazon will allow us to continue advancing AI research while delivering Claude to our customers, including the more than 100,000 building on AWS."

Anthropic was founded in 2021 by a group of researchers and executives who defected from OpenAI. The company is best known for its family of Claude AI models and it's found early success selling to enterprises. Annualized revenue has topped $30 billion.

Anthropic named AWS its primary cloud provider in 2023 and its primary training partner in 2024, but the company has also inked deals with competing providers, including Microsoft and Google.

In November, Microsoft agreed to invest up to $5 billion into Anthropic, and Anthropic said it committed to purchasing $30 billion of Azure compute capacity. Earlier this month, Anthropic expanded its partnerships with Google and Broadcom for "multiple gigawatts" of capacity.

What Physical ‘Life Force' Turns Biology's Wheels? (14 minute read)

What Physical ‘Life Force' Turns Biology's Wheels? (14 minute read)

Tech
Scientists have finally cracked how the bacterial flagellar motor works after 50 years, revealing a molecular machine that uses flowing protons to spin hundreds of times per second and propel bacteria toward food.
What: The bacterial flagellar motor is a self-assembling molecular machine that enables single-celled bacteria to swim toward nutrients by rotating a tail-like flagellum at several hundred revolutions per second. Recent cryo-electron microscopy breakthroughs (2020-2026) revealed exactly how small protein cogwheels turn the larger base ring, and how the motor reverses direction when bacteria need to change course.
Why it matters: The motor's workings illuminate the fundamental physical force that powers all cellular life: the proton motive force, where thousands of protons per second flow into cells while being continuously pumped back out, creating a current that biological machines harness to do work.
Deep dive
  • Howard Berg discovered the flagellar motor in the 1970s by inventing an automatic tracking microscope to follow fast-moving bacteria, hypothesizing rotation 50 years before the complete mechanism was understood
  • Bacteria "run and tumble" by switching between forward swimming (counterclockwise rotation) and chaotic rolling (clockwise rotation) to navigate toward higher concentrations of nutrients
  • The motor consists of a C-ring of 34 proteins at the flagellum's base, surrounded by 10-12 smaller "stator" complexes that act like turnstiles
  • Each stator has a pentagonal ring of 5 proteins surrounding 2 central proteins, a 5:2 geometry revealed by cryo-EM studies in 2020
  • Over 2,000 protons per second flow through these pentagonal turnstiles, each pushing the ring one-tenth of a revolution and collectively spinning the larger C-ring
  • Direction switching occurs when phosphorylated CheY proteins bind to the C-ring in response to declining nutrient levels, causing the entire ring to snap into an alternate configuration like a hair clip
  • In the flipped state, the clockwise-rotating stators engage the inner edge of the C-ring instead of the outer edge, making the C-ring also turn clockwise and causing the bacterial bundle to fall apart
  • The final pieces of the puzzle were published as recently as March 2026, when researchers confirmed the system responds to a single signaling molecule
  • The proton motive force was proposed by Peter Mitchell in 1961, initially ridiculed but ultimately earning a 1978 Nobel Prize in Chemistry
  • Bacteria maintain fewer than 100 free protons inside while the surrounding water has tens of thousands, creating a concentration gradient that drives protons inward while electron transport chains pump them back out
  • The system operates at incredible equilibrium speeds, with thousands of protons flowing in and being pumped out every second while maintaining the low internal concentration
  • If the proton flow is interrupted (such as when cells starve), the voltage drops instantly and all cellular machinery shuts down
Decoder
  • Flagellar motor: A rotating molecular machine at the base of bacterial flagella (tail-like appendages) that spins to propel bacteria through water
  • Cryo-EM: Cryogenic electron microscopy, an imaging technique that flash-freezes samples to reveal molecular structures at near-atomic resolution
  • C-ring: The cytoplasmic ring of 34 proteins at the motor's base that rotates to turn the flagellum
  • Stators: Small protein complexes that anchor above the C-ring and act as motors, with pentagonal rings that rotate when protons flow through them
  • Proton motive force: The driving force created by protons constantly flowing into cells (due to concentration gradients) while being actively pumped back out
  • Phosphorylation: The process of attaching phosphorus atoms to proteins, which changes their behavior and triggers cellular responses
  • CheY proteins: Signaling molecules that, when phosphorylated, bind to the C-ring and trigger the motor to switch rotation direction
  • Electron transport chains: Molecular machines in cell membranes that pump protons out of the cell, maintaining the proton gradient
Original article

This article tells the story of how scientists figured out how the flagellar motor worked. The flagellar motor was discovered by Howard Berg, who set out in the early 1970s to apply his training in physics to understand how bacteria move. Bacteria move quickly, so Berg had to invent and build an automatic tracking microscope to keep them in view. He hypothesized how the mechanism worked 50 years before scientists discovered how the motor works.

Quantum Computers Are Not a Threat to 128-bit Symmetric Keys (23 minute read)

Quantum Computers Are Not a Threat to 128-bit Symmetric Keys (23 minute read)

Tech
AES-128 and SHA-256 are safe against quantum computers, and the common belief that quantum computing halves symmetric key security is a misconception that could waste resources during post-quantum transitions.
What: A detailed technical analysis explaining why symmetric encryption algorithms like AES-128 and SHA-256 remain secure against quantum computers, despite widespread belief that 256-bit keys are needed for quantum resistance.
Why it matters: The misconception about quantum threats to symmetric keys could divert critical resources away from the urgent work of transitioning asymmetric cryptography (RSA, ECDH, ECDSA) which is genuinely vulnerable to quantum attacks via Shor's algorithm.
Takeaway: Focus post-quantum cryptography migration efforts exclusively on replacing asymmetric primitives (key exchange and digital signatures), not on upgrading AES-128 or SHA-256 which can remain unchanged.
Deep dive
  • Grover's algorithm provides quadratic speedup for searching (sqrt(N) instead of N operations), commonly misinterpreted as "halving" AES security from 128 bits to 64 bits
  • The critical limitation: Grover's algorithm requires sequential operations and cannot be efficiently parallelized like classical brute force attacks
  • When parallelizing Grover's across multiple quantum computers by partitioning the search space, the quadratic speedup degrades significantly—splitting work across 2^16 machines only saves 2^8 work per instance, not 2^16
  • Concrete attack cost: breaking AES-128 would require approximately 140 trillion quantum circuits with 724 logical qubits each running continuously for 10 years
  • The depth-width (DW) cost is approximately 2^104.5, and unlike Shor's algorithm, there's little room for optimization (only 17 bits come from potentially optimizable circuit parameters)
  • Comparison: breaking AES-128 with Grover is 2^78.5 (430 quintillion) times more expensive than breaking 256-bit elliptic curves with Shor's algorithm
  • NIST explicitly designates AES-128 as the security benchmark for Category 1 post-quantum algorithms and states that all AES key sizes (128, 192, 256) remain approved through 2035 and beyond
  • NIST's MAXDEPTH concept formalizes how the required sequential computation forces parallelization that limits Grover's practical advantage
  • German BSI and quantum computing expert Samuel Jaques independently reach the same conclusion using similar analysis
  • CNSA 2.0 requiring 256-bit keys is not a quantum adjustment—it's maintaining consistency with Suite B's Top Secret requirements for a uniform "256-bit security level" across all primitives
  • The practical challenge ignored in theoretical analysis: maintaining quantum coherence for a decade-long computation is essentially impossible with any foreseeable technology
  • Each logical T-gate in surface code architecture requires 2^16 physical operations, adding even more overhead not captured in the base estimates
  • Resources are finite: unnecessary symmetric key transitions create churn, complexity, and interoperability issues that detract from the urgent work of replacing quantum-vulnerable asymmetric cryptography
Decoder
  • Grover's algorithm: A quantum algorithm that can search through an unsorted database of N items in roughly sqrt(N) steps instead of N steps, providing quadratic speedup
  • Shor's algorithm: A quantum algorithm that can break RSA and elliptic curve cryptography exponentially faster than classical computers, making current asymmetric cryptography vulnerable
  • Symmetric cryptography: Encryption where the same key is used for both encryption and decryption (like AES), as opposed to asymmetric cryptography which uses public/private key pairs
  • Post-quantum cryptography (PQC): Cryptographic algorithms designed to be secure against both classical and quantum computer attacks
  • ML-KEM / ML-DSA: Module-Lattice-based Key Encapsulation Mechanism and Digital Signature Algorithm, NIST's new post-quantum standards replacing ECDH and RSA/ECDSA respectively
  • Logical qubit vs physical qubit: A logical qubit is an error-corrected qubit implemented using many physical qubits; quantum error correction requires thousands of physical qubits to create one reliable logical qubit
  • T-gate: A specific quantum gate operation that is particularly expensive in error-corrected quantum computing, often used as the unit for measuring quantum circuit cost
  • Depth × Width (DW) cost: A measure of quantum circuit cost where depth is sequential operations and width is parallel qubits, roughly analogous to CPU cycles × cores for classical computing
  • Surface code: A leading quantum error correction architecture that encodes logical qubits in a 2D lattice of physical qubits
  • CNSA 2.0: Commercial National Security Algorithm Suite 2.0, the NSA's cryptographic standard for protecting national security systems
  • Birthday bound: In cryptography, the phenomenon where collision probability grows with the square of attempts, requiring double the bit length (e.g., 256-bit hash for 128-bit collision security)
Original article

Both AES-128 and SHA-256 are safe against quantum computers. No symmetric key sizes have to change as part of the post-quantum transition. Almost all experts agree on this. The misconception is usually based on a misunderstanding of the applicability of a quantum algorithm called Grover's.

Jujutsu megamerges for fun and profit (13 minute read)

Jujutsu megamerges for fun and profit (13 minute read)

Tech
Jujutsu's megamerge workflow lets developers work on top of all their branches simultaneously by creating octopus merge commits with many parents, eliminating context switching friction.
What: A detailed guide to using "megamerges" in Jujutsu, a version control system where you create a single merge commit that combines all your active feature branches, bug fixes, and work-in-progress changes as parents, then work on top of this combined view instead of switching between individual branches.
Why it matters: This workflow catches integration issues immediately since you're always working with all changes combined, eliminates surprise merge conflicts when submitting PRs, and makes it trivial to jump between tasks or create small drive-by fix PRs without VCS gymnastics.
Takeaway: Try Jujutsu with the provided config aliases for stack, stage, and restack commands to experiment with the megamerge workflow on a non-critical project.
Deep dive
  • Merge commits in version control are just regular commits with multiple parents, not special entities, and can have three or more parents (octopus merges)
  • The megamerge workflow involves creating one octopus merge commit as a child of every branch you're working on, then doing all work on top of this combined state
  • Key benefits include always compiling/testing against the full combination of your work, discovering merge conflicts immediately, and switching tasks by just editing different files without VCS commands
  • The megamerge itself never gets pushed to remote repositories, only the individual branches that compose it get published as separate PRs
  • Creating a megamerge is simple: jj new x y z followed by jj commit --message "megamerge" creates the base, with work happening in commits above
  • Getting WIP changes into proper branches uses jj absorb (automatically identifies which downstream commit each change belongs to) or jj squash --interactive for manual control
  • Custom aliases like stack and stage automate inserting new branches into the megamerge structure using revset queries
  • The restack alias solves the challenge of rebasing only your mutable commits onto trunk while leaving other contributors' branches untouched
  • The workflow relies heavily on Jujutsu's first-class conflict support and powerful revset language for targeting specific commits
  • Revset functions like closest_merge(to) find the nearest merge commit ancestor, while roots(trunk()..) &amp; mutable() identifies rebaseable commits
  • The --simplify-parents flag cleans up redundant edges in the commit graph after complex rebase operations
  • Practical usage involves absorb/squash for modifying existing commits, rebase for new commits, and stage/stack for moving entire branches into the megamerge
Decoder
  • Jujutsu (jj): A Git-compatible version control system with better support for workflows like handling conflicts as first-class objects and powerful commit manipulation
  • Octopus merge: A merge commit with three or more parents, combining multiple branches simultaneously rather than the typical two-parent merge
  • Megamerge: An octopus merge commit that combines all of a developer's active branches, serving as the base for all work
  • Revset: Jujutsu's query language for selecting sets of commits using predicates and set operations
  • absorb: A command that automatically identifies which downstream commits your current changes belong to and squashes them appropriately
  • trunk(): A revset alias typically referring to the main development branch (like main or master in Git)
  • Mutable/immutable commits: Jujutsu's concept of which commits you're allowed to modify, protecting published or shared work from accidental rewrites
Original article

Jujutsu megamerges let developers work on many different streams of work simultaneously. This article provides an in-depth explanation of how they work. Megamerges are a way of showing developers the whole picture and are not really meant to be pushed to remote. Developers will still want to publish branches individually as usual.

The AI engineering stack we built internally — on the platform we ship (14 minute read)

The AI engineering stack we built internally — on the platform we ship (14 minute read)

Tech
Cloudflare built an internal AI coding stack on their own platform that achieved 93% R&amp;D adoption and nearly doubled weekly merge requests from 5,600 to 8,700.
What: Cloudflare developed a complete AI engineering infrastructure over 11 months, including MCP servers for tool access, AI Gateway for request routing, Workers AI for inference, automated code reviewers, and a knowledge graph of 16,000+ entities—all running on their shipping platform products.
Why it matters: This is a detailed architectural blueprint for enterprise AI tooling at scale, showing how to solve real problems like authentication, cost tracking, context management, and quality enforcement while processing 240+ billion tokens per month across 3,683 users.
Takeaway: Cloudflare released the same stack as shipping products during "Agents Week"—developers can start with `npx create-cloudflare@latest --template cloudflare/agents-starter` to build similar systems.
Deep dive
  • Cloudflare achieved 93% R&D adoption (3,683 users) of internal AI coding tools in 11 months, with merge requests jumping from approximately 5,600 per week to over 8,700 (peak 10,952)
  • The architecture routes all requests through AI Gateway for centralized authentication, cost tracking, and zero data retention policies, processing 20.18 million requests and 241.37 billion tokens monthly
  • Frontier models (OpenAI, Anthropic, Google) handle 91% of requests while Workers AI handles 8.84%, with Workers AI running 77% cheaper than proprietary models for some workloads
  • Single sign-on setup via opencode auth login discovers configuration, authenticates through Cloudflare Access, and auto-configures providers, models, and MCP servers without manual setup
  • Proxy Worker pattern gives centralized control for adding features like per-user attribution and permission enforcement without touching client configs
  • MCP Server Portal consolidates 13 production MCP servers exposing 182+ tools across GitLab, Jira, Sentry, and internal systems through one OAuth flow
  • Code Mode collapses MCP tool schemas from 15,000 tokens (34 tools) to just 2 portal-level tools, saving 7.5% of context window per request
  • Backstage service catalog tracks 2,055 services, 228 APIs, 544 systems, and 1,302 databases as a knowledge graph for agents to understand dependencies and ownership
  • Auto-generated AGENTS.md files across 3,900 repositories provide structured context about runtime, test commands, conventions, and boundaries so agents don't work blind
  • AI Code Reviewer runs on every merge request via GitLab CI, categorizing reviews by risk tier and delegating to specialized agents for security, performance, and codex compliance
  • Engineering Codex system distills standards into agent skills and rules that both developers and reviewers can reference, reducing manual audit work from weeks to structured processes
  • Background agents running on Durable Objects with Sandbox SDK can clone repos, run tests, iterate on failures, and open merge requests in single long-running sessions
  • The entire stack runs on shipping Cloudflare products released during Agents Week, validating their platform through internal dogfooding at massive scale
Decoder
  • MCP (Model Context Protocol): Protocol for connecting AI agents to external tools and data sources, allowing agents to call APIs and access systems beyond their training data
  • AI Gateway: Cloudflare's centralized routing layer for LLM requests that handles authentication, cost tracking, provider management, and data retention policies
  • Workers AI: Cloudflare's serverless AI inference platform running open-source models on GPUs across their global network
  • Code Mode: Emerging pattern where models discover and call tools through code execution rather than loading all tool schemas upfront into the context window
  • iMARS: Internal MCP Agent/Server Rollout Squad, the tiger team Cloudflare formed to build their AI engineering infrastructure
  • Durable Objects: Cloudflare's serverless coordination primitive providing low-latency consistent storage for stateful applications
  • Backstage: Open-source developer portal (originally from Spotify) that serves as a service catalog and knowledge graph
  • AGENTS.md: Structured markdown file in repositories that tells AI coding agents about runtime, conventions, dependencies, and boundaries
  • Engineering Codex: Cloudflare's internal standards system that distills engineering rules into agent-readable skills
  • Zero Data Retention (ZDR): Policy ensuring model providers don't store request data for training or other purposes
  • Kimi K2.5: Frontier-scale open-source model with 256k context window that runs on Workers AI
  • Sandbox SDK: Cloudflare SDK for isolated execution environments where agents can clone, build, and test code safely
Original article

The AI engineering stack we built internally — on the platform we ship

In the last 30 days, 93% of Cloudflare's R&D organization used AI coding tools powered by infrastructure we built on our own platform.

Eleven months ago, we undertook a major project: to truly integrate AI into our engineering stack. We needed to build the internal MCP servers, access layer, and AI tooling necessary for agents to be useful at Cloudflare. We pulled together engineers from across the company to form a tiger team called iMARS (Internal MCP Agent/Server Rollout Squad). The sustained work landed with the Dev Productivity team, who also own much of our internal tooling including CI/CD, build systems, and automation.

Here are some numbers that capture our own agentic AI use over the last 30 days:

  • 3,683 internal users actively using AI coding tools (60% company-wide, 93% across R&D), out of approximately 6,100 total employees

  • 47.95 million AI requests

  • 295 teams are currently utilizing agentic AI tools and coding assistants.

  • 20.18 million AI Gateway requests per month

  • 241.37 billion tokens routed through AI Gateway

  • 51.83 billion tokens processed on Workers AI

The impact on developer velocity internally is clear: we've never seen a quarter-to-quarter increase in merge requests to this degree.

As AI tooling adoption has grown the 4-week rolling average has climbed from ~5,600/week to over 8,700. The week of March 23 hit 10,952, nearly double the Q4 baseline.

MCP servers were the starting point, but the team quickly realized we needed to go further: rethink how standards are codified, how code gets reviewed, how engineers onboard, and how changes propagate across thousands of repos.

This post dives deep into what that looked like over the past eleven months and where we ended up. We're publishing now, to close out Agents Week, because the AI engineering stack we built internally runs on the same products we're shipping and enhancing this week.

The architecture at a glance

The engineer-facing tools layer (OpenCode, Windsurf, and other MCP-compatible clients) include both open-source and third-party coding assistant tools.

Each layer maps to a Cloudflare product or tool we use:

What we built

Built with

Zero Trust authentication

Cloudflare Access

Centralized LLM routing, cost tracking, BYOK, and Zero Data Retention controls

AI Gateway

On-platform inference with open-weight models

Workers AI

MCP Server Portal with single OAuth

Workers + Access

AI Code Reviewer CI integration

Workers + AI Gateway

Sandboxed execution for agent-generated code (Code Mode)

Dynamic Workers

Stateful, long-running agent sessions

Agents SDK (McpAgent, Durable Objects)

Isolated environments for cloning, building, and testing

Sandbox SDK — GA as of Agents Week

Durable multi-step workflows

Workflows — scaled 10x during Agents Week

16K+ entity knowledge graph

Backstage (OSS)

None of this is internal-only infrastructure. Everything (besides Backstage) listed above is a shipping product, and many of them got substantial updates during Agents Week.

We'll walk through this in three acts:

  1. The platform layer — how authentication, routing, and inference work (AI Gateway, Workers AI, MCP Portal, Code Mode)

  2. The knowledge layer — how agents understand our systems (Backstage, AGENTS.md)

  3. The enforcement layer — how we keep quality high at scale (AI Code Reviewer, Engineering Codex)

Act 1: The platform layer

How AI Gateway helped us stay secure and improve the developer experience

When you have over 3,600+ internal users using AI coding tools daily, you need to solve for access and visibility across many clients, use cases, and roles.

Everything starts with Cloudflare Access, which handles all authentication and zero-trust policy enforcement. Once authenticated, every LLM request routes through AI Gateway. This gives us a single place to manage provider keys, cost tracking, and data retention policies.

The OpenCode AI Gateway overview: 688.46k requests per day, 10.57B tokens per day, routing to four providers through one endpoint.

AI Gateway analytics show how monthly usage is distributed across model providers. Over the last month, internal request volume broke down as follows.

Provider

Requests/month

Share

Frontier Labs (OpenAI, Anthropic, Google)

13.38M

91.16%

Workers AI

1.3M

8.84%

Frontier models handle the bulk of complex agentic coding work for now, but Workers AI is already a significant part of the mix and handles an increasing share of our agentic engineering workloads.

How we increasingly leverage Workers AI

Workers AI is Cloudflare's serverless AI inference platform which runs open-source models on GPUs across our global network. Beyond huge cost improvements compared to frontier models, a key advantage is that inference stays on the same network as your Workers, Durable Objects, and storage. No cross-cloud hops to deal with, which cause more latency, network flakiness, and additional networking configuration to manage.

Workers AI usage in the last month: 51.47B input tokens, 361.12M output tokens.

Kimi K2.5, launched on Workers AI in March 2026, is a frontier-scale open-source model with a 256k context window, tool calling, and structured outputs. As we described in our Kimi K2.5 launch post, we have a security agent that processes over 7 billion tokens per day on Kimi. That would cost an estimated $2.4M per year on a mid-tier proprietary model. But on Workers AI, it's 77% cheaper.

Beyond security, we use Workers AI for documentation review in our CI pipeline, for generating AGENTS.md context files across thousands of repositories, and for lightweight inference tasks where same-network latency matters more than peak model capability.

As open-source models continue to improve, we expect Workers AI to handle a growing share of our internal workloads.

One thing we got right early: routing through a single proxy Worker from day one. We could have had clients connect directly to AI Gateway, which would have been simpler to set up initially. But centralizing through a Worker meant we could add per-user attribution, model catalog management, and permission enforcement later without touching any client configs. Every feature described in the bootstrap section below exists because we had that single choke point. The proxy pattern gives you a control plane that direct connections don't, and if we plug in additional coding assistant tools later, the same Worker and discovery endpoint will handle them.

How it works: one URL to configure everything

The entire setup starts with one command:

opencode auth login https://opencode.internal.domain

That command triggers a chain that configures providers, models, MCP servers, agents, commands, and permissions, without the user touching a config file.

Step 1: Discover auth requirements. OpenCode fetches config from a URL like https://opencode.internal.domain/.well-known/opencode.

This discovery endpoint is served by a Worker and the response has an auth block telling OpenCode how to authenticate, along with a config block with providers, MCP servers, agents, commands, and default permissions:

{
  "auth": {
    "command": ["cloudflared", "access", "login", "..."],
    "env": "TOKEN"
  },
  "config": {
    "provider": { "..." },
    "mcp": { "..." },
    "agent": { "..." },
    "command": { "..." },
    "permission": { "..." }
  }
}

Step 2: Authenticate via Cloudflare Access. OpenCode runs the auth command and the user authenticates through the same SSO they use for everything else at Cloudflare. cloudflared returns a signed JWT. OpenCode stores it locally and automatically attaches it to every subsequent provider request.

Step 3: Config is merged into OpenCode. The config provided is shared defaults for the entire organization, but local configs always take priority. Users can override the default model, add their own agents, or adjust project and user scoped permissions without affecting anyone else.

Inside the proxy Worker. The Worker is a simple Hono app that does three things:

  1. Serves the shared config. The config is compiled at deploy time from structured source files and contains placeholder values like {baseURL} for the Worker's origin. At request time, the Worker replaces these, so all provider requests route through the Worker rather than directly to model providers. Each provider gets a path prefix (/anthropic, /openai, /google-ai-studio/v1beta, /compat for Workers AI) that the Worker forwards to the corresponding AI Gateway route.

  2. Proxies requests to AI Gateway. When OpenCode sends a request like POST /anthropic/v1/messages, the Worker validates the Cloudflare Access JWT, then rewrites headers before forwarding:

    Stripped:   authorization, cf-access-token, host
    Added:      cf-aig-authorization: Bearer <API_KEY>
                cf-aig-metadata: {"userId": "<anonymous-uuid>"}

    The request goes to AI Gateway, which routes it to the appropriate provider. The response passes straight through with zero buffering. The apiKey field in the client config is empty because the Worker injects the real key server-side. No API keys exist on user machines.

  3. Keeps the model catalog fresh. An hourly cron trigger fetches the current OpenAI model list from models.dev, caches it in Workers KV, and injects store: false on every model for Zero Data Retention. New models get ZDR automatically without a config redeploy.

Anonymous user tracking. After JWT validation, the Worker maps the user's email to a UUID using D1 for persistent storage and KV as a read cache. AI Gateway only ever sees the anonymous UUID in cf-aig-metadata, never the email. This gives us per-user cost tracking and usage analytics without exposing identities to model providers or Gateway logs.

Config-as-code. Agents and commands are authored as markdown files with YAML frontmatter. A build script compiles them into a single JSON config validated against the OpenCode JSON schema. Every new session picks up the latest version automatically.

The overall architecture is simple and easy for anyone to deploy with our developer platform: a proxy Worker, Cloudflare Access, AI Gateway, and a client-accessible discovery endpoint that configures everything automatically. Users run one command and they're done. There's nothing for them to configure manually, no API keys on laptops or MCP server connections to manually set up. Making changes to our agentic tools and updating what 3,000+ people get in their coding environment is just a wrangler deploy away.

The MCP Server Portal: one OAuth, multiple MCP tools

We described our full approach to governing MCP at enterprise scale in a separate post, including how we use MCP Server Portals, Cloudflare Access, and Code Mode together. Here's the short version of what we built internally.

Our internal portal aggregates 13 production MCP servers exposing 182+ tools across Backstage, GitLab, Jira, Sentry, Elasticsearch, Prometheus, Google Workspace, our internal Release Manager, and more. This unifies access and simplifies everything giving us one endpoint and one Cloudflare Access flow governing access to every tool.

Each MCP server is built on the same foundation: McpAgent from the Agents SDK, workers-oauth-provider for OAuth, and Cloudflare Access for identity. The whole thing lives in a single monorepo with shared auth infrastructure, Bazel builds, CI/CD pipelines, and catalog-info.yaml for Backstage registration. Adding a new server is mostly copying an existing one and changing the API it wraps. For more on how this works and the security architecture behind it, see our enterprise MCP reference architecture.

Code Mode at the portal layer

MCP is the right protocol for connecting AI agents to tools, but it has a practical problem: every tool definition consumes context window tokens before the model even starts working. As the number of MCP servers and tools grows, so does the token overhead, and at scale, this becomes a real cost. Code Mode is the emerging fix: instead of loading every tool schema up front, the model discovers and calls tools through code.

Our GitLab MCP server originally exposed 34 individual tools (get_merge_request, list_pipelines, get_file_content, and so on). Those 34 tool schemas consumed roughly 15,000 tokens of context window per request. On a 200K context window, that's 7.5% of the budget gone before asking a question. Multiplied across every request, every engineer, every day, it adds up.

MCP Server Portals now support Code Mode proxying, which lets us solve that problem centrally instead of one server at a time. Rather than exposing every upstream tool definition to the client, the portal collapses them into two portal-level tools: portal_codemode_search and portal_codemode_execute.

The nice thing about doing this at the portal layer is that it scales cleanly. Without Code Mode, every new MCP server adds more schema overhead to every request. With portal-level Code Mode, the client still only sees two tools even as we connect more servers behind the portal. That means less context bloat, lower token cost, and a cleaner architecture overall.

Act 2: The knowledge layer

Backstage: the knowledge graph underneath all of it

Before the iMARS team could build MCP servers that were actually useful, we needed to solve a more fundamental problem: structured data about our services and infrastructure. We need our agents to understand context outside the code base, like who owns what, how services depend on each other, where the documentation lives, and what databases a service talks to.

We run Backstage, the open-source internal developer portal originally built by Spotify, as our service catalog. It's self-hosted (not on Cloudflare products, for the record) and it tracks things like:

  • 2,055 services, 167 libraries, and 122 packages

  • 228 APIs with schema definitions

  • 544 systems (products) across 45 domains

  • 1,302 databases, 277 ClickHouse tables, 173 clusters

  • 375 teams and 6,389 users with ownership mappings

  • Dependency graphs connecting services to the databases, Kafka topics, and cloud resources they rely on

Our Backstage MCP server (13 tools) is available through our MCP Portal, and an agent can look up who owns a service, check what it depends on, find related API specs, and pull Tech Insights scores, all without leaving the coding session.

Without this structured data, agents are working blind. They can read the code in front of them, but they can't see the system around it. The catalog turns individual repos into a connected map of the engineering organization.

AGENTS.md: getting thousands of repos ready for AI

Early in the rollout, we kept seeing the same failure mode: coding agents produced changes that looked plausible and were still wrong for the repo. Usually the problem was local context: the model didn't know the right test command, the team's current conventions, or which parts of the codebase were off-limits. That pushed us toward AGENTS.md: a short, structured file in each repo that tells coding agents how the codebase actually works and forces teams to make that context explicit.

What AGENTS.md looks like

We built a system that generates AGENTS.md files across our GitLab instance. Because these files sit directly in the model's context window, we wanted them to stay short and high-signal. A typical file looks like this:

# AGENTS.md

## Repository
- Runtime: cloudflare workers
- Test command: `pnpm test`
- Lint command: `pnpm lint`

## How to navigate this codebase
- All cloudflare workers  are in src/workers/, one file per worker
- MCP server definitions are in src/mcp/, each tool in a separate file
- Tests mirror source: src/foo.ts -> tests/foo.test.ts

## Conventions
- Testing: use Vitest with `@cloudflare/vitest-pool-workers` (Codex: RFC 021, RFC 042)
- API patterns: Follow internal REST conventions (Codex: API-REST-01)

## Boundaries
- Do not edit generated files in `gen/`
- Do not introduce new background jobs without updating `config/`

## Dependencies
- Depends on: auth-service, config-service
- Depended on by: api-gateway, dashboard

When an agent reads this file, it doesn't have to infer the repo from scratch. It knows how the codebase is organized, which conventions to follow and which Engineering Codex rules apply.

How we generate them at scale

The generator pipeline pulls entity metadata from our Backstage service catalog (ownership, dependencies, system relationships), analyzes the repository structure to detect the language, build system, test framework, and directory layout, then maps the detected stack to relevant Engineering Codex standards. A capable model then generates the structured document, and the system opens a merge request so the owning team can review and refine it.

We've processed roughly 3,900 repositories this way. The first pass wasn't always perfect, especially for polyglot repos or unusual build setups, but even that baseline was much better than asking agents to infer everything from scratch.

The initial merge request solved the bootstrap problem, but keeping these files current mattered just as much. A stale AGENTS.md can be worse than no file at all. We closed that loop with the AI Code Reviewer, which can flag when repository changes suggest that AGENTS.md should be updated.

Act 3: The enforcement layer

The AI Code Reviewer

Every merge request at Cloudflare gets an AI code review. Integration is straightforward: teams add a single CI component to their pipeline, and from that point every MR is reviewed automatically.

We use GitLab's self-hosted solution as our CI/CD platform. The reviewer is implemented as a GitLab CI component that teams include in their pipeline. When an MR is opened or updated, the CI job runs OpenCode with a multi-agent review coordinator. The coordinator classifies the MR by risk tier (trivial, lite, or full) and delegates to specialized review agents: code quality, security, codex compliance, documentation, performance, and release impact. Each agent connects to the AI Gateway for model access, pulls Engineering Codex rules from a central repo, and reads the repository's AGENTS.md for codebase context. Results are posted back as structured MR comments.

A separate Workers-based config service handles centralized model selection per reviewer agent, so we can shift models without changing the CI template. The review process itself runs in the CI runner and is stateless per execution.

The output format

We spent time getting the output format right. Reviews are broken into categories (Security, Code Quality, Performance) so engineers can scan headers rather than reading walls of text. Each finding has a severity level (Critical, Important, Suggestion, or Optional Nits) that makes it immediately clear what needs attention versus what's informational.

The reviewer maintains context across iterations. If it flagged something in a previous review round that has since been fixed, it acknowledges that rather than re-raising the same issue. And when a finding maps to an Engineering Codex rule, it cites the specific rule ID, turning an AI suggestion into a reference to an organizational standard.

Workers AI handles about 15% of the reviewer's traffic, primarily for documentation review tasks where Kimi K2.5 performs well at a fraction of the cost of frontier models. Models like Opus 4.6 and GPT 5.4 handle security-sensitive and architecturally complex reviews where reasoning capability matters most.

Over the last 30 days:

  • 100% AI code reviewer coverage across all repos on our standard CI pipeline.

  • 5.47M AI Gateway requests

  • 24.77B tokens processed

We're releasing a detailed technical blog post alongside this one that covers the reviewer's internal architecture, including how we route between models, the multi-agent orchestration, and the cost optimization strategies we've developed.

Engineering Codex: engineering standards as agent skills

The Engineering Codex is Cloudflare's new internal standards system where our core engineering standards live. We have a multi-stage AI distillation process, which outputs a set of codex rules ("If you need X, use Y. You must do X, if you are doing Y or Z.") along with an agent skill that uses progressive disclosure and nested hierarchical information directories and links across markdown files.

This skill is available for engineers to use locally as they build with prompts like "how should I handle errors in my Rust service?" or "review this TypeScript code for compliance." Our Network Firewall team audited rampartd using a multi-agent consensus process where every requirement was scored COMPLIANT, PARTIAL, or NON-COMPLIANT with specific violation details and remediation steps reducing what previously required weeks of manual work to a structured, repeatable process.

At review time, the AI Code Reviewer cites specific Codex rules in its feedback.

AI Code Review: showing categorized findings (Codex Compliance in this case) noting the codex RFC violation.

None of these pieces are especially novel on their own. Plenty of companies run service catalogs, ship reviewer bots, or publish engineering standards. The difference is the wiring. When an agent can pull context from Backstage, read AGENTS.md for the repo it's editing, and get reviewed against Codex rules by the same toolchain, the first draft is usually close enough to ship. That wasn't true six months ago.

The scoreboard

From launching this effort to 93% R&D adoption took less than a year.

Company-wide adoption (Feb 5 – April 15, 2026):

Metric

Value

Active users

3,683 (60% of the company)

R&D team adoption

93%

AI messages

47.95M

Teams with AI activity

295

OpenCode messages

27.08M

Windsurf messages

434.9K

AI Gateway (last 30 days, combined):

Metric

Value

Requests

20.18M

Tokens

241.37B

Workers AI (last 30 days):

Metric

Value

Input tokens

51.47B

Output tokens

361.12M

What's next: background agents

The next evolution in our internal engineering stack will include background agents: agents that can be spun up on demand with the same tools available locally (MCP portal, git, test runners) but running entirely in the cloud. The architecture uses Durable Objects and the Agents SDK for orchestration, delegating to Sandbox containers when the job requires a full development environment like cloning a repo, installing dependencies, or running tests. The Sandbox SDK went GA during Agents Week.

Long-running agents, shipped natively into the Agents SDK during Agents Week, solve the durable session problem that previously required workarounds. The SDK now supports sessions that run for extended periods without eviction, enough for an agent to clone a large repo, run a full test suite, iterate on failures, and open a MR in a single session.

This represents an eleven-month effort to rethink not just how code gets written, but how it gets reviewed, how standards are enforced, and how changes ship safely across thousands of repos. Every layer runs on the same products our customers use.

Start building

Agents Week just shipped everything you need. The platform is here.

npx create-cloudflare@latest --template cloudflare/agents-starter

That agents starter gets you running. The diagram below is the full architecture for when you're ready to grow it, your tools layer on top (chatbot, web UI, CLI, browser extension), the Agents SDK handling session state and orchestration in the middle, and the Cloudflare services you call from it underneath.

Docs: Agents SDK · Sandbox SDK · AI Gateway · Workers AI · Workflows · Code Mode · MCP on Cloudflare

Repos: cloudflare/agents · cloudflare/sandbox-sdk · cloudflare/mcp-server-cloudflare · cloudflare/skills

For more on how we're using AI at Cloudflare, read the post on our process for AI Code Review. And check out everything we shipped during Agents Week.

We'd love to hear what you build. Find us on Discord, X, and Bluesky.

Random thoughts while gazing at the misty AI Frontier (10 minute read)

Random thoughts while gazing at the misty AI Frontier (10 minute read)

Tech
Veteran investor Elad Gil shares predictions on the AI industry's trajectory, from compute constraints creating an oligopoly through 2028 to why most AI startups should sell soon.
What: A wide-ranging essay of predictions and observations about the AI industry, covering market dynamics, talent economics, compute limitations, employment impacts, and regulatory backlash. Gil argues AI has grown from zero to 0.5% of US GDP in just a few years, with OpenAI and Anthropic each at roughly $30B revenue run rate.
Why it matters: These predictions span critical near-term shifts that could reshape the AI landscape: memory constraints may prevent any single lab from breaking ahead until 2028, creating an artificial oligopoly; developing countries may lose their economic ladder as outsourced jobs vanish first; and most AI companies face a narrow exit window before competition intensifies.
Takeaway: AI company founders should seriously evaluate exiting in the next 12-18 months while markets are hot; engineers should focus on systems thinking and orchestrating agents rather than code craftsmanship.
Deep dive
  • OpenAI and Anthropic each reached $30B revenue run rate (0.1% of US GDP), with projections hitting $100B by end of 2026, meaning AI could represent 1% of GDP run rate within a year
  • Meta's aggressive talent bidding created a "distributed IPO" effect where top AI researchers across all labs received massive compensation increases simultaneously, potentially changing focus and behavior similar to newly enriched IPO employees
  • Memory constraints from manufacturers like Hynix and Samsung may create an artificial asymptote on model capabilities through at least 2028, preventing any single lab from breaking significantly ahead and reinforcing an LLM oligopoly
  • The compute constraint could shift unless a lab achieves secret algorithmic breakthroughs and coding agents enable self-improvement loops leading to "liftoff"
  • Tokens/compute has become a new unit of economic value determining engineering capabilities, company spend models, and business viability; some companies like Cursor are essentially subsidizing inference for user acquisition
  • Developing countries face the first wave of AI displacement as companies cut outsourced services (customer support, etc.) before internal headcount, potentially eliminating the traditional services ladder for economic development
  • Later-stage companies plan to flatten or slightly reduce headcount while revenue grows 30-100%, relying on attrition rather than layoffs, and may optimize token budget versus salary ratios
  • The current "Slop Age" may represent a golden era where AI provides useful leverage requiring human refinement, before AI becomes superhuman and potentially displaces more interesting work
  • AI will automate closed-loop systems first (testable, iterative environments like coding and AI research), with jobs ranked by loop-tightness and economic value determining displacement speed
  • The "harness" (UX, workflow, prompting environment around AI models) is becoming increasingly important for stickiness and defensibility, potentially more than the underlying model itself
  • AI companies are selling units of labor (customer support work, coding output) rather than software seats, dramatically expanding TAMs compared to traditional SaaS
  • Most AI companies should consider exiting in 12-18 months despite growing demand, as only a handful will survive long-term similar to the dot-com era where only dozens of the 2000 IPOs survived
  • Anti-AI regulation (Maine banned new data centers) and violence against AI leaders (recent attack on Sam Altman mentioned) will increase dramatically despite minimal actual job displacement so far
  • The industry needs more optimistic public messaging to counter the doom-and-gloom narrative that's fueling political and activist backlash
Decoder
  • Slop Age: The current era where AI generates useful but imperfect output at scale that humans must refine, potentially the sweet spot before AI becomes superhuman
  • Harness: The UX, workflow, and environment wrapped around an AI model (like coding assistants) that makes it sticky beyond just model quality
  • Closed loop: A system where AI can test and iterate on its own output (like code that can be executed and tested), enabling faster learning
  • Neoclouds: Companies that are essentially inference providers disguised as tools, subsidizing compute access as their core product offering
  • Liftoff: The hypothetical point where AI begins recursive self-improvement, potentially through coding agents building better AI systems
Original article

Random thoughts while gazing at the misty AI Frontier

A bunch of random things I have been thinking about, some of which are probably wrong

I was originally going to write a long articulate post for each of the below, with lots of fancy graphic, charts, and detailed analysis. Then realized it is too much work. Instead, here is some human idea slop & random thoughts. Enjoy!

  1. OAI and Anthropic are now at 0.1% of US GDP each. What % of GDP is AI revenue in 2030?

    US GDP is roughly $30T. OpenAI and Anthropic are both rumored to be currently in the ball park of $30B of revenue run rate, or at 0.1% of overall GDP each. Through in clouds and other services and AI has grown from roughly zero to 0.25%-0.5% of US GDP in just a few years. If Anthropic and OpenAI hit $100B of revenue by EOY as many think they might, roughly 1% of GDP run rate will be from AI by end of 2026. This is insanely fast.

    What % of GDP will AI be in 2030? 2035? How does the US economic base impact the slowing of AI impact? How much of the productivity gains ends up missing from GDP a la the missing productivity impact of the internet in the 2000s or IT in the 1980s and 1990s?

    (Aside - If the impact of AI is mismeasured perhaps the wrong regulatory policies get implemented as a reaction as well - as AI gets blamed for only the bad (job losses) and not the good (new types of jobs, impact to education, healthcare…). Maybe the real ASI/Turing test is the ability to measure real world US GDP and productivity gains? :)

  2. The AI research community just had a distributed IPO

    When a company goes public many of the early employees may find themselves suddenly enriched. This may change behavior - people get distracted buying homes, chasing status or spouses, partying, or doing societal side quests. This does not apply to everyone, but a subset of people experience this.

    Meta aggressively paying for talent changed the AI research talent market as the main labs had to match or provide large compensations increases to their researchers. Arguably, the AI research community just underwent the cross-company equivalent of an IPO as a cross section of the big labs & big tech. Somewhere between 50 to a few hundred people across all AI labs were granted huge sums of money as a reaction to Meta bidding on the best regarded researchers driving up everyones salaries.

    Just like a traditional IPO, a subset of the members of that community are shifting some aspects of focus and lifestyle, checking out or getting distracted, while others stay the course. In general the AI community is very mission aligned around building AGI or focusing on AI for science.

    Either way, an interesting new phenomenon has quietly occurred in Silicon where, instead of a company going public, a very specific slice of people effectively did. The top AI researchers became post-economic all at once. (Maybe the closest prior analogues is the early crypto HODLRs?).

  3. Compute ceiling = artificial asymptote on near term model capabilities? Does this just re-enforce an oligopoly market for now?

    We have seen amazing progress in model capabilities in the last few years. This has been reflected in the flowering of use cases + revenue for the main labs and app companies built on top.

    At the same time, the labs are increasingly compute limited as one extrapolates out both training scale planned as well as future inference needs. Compute build outs seem at least in part to be limited by memory from Hynix, Samsung, Micron et al at least for the next 2 years as a build out cycle occurs for manufacturing for these companies.

    This means that rather than a single lab buying well ahead, or being able to use all the compute it wants, all the big labs are effectively and increasingly in a compute constrained world. This constraint may end up creating an artificial short-term asymptote on AI model progress. While people will undoubtedly get more efficiency out of the compute they have, this artificial compute constraint may mean no one lab is able to break significantly ahead until 2028 at the soonest - re-enforcing an oligopoly market for LLMs. We may also see the labs "accordion" between allocating compute and human resources to apps vs models and back again. Similarly, the depreciation cycles on chips and systems will be different then everyone expected and the lifetime of silicon will be extended due to lack of sufficient new supply.

    The counter to this is algorithmic or other breakthroughs, if contained within a single lab (vs leaking at an SF holiday party attended by researchers) could turbocharge a single company to dominance, particularly if coding takes off and there is some form of ongoing self improvement loops by AI building future AI leading into liftoff. If we do end up with a hard compute constrained environment breakaway liftoff may wait for 2028. Of course, it is also possible we are compute constrained for years post 2028 due to excess demand. Exciting to watch what happens.

  4. Compute (tokens) is the new currency

    Compute (or could be stated as tokens) is a new unit of denomination for economic value in silicon valley. Token budget impacts things like

    1. What can you accomplish as an engineer
    2. Your spend and potential revenue as a company
    3. Your business model.

    Some companies are effectively inference providers disguised as tools. Neoclouds are the clearest form of this, but things like Cursor similarly are providing cheap inference as a core part of their product offerings and effectively subsidizing compute, which has been a smart user acquisition and usage model. Who doesnt love extra tokens?

    Things have gotten to the point where Allbirds (shoe company) just raised a convert to build a GPU farm. Will they be to AI what Microstrategy is to crypto?

  5. Hidden layoffs & the developing world

    Most of the "layoffs due to AI" announced so far are probably just companies that overhired during the COVID zero interest rate environment slimming back down. Saying "look how good we are at AI we need fewer people" sounds much better then "we way overhired and are fixing it a few years too late".

    That said, AI is having a real impact in multiple areas such as customer support. Companies that are shrinking teams due to AI are actually cutting outsourcing firms first - so they headcount is not directly on their balance sheet but paid for as a service. This means countries like India and the Philippines may be the most impacted soonest in terms of employment and AI as they house many of these outsourced services organizations.

    It also means some developing countries may lose their services ladder to upgrading their economy and work. If AI takes many of the outsourced services jobs first, employment in these economies will need to shift elsewhere. An interesting question is whether this shifts human migration patterns?

  6. Employee headcount is going to flatten for lots of companies and then shrink

    Multiple later stage CEOs told me that rather then do big layoffs due to AI, they will just stop growing. So if revenue at the company is growing 30%, 50%, or 100%, headcount may be flat or slightly down as they allow attrition to shrink staff. Existing headcount will become more productive, and companies may start swapping in fewer better people. This may medium term inflate the salaries of the very best people who can leverage AI immensely. Expect hiring to continue in sales, some engineering for growing companies, but maybe not as much elsewhere.

    Some companies are starting to ask what is the right ratio of token budget vs salaries in their org? Unclear what the right timeline for this metric is.

    True startups (e.g. a 5 person team) in the short run should continue to scale up headcount like in the olden days as they hit product/market fit but just with more leverage per person. So the "flat company" is going to be more of a later stage or public company phenomenon for growing companies in the next 2-4 years. Low growth companies of course should shrink.

    This may have implications for HR/software companies. See also:

  7. The Slop Age could be the golden era of AI x humanity

    We are likely in the golden era of AI + humanity. Before the last few years, AI was inaccessible, not very generalizable, and could only do specific tasks. In the future, AI may become superhuman at most tasks and take over a lot of work some people find fun. Today, AI creates useful slop at volume, which means humans are still needed to desloppify the slop, but the slop provides real leverage on time and jobs, which means it is fun to be working right now. If AI displaces people eventually or does more interesting work, this golden moment may fade or change. Is the Slop Age the golden era of humanity + AI?

    (One could of course argue that we were in the midst of a human slop era before the AI slop era - in other words the era of huge amounts of human created sloppy content on the internet as it grew to billions of web pages, but not billions of new human insights. Does the slop era end with AGI, or when AGI cleans up all the prior waves of human slop?)

  8. AI will eat closed loops first

    AI will first automate away the things that are easier to form a closed loop learning system on. This is why code and AI research may be accelerated and then displaced quickly - you can have testable closed loop systems so machines can learn and iterate quickly. The tighter the closed loop, the faster the AI can learn. You can make a 2x2 of jobs by how closed loop they can be made, versus their economic value, and see where AI may impact labor fastest. Fast time to closed loop + high economic value = fastest AI impact (hence software engineering).

    Code is interesting in that there is probably 10-100X the demand for great software developers as there is supply today (hence coding tools doing so well in market). The AI engineer of the future will be managing and orchestrating large numbers of agents to build things (systems and product thinking) vs writing a lot of code themselves (the auto-complete tab era).

    An interesting question is what jobs or tasks will be made more closed loop next? Where is AI most embeddable and teachable?

    Relatedly, data collection & labelling in every field will continue to grow.

  9. Artisanal engineers vs utility engineers and AI

    Deep artisanal "my code is my craft" and "I love creating bespoke things" engineers decreasingly happy in world of AI. Systems thinkers and product thinkers engineers happiest. Many people are a mix of both.

  10. The Harness

    If you look at the use of AI coding tools, the harness (and broader product surface area eg UX, workflow, etc) seems to be increasingly sticky in the short term. It is not just the model you use, but the environment, prompting, etc you build around it that helps impact your choice. Brand also matters more then many people think. At some point, either one coding model breaks very far ahead, or they stay neck in neck. How important is the harness/workflow long term for defensibility for coding or enterprise applications?

    Products tend to not be sticky until they suddenly are very sticky.

    There will be variability in where future forms of harnesses matter vs not. What is the sales AI harness? The AI architect harness? This leaves room for some startups to thrive.

  11. Selling work, not software. Units of labor as the product

    AI is about selling units of labor online (and eventually in the atomic world via robotics), not displacing software. While Zendesk was selling seats to customer support reps, Decagon and Sierra sell customer support agentic work output and labor.

    AI grows tech TAMs dramatically.

  12. Most AI companies should consider exiting in the next 12-18 months

    In the Internet era of 1995-2001, roughly 2000 or so companies went public. Of these only a dozen or two survived. Similarly in the AI era, most companies, including those that are ramping revenue today, will see the market, competition, and adoption, turn on them.

    Founders running successful AI companies should all take a cold hard look at exiting in the next 12-18 months, which may be a value maximizing moment for outcomes. A handful of companies should absolutely not exit (eg OpenAI, Anthropic) but many should if they can while everything is on the upswing.

    This is all of course counterbalanced by enormous growing demand for AI services of all types. While the tide is rising, many companies will seem to be unstoppable and durable - whether they are or not in the long run remains to be seen.

  13. Anti-AI regulation & violence will both increase

    AI has had very little real world impact to eg job displacement so far. However, some AI pundits and some leaders have been quite vocal and doomer-esque to the point where a strong anti-AI narrative is emerging in both politics (Maine just banned new data centers although this also ties into energy, jobs, and NIMBYism) and amongst violence-centric activists (see recent attack on Sam Altman). Expect this to increase dramatically. It would be great if more leaders in AI continue to emphasize the optimistic side of what is coming in public rhetoric and political lobbying. In general, the AI field would benefit from its leaders continue to work actively on reigning in the doom and gloom.

  14. Other

    Any other random thoughts to consider? Ping me on X.

Thanks to Aravind Srinivas of Perplexity, Scott Wu of Cognition, Adam d'Angelo of Quora/Poe, and others for comments.

California Accuses Amazon of Price Fixing in Legal Filing (5 minute read)

California Accuses Amazon of Price Fixing in Legal Filing (5 minute read)

Tech
California's lawsuit reveals internal emails showing Amazon pressured brands like Levi's and Hanes to ask Walmart and Target to raise prices on competing websites.
What: A newly unsealed court filing in California's 2022 antitrust lawsuit against Amazon shows specific email exchanges where Amazon employees asked suppliers to get rival retailers to increase prices, with brands complying to avoid being delisted from Amazon's marketplace.
Why it matters: The evidence demonstrates how platform dominance can enable indirect price manipulation across the entire online retail ecosystem, not just on Amazon's own site, raising questions about how major tech platforms wield market power.
Original article

Amazon reportedly pressured major brands to ask competing retailers to raise prices on certain products. California is suing Amazon over allegations that the retailer harms competition and increases prices that consumers pay online. The lawsuit claims that Amazon punished sellers for offering lower prices on other websites. The trial is scheduled to begin next year.

ggsql (GitHub Repo)

ggsql (GitHub Repo)

Tech
ggsql is a SQL extension that lets you write data queries and visualization specifications in a single composable syntax, eliminating context switches to Python or R.
What: ggsql extends SQL with grammar-of-graphics style visualization commands (VISUALISE, DRAW, SCALE, LABEL) that let analysts query databases and define charts in one go, currently targeting DuckDB/SQLite with Vegalite output.
Why it matters: For SQL-native analysts, this removes the friction of extracting data then switching to another language just to understand what you're looking at. The composable syntax also makes it easy for AI agents to generate queries that humans can immediately validate.
Takeaway: Try the web playground to see SQL-integrated visualization without installing anything, or check the documentation for interactive examples.
Deep dive
  • SQL extension that combines data retrieval with visualization specifications in a single query language
  • Based on the Grammar of Graphics (popularized by ggplot2 in R), offering composable syntax that scales from simple to complex visualizations
  • Designed to eliminate context switching for SQL analysts who otherwise need to export data and switch to Python/R just to create charts
  • Queries use familiar SQL SELECT statements followed by VISUALISE clauses specifying aesthetic mappings (x, y, color), then DRAW, SCALE, and LABEL commands
  • Pre-alpha status with core architecture complete, approaching alpha release
  • Currently focuses on DuckDB/SQLite database support with Vegalite as the output format
  • Future development will add more database readers and output format writers beyond the current stack
  • Compiles to WebAssembly enabling browser-based usage without installation via their playground
  • Syntax deliberately designed to be readable and writable by both humans and AI agents, making validation straightforward
  • Documentation includes interactive examples for hands-on learning
Decoder
  • Grammar of Graphics: A framework for building visualizations compositionally by specifying data mappings, geometric shapes, scales, and other layers rather than using pre-made chart types
  • ggplot2: A popular R visualization library based on Grammar of Graphics that inspired ggsql's design
  • Vegalite: A high-level grammar for creating interactive visualizations, often compiled to JavaScript
  • DuckDB: An embedded analytical database designed for fast queries on columnar data
  • WASM (WebAssembly): A binary instruction format that allows compiled code to run in web browsers at near-native speed
Original article

ggsql allows developers to write queries that combine SQL data retrieval with visualization specifications in a single, composable syntax.

Google builds elite team to close the coding gap with Anthropic (2 minute read)

Google builds elite team to close the coding gap with Anthropic (2 minute read)

Tech
Google DeepMind assembled a specialized team to improve Gemini's coding abilities after internally acknowledging that Anthropic's coding tools currently outperform theirs.
What: The team, led by Sebastian Borgeaud (former pre-training lead), focuses on teaching Gemini to handle complex programming tasks like writing new software from scratch. Co-founder Sergey Brin is directly involved and has mandated that all Gemini engineers use internal AI agents for multi-step tasks.
Why it matters: Coding has become the primary competitive battleground for major AI labs in 2026, with Google viewing stronger coding capabilities as essential progress toward self-improving AI systems that could eventually automate AI research itself.
Deep dive
  • Google DeepMind formed a dedicated team led by Sebastian Borgeaud to improve Gemini's coding capabilities after internally acknowledging that Anthropic's coding tools are currently superior
  • The team focuses on complex long-horizon programming tasks like writing new software from scratch, requiring models to read files and interpret user intent
  • Sergey Brin and DeepMind CTO Koray Kavukcuoglu are directly overseeing the effort, with Brin writing that Google must "urgently bridge the gap in agentic execution"
  • Brin mandated that all Gemini engineers use internal AI agents for complex multi-step tasks, making AI adoption a requirement rather than optional
  • Google sees stronger coding capabilities as a stepping stone toward self-improving AI systems that could eventually automate AI research itself
  • The company tracks usage of its internal coding tool "Jetski" and ranks teams accordingly, similar to Meta's token usage metrics
  • Google is training models on its internal codebase, which differs from public code and cannot be released publicly, but could improve both internal development and future public models
  • Coding has become the primary battleground for major AI labs in 2026, with OpenAI shutting down Sora video generation to redirect compute resources toward other models
Decoder
  • Agentic execution: AI systems that can autonomously plan and carry out multi-step tasks without constant human guidance
  • Long-horizon tasks: Complex programming work requiring planning across many steps and future consequences, like designing entire applications from scratch
  • Self-improving AI: AI systems capable of modifying and enhancing their own capabilities, potentially leading to rapid recursive improvement
  • Jetski: Google's internal AI-powered coding assistant used by employees
Original article

Google builds elite team to close the coding gap with Anthropic

Image description

Google is doubling down on AI coding, using more AI internally and aiming for models that can eventually improve themselves.

Google Deepmind has put together a specialized team of researchers and engineers to sharpen the programming chops of its Gemini models, The Information reports. The group is led by Deepmind engineer Sebastian Borgeaud, who previously ran pre-training for the company's models.

The team is focused on complex, long-horizon programming tasks like writing new software from scratch, work that requires models to read files and figure out what the user actually wants. Part of the motivation: Google researchers think Anthropic's coding tools are better.

Coding has become the battleground for every major AI lab this year, with OpenAI and Google both scrambling to catch up to Anthropic. OpenAI recently pulled the plug on its Sora video generator to free up compute for training and running other AI models.

Brin pushes for self-improving AI

Google co-founder Sergey Brin and Deepmind CTO Koray Kavukcuoglu are directly involved in the effort. "To win the final sprint, we must urgently bridge the gap in agentic execution and turn our models into primary developers" of code, Brin wrote in an internal memo. He also required every Gemini engineer to use internal agents for complex, multi-step tasks.

Brin told employees that stronger coding skills are a stepping stone toward AI that can improve itself. A sophisticated coding agent, paired with AI that handles math problems and experiments, could eventually automate much of the work done by AI researchers and engineers.

Internally, Google tracks how much its coding tool "Jetski" gets used and ranks teams accordingly, a setup similar to Meta, which tracks token usage as its metric. Some teams outside Deepmind also require engineers to attend AI training sessions.

According to The Information's sources, Google is leaning more heavily on models trained on its internal code. Google's internal codebase looks very different from the public code typically used to train general-purpose coding agents, so these internally trained models can't be released publicly. They could, however, help Google build better models that eventually ship to users, while also speeding up internal development.

Your Future Phone Will Have a Battery You Can Swap at Home — Thanks to the EU (2 minute read)

Your Future Phone Will Have a Battery You Can Swap at Home — Thanks to the EU (2 minute read)

Tech
The European Union has mandated that smartphones must have user-replaceable batteries that can be swapped at home with basic tools.
What: A new EU law requires phone manufacturers to design devices with batteries that consumers can easily remove and replace themselves using commonly available tools, reversing the industry trend toward sealed, non-serviceable designs.
Why it matters: This regulation could extend device lifespans, reduce electronic waste, and force a fundamental rethink of phone hardware design that has prioritized thinness and water resistance over repairability for over a decade.
Original article

The EU has passed a law that requires phones to have batteries that can be easily changed by users at home using basic tools.

Apple to Focus Hardware Team on Five Areas Under Johny Srouji (3 minute read)

Apple to Focus Hardware Team on Five Areas Under Johny Srouji (3 minute read)

Tech
Apple reorganizes its hardware division into five focused areas under newly appointed chief hardware officer Johny Srouji.
What: The combined hardware engineering and hardware technologies division will be structured into hardware engineering, silicon, advanced technologies, platform architecture, and project management teams. This follows Srouji's promotion to chief hardware officer and John Ternus's elevation to CEO.
Why it matters: The reorganization signals Apple's strategic priorities in hardware development, with silicon design and advanced technologies getting dedicated focus alongside traditional hardware engineering and cross-platform coordination.
Original article

Apple's newly combined hardware engineering and hardware technologies division will be organized into hardware engineering, silicon, advanced technologies, platform architecture, and project management divisions.

GitHub halts new Copilot signups amid soaring usage and rising costs (2 minute read)

GitHub halts new Copilot signups amid soaring usage and rising costs (2 minute read)

Tech
GitHub has temporarily stopped accepting new signups for its paid Copilot plans due to unexpectedly high usage and associated infrastructure costs.
What: GitHub paused new signups for GitHub Copilot Pro, Pro+, and Student subscription tiers, indicating that demand and resource consumption have exceeded their capacity to scale the AI coding assistant service.
Why it matters: This highlights the real infrastructure challenges of running large language models at scale, even for major platforms like GitHub, and signals potential pricing or capacity adjustments ahead as AI coding tools face sustainability pressures.
Takeaway: Developers interested in paid Copilot plans will need to wait for GitHub to reopen signups, likely after addressing infrastructure scaling or cost management issues.
Original article

GitHub has paused new signups for GitHub Copilot Pro, Pro+, and Student plans.

Google's AI adoption (3 minute read)

Google's AI adoption (3 minute read)

Tech
Former Googler Steve Yegge reveals that Google's own DeepMind engineers use Anthropic's Claude over Google's Gemini, exposing a two-tier system and internal dysfunction around AI adoption.
What: According to anonymous sources within Google, the company has a split AI tool policy where DeepMind engineers use Claude daily while other engineers are restricted to internal Gemini variants that reportedly have reliability problems severe enough to cause attrition concerns.
Why it matters: This reveals a fundamental credibility problem when the team building Google's AI doesn't trust it enough to use it themselves, and shows that even AI industry leaders struggle with the gap between developing AI and actually adopting it internally for productivity gains.
Takeaway: Evaluate AI coding tools based on actual productivity and quality rather than vendor or internal politics, and resist mandated adoption metrics that encourage box-checking over genuine value.
Deep dive
  • Google attempted to equalize AI tool access by proposing to remove Claude for everyone, but DeepMind engineers objected so strongly that several threatened to quit
  • Non-DeepMind engineers are pushed onto internal Gemini variants hidden behind router-style names that obscure which model is actually serving requests
  • Multiple engineers report regressions and reliability problems severe enough that senior engineers have stopped using the tools entirely
  • Leadership has responded to low adoption by mandating AI usage in OKRs and creating an internal token-usage leaderboard to track who uses AI tools
  • Managers received contradictory guidance about whether the leaderboard will be used for performance reviews, creating confusion and distrust
  • Google claims 40,000 software engineers use agentic coding weekly, but Yegge argues "weekly" is a low bar that includes people who tried it once and abandoned it
  • A senior manager on a major product line has flagged attrition concerns specifically related to poor AI tooling quality
  • Anonymous Googlers reached out to Yegge expressing fear of being doxxed and concern about internal bullying over this issue
  • The situation suggests Google's engineering culture hasn't adapted to high-volume AI-assisted coding practices
  • Yegge emphasizes that even companies that look far ahead from the outside are struggling with AI adoption, and no one should feel behind
Decoder
  • DeepMind: Google's AI research lab, the team that built models like AlphaGo and contributes to Gemini development
  • Agentic coding: AI tools that autonomously perform multi-step coding tasks rather than just autocomplete suggestions
  • OKRs: Objectives and Key Results, Google's goal-setting framework used to measure employee performance
  • Router-style names: Internal naming conventions that hide which specific AI model is actually processing requests
  • Token-usage leaderboard: Internal dashboard tracking how many AI tokens (units of text processing) each engineer uses, meant to measure AI adoption
Original article

DeepMind engineers use Claude as a daily tool, but most of the rest of Google does not.

Tim Cook stepping down as Apple CEO, John Ternus taking over (2 minute read)

Tim Cook stepping down as Apple CEO, John Ternus taking over (2 minute read)

Design
Apple CEO Tim Cook is stepping down after 15 years, with hardware engineering chief John Ternus taking over on September 1 as the company transitions leadership at its $4 trillion valuation.
What: Tim Cook will transition from CEO to executive chairman after leading Apple since 2011, while John Ternus, the 51-year-old Senior Vice President of Hardware Engineering who has been at Apple since 2001, becomes the new CEO effective September 1, 2026.
Why it matters: This marks the end of one of the longest and most successful CEO tenures in tech, as Cook quadrupled Apple's revenue and expanded services to over $100 billion annually, while Ternus brings a focus on hardware innovation, sustainability, and repairability that could signal Apple's priorities for the next era.
Deep dive
  • Cook inherited Apple in 2011 after Steve Jobs' death, facing uncertainty about whether anyone could follow the company's legendary founder, and leaves behind a company worth $4 trillion with revenue that quadrupled during his tenure
  • Originally hired in 1998 to fix Apple's disastrous supply chain, Cook was a methodical operations expert rather than a product visionary, but proved himself during Jobs' health-related absences in 2004, 2009, and 2011
  • Cook's biggest stumble was Vision Pro, the mixed-reality headset that consumers largely ignored due to its several-thousand-dollar price tag and heavy form factor
  • Under Cook, Apple Services grew to exceed $100 billion annually and the Apple Watch captured roughly 25% of global smartwatch sales
  • Ternus joined Apple's product design team in 2001 after studying mechanical engineering at Penn and briefly working on VR headsets at a small firm
  • He was promoted to SVP of Hardware Engineering in 2021 when his predecessor Dan Riccio moved to oversee Vision Pro, making him the youngest member of Apple's executive team at the time
  • Ternus has overseen key products including iPad, AirPods, and multiple generations of iPhone, Mac, and Apple Watch, plus recent releases like the iPhone 17 lineup and MacBook Neo
  • His team developed AirPods into an over-the-counter hearing health system beyond just headphones
  • Ternus has made durability and repairability major priorities, introducing recycled aluminum compounds and manufacturing techniques that reduce carbon footprint while extending device lifespans
  • Arthur Levinson, who served as non-executive chairman for 15 years, will become lead independent director while Cook remains as executive chairman
Decoder
  • Executive Chairman: A senior role where Cook will remain on Apple's board and provide strategic guidance but won't handle day-to-day operations as CEO
  • Vision Pro: Apple's mixed-reality headset that combines virtual and augmented reality, launched as Cook's bet on the next major computing platform but failed to gain consumer traction
  • SVP: Senior Vice President, a top executive position at Apple reporting directly to the CEO
Original article

Apple CEO Tim Cook will step down after 15 years in the role, transitioning to executive chairman while hardware chief John Ternus becomes CEO on September 1. Cook leaves behind a $4 trillion company with massively expanded services and wearables businesses, despite some product missteps like Vision Pro. Ternus, a longtime Apple engineer, is expected to continue shaping the company's hardware and sustainability efforts as he takes over leadership.

‘WhatsApp Plus' subscription launching soon with new features (1 minute read)

‘WhatsApp Plus' subscription launching soon with new features (1 minute read)

Design
WhatsApp is testing a $2.99/month subscription tier with cosmetic features, marking Meta's push to add subscription revenue to its messaging platform.
What: The "WhatsApp Plus" subscription, currently in Android beta testing, offers premium customization options including exclusive stickers, themes, custom app icons, additional chat pins, and enhanced notifications, with iOS and broader platform support planned.
Why it matters: This represents a significant business model shift for WhatsApp, which has remained largely free since Meta acquired it, and reflects the broader industry trend of established platforms adding freemium subscription tiers to diversify beyond advertising revenue.
Original article

WhatsApp is testing a new ‘WhatsApp Plus' subscription in its Android beta, priced around $2.99 per month, as part of Meta's push toward subscription revenue. The plan adds customization features like premium stickers, themes, app icons, extra chat pins, and upgraded notifications. While still limited to testers, the feature is expected to expand to iOS and other platforms ahead of a broader launch.

How UX Designers Can Build A Personal AI Operating System (9 minute read)

How UX Designers Can Build A Personal AI Operating System (9 minute read)

Design
UX designers should build personal AI operating systems that codify their decision-making logic rather than treating AI as just a search engine for small tasks.
What: Article argues UX designers should create AI systems that encode their judgment patterns and decision-making logic for recurring work situations, turning AI agents into "judgment proxies" rather than just document generators.
Why it matters: This shifts AI from a basic productivity tool to a system that can handle repeated, energy-draining decisions using your own codified thinking patterns, freeing up mental energy for more important work.
Takeaway: Map your recurring work situations, how you've handled them previously, and what constitutes good judgment in those contexts as the foundation for your AI system.
Decoder
  • Personal AI Operating System: A customized AI setup that encodes an individual's decision-making patterns and judgment rather than generic capabilities
  • Judgment proxy: An AI agent that makes decisions based on your codified thinking patterns rather than just generating documents or answering queries
Original article

Most UX designers treat AI as a glorified search engine for small tasks, but the real value comes from building a personal AI operating system that codifies your own decision-making logic. The foundation isn't prompts or tools — it's mapping recurring work situations, how you handled them, and what good judgment looks like, so an AI agent can actually operate on your thinking. Once that's done, useful agents aren't document generators but judgment proxies for the repeated, energy-draining conversations only you could previously handle.

I do Design Innovation. I barely open Figma anymore. (5 minute read)

I do Design Innovation. I barely open Figma anymore. (5 minute read)

Design
A designer at an AI-native startup explains how their work shifted from Figma mockups to writing code prototypes and behavioral specifications that both engineers and AI systems can execute.
What: A first-person account from a designer at Swoop, a 20-person AI-native team, describing how AI features—which center on timing, conversation, and dynamic behavior rather than static screens—require new design artifacts like code prototypes and pull requests instead of traditional Figma files.
Why it matters: This represents an emerging model of design work where traditional role boundaries blur: designers prototype in code, pull requests function as interactive specifications, and design systems are written as markdown files consumed by both humans and AI models, though the author acknowledges this pattern works best in small, code-fluent teams.
Takeaway: If you're building AI-heavy features, consider whether behavioral specifications (timing logic, decision thresholds, edge cases) communicated through runnable code prototypes might work better than static mockups for design-engineering collaboration.
Deep dive
  • AI-native features often have minimal traditional interface—short conversations, voice prompts, background agents—making static Figma frames inadequate for capturing timing, behavior thresholds, and intervention logic
  • Three working modes emerged: AI-logic-heavy (designer writes behavioral descriptions, engineers implement), UX-heavy (designer builds end-to-end in code), and innovation (designer vibe-codes prototypes in Cursor or Claude Code)
  • Pull requests function as executable behavioral specifications that engineers can run, modify parameters on, and observe effects—unlike Figma prototypes that only approximate behavior
  • A fourth mode involves designing frameworks and interaction patterns that constrain entire feature families, requiring cross-team context and Director-level authority
  • Design artifacts now serve two readers: human engineers who build from them and AI systems that execute them, eliminating ambiguity that humans could resolve by asking questions
  • The team's design system lives in a single DESIGN.md file consumed by both engineers and the AI model, with no parallel Figma library
  • This workflow succeeds because the team is small, AI-native from inception, and has senior engineers comfortable treating designer PRs as reference material rather than production code
  • The author frames this as a "second shape" of design work coexisting with traditional Figma-centered workflows, not a replacement—enterprise teams with rigid roles still use established processes
  • The shift isn't about designers becoming engineers but about design judgments shipping through code-adjacent artifacts when designing behavior rather than surfaces
  • Management overhead is distributed across the team rather than concentrated, enabled by everyone shipping code and working across traditional role boundaries
Decoder
  • IC: Individual Contributor, a non-management role focused on hands-on execution rather than people management
  • Vibe-code: Informal prototyping in code to explore how a feature should feel, typically not production-ready but executable enough to demonstrate behavior
  • DESIGN.md: A markdown file containing design system specifications readable by both human developers and AI models, replacing traditional visual design libraries
  • Behavioral spec: A description of how a system should act—timing, thresholds, decision logic, edge cases—rather than how it should look
  • AI-native team: A team built from inception around AI capabilities where workflows, roles, and tooling assume AI as a core part of the stack rather than an add-on
Original article

In AI-native teams, design is shifting away from static tools like Figma toward shaping behavior—timing, logic, and interaction—which can't be fully captured in traditional mockups. Designers increasingly work across coding, research, and strategy, using prototypes and pull requests as “behavioral specs” that engineers and even AI systems can interact with directly. This creates a new model of design work: faster, more code-adjacent, and focused on systems and frameworks rather than screens, while traditional Figma-based workflows still coexist in more structured environments.

The New Designer/Developer Collaboration (4 minute read)

The New Designer/Developer Collaboration (4 minute read)

Design
A team built a production website in three weeks using Intent, an AI tool that automatically translated Figma designs into code and let designers, developers, and project managers work simultaneously on the same codebase.
What: Intent is an AI development tool that connects to Figma via MCP, uses agent configuration files to automatically translate design tokens and layouts into code (achieving 85% fidelity in 1-2 days), and allows multiple team members to work in parallel through AI-assisted workspaces—the designer handled alignment and animations through natural language commands, the project manager updated content, and the developer managed templated pages and code standards.
Why it matters: This workflow eliminates the traditional sequential design-to-development handoff bottleneck by letting non-developers contribute directly to the codebase while maintaining design system consistency, fundamentally changing who can ship changes without waiting for developer availability.
Takeaway: If your team struggles with design-to-development handoffs, explore Intent or similar AI-assisted tools that can translate design systems into code and enable parallel collaboration across roles.
Deep dive
  • Team started with traditional two-week Figma design process including visual explorations, wireframes, grid setup, typography scales, color variables, and reusable components
  • Developer connected Intent to Figma via MCP and created agents.md file pointing to artboards, using Astro and Tailwind as the tech stack
  • Intent agents automatically pulled design tokens into Tailwind config and laid out pages using those tokens in parallel workspaces, desktop first then mobile
  • Front-end infrastructure reached 85% fidelity in 1-2 days of work, creating functional pages using the design system without manual pixel-pushing
  • Designer worked in Intent using grid overlays to verify alignment, instructing agents to "align to column three" rather than guessing percentage values
  • Designer handled animations and entrance effects in hours instead of days through natural language instructions combined with manual control of easing curves
  • Project manager handled content updates, blog posts, image assets, and text changes using simple single-agent tasks with frequent commits
  • Developer managed templated pages where variable content required design rules, standardized code patterns through agents.md documentation, and handled PR merging
  • Out of 30-40 pull requests across the project, only 5 required manual intervention, with Intent handling most git conflict resolution automatically
  • Design tokens and initial setup ensured all team members' contributions remained consistent with design and development architecture regardless of technical skill
  • During pre-launch crunch time, all three team members made changes simultaneously without breaking the design system or codebase
  • Key advantage was eliminating the developer bottleneck where every change requires developer implementation, while still maintaining code quality and design consistency
Decoder
  • Intent: An AI-powered development tool that uses agents to translate designs into code and enables collaborative coding through natural language instructions
  • MCP (Model Context Protocol): A connection protocol that allows Intent to access and read Figma design files programmatically
  • agents.md: A configuration file that defines instructions, patterns, and conventions for AI agents to follow when generating code
  • Design tokens: Standardized design values like colors, spacing, and typography that can be extracted from design tools and translated into code variables
  • Workspace: An isolated environment in Intent where specific tasks are performed by AI agents before changes are committed to the codebase
Original article

There's lots of ways to build a website. Most of them involve designers working in one tool, developers working in another, and a painful handoff process somewhere in between. We recently used Intent to design, build and ship a well-crafted website in about three weeks, and the collaboration model that emerged shined a light on how things could (no, should) be.

Design First

We started the way most Web projects start these days: in Figma. Visual explorations of what the style, wireframes for the structure, then bringing the two together into full page layouts. Our designer set up the grid, typography scales, color variables, buttons, and reusable components. Your typical design system.

Aria Design System in Figma

This process took about two weeks and was pretty standard. Desktop and mobile comps, a couple rounds of feedback on visuals and copy, iterating until we had a visual style, a rough structure, and directional content. Just a solid Web design process.

Aria Web sites page designs in Figma

Development Foundation

Once the designs were in a good place, our developer jumped in. But not by staring at a Figma file and manually translating pixels into code. Instead, he opened up Intent, set up the project scaffolding (Astro, Tailwind), connected to the Figma MCP, and wrote an agents.md file that pointed to all the artboards.

Then he kicked off a series of workspaces. The first one pulled the design tokens into Tailwind. The second started laying out the first page using those tokens. After that, he was able to break off into parallel workspaces, one for each page. Desktop layouts first, then separate passes for mobile.

This whole phase, the front-end infrastructure, took maybe one or two days of actual work. And by the end, every page existed in code, using the design system, at roughly 85% fidelity. Not pixel perfect, but pretty damn close.

Parallel Work

Once he deployed the site to a staging URL, the three of us started working in Intent simultaneously: our designer, our front-end developer, and me handling product/project management. Though we all were using the same tool, we each worked our own way.

Our designer set up a grid overlay so he could visually verify alignment. He would tell the agent "align to column three" and it would snap things into place (way better than guessing at percentage values). He preffered staying in one workspace to tweak alignment and refine grid positioning across a full page before committing things.

Designer Workflow in Intent - setting up a grid

Once the pages were structurally solid, he moved on to animations. Entrance effects on homepage elements, scroll-triggered transitions, etc.. Work that normally takes days of back-and-forth between a designer specifying timing curves and a developer implementing them happened in about hours. He still maintained manual control where it mattered, finding the exact easing curve he wanted then telling the agent to use it. The implementation was handled for him so he could focus on how things felt.

Designer Workflow in Intent - animation tweaking

Meanwhile, I was doing content and product work. Dumping in blog posts from Word docs, adding image assets, making text changes based on feedback from the broader team. My approach was simple: small discrete tasks with a single agent. Fix one thing, commit. Fix another thing, commit. Once I had four or five commits, I'd open a pull request, toss out the workspace, and start a new one. The design tokens and setup our developer created ensured my changes were all inline with our design and development architecture.

Working on Aria Web Site in the Intent app

Our developer's job during this phase was partly creative and partly managerial. He handled the templatized pages (news, product detail) where variable content meant design rules mattered more. He also kept an eye on pull requests, merged changes, resolved conflicts, and updated the agents.md file when he noticed patterns emerging in the code that should be standardized.

Developer Workflow in Intent

For example, when he saw icons being added in a way that wouldn't scale, he set up a better pattern and documented it. The next time anyone needed to add icons, the agent just followed the convention automatically. He used Intent for conflict resolution too, pulling up conflicting branches and having the agent sort them out. Out of maybe 30 or 40 pull requests across the project, only five needed real manual intervention.

Developer Workflow in Intent

Same tool, three different workflows, nobody waiting on each other.

Crunch Time

Every web project has a crunch period right before launch and ours was no different. The broader team started paying attention (as they always do at the very end), and feedback flooded in. But because the three of us could all be in Intent making changes at the same time, the crunch was way more manageable than usual.

Crunch time for the Aria Web Site in Slack

The biggest win was that any one of us could contribute meaningfully to the codebase without breaking the design system, code structure, or the site. That's a fundamentally different dynamic than waiting for a developer to make every change.

A New Way of Working?

It wasn't perfect. CSS layout struggles are still a thing. Git seems to keep finding ways to bite you. And there's still a learning curve for non-developers, even with agents handling the hard parts.

But without the handoff, everyone builds. And that makes all the difference.

Free Online Vector Design Tool (Website)

Free Online Vector Design Tool (Website)

Design
Graphite is an open-source vector graphics editor that uses a node-based procedural workflow, letting you edit everything non-destructively through adjustable parameters instead of manually tweaking individual elements.
What: Graphite is a free, open-source vector graphics editor currently in alpha that combines traditional layer-based compositing with node graph systems for generative design. It's available as a web app now, with native Windows, Mac, and Linux desktop apps in release candidate testing.
Why it matters: The procedural node-based approach is a fundamentally different paradigm from traditional vector editors like Illustrator—every design decision becomes a tweakable parameter, making it easier to iterate on designs and create generative artwork. The project aims to become an all-in-one 2D content creation suite covering graphic design, animation, image editing, and VFX compositing.
Takeaway: Try the web app at the Graphite website or install it as a PWA, and join the Discord #🔥announcements channel to test the desktop app release candidates before official launch.
Deep dive
  • Graphite uses a node graph system as its core, where artwork is represented as interconnected nodes that can be adjusted parametrically rather than editing individual elements directly
  • The non-destructive workflow means all design decisions remain editable—you can change circle density, size, placement patterns, color schemes, and shape morphing after the fact by adjusting node parameters
  • Built with Rust and leveraging WebAssembly and WebGPU for high performance both in browsers and on native platforms with minimal overhead
  • Runs entirely locally on user hardware with no server component, ensuring privacy and offline capability
  • Currently focused on vector graphics editing but evolving toward a general-purpose 2D content creation suite covering motion graphics, raster editing, digital painting, page layout, and VFX compositing
  • The web version is live now and installable as a PWA, while native desktop apps for Windows, Mac, and Linux are in release candidate stage
  • Future roadmap includes fully-featured raster manipulation, live collaborative editing, and the ability to compile standalone programs from node systems for studio production pipelines
  • Vector artwork can be infinitely zoomed and exported at any resolution without pixelation
  • The project is seeking donations to maintain sustainability and independence as a free, open-source alternative to commercial tools
  • Active community with Discord, newsletter, and upcoming in-person meetup in Karlsruhe, Germany (October 2025), plus Google Summer of Code internships for Rust developers
Decoder
  • Non-destructive editing: A workflow where original data is never permanently altered, allowing any decision to be changed later without loss of quality or having to start over
  • Node-based design: A visual programming approach where operations are represented as connected boxes (nodes) in a graph, with data flowing between them to generate the final output
  • Procedural workflow: Creating content through algorithms and parameters rather than manual drawing, allowing easy iteration by adjusting values instead of redoing work
  • WebAssembly: A binary instruction format that lets languages like Rust run in browsers at near-native performance, far faster than JavaScript
  • WebGPU: A modern browser API for high-performance graphics and computation, giving web apps access to GPU acceleration
  • PWA (Progressive Web App): A web application that can be installed and behave like a native desktop app, working offline and appearing in app launchers
  • Parametric editing: Designing by defining relationships and constraints through adjustable parameters, so changes propagate automatically through the system
Original article

Your procedural toolbox for 2D content creation

Graphite is a free, open source vector graphics editor and animation engine, available now in alpha. Get creative with a fully nondestructive editing workflow that combines layer-based compositing with node-based generative design.

Painted Dreams - Made using nondestructive boolean operations and procedural polka dot patterns Design for a magazine spread, a preview of the upcoming focus on desktop publishing Procedurally generated animation demonstrating Graphite's approach to creative coding with nodes Mandelbrot fractal filled with a noise pattern, procedurally generated and infinitely scalable

What's new?

The latest major update is out now! See what the team has been cooking up recently:

Software overview

Starting life as a vector editor, Graphite is evolving into a general-purpose, all-in-one graphics toolbox that is built more like a game engine than a conventional creative app. The editor's tools wrap its node graph core, exposing user-friendly workflows for vector, raster, animation, and beyond.

One app to rule them all

Stop jumping between programs. Upcoming tools will make Graphite a first-class content creation suite for many workflows, including:

  • Graphic Design
  • Motion Graphics
  • Image Editing
  • Digital Painting
  • Page Layout & Print
  • VFX Compositing

Current features

  • Vector editing tools
  • Procedural workflow for graphic design
  • Node-based layers
  • Forever free and open source

Presently, Graphite is a lightweight offline web app with features primarily oriented around procedural vector graphics editing.

Upcoming features

  • All-in-one creative tool for all things 2D
  • Fully-featured raster manipulation
  • Windows/Mac/Linux native apps + web
  • Live collaborative editing

Desktop-first and web-ready

Where's the download? The web app is currently live and desktop apps for Windows, Mac, and Linux are now in a release candidate stage. Check the announcements channel of the Discord for download links if you would like to partake in testing prior to the impending launch.

Graphite is designed principally as a professional desktop application that is also accessible in a browser for quick access from anywhere. It's built for speed with (nearly) no JavaScript. And regardless of platform, it runs locally and privately on your own hardware—there is no server.

Engineering the tech for a native app distributed across three new platforms takes extra time. That's why supporting the web platform, which keeps up-to-date and reaches all devices, has been the initial target. For now, you can install the app as a PWA for a desktop-like experience.

Graphite's code architecture is structured to deliver true native performance for your graphically intensive workloads on desktop platforms and very low overhead on the web thanks to WebAssembly and WebGPU, new high-performance browser technologies.

The power of proceduralism

Graphite is the first and only comprehensive graphic design suite built for procedural editing—where everything you make is nondestructive.

Explore parametric possibilities

Save hours on tedious alterations and make better creative choices. Graphite lets you iterate rapidly by adjusting node parameters instead of individual elements.

Scatter circles with just a couple nodes. Want them denser? Bigger? Those are sliders. Want a different placement area? Just tweak the path.

Mix and morph anything

Nondestructive editing means every decision is tied to a parameter you can adjust later on. Use Graphite to interpolate between any states just by dragging value sliders.

Blend across color schemes. Morph shapes before they're scattered around the canvas. The options are endless.

Geared for generative pipelines

Graphite's representation of artwork as a node graph lets you customize, compose, automate, reuse, and share your content workflows.

  • Infinitely pan and zoom without pixelation, export any resolution
  • Compile standalone programs from node systems (future)
  • Deploy asset pipelines in studio production environments (future)
The Future of UI Design is Agentic Design (7 minute read)

The Future of UI Design is Agentic Design (7 minute read)

Design
AI agents are now generating complete UI designs directly inside Figma, moving beyond simple assistance to core integration in the design workflow.
What: Anima Agent is a Figma plugin that uses AI models (defaulting to Claude Sonnet 4.6) to generate elaborate user interfaces from text prompts, supporting three scenarios: building new designs from scratch, creating variants of existing designs, and assembling screens from pre-existing design system components.
Why it matters: This represents a fundamental shift in design workflows where AI creates actual deliverables rather than just offering suggestions, though the generated designs come with caveats like visual defects requiring manual cleanup and detached component instances that aren't linked to the original design system.
Takeaway: Designers can install the Anima Agent plugin in Figma to experiment with AI-generated UI designs in their workflow.
Deep dive
  • The plugin uses Claude Sonnet 4.6 as the default model with options to switch to OpenAI GPT 5.3 Codex, while Claude 4.6 Opus is available but consumes more tokens for similar output quality
  • All generated designs include auto-layout configurations built in, making them immediately adjustable and responsive without manual setup
  • When building from scratch, the tool can generate 5-7 complete app screens with consistent visual language from a single detailed prompt describing user needs, UX requirements, visual design specs, and components
  • Generated designs may contain visual defects like misplaced navigation bars that require manual correction before handoff to development teams
  • Creating variants allows designers to select existing frames and request alternative versions focused on specific aspects like data visualization approaches
  • When working with design systems, the plugin can use selected components to assemble new screens, but generates detached visual copies rather than true component instances linked to the master design system
  • This disconnect from the source design system means future updates to components won't propagate to AI-generated screens, creating potential maintenance challenges
  • The approach differs from autonomous AI design tools that generate and ship coded prototypes directly, instead keeping designers in control with Figma-based refinement
  • The workflow represents a middle ground between traditional manual design and fully autonomous AI generation, maintaining designer agency while accelerating production
Decoder
  • Agentic design: An approach where AI tools are integrated at the core of the design process to actively create UI designs, rather than just providing suggestions or answering questions
  • Auto-layout: A Figma feature that makes design elements automatically adjust and reflow when content or container sizes change, similar to CSS flexbox
  • Design system: A collection of reusable components, patterns, and guidelines that ensure visual and functional consistency across a product
  • Detached instances: Visual copies of components that look identical but aren't linked to the original master component, so they don't update when the master changes
  • Claude Sonnet 4.6: An AI language model from Anthropic used to interpret design prompts and generate UI specifications
  • Autonomous AI design: A fully automated approach where AI tools create complete designs and ship coded prototypes without designer intervention in the process
Original article

Agentic design is reshaping UI workflows. Tools like Anima Agent are enabling designers to generate elaborate interfaces directly inside Figma using AI. The plugin defaults to Claude Sonnet 4.6 and supports three core scenarios: building new designs from scratch, creating variants of existing ones, and assembling screens from a pre-existing design system. Generated designs come with auto-layout built in, though they may require manual cleanup for visual defects and produce detached — rather than true — Figma component instances.

Autopilot, agentic AI, and the dangers of imperfect metaphors (14 minute read)

Autopilot, agentic AI, and the dangers of imperfect metaphors (14 minute read)

Design
The autopilot metaphor for agentic AI is fundamentally misleading because autopilot operates with transparent, rule-based logic while AI lacks explainability and requires far more skilled human oversight than public perception suggests.
What: An opinion piece arguing that comparing AI and agentic AI to airplane autopilot creates dangerous misconceptions—autopilot systems work through strict, explainable rules with constant human monitoring, while AI depends heavily on prompting, lacks transparency about its reasoning, and struggles with complexity outside narrow constraints.
Why it matters: The metaphor shapes how the technology is perceived and trusted by the public, obscuring AI's limitations and creating unrealistic expectations about autonomy when effective AI use actually requires expertise similar to how pilots actively manage autopilot rather than treating it as fully autonomous.
Takeaway: Constrain AI tools to hyper-focused, well-defined tasks with limited scope and active human oversight, and develop expertise in understanding what's happening "under the hood" rather than treating them as magical black boxes.
Deep dive
  • Autopilot systems use negative feedback loops to maintain equilibrium through sensors and central processing, with all inputs and outputs being fully explainable and transparent to pilots at any moment
  • Wiener's Law describes autopilot as "Dumb and Dutiful"—it accepts any acceptable input (even illogical ones) and always follows core objectives, requiring pilots to constantly verify outputs and maintain situational awareness
  • AI's core problem is "explainability"—it cannot show the reasoning behind its outputs, making it impossible to audit or understand the "paper trail" of how it arrives at conclusions
  • Agentic AI depends heavily on prompt engineering, and even well-crafted prompts introduce ambiguity (defining "important emails" requires context that may change unpredictably over time)
  • Language choices like "Artificial Intelligence," purple color schemes, and sparkle icons anthropomorphize and present AI as "magic" rather than extrapolated statistics and mathematics
  • The framing parallels historical uses of euphemisms to shape narratives—from "carbon footprint" (created by oil industry) to "prediction markets" (allowing Kalshi to avoid gambling regulations)
  • Small language models (SLMs) performing focused tasks at a fraction of energy costs suggest AI works best when hyper-constrained, not as general-purpose autonomous agents
  • Most AI pilots at large companies are failing or not generating expected returns, likely because they're deployed too broadly without proper constraints
  • AI tools like FigmaMake work best in hands of experienced professionals who understand the domain (UX design, accessibility, design systems) and can recognize when outputs fail
  • The effective model is multiple limited-scope agents with governance feeding data to a central human operator—which ironically does resemble autopilot, but requires the same level of expertise pilots need
  • The general public lacks awareness of complexity behind everyday products and services, making them susceptible to accepting AI as another magical convenience rather than a tool requiring skilled operation
Decoder
  • Agentic AI: AI systems given autonomy to make decisions and perform tasks on behalf of humans without constant input, using goal-oriented reward systems to complete objectives
  • Explainability: The ability to trace and understand the reasoning process behind an AI's outputs, like "showing your work" in math—something current AI largely cannot do
  • SLM (Small Language Model): Smaller, more focused language models that perform specific tasks at much lower energy costs than general-purpose generative AI
  • ADAS (Advanced Driver Assistance System): Car systems that use sensors and distance calculations to assist driving, using rules-based logic rather than intelligence
  • Wiener's Law: The principle that autopilot is "Dumb and Dutiful"—it accepts any valid input and follows objectives literally, requiring human oversight to prevent illogical outcomes
  • Tokenization: The process of breaking down inputs into discrete units for AI processing, affecting how the system interprets and generates responses
  • Reinforcement learning: Training AI through reward/penalty systems to improve performance on specific tasks over time
Original article

Comparing AI—especially agentic AI—to autopilot is misleading: autopilot systems operate within strict, transparent rules, while AI is far less explainable and depends heavily on context, prompting, and interpretation. Describing AI as “magic” or autonomous obscures its limitations, shapes public perception, and can lead to misplaced trust. AI is most effective when constrained to specific, well-defined tasks with human oversight, functioning more like controlled systems than independent intelligence—making clear understanding and honest framing essential.

Org Design in the Age of AI (3 minute read)

Org Design in the Age of AI (3 minute read)

Design
AI should trigger a fundamental rethinking of organizational hierarchies, not just make existing workflows faster, because traditional structures exist primarily to route information between people.
What: A strategic analysis arguing that companies are missing the bigger opportunity by adding AI to existing workflows instead of redesigning org charts built around information routing that AI can now collapse.
Why it matters: The competitive advantage shifts from execution speed to learning speed—companies willing to restructure around what AI enables will pull ahead of those just optimizing current processes with faster tools.
Takeaway: Evaluate whether your team structure is still organized around translation costs between roles that AI has effectively eliminated.
Deep dive
  • Traditional hierarchies exist primarily to route information—aggregating signals from the front lines up and translating strategy down—not just to establish authority chains
  • The real bottleneck in product development is translation cost, not speed: when a PM writes a PRD, designers decode it into mocks, engineers decode mocks into code, and QA decodes behavior into tests, each handoff loses fidelity and requires alignment meetings
  • A typical mid-sized feature takes three to six months because making one person's understanding legible to another is genuinely hard, creating wait time between each translation
  • AI collapses these translation layers: PMs can go from idea to interactive prototype in a day, AI generates tests alongside code, and intelligence layers synthesize metrics in real-time
  • Sequential relay-race models (PM → design → eng → QA → GTM) will give way to small autonomous squads of 3-5 people with all necessary skills making decisions simultaneously
  • Middle management compresses because managers whose primary function was routing information lose their value proposition—only those providing judgment, coaching, and navigating ambiguity will thrive
  • The most radical shift is systems generating roadmaps autonomously: detecting patterns like a restaurant's cash flow tightening before a seasonal dip and automatically offering tailored financing before the merchant even looks for it
  • Departments will decompose into composable capability atoms—independent, self-contained functions that can be combined dynamically rather than organized by traditional functional boundaries
  • Competitive moat changes from execution speed (who ships fastest) to learning speed (how quickly the org can absorb what AI makes possible and restructure accordingly)
  • Most companies use AI like a faster horse to optimize existing structures, but the winners will ask what they'd build if designing the organization from scratch today
Decoder
  • PRD: Product Requirements Document, a specification written by product managers describing what a feature should do
  • QA: Quality Assurance, the team responsible for testing software before release
  • GTM: Go-to-Market, the strategy and execution of launching a product to customers
  • CI/CD: Continuous Integration/Continuous Deployment, infrastructure that automatically tests and deploys code changes
Original article

Org Design in the Age of AI

I've been talking to companies — startups to megacaps — about AI and organizational design. Everyone is adding AI to their workflows. Almost no one is rethinking why the workflow is shaped that way in the first place. The org chart is next thing AI disrupts.


The hidden function of hierarchy

Strip a company down to first principles and it's really three things: people, hierarchy, and information flow. We tend to think of hierarchy as being about authority — who reports to whom, who approves what. But that's the surface. The deeper function of hierarchy is information routing. The org is too large for any single person to see the whole picture, so you install layers of managers to aggregate signals from the front lines, synthesize them, and pass them up — and to translate strategic intent from the top and distribute it down.

Most of the organizational machinery we take for granted exists to solve this problem. Meetings, status updates, steering committees, quarterly business reviews — these are all information-routing mechanisms. They exist because moving knowledge between people is expensive.


The real bottleneck was never speed

Consider how a typical product gets built. A PM spends weeks writing a PRD. Design interprets it into mocks. Engineering interprets the mocks and estimates "eight weeks" — at which point the requirements shift and the PRD gets rewritten. Dev takes two to three months. QA spends weeks on regression testing. GTM prepares launch materials and trains sales. End to end, a mid-sized feature easily takes three to six months.

The real bottleneck is translation cost. PM's intent gets encoded into a document. A designer decodes that document and re-encodes it as a visual. An engineer decodes the visual and re-encodes it as code. QA decodes the intended behavior and re-encodes it as test cases. Every translation loses fidelity. Every translation requires alignment meetings. Every translation generates wait time — not because people are slow, but because the act of making one person's understanding legible to another person is genuinely hard.

This is what AI collapses.


What it actually changes

When a PM can go from idea to interactive prototype in a day using AI, the translation layer between PM and engineering compresses to near-zero. When AI generates tests alongside code as it's being written, the handoff between dev and QA disappears. When an intelligence layer can synthesize customer signals, usage data, and business metrics in real time, the middle manager whose job was to aggregate that information weekly has to find a different source of value.

This isn't about any single role getting faster. It's about the gaps between roles — the translation layers, the handoff queues, the alignment meetings — evaporating.

And once you see it that way, the implications for org design get serious:

  • The relay race becomes a basketball game. The sequential handoff model — PM then design then eng then QA then GTM — gives way to small squads of 3–5 people with all the necessary skills, moving simultaneously, making most decisions themselves. Only big directional bets escalate up.

  • Departments decompose into capability atoms. Instead of teams organized by function, the org becomes a set of independent, composable capabilities — collections, identity verification, risk scoring, savings — each self-contained, each combinable.

  • PMs become product creators. The old PM spent most of their energy making ideas legible to other people. The new PM validates directly — prototyping, running data analyses, generating first-pass implementations.

  • Middle management compresses. The managers who thrive will be the ones whose real contribution was always judgment, coaching, and navigating ambiguity — not routing information.

  • QA embeds into development.

  • The system starts generating the roadmap. This is the most radical shift. The example Jack Dorsey used - A restaurant's cash flow tightens ahead of a seasonal dip. The system detects the pattern, packages a short-term loan with adjusted repayment, and pushes it to the merchant — before they even thought to look for financing. No PM decided to build that. The system recognized the moment and composed them.

  • Release cycles give way to continuous flow. No more "v2.0 ships in Q3." Ship small improvements daily. This requires CI/CD infrastructure, but more importantly it requires letting go of the big-launch identity — trading the dopamine of a major release for the discipline of relentless, quiet value delivery.


The deeper shift

Competitive moat changes: It used to be execution speed — who could ship fastest. Now it's learning speed — how quickly the org can absorb what AI makes newly possible and restructure around it.

Most companies today are using AI the way you'd use a faster horse — to make the existing structure run a little better. The companies that pull ahead will be the ones willing to ask a harder question: what would we build if we were designing this organization from scratch, today, knowing what AI can do?

$13 Billion DeFi TVL Wipeout in Two Days Following Kelp DAO Hack (3 minute read)

$13 Billion DeFi TVL Wipeout in Two Days Following Kelp DAO Hack (3 minute read)

Crypto
A $292 million bridge exploit of KelpDAO triggered a two-day bank run across DeFi platforms, wiping out $13.21 billion in deposits as users panicked despite limited direct exposure to the hack.
What: Attackers exploited KelpDAO's LayerZero bridge to mint unbacked rsETH tokens, then used them as collateral to borrow real assets on lending platforms like Aave. This created bad debt risk and prompted protocol freezes, sparking mass withdrawals that pulled $8.45 billion from Aave alone and $13.21 billion from DeFi overall within 48 hours.
Why it matters: The incident demonstrates how DeFi's interconnected architecture can amplify localized exploits into systemic crises, with protocols having no direct exposure to the original attack experiencing withdrawal cascades reminiscent of traditional bank runs. Liquidity constraints pushed lending pools to 100% utilization and borrow rates to 10-15%, threatening leveraged strategies across the ecosystem.
Deep dive
  • The exploit worked like depositing counterfeit cash at a bank and taking out real loans—attackers created unbacked rsETH and borrowed legitimate assets against it, leaving lenders with potential bad debt
  • Total DeFi TVL dropped from $99.5 billion to $86.3 billion, while Aave specifically fell from approximately $26.4 billion to $18 billion in deposits over the 48-hour period
  • Despite massive deposit outflows, token prices moved modestly—AAVE down 2.5% in 24 hours, UNI and LINK under 1%—suggesting the panic was about solvency risk rather than fundamental protocol value
  • Protocols responded by freezing affected markets, but this defensive measure itself triggered broader withdrawals as users rushed to exit before their funds became locked
  • Liquidity stress manifested in borrow rates spiking to 10-15% and multiple lending pools hitting 100% utilization, creating margin compression for yield farmers and leveraged traders
  • Early analysis points to vulnerabilities in the bridge verification layer rather than smart contract bugs, highlighting persistent weaknesses in cross-chain infrastructure
  • Platforms like Euler and Sentora also saw double-digit percentage TVL drops despite having no direct connection to the rsETH exploit, illustrating DeFi's contagion dynamics
  • The episode reveals that DeFi's composability—usually touted as a feature—becomes a transmission mechanism for systemic shocks when trust breaks down
Decoder
  • TVL (Total Value Locked): the dollar value of crypto assets deposited across DeFi protocols, used as a liquidity and activity metric
  • rsETH: a liquid restaking token from KelpDAO that represents staked Ethereum with additional yield layers
  • LayerZero bridge: cross-chain infrastructure that enables asset transfers between different blockchains
  • Utilization rate: percentage of available funds in a lending pool that are currently borrowed; 100% means no liquidity remains for withdrawals
Original article

A $292 million exploit of KelpDAO's rsETH via the LayerZero bridge triggered a 48-hour DeFi-wide panic that erased $13.21 billion in TVL, pulling the sector from $26.4 billion to roughly $20 billion by April 20. Aave absorbed the sharpest blow, losing $8.45 billion in deposits as withdrawals cascaded into protocols unconnected to the original attack, driving the AAVE token down 18%. Liquidity constraints pushed multiple lending pools to 100% utilization and borrow rates to 10-15%, compressing margins for leveraged strategies across the sector.

Bitmine Nears 5% of ETH Supply Target (1 minute read)

Bitmine Nears 5% of ETH Supply Target (1 minute read)

Crypto
Bitmine has accumulated over 4 million ETH, representing 4.12% of Ethereum's total supply and making it the largest corporate holder of ether.
What: Bitmine purchased another 101,627 ETH in the past week, bringing its total holdings to 4,976,485 ETH, approaching its 5% supply target and solidifying its position as the largest corporate Ethereum holder.
Why it matters: This mirrors the corporate Bitcoin treasury strategy but for Ethereum, signaling that institutions are treating ETH as a strategic reserve asset driven by staking yields and tokenization use cases, not just speculative investment.
Original article

Bitmine said it now holds 4,976,485 ETH, or 4.12% of total supply, after buying another 101,627 ETH in the past week, reinforcing its position as the largest corporate ether holder and demonstrating how treasury accumulation, staking income, and tokenization demand are strengthening the institutional case for ETH.

Recent Aave rsETH Exploit can be eliminated by a new n-VM architecture (3 minute read)

Recent Aave rsETH Exploit can be eliminated by a new n-VM architecture (3 minute read)

Crypto
A new n-VM blockchain architecture proposes to eliminate bridge exploits by running multiple virtual machines on a single consensus layer with unified identity and atomic cross-VM transfers.
What: n-VM is a proposed blockchain architecture that runs heterogeneous virtual machines (EVM, SVM, Bitcoin Script) as co-equal citizens on one consensus layer with a shared state tree, enabling atomic token transfers across VMs without bridges through a unified identity system and single token ledger.
Why it matters: Bridge exploits have caused $2.8 billion in losses historically, with the recent April 2026 Aave rsETH attack draining $293 million through a LayerZero vulnerability; this architecture eliminates the entire attack class by replacing lock-mint-burn cycles with atomic ledger updates, preventing unbacked collateral attacks.
Takeaway: Review the research paper (arXiv 2603.23670) to understand the opcode-based routing mechanism and unified identity layer design, and consider implications for existing Ethereum tooling and L2 integration.
Deep dive
  • The n-VM architecture runs multiple virtual machines (EVM, SVM, Bitcoin Script) as equal first-class citizens on a single consensus layer, unlike existing multi-VM projects that treat one VM as primary
  • A unified identity layer uses a single 32-byte commitment that deterministically derives native addresses for each VM through domain-separated hashing, allowing one root identity while preserving VM-native address formats
  • All tokens exist in a unified ledger where ERC-20 and SPL are simply different views over the same storage slots, keyed by the universal identity commitment
  • Cross-VM transfers are atomic state transitions that directly update balances without lock-mint-burn-release cycles, bridge contracts, or multi-sig committees
  • Opcode-based routing uses the first byte of every transaction to deterministically route to the correct VM, with automatic rollback on failures to maintain isolation
  • The architecture eliminates the unbacked-collateral attack vector that enabled the April 2026 rsETH exploit, where attackers minted 116,500 fake tokens through a compromised LayerZero bridge and deposited them as Aave collateral
  • Parallel execution is achieved through write-set conflict detection and optional context-based sharding (64 shards by default)
  • Legacy wallet compatibility (MetaMask, Phantom) is maintained through raw chain ingress where the chain recovers signatures and binds them to the unified identity
  • The rsETH attack caused Aave to freeze markets, resulted in $6+ billion TVL drop, and left substantial bad debt after attackers borrowed real WETH against fake collateral
  • Community responses propose adding execution commitment primitives for independent verification and standardizing risk signal propagation (ERC-1705) to address the 77-minute detection-to-reaction gap observed during the exploit
Decoder
  • n-VM: Architecture that runs N heterogeneous virtual machines on one consensus layer
  • EVM: Ethereum Virtual Machine, the execution environment for Ethereum smart contracts
  • SVM: Solana Virtual Machine, the runtime for Solana programs
  • TVL: Total Value Locked, the amount of assets deposited in a DeFi protocol
  • rsETH: Kelp DAO's liquid restaking token
  • LayerZero: Cross-chain messaging protocol used for bridging assets between blockchains
  • DVN: Decentralized Verifier Network, LayerZero's security validation layer
  • id_com: Identity commitment, the 32-byte root identity in n-VM that derives VM-specific addresses
  • ERC-20/SPL: Token standards for Ethereum and Solana respectively
  • OCP: Open Computation Primitive, a protocol for making execution independently referenceable
Original article

The n-VM architecture integrates heterogeneous virtual machines like EVM and SVM into a single consensus and shared state tree. By enabling atomic cross-VM transfers through a unified ledger, this design eliminates bridge-dependent vulnerabilities, potentially preventing the $2.8 billion in losses historically caused by cross-chain bridge exploits.

Permissionless Perpetuals Market Creation (4 minute read)

Permissionless Perpetuals Market Creation (4 minute read)

Crypto
Perps.fun launched a platform that lets anyone propose and crowdfund new perpetual futures markets, removing centralized gatekeeping from derivatives trading.
What: Perps.fun is a new alpha platform with a propose-crowdfund-launch model that allows community members to propose perpetual futures markets for any asset, crowdfund the ticker listing costs, and deploy them permissionlessly without approval from centralized exchanges.
Why it matters: Traditional perpetual futures markets are limited to popular assets because centralized exchanges control which markets exist and charge high listing fees, but this democratizes access to derivatives trading for long-tail assets.
Takeaway: Developers or traders interested in perpetual futures for niche assets can visit app.perps.fun to propose or fund new markets.
Decoder
  • Perpetual futures (perps): Crypto derivative contracts that allow traders to speculate on asset prices without an expiration date, unlike traditional futures
  • Long-tail assets: Less popular or niche cryptocurrencies and tokens that typically lack trading infrastructure on major exchanges
  • Permissionless: Anyone can participate without needing approval from a central authority
Original article

Perps.fun launched in alpha with a propose-crowdfund-launch model that allows anyone to propose a new perpetual futures market, crowdfund the ticker listing costs, and deploy it permissionlessly on app.perps[.]fun. The mechanism removes the cost barrier that has historically restricted long-tail asset coverage in perp markets, enabling community-driven market creation without gatekeeping from centralized venues.

How Freeport Users Made 11.7% on $27M in 45 Days (6 minute read)

How Freeport Users Made 11.7% on $27M in 45 Days (6 minute read)

Crypto
A new AI-powered trading platform reports 11.7% user returns in 45 days, but transparently attributes most gains to market beta and momentum rather than platform-generated alpha.
What: Freeport is an AI news feed platform with one-tap trading execution that generated $27M in volume across its first 45 days. Users primarily traded major macro events: oil futures during Middle East escalation (WTI crude from $60s to $100+) and NASDAQ longs during subsequent de-escalation, achieving 11.7% aggregate money-weighted returns.
Why it matters: This represents unusual transparency for a fintech product launch. Rather than claiming credit for returns, Freeport breaks down attribution: 4-5% from market beta (being long during a rally), 3-4% from momentum (following established trends), 2-3% from oil concentration, leaving only 2-3% residual that isn't statistically significant at 46 days. The counternarrative to typical fintech hype makes the underlying mechanism more interesting—their data shows top performers (18.2% returns) traded less frequently and used lower leverage than average users.
Takeaway: If building trading or investment tools, consider optimizing for user outcomes over engagement—Freeport's best traders averaged 2.1 trades per day versus 5.8 for median users, held positions 31 hours versus 19, and used 2.4x leverage versus 3.3x.
Deep dive
  • Platform launched February 28, 2026 during U.S.-Israel strikes on Iran that killed Supreme Leader Khamenei, creating immediate opportunity to test event-driven trading thesis
  • Users traded primarily real assets (NASDAQ/S&P indices, WTI crude, crypto) at 2-4x leverage, far below 10-200x platform maximums, suggesting measured risk management rather than speculation
  • Most oil profits came not from initial headline trades but from users entering hours/days later as escalation deepened, potentially exploiting serial correlation in geopolitical events similar to post-earnings announcement drift
  • When diplomatic channels reopened, users went long equities rather than shorting oil, possibly due to feed architecture biased against shorts in favor of anti-correlated longs, aligning with research showing negative media content overstates actual deterioration
  • Return decomposition via factor regression: approximately 4-5% from market beta (users were long during rally), 3-4% from momentum (following recent price moves), 2-3% from concentrated WTI exposure, 2-3% unexplained residual
  • The 2-3% residual has t-statistic of ~0.3 (not statistically significant) but would be economically meaningful if sustained, as even 1% over 45 days would annualize above most hedge fund performance
  • Top 1% of users (25 traders) achieved 18.2% returns through counter-intuitive behavior: traded less (2.1 vs 5.8 daily trades), used lower leverage (2.4x vs 3.3x), held longer (31 vs 19 hours median)
  • Platform cites academic research to support mechanisms: Hong and Stein (1999) on event-driven momentum from slow information diffusion, Tetlock (2007) on media negativity bias, Jame et al. (2022) on curated analysis improving retail order flow predictiveness
  • Product philosophy explicitly rejects engagement optimization in favor of outcome optimization, implementing fewer notifications and context for non-trading as much as trading signals
  • Volume breakdown shows institutional-style positioning: 33% in equity index perpetuals, 15% WTI crude, with remaining volume across single stocks, crypto, and pre-IPO tokens
Decoder
  • Money-weighted returns: Returns calculated by weighting each position by its dollar size, giving more influence to larger trades (versus time-weighted returns that treat all periods equally)
  • Market beta: The portion of returns explained by broad market movement—if the market rises 5% and you're long, you capture that regardless of skill
  • Momentum: Persistent tendency for assets that have risen recently to continue rising, one of the most robust empirical patterns in finance across decades of data
  • Factor regression: Statistical technique decomposing returns into systematic components (beta, momentum, etc.) versus unexplained residual that might represent skill or luck
  • T-statistic: Measure of statistical significance; values below ~2.0 suggest results could easily occur by chance, Freeport's 0.3 indicates their residual returns are not statistically meaningful yet
  • Post-earnings announcement drift (PEAD): Phenomenon where stock prices continue drifting in the direction of an earnings surprise for weeks afterward due to underreaction
  • WTI crude: West Texas Intermediate, the U.S. benchmark for oil pricing
  • Perpetual futures: Crypto-style derivative contracts with no expiration date, maintained through funding rate mechanisms
Original article

Freeport, an AI news feed platform with one-tap trading execution, reported $27M in volume and 11.7% aggregate money-weighted returns across its first 45 days, with users trading NASDAQ, S&P, crude oil, and crypto at 2-4x average leverage. The platform's two dominant macro trades were a WTI crude long from the low $60s to above $100 on Middle East tensions, followed by a NASDAQ 100 long that captured about 15% off March lows as diplomatic channels reopened. Return attribution assigns 4-5% to market beta, 3-4% to momentum, and 2-3% to oil concentration, with 2-3% residual alpha that lacks statistical significance at 46 days. Top users (1% of the base, 18.2% returns) averaged 2.1 trades per day versus a 5.8 median, held positions 31 hours versus 19, and used 2.4x leverage versus 3.3x.

Coinbase Ventures Maps Four Frontier Themes for 2026 (4 minute read)

Coinbase Ventures Maps Four Frontier Themes for 2026 (4 minute read)

Crypto
Coinbase Ventures identifies four investment priorities for 2026 despite a 15% drop in crypto funding, signaling where major capital will flow in a down market.
What: Coinbase Ventures outlined four key investment themes for 2026: real-world asset tokenization, specialized institutional exchanges, next-generation DeFi protocols, and AI agents as blockchain economic actors. This comes as Q1 2026 crypto venture funding fell to under $5 billion, down 15% year-over-year.
Why it matters: The themes reveal where major crypto investors see opportunity despite market compression, with BlackRock projecting tokenization reaching $20 trillion by 2030 and institutional trading growing from $5 billion to $18 billion. Kraken's recent $200 million raise at a $13.3 billion valuation—down from $20 billion—shows even top-tier companies face compressed valuations, making this a "builders market" where conviction investors believe the best companies emerge.
Takeaway: Developers building in tokenization infrastructure, institutional-grade trading tools, privacy-focused DeFi protocols, or AI agent payment systems are positioned in areas where major VCs plan to deploy capital.
Deep dive
  • Coinbase Ventures principal Jonathan King argues tough markets create the best opportunities, noting investors who "show up when it's not obvious" win later, positioning Q2 2026 as a builders market despite gloomy headlines
  • The tokenization theme focuses on bringing real-world assets like stocks and commodities onchain, with perpetual exchanges like Hyperliquid seeing billions in volume and BlackRock forecasting a 754x market expansion to $20 trillion by 2030
  • Specialized institutional infrastructure is thriving with purpose-built exchanges, proprietary automated market makers, and vertical trading apps, as Bernstein predicts institutional crypto trading will triple from $5 billion to $18 billion by 2030 with US market share jumping from 7% to 20%
  • Next-generation DeFi protocols prioritize composability, capital efficiency, and privacy, with institutional investors shifting from pure token appreciation to yield strategies—Nomura's survey found over two-thirds want DeFi staking exposure and 65% target lending and tokenized assets
  • Privacy is emerging as a critical DeFi feature, with the Ethereum Foundation deploying a 47-person "Privacy Cluster" team and Vitalik Buterin endorsing tools like Railgun, addressing the gap that "privacy is in every financial system except for DeFi"
  • AI agents are positioned as "economic actors" that function as new blockchain users, with Coinbase's x402 protocol partnering with Amazon, Google, and Stripe to embed payments directly into web interactions for seamless value transfer between agents and APIs
  • McKinsey projects the AI agent market reaching $5 trillion by 2030, supported by executives including Coinbase CEO Brian Armstrong, former Binance CEO CZ, and Circle CEO Jeremy Allaire
  • Recent major raises reflect these themes: Kraken parent raised $200 million at $13.3 billion valuation (down from $20 billion), Spektr raised $20 million for AI-powered compliance automation, and Paxos Labs raised $12 million to expand its crypto services toolkit Amplify
  • The overall crypto market remains down 40% from its October all-time high, with industry layoffs blamed on AI and several DeFi projects shutting down, creating a challenging environment that King views as separating serious builders from opportunists
  • Coinbase Ventures was a top crypto investor in Q1 2026 alongside firms like Andreessen Horowitz, Sequoia Capital, Founders Fund, Bain, and Alibaba Group, showing institutional conviction despite market weakness
Decoder
  • RWA tokenization: Converting real-world assets like stocks, bonds, commodities into blockchain tokens that can be traded 24/7 onchain
  • Perpification: Expanding perpetual futures contracts (derivatives with no expiration date) beyond crypto to traditional assets
  • DeFi: Decentralized finance, blockchain-based financial services operating without traditional intermediaries like banks
  • x402 protocol: Coinbase's proposed universal standard for embedding payments into web interactions, allowing APIs and AI agents to transfer value like they exchange data
  • Composability: The ability for DeFi protocols to integrate and interact with each other like building blocks
  • AMM: Automated market maker, algorithms that automatically provide liquidity and facilitate trades without traditional order books
Original article
  • Coinbase's venture arm says Q2 is a 'builders market' despite quieter trading.
  • Tokenisation, exchanges, next-gen DeFi and AI agents top the agenda.

At a glance, things look grim for crypto startups.

Investors poured just under $5 billion into innovating industry players in the first three months of 2026, a 15% drop from the capital injected in the first quarter of 2025, according to data from DefiLlama.

Add to that that the overall crypto market is still down some 40% from its October all-time high value, the industry is sacking hundreds of employees and blaming it on artificial intelligence, and decentralised finance projects are shutting down, and it's clear things look gloomy for entrepreneurs.

However, the industry will come out stronger for it, Jonathan King, principal investor at Coinbase Ventures, told DL News.

"When things look quiet or the market is more tough, that's when the best companies often get started," King said. "And the investors who show up then win big later. That's what we're seeing right now, and why we were a top crypto investor in Q1."

He's not alone. It's a sentiment shared by investors like Andreessen Horowitz, Sequoia Capital, Founders Fund, Bain, and Alibaba Group.

"Conviction shows up when it's not obvious," King told DL News. "Anyone can invest in a hot market, but the real signal is who leans in before it's consensus."

He said investors will focus on "four buckets:" Tokenisation, specialised exchanges, next-generation DeFi, and AI.

'Perpification of everything'

Tokenisation, or the "perpification of everything," as King calls it, is a massive opportunity.

"Markets are expanding beyond crypto into real-world assets — stocks, commodities, macro exposure — that are all moving onchain," King said.

Indeed, financial markets are rapidly expanding beyond native crypto assets into equities, commodities and macro exposure that can trade continuously onchain, with volume on perpetual exchanges like Hyperliquid exploding into the billions.

Firms like BlackRock, RobinHood, Greyscale have all waxed lyrical about assets going online and for good reason.

The tokenisation market is expected to jump by 754 times to become a $20 trillion market by 2030, according to BlackRock.

Specialised exchanges

Institutional market infrastructure — things like specialised exchanges and other trading technologies — are thriving, King said.

"We're seeing a shift toward more purpose-built, pro-grade market structures, prop [automated market makers], verticalised trading apps, prediction markets," he said.

Bernstein forecasted that the institutional crypto trading market's value will more than triple, from $5 billion in 2024 to $18 billion by 2030, with the US market share surging from 7% to 20%.

Next generation DeFi

Advanced DeFi protocols that are "more composable, more capital efficient, more private" are next on King's list.

"The next wave is improved protocols built to integrate and scale," he said.

The sentiment is shared by banks like the Japanese financial giant Nomura. The bank's 2026 Digital Asset Institutional Investor Survey shared with DL News on Thursday found that institutional investors are pursuing cryptocurrency yield strategies rather than just token price appreciation.

Its research found that over two-thirds of respondents want exposure to DeFi mechanics like staking, while 65% are targeting lending and tokenised assets, and 63% are exploring derivatives and stablecoins.

"This reflects growing demand for income-generating and asset-utilisation strategies," Nomura said.

King also stressed that "privacy is a big unlock here."

The Ethereum Foundation shares the same view. In October, it rolled out a new expanded effort to embed privacy into the blockchain, led by a new "Privacy Cluster" team of 47 engineers, researchers, and cryptographers.

Ethereum co-founder Vitalik Buterin has publicly endorsed privacy tech like Railgun and other similar efforts, arguing that privacy should be a default option for blockchain users.

"Privacy is in every financial system except for DeFi," Railgun contributor Bill Liang told DL News in October.

Crypto and AI

AI agents are "one of the most underappreciated areas" in crypto, according to King.

"AI agents are becoming economic actors," he said. "Every agent is effectively a new "user" of the blockchain."

The view is shared by Coinbase CEO Brain Armstrong, former Binance CEO Changpeng Zhao and Circle CEO Jeremy Allaire.

Coinbase has teamed up with tech titans like Amazon, Google, and Stripe for its x402 protocol.

x402 is designed as a universal standard for embedding payments directly into web interactions. It allows AI agents, APIs and applications to transfer monetary value as seamlessly as they exchange data over the internet.

By 2030, the market will reach as high as $5 trillion, according to McKinsey.

Some of the themes laid out by King echo along the top three capital raises this week, according to DefiLlama data.

Payward, $200 million

Payward, the parent company behind the cryptocurrency exchange Kraken, raised $200 million through secondary shares sales to Germany banking giant Deutsche Börse Group.

The move values Kraken at about $13.3 billion, down from $20 billion in late 2025.

The company made five acquisitions in 2025. Kraken co-CEO Arjun Sethi told DL News in September that the crypto exchange had more deals lined up.

Spektr, $20 million

Copenhagen-based Spektr has raised $20 million in a Series A round. NEA led the raise with backing from Northzone, Seedcamp, and PSV Tech.

Spektr uses AI agents to handle compliance tasks like know-your-customer and company risk checks, replacing manual processes that still dominate the financial industry.

Paxos Labs, $12 million

Blockchain infrastructure firm Paxos Labs has raised $12 million in a strategic funding round at an undisclosed valuation, Fortune reported.

Blockchain Capital led the round. Robot Ventures, Arthur Hayes' family office Maelstrom, and Uniswap Labs Ventures also supported the raise.

The funding will be used to expand Amplify, Paxos's software toolkit that allows businesses to add crypto services such as lending, yield products, and stablecoin issuance.

Block's MoneyBot and ManagerBot Target Agentic Finance (2 minute read)

Block's MoneyBot and ManagerBot Target Agentic Finance (2 minute read)

Crypto
Block is building AI agents that aim to automate financial tasks for consumers and businesses, using their open-source Goose framework.
What: MoneyBot and ManagerBot are Block's agentic AI products built on Goose, an open-source agent framework, designed to handle financial workflows for consumers and small businesses.
Why it matters: The comparison to "Claude Code for money" suggests these agents could autonomously handle complex financial tasks the way Claude Code assists with development work.
Takeaway: Explore Block's open-source Goose framework to learn how they're building AI agents for finance.
Decoder
  • Agentic AI: AI systems that can autonomously perform multi-step tasks and make decisions, rather than just responding to prompts
  • Goose: Block's open-source framework for building AI agents
  • Claude Code: Anthropic's AI coding assistant that can autonomously execute complex development tasks
Original article

Earlier this week, I listened to an episode of the On the Block podcast, where the host, Matt Ross, Head of Investor Relations, sat down with Brad Axen, Principal Engineer for Data and Machine Learning at Block $XYZ. Brad built Goose, Block's open-source agent framework. The conversation was about how Goose got built.

This episode made me realize that Block's MoneyBot and ManagerBot aim to be the Claude Code of money.

If you've used Claude Code, you already know what's coming for consumers and small business owners. If not, follow the link in the reply to read my latest @PopularFintech essay 👇🏻

Block is quietly building Claude Code for money $XYZ
popularfintech.com/p/block-is-qui…

The Web2.5 Kill Chain (Part 1): The Oracle's Whisper (6 minute read)

The Web2.5 Kill Chain (Part 1): The Oracle's Whisper (6 minute read)

Crypto
A security researcher demonstrates how blockchain oracles that bridge Web2 and Web3 systems can be exploited through insecure deserialization, turning the "unhackable" blockchain itself into an attack delivery mechanism.
What: The article presents a fictional but realistic attack scenario where a researcher exploits a blockchain oracle by embedding a malicious Python pickle payload in blockchain transaction data that executes arbitrary code when the oracle server deserializes it, gaining root access despite extensive smart contract audits.
Why it matters: This highlights a critical blindspot in blockchain security where teams spend heavily auditing smart contracts while neglecting the traditional Web2 infrastructure that feeds data to them, creating an attack surface where the blockchain's immutability and trusted status actually works against defenders.
Takeaway: If you're building or auditing blockchain infrastructure with oracles, never deserialize untrusted data from the blockchain using unsafe methods like Python's pickle library, and treat all on-chain data as potentially malicious input regardless of valid cryptographic signatures.
Deep dive
  • The article describes a theoretical attack on "Web2.5" infrastructure where blockchain systems meet traditional servers, demonstrating real vulnerability classes
  • The fictional target spent $500,000 auditing their smart contract for common vulnerabilities but completely neglected the oracle that reads blockchain data
  • Oracles are necessary because smart contracts cannot natively access real-world data and need Web2 servers to fetch and push information on-chain
  • The attacker embedded a weaponized Python pickle serialized object in the calldata of a standard blockchain transaction instead of expected diagnostic data
  • Python's pickle library can execute arbitrary code during deserialization if crafted with the reduce method, making it dangerous for untrusted input
  • The blockchain accepted the transaction because the cryptography and gas fees were mathematically valid, blockchains have no malware scanning capability
  • When the Oracle's scheduled cron job read transaction data and called pickle.loads(), it executed the embedded reverse shell command
  • Because the payload originated from the blockchain itself, a source explicitly programmed as trusted, firewalls and security controls never flagged it
  • The attacker gained root access to the AWS server running the Oracle, completely bypassing enterprise-grade Web2 defenses by using Web3 infrastructure as the delivery vector
  • The article emphasizes that extensive smart contract audits are worthless if the traditional infrastructure bridging to the blockchain treats on-chain data as inherently safe
Decoder
  • Oracle: A server that bridges blockchain smart contracts with real-world data by fetching external information and pushing it on-chain, since blockchains cannot natively access outside data
  • Web2.5: Infrastructure combining traditional Web2 servers with Web3 blockchain technology, creating a hybrid architecture with unique attack surfaces
  • Serialization/Deserialization: Converting data structures into byte streams for transmission and reconstructing them on the receiving end
  • pickle: Python's serialization library known for critical security risks because it can execute arbitrary code when deserializing untrusted data
  • Smart Contract: Self-executing code on a blockchain that is blind to external data without oracles feeding information to it
  • calldata: Arbitrary data field attached to Ethereum transactions that can contain function parameters or additional information
  • Foundry/forge: Ethereum development framework for testing and simulating smart contract interactions in a local environment
Original article

Security researcher demonstrates a critical vulnerability in Web2.5 infrastructure by exploiting insecure deserialization in a blockchain-based oracle.

Visa and Mastercard Sell Access to the Payment Rails (1 minute read)

Visa and Mastercard Sell Access to the Payment Rails (1 minute read)

Crypto
Visa and Mastercard profit by operating the payment networks and charging usage fees, not by lending money like many assume.
What: An explanation of how Visa and Mastercard's business model works—they earn money by operating card networks and collecting assessment and usage fees at scale, while card issuers provide the credit and absorb fraud risk.
Why it matters: Clarifies a common misconception about payment networks versus issuers, helping developers understand the economic structure behind payment processing.
Takeaway: If you're building payment features, understand that network fees (Visa/Mastercard) are separate from interchange fees (which go to card issuers).
Decoder
  • Interchange: The fee paid by the merchant's bank (acquirer) to the cardholder's bank (issuer), typically the largest portion of payment processing fees
  • Acquirer: The bank or payment processor that handles card payments on behalf of merchants
  • Issuer: The bank that issues credit or debit cards to consumers and fronts the money for transactions
  • Assessment fees: The fees Visa and Mastercard charge for using their payment networks to route transactions
Original article

Visa and Mastercard do not earn money by lending, but by operating the card networks that route payments and collect assessment and usage fees at scale, while issuers supply the credit, absorb fraud risk, and capture most of the merchant fee through interchange.

North Korea's Lazarus is likely behind the $292M Kelp DAO Hack (2 minute read)

North Korea's Lazarus is likely behind the $292M Kelp DAO Hack (2 minute read)

Crypto
LayerZero attributes a $292 million cryptocurrency exploit of Kelp DAO to North Korea's state-sponsored Lazarus hacking group.
What: LayerZero has identified the North Korean Lazarus group as the likely perpetrator behind a $292 million exploit targeting Kelp DAO, marking one of the largest cryptocurrency heists in recent history.
Why it matters: This represents a continuation of state-sponsored attacks on DeFi infrastructure, demonstrating that even established protocols remain vulnerable to sophisticated attackers with nation-state resources.
Takeaway: If you're building or working with DeFi protocols, review your security practices and consider additional auditing given the escalating sophistication of attacks targeting crypto infrastructure.
Decoder
  • Lazarus Group: North Korean state-sponsored hacking organization known for major cryptocurrency thefts and cyberattacks, previously linked to the WannaCry ransomware and Sony Pictures hack
  • Kelp DAO: Decentralized autonomous organization involved in liquid restaking protocols for Ethereum
  • LayerZero: Cross-chain interoperability protocol that enables communication between different blockchains
  • DeFi: Decentralized Finance, blockchain-based financial services that operate without traditional intermediaries
Original article

LayerZero identified North Korean hacker group Lazarus as the likely perpetrator of the $292 million Kelp DAO exploit.

Digest devoured!