Loading digest...
Jul 3
1 / ?
AI devopsinfrastructure

Agent-Assisted SGLang Development

The SGLang team is standardizing agentic workflows by turning complex performance engineering steps into executable 'skills' for repeatable deployment.

Summary

What: The SGLang team, including lead developer BBuf, has moved beyond simple prompting, using structured 'SKILL.md' files to automate CUDA kernel debugging, benchmarking, and profiling for its high-performance inference framework.
Why it matters: System-level AI engineering is shifting toward codifying developer expertise into deterministic protocols that agents can execute, transforming opaque human intuition into verifiable, machine-readable performance gates.
Takeaway: If you manage complex infrastructure, create a `.claude/skills` directory in your repo to store structured, executable debugging and benchmarking protocols for your team's common tasks.

Deep Dive

  • SGLang now uses agent-driven 'skills' for repo-level development workflows.
  • Skills are stored in .claude/skills and include preflight checks, failure gates, and result schemas.
  • Performance work is now driven by 'Loop Engineering' to compare models against the best available inference frameworks.
  • Key skills include CUDA crash debugging, LLM serving benchmarks, and torch profiler trace triage.
  • Humanize/RLCR and 'Codex Goal' allow agents to run experiments while maintaining state and rigorous review.
  • The Kernel Design Agents (KDA) project uses these techniques to optimize GPU kernels for specific model-hardware combinations.
  • Optimizations are only merged if supported by rigorous proof, accuracy checks, and NCU (Nvidia Compute Unit) metrics.

Decoder

  • CUDA: Parallel computing platform and programming model created by Nvidia.
  • Profiler: A tool that analyzes a program's execution to identify performance bottlenecks, such as memory usage or slow CPU/GPU instructions.
  • TTFT: Time To First Token; a critical metric for LLM responsiveness measuring the latency between a request and the first output character.

Original Article

Full article content is not available for inline reading.

Read the original article →

AI securityagents

Introducing Devin Security Swarm

Cognition's new Devin Security Swarm uses 'Agentic MapReduce' to shard repositories and verify vulnerabilities in isolated sandboxes.

Summary

What: Devin Security Swarm is a security scanning tool that shards large codebases to parallelize vulnerability detection and confirmation. It correctly identified 36 of 50 GHSA vulnerabilities in testing, operating at 30% lower cost than competing tools.
Why it matters: This indicates a shift toward 'agentic' architectures that treat security analysis as a scalable distributed computing problem rather than a monolithic code search task.

Deep Dive

  • Uses Agentic MapReduce to split large repositories into shards.
  • Employs specialized agents to analyze shards in parallel.
  • Aggregates findings into a single report.
  • Performs runtime validation of potential vulnerabilities in sandboxed environments.
  • Evaluated against 50 real-world GHSA vulnerabilities across 14 programming languages.
  • Aims to solve the difficulty of maintaining context in large-scale codebase reasoning.

Decoder

  • Agentic MapReduce: A pattern that distributes complex reasoning tasks across multiple autonomous agents working on isolated segments (shards) of a codebase, then aggregates the results.
  • GHSA: GitHub Security Advisory, a database of publicly disclosed security vulnerabilities in open-source projects.

Original Article

Introducing Devin Security Swarm

A more cost effective and accurate way to find security vulnerabilities in complex codebases, based on a new architecture: Agentic MapReduce.

In testing, Devin Security Swarm found 36 of 50 real-world GHSA vulnerabilities at 30% lower cost per finding than the next most accurate alternative.

We built a new architecture for whole-codebase reasoning that we’re calling Agentic MapReduce.

Security scanning is different from most coding tasks: a report is only trustworthy if the whole codebase is considered. But most agentic systems struggle to scale reasoning across large repos.

Devin maps relevant signals across the repo, fans out focused agents over bounded shards, reduces their findings into one report, then verifies serious vulnerabilities in isolated sandboxes before marking them confirmed.

The result is simultaneously more efficient and more accurate than other tools. We evaluated a variety of security scanning tools on a dataset of 50 GHSA vulnerabilities across 14 languages including Go, Rust, Python, Ruby, Java, C#, JavaScript, C, Swift, Dart, and Elixir. The dataset spans opens source repos of various sizes and of many software categories.

Beyond excelling on our eval, Devin Security Swarm also found critical vulnerabilities that other tools missed, like a PHP sandbox bypass via template injection, an argument injection through metadata value parsing, and an overly broad deserialization surface.

Security Swarm is a new pillar of Devin for Security: a suite of tools to help you find vulnerabilities, validate their exploitability at runtime, and ship remediation PRs.

Learn more and try it today at: devin.ai/security

We’re also publishing extensive documentation and technical materials about Agentic MapReduce, including a deep-dive on our evals.

Read our announcement: cognition.com/blog/introducing-devin-security-swarm

Learn about Agentic MapReduce: devin.ai/blog/agentic-map-reduce

Check out the evals: devin.ai/blog/security-swarm-eval

Tech cloudaienterprise

Meta's Inevitable Cloud

Facing investor pressure to monetize its massive AI infrastructure, Meta is planning a cloud service to sell compute power and model access.

Summary

What: Meta is forming 'Meta Compute,' an initiative led by Santosh Janardhan, Daniel Gross, and Dina Powell McCormick, to sell excess GPU capacity and access to models like Muse Spark to enterprise customers.
Why it matters: Meta is shifting from being a purely ad-driven company to an infrastructure provider to justify its multi-billion dollar CapEx spend, following the 'neocloud' strategy demonstrated by xAI and SpaceX.

Deep Dive

  • Meta faces a critical need to diversify revenue beyond advertising, which currently accounts for nearly all of its income.
  • The company is building out massive AI infrastructure that is expensive to maintain without direct monetization.
  • Internal initiatives like 'Meta Compute' are consolidating infrastructure management to prepare for cloud sales.
  • Meta is considering two models: selling API access to internal models similar to AWS Bedrock, and offering raw compute capacity similar to CoreWeave.
  • The strategy mimics successful moves by Google's cloud transition and Elon Musk's xAI/SpaceX 'neocloud' approach.
  • The announcement prompted a 10% stock price increase, reflecting investor preference for infrastructure-as-a-service (IaaS) business models.

Decoder

  • Neocloud: A specialized cloud provider that focuses on selling raw, high-performance compute capacity (typically H100s or newer) for AI training and inference, rather than general-purpose cloud services like managed databases or storage.
  • CapEx: Capital expenditure; funds used by a company to acquire, upgrade, and maintain physical assets such as data centers and hardware.

Original Article

Meta has a problem. Well, two of them, actually.

First, after years and years spent trying to diversify, their business is still almost entirely advertising-based. To be clear, it's a great business – one of the best ever created, in fact. It's a good problem to have, but it's still a problem. Because if that business ever slows... Look out below.

That leads to the second problem. The latest way Meta thinks they can fix the first problem is with AI. Sure, they'll use AI to super-charge the ads business, but ideally it will also unlock other businesses for them. Again, to diversify. Currently, they view it as the key to their devices strategy, led by their smartglasses. But they're also working on other products and ways to potentially monetize AI beyond just selling ads. But the problem here is that it's expensive to build out those AI capabilities. Like, insanely expensive. Like the most expensive endeavor in human history, perhaps.

Wall Street doesn't like this. Specifically, Wall Street doesn't like this for Meta. Why? That first problem. Because unlike their Big Tech peers also clearly determined to pour all of their free cash flow into the AI build out, Meta doesn't have an obvious, direct way to monetize the capabilities. Again, there are ads, but that's indirect. Amazon, Google, and Microsoft are all selling their AI directly.

Are you seeing it yet? These two problems have a single solution. It's not simple, but it is fairly straightforward: Meta needs a cloud business.

It's a notion and solution that's so obvious that I've been noting it for quite some time now. After Meta bought Manus late last year (before they were forced to un-buy them), it occurred to me that at least part of the play was to get Meta into the business of selling products to other businesses. That is, enterprise sales. Granted, Meta has been trying to do this for years – remember Workplace, their short-lived Slack competitor? – but nothing has really worked. Again, Meta has remained the ads-based social media company. But Manus was already working. And beyond their consumer angle, there was clearly a big business brewing in selling agentic workflows to enterprises. As I concluded that post:

This deal seemingly makes a lot of sense for Meta on a few fronts. And it also may point to the start of a renewed push into enterprise. Again, easier said than done, but don't be shocked if this is a wedge of sorts. If they can keep Manus expanding into businesses, we should see other Meta cloud offerings follow, putting them more in line with those aforementioned Big Tech peers. And perhaps easing some concerns Wall Street has with regard to their AI spend.

Well, again, Manus, sadly, didn't really work out for Meta. But not because the business or the strategy wasn't good – if anything, those may have been too good, to the point where China took one look at the deal and said essentially: "yeah, no." Honestly, Meta probably should have used the "hackquisition" method to try to do the deal, but again, they clearly wanted the Manus actual business, not just the employees and some "non-exclusive rights". But I digress... The point is that with Manus, you could see a path for Meta to get a toehold into enterprise sales and expand from there – perhaps all the way up to a true cloud offering.

A few weeks later, another bit of Meta news made this general game plan even more obvious, at least to me. As I wrote about the formation of "Meta Compute" – their formal AI infrastructure play:

When I read about this new initiative within Meta, I can’t be the only one who assumes it will eventually lead to a full-on Meta Cloud, right?

Again, it seemed fairly obvious, though a number of people pushed back on the notion. Specifically because it would be so far afield from Meta's core business – and a huge potential headache, going up against the aforementioned Amazon, Google, and Microsoft clouds. That's obviously true, and I noted as much – in particular how it has taken Google years and several micro-pivots to be able to effectively compete in the space. Why? Because for as massive as Google is, and as good as they have always been with infrastructure, they didn't have the muscles to really do enterprise sales. It took bringing in someone like Thomas Kurian from Oracle to make that happen. And he has made that happen. To the tune of $20B in revenue a quarter – fast approaching a $100B/year business for Google. That makes it nearly 20% of Google's overall revenue – and again, rising.

My point is simply that Google, a company once knocked as being a one-hit wonder thanks to their ads business – again, one of the best businesses ever created – eventually found a way to diversify. It was painful and took a long time, but it worked. No one talks about them being a one-trick pony anymore. Meta has tried many things to diversify – going so far as to change the name of the company to one of those bets that, at least thus far, has not panned out – but they haven't tried the one that has worked so well for Google.

And so while some thought Meta doing a cloud business would be crazy because it's so far from their core, I viewed that as sort of the point. Again, they need to diversify! But one final element really drove this notion home: the AI build-out.

When Meta rolled out their first "Muse Spark" models in April, the most interesting element to me wasn't the models themselves but the idea of how they might sell access to them. As I wrote at the time:

One more thing: perhaps the most interesting element of the Muse movement is the notion that Meta intends to sell access via APIs. A first step towards a bigger Meta Cloud offering? You don't spend $140B a year for table stakes.

Even with these new models out there in the wild, and Meta's AI pivot from their failed Llama strategy seemingly on the cusp of being complete, Wall Street continued to throw up all over the company due to their CapEx spend without that clear path to directly monetize it.

Given their valuation as a private company, investors have started to worry about this as well for OpenAI. And so unsurprisingly, the company has started to say that they might be able to launch a cloud offering of sorts to help support their infrastructure build out. I've called this "Field of Dreams Economics" in the past – that is, if you build it (the data centers), they will come (the cloud customers). It's unproven at best, and folly at worst.

In walks Elon Musk...

In a way, Meta and xAI found themselves in the same boat having spent billions to try to catch up in AI and not having much to show for it. At first, it looked like throwing money at the problem would work, but now it looks more like they've built out and up a ton of compute capacity without the demand to truly put it to work. Amazon, Google, and Microsoft don't have this problem, because beyond the varying degrees of success for their own AI products, they have third-party customers. Which is to say, they have their clouds.

As a result, if anything, they each have the opposite problem: they're having a hard time striking the right balance between their own needs and those of their customers. In a way, at the highest level, this is clearly what drove Microsoft to push OpenAI away. As a customer of their cloud services – the most massive customer – they were eating Azure alive. These days, we have Google telling Meta that they need to cut back their compute usage. Anyway, what this all showcases is that the cloud demand is there – and rising.

And beyond AI services, raw compute is driving much of this. This, in turn, led Elon to connect the obvious dots on his own problem.

In an age where data center capacity is king, xAI found itself holding the crown jewels. They clearly didn't want to be in that position – they'd love for demand for xAI models to take up all available compute – but it's actually not a bad position to be in given the current situation. And that's especially true if you're, say, about to IPO and need a good narrative around why you merged your cash-incinerator (xAI) with your cash machine (SpaceX).

xAI is not a cash-incinerator, as it turns out – well, it still is, but not as big of one because it's also now a cash generator in the form of a neocloud!

Yes, Elon figured out a way to turn his space company into a cloud company too. He wants it to be an AI company, and it is, to some extent, but the better narrative is the cloud, because it's the far better business at the moment. And while the "neocloud" business isn't exactly what Amazon, Google, and Microsoft offer, it's arguably a far more straightforward cloud business. Granted, it's one that may be a moment-in-time thing given the capacity constraints that are even forcing even Big Tech to cut deals with the neoclouds to try to boost their capacity. Still, it's better than just burning money with nothing to offset it!

Zuckerberg clearly saw Elon's jujitsu maneuver in getting first Cursor to sign up to use xAI's compute, and then Anthropic, and finally Google. And saw how the market reacted: a huge potential weakness (AI costs) got turned around to help fuel the largest IPO in history. If everything else above made it clear that Meta would need to have a cloud business at some point, the SpaceX neocloud made it an imperative to happen – or at least be talked about – now.

Meta's stock is in the dumps? Zuck could simply pull the "Elon Lever".

And now he has. I've buried the lede some 1,750 words in, but yes, Meta is planning to launch a cloud business, reports Riley Griffin and Kurt Wagner for Bloomberg:

Meta Platforms Inc. is developing plans for a cloud infrastructure business that will sell access to AI computing power and models, setting up a new vector of competition with industry leaders like Amazon Web Services, Microsoft Azure and Google Cloud.
Meta, which has been rushing to secure expensive data centers and other infrastructure to fuel its own artificial intelligence ambitions, is forming a business to generate revenue from excess computing power sold to outside customers, according to people familiar with the matter, who asked not to be named as the details aren’t public.

There we go. But they're also seemingly torn in terms of what their cloud should offer:

One potential plan includes selling access to various AI models that are hosted on Meta’s existing AI infrastructure, an approach similar to AWS’s Bedrock offering, the people said. Meta would run the data centers and chips that power the models, including its own Muse Spark models, and charge developers to access them.
The company is also considering selling access to “raw” computing capacity, akin to other so-called neocloud businesses like CoreWeave Inc., the people said. Development of these new business lines is part of Meta Compute, an internal initiative to build and manage the company’s AI infrastructure efforts, according to a person familiar with the plans. Meta Compute is led by Santosh Janardhan, Meta’s head of infrastructure; Daniel Gross, a leader inside the Meta Superintelligence Labs AI unit; and Meta President Dina Powell McCormick.

The answer may end up being both. The reality is that it will end up being whatever the market demands. And what investors end up liking. And sure enough, Meta's stock, after months in the Wall Street doghouse, shot up nearly 10% yesterday on this news.

Now, maybe it's all a head fake just to juice the stock, but it certainly doesn't seem that way. It seems like Meta is figuring out a way to launch a cloud service that will add to their top and bottom lines. And that should, in turn, help to diversify their business away from ads.

That doesn't mean they'll offer everything that Amazon, Google, and Microsoft do – and they probably shouldn't even try to do that. But they should probably have some sort of neocloud offering, at least to start, mixed with that 'Bedrock' competitor, which would include selling their own models via APIs. And perhaps they can even hark back to their "open source" AI roots by serving up DeepSeek models and the like.

Maybe down the road, Meta starts to augment their AI cloud with other types of cloud services. And maybe one day 'Meta Cloud' even makes up a double-digit percent of revenue. Or maybe it doesn't work, just as many of Meta's recent initiatives haven't. But it's certainly worth trying, for the optics alone, if not the actual business potential. It has been obvious and inevitable for a long time now.


1 Certainly I'm more than a little worried about Meta's ability to execute on much of anything, as I'm currently on day 3 of getting continuously banned by WhatsApp which they can't seem to fix...

Tech aienterprise

Microsoft unveils $2.5B ‘Frontier Company' to embed AI engineers inside customers

Microsoft is launching a $2.5 billion 'Frontier Company' to embed engineers directly into client organizations to build and manage custom AI systems.

Summary

What: Microsoft is deploying dedicated engineering teams to work onsite with enterprise customers, mirroring the 'forward-deployed engineer' model to drive AI adoption.
Why it matters: This indicates that selling API tokens is insufficient for enterprise adoption; companies now require embedded human expertise to bridge the gap between AI research and production workflows.

Decoder

  • Forward-deployed engineer (FDE): Software engineers who work onsite at a customer's location to build, integrate, and maintain custom software tailored to the customer's specific operational needs.

Original Article

The Microsoft Frontier Company will embed engineers inside customers to build and run AI systems.

DevOps gitopssecurity

Argo CD 3.5 Tightens Supply Chain Security with Internal mTLS and Source Integrity

Argo CD 3.5 improves GitOps security by introducing internal mTLS and mandatory Git commit signature verification.

Summary

What: Argo CD v3.5 release candidate adds internal mTLS for inter-component communication, native UI support for ApplicationSets, and beta impersonation for auditing. It also adds Helm 4 support and expanded multi-namespace management capabilities.
Why it matters: This shift addresses long-standing security blind spots in GitOps pipelines where internal traffic between control plane components remained unencrypted and unverified.
Takeaway: If you require stricter supply chain security, enable the source integrity check via the command 'argocd app set --source-integrity-required' for your applications.

Deep Dive

  • Implements internal mTLS to secure communication between the repo-server and API server/controllers.
  • Adds Source Integrity validation for Git commits to prevent unverified manifest deployment.
  • Includes a new ApplicationSet UI with filtering, detail views, and a Preview Apps feature.
  • Promotes impersonation and Source Hydrator features to beta status.
  • Expands support for Helm 4.
  • Enables multi-namespace ApplicationSet management.

Decoder

  • mTLS: Mutual Transport Layer Security, a security protocol that requires both the client and server to authenticate each other using digital certificates.
  • GitOps: An operational framework that takes DevOps best practices used for application development and applies them to infrastructure automation, typically using Git as the source of truth.
  • Source Hydrator: A tool used to process and expand (hydrate) templated manifests into fully rendered Kubernetes objects.

Original Article

Argo CD 3.5 Tightens Supply Chain Security with Internal mTLS and Source Integrity

The Argo CD project released a v3.5 release candidate in June 2026. This version adds mutual TLS enforcement for internal components. It also includes Git commit signature verification for supply chain security and native ApplicationSet management in the UI. The release also graduates two significant features: impersonation and Source Hydrator, from alpha to beta.

Large-scale Argo CD deployments have long faced gaps in internal security posture and operational visibility. Communication between the repo-server and other components, like the API server and controllers, was unencrypted before. This left internal traffic outside the usual mTLS protection that teams apply at the ingress layer. On the supply chain side, nothing in Argo CD prevented a compromised Git repository from silently deploying unsigned or tampered manifests. ApplicationSet resources create applications at scale, but they lacked native UI support. Operators had to check YAML or use kubectl to see what an ApplicationSet would produce.

The repo-server now supports mTLS, requiring client certificates from each connecting component. For environments without custom certificates, the repo-server creates self-signed certs in memory. This helps with internal health checks and avoids relying on the filesystem. Source Integrity validation lets operators ensure Git sources have valid signatures before syncing. This can be set up through the Application spec with sourceIntegrity.required: true or by using the CLI flag argocd app set --source-integrity-required. The ApplicationSet UI, created by engineers from Intuit, Red Hat, GoTo, and Octopus Deploy, includes list, filter, and detail views. It also has a Preview Apps tab, which shows the applications an ApplicationSet template will create before deployment.

Argo CD can now take on a specific user identity for server-side tasks. This includes log streaming, resource deletion, and sync. It has moved to beta. When you set up impersonation through AppProject or RBAC policy, it now applies automatically to all server operations. This is important for audit trails in multi-tenant clusters. The Source Hydrator, which separates dry (unhydrated) manifests from their hydrated output, also reaches beta. Teams can now set different drySource.repoURL and syncSource.repoURL values. This enables multi-repository GitOps patterns. Here, source templates and rendered manifests are in different repositories. Each can have its own access controls.

The release also adds Helm 4 support alongside backward compatibility with Helm 3 deployments. ApplicationSets can now be deployed in any namespace, not just the Argo CD namespace. This change brings stable status and meets a long-requested need for teams managing GitOps by namespace. ApplicationSet concurrency controls help operators limit the number of applications processed at the same time. This reduces the risk of overwhelming the cluster or upstream Git APIs. Azure AD users hitting group claims overflow can now route group resolution through the Microsoft Graph API, bypassing OIDC token size limits. Azure DevOps repositories gain Service Principal authentication support, eliminating the PAT dependency.

Among Argo CD's main competitors, the three v3.5 headline features: internal mTLS, Source Integrity, and ApplicationSet UI, vary greatly based on each tool's architecture. Flux (v2.8, March 2026) avoids the internal mTLS issue by design. Its controllers use Kubernetes API objects to communicate instead of direct gRPC. This means there’s no internal traffic that needs securing. Flux supports mTLS for external connections. You can use certSecretRef with GitRepository, HelmRepository, and OCIRepository resources. On commit signature verification, Flux has offered native GPG support in the GitRepository API via spec.verify.mode: head longer than Argo CD, so Argo CD is catching up here. Flux 2.8 added a web dashboard to the UI via the separate Flux Operator. This includes views for Kustomization and HelmRelease. However, it lacks an ApplicationSet Preview equivalent. Instead, Flux's ArtifactGenerator API manages template-driven generation without a visual preview step. On UI, Flux 2.8 introduced a web dashboard through the separately-installed Flux Operator, covering Kustomization and HelmRelease views, but there is no ApplicationSet Preview equivalent.

Rancher Fleet uses a websocket-based agent system instead of gRPC. This means internal mTLS isn't needed. TLS is handled at the Rancher ingress and agent trust levels. Fleet lacks a built-in Source Integrity mechanism. To enforce commit signatures, you'll need a policy from your Git provider or an admission webhook. Its UI is part of the Rancher dashboard's Continuous Delivery section, not a separate app.

Jenkins X uses Tekton to enforce supply chain controls at the pipeline layer. This includes Tekton Chains and GPG-signed releases. It does not provide a first-party dashboard and lacks a preview capability like ApplicationSet. Multi-cluster promotion occurs through pull request–based environment flows, not by generating templates.

DevOps databaseperformancebackend

How ScyllaDB's Trie-Based Index Delivers Up to 3X More Throughput

ScyllaDB 2026.2 adopts a trie-based index format, significantly boosting read throughput by replacing legacy file structures with more efficient prefix trees.

Summary

What: ScyllaDB has moved to a default trie-based index (ms/mt format) that replaces separate Summary and Index files with a prefix tree. Benchmarks show 20% to 230% higher throughput and significant latency reduction for read-heavy workloads.
Why it matters: Optimizing internal index formats to be more cache-friendly and I/O-efficient is becoming a critical performance frontier for distributed databases handling massive datasets.
Takeaway: If you operate read-heavy ScyllaDB clusters, test the new 2026.2 release to evaluate potential throughput gains for your specific access patterns.

Deep Dive

  • Replaces legacy Summary.db and Index.db with a prefix tree structure stored in Partitions.db and Rows.db.
  • Improves cache efficiency by packing nodes into 4KB pages.
  • Reduces disk I/O by lowering the number of page fetches required for lookups.
  • Simplifies memory overhead as indices now reside in the OS page cache rather than requiring dedicated memory allocation.
  • Shows 31% to 63% latency improvement in standard read workloads.
  • Minimal performance impact on write paths compared to previous versions.

Decoder

  • Trie: A prefix tree data structure where keys are stored character-by-character, enabling fast lookup by sharing common prefixes.
  • SSTable: Sorted String Table, a persistent, immutable file format used for storage in databases like Cassandra and ScyllaDB.
  • I/O (Input/Output): The communication between a computer system and the outside world, in this case, disk reads and writes.

Original Article

How ScyllaDB’s Trie-Based Index Delivers Up to 3X More Throughput

By transitioning from separate summary and index files to a prefix tree, we optimized cache efficiency, reduced disk I/O, and reduced memory overhead

Trie-based SSTable index format was added in ScyllaDB 2025.4. Since then, it has evolved and matured to become the default index format in ScyllaDB 2026.2.

In this post, we deep dive into the format change, present its pros and cons, and show our latest benchmark results of the legacy vs. Trie index formats.

For benchmarking, we chose four different read workloads that would benefit from the Trie index format in different degrees. For all four, Trie indexes demonstrated better performance. They achieved 20% to 230% higher throughput and 31% to 63% lower latency compared to legacy indexes. The impact of Trie index on the write path is negligible.

Note that for use cases with either very low cache usage or 100% row-cache hit rate, the performance gain is expected to be lower. However, these use cases are unlikely in production.

Trie Index Usage in ScyllaDB

Before explaining the new format, we will cover the legacy index format and its challenges.

Legacy Three-Layer Lookup (me/md format)

Until ScyllaDB 2026.2 the default storage format was the SSTable version md and me.

Every SSTable lookup in the me/md format traverses three or four structures:

  • Summary.db (entirely in RAM)
  • Binary search in Index.db
  • Sequential read in Data.db

Both the partition index and the clustering-row index are stored in Index.db. The partition index is partially represented in memory by the sampled Summary.db, while the clustering-row index exists as a promoted index for large partitions.

┌──────── MEMORY ──────────────────────────────┐
│                                              │
│  Summary.db  (entirely in RAM)               │
│  ─────────────────────────────               │
│  Sampled at ~1 byte per ~2000 bytes of       │
│  Data.db                                     │
│                                              │
│  "aardvark" → Index.db byte        0         │
│  "kangaroo" → Index.db byte  1,048,576       │
│  "platypus" → Index.db byte  2,097,152       │
│  "zebra"    → Index.db byte  3,145,728       │
│                                              │
└──────────────────┬───────────────────────────┘
                   │
       binary search → window in Index.db
                   │
                   ▼
┌──────── DISK ────────────────────────────────┐
│                                              │
│  Index.db                                    │
│  ─────────                                   │
│  "kangaroo"   → Data.db: 4,096,000           │
│  "koala"      → Data.db: 4,097,280           │
│  "kookaburra" → Data.db: 4,098,560           │
│  "lemur"      → Data.db: 4,099,840  ← found  │
│  ... (up to ~800 entries per 1 MB scan)      │
│                                              │
└──────────────────┬───────────────────────────┘
                   │
       1 seek + sequential read
                   │
                   ▼
┌──────── DISK ────────────────────────────────┐
│  Data.db                                     │
│  <partition data>                            │
└──────────────────────────────────────────────┘

For partitions containing enough clustering rows, a fourth structure is involved:

Index.db entry for a large partition
┌──────────────────────────────────────────┐
│  partition key  → Data.db offset         │
│  promoted_index (flat list of CK blocks) │
│    block 0: ck_start="aaa", ck_end="azz" │
│    block 1: ck_start="baa", ck_end="bzz" │
│    ...                                   │
│    block N: ck_start=...,  ck_end=...    │
│    offsets[0..N]  ← binary search here   │
└──────────────────────────────────────────┘

The New Trie Index Format

The Trie index replaces Summary.db + Index.db with a single on-disk prefix tree. The storage format is compatible with Apache Cassandra’s BTI (Big Trie Index) format, implemented using ScyllaDB’s Seastar architecture.

Trie indexes are used for both partition indexes and clustering key indexes.

What Is a Trie?

A trie (prefix tree) stores keys character-by-character. Shared prefixes occupy a single path, eliminating redundancy:

Keys: "kangaroo", "koala", "kookaburra", "lemur", "lion"

            [root]
            /    \
          'k'    'l'
           |      |
          [k]    [l]
          / \    / \
        'a' 'o''e' 'i'
         |   |  |   |
        'n' [o]'m' 'i'
         |  / \ |   |
        'g''a' 'o''u' 'o'
         |  |  |  |   |
        'a''l' 'k''r' 'n'
         |  |  |  *
        'r''a' 'a'    * = leaf node
         |  *  |       (payload: Data.db offset)
        'o'   'b'
         |     |
        'o'   'u'
         *     |
              'r'
               |
              'r'
               |
              'a'
               *

"k", "ko", "koo" — shared prefixes stored ONCE

New SSTable Files

The ms/mt format replaces Summary.db and Index.db with two purpose-built files:

SSTable (ms format)
├── Data.db
│     unchanged — partition and row data,
│     same binary layout as me/md
│
├── Partitions.db                     ← NEW
│     Trie index:
│       partition key → Data.db offset
│         (small partitions)
│       partition key → Rows.db offset
│         (large partitions)
│
│     ┌── Page 0 (4,096 bytes) ───────┐
│     │  trie root node [1]           │
│     │  + children (fan-out ≤ 256)   │
│     │  + their children (packed)    │
│     └───────────────────────────────┘
│     ┌── Page 1 (4,096 bytes) ───────┐
│     │  subtree for keys 'a'–'g'     │
│     └───────────────────────────────┘
│     ...
│     ┌── Footer ─────────────────────┐
│     │  first_key      (raw bytes)   │
│     │  last_key       (raw bytes)   │
│     │  partition_count (uint64)     │
│     │  trie_root_pos  (uint64)      │
│     └───────────────────────────────┘
│
├── Rows.db                            ← NEW
│     Per-partition clustering-key
│     tries, concatenated.
│     Each sub-trie:
│       clustering key → byte-offset
│         within partition
│     (replaces flat "promoted index"
│      in Index.db)
│
├── Filter.db     bloom filter — unchanged
├── Statistics.db statistics  — unchanged
└── Scylla.db     ScyllaDB metadata — unchanged

[1] Parent nodes are always written after their child nodes. Parents point to children, so child positions must be known before parents are written.

How a Partition Lookup Works

Query: SELECT * FROM orders
       WHERE order_id = 'ORD-20240611-98765'

┌─────────────────────────────────────────┐
│  Step 1 — Key Translation               │
│                                         │
│  'ORD-20240611-98765'                   │
│         ↓  bti_key_translation.cc       │
│  comparable byte sequence               │
│  (lexicographic order matches           │
│   CQL semantic order)                   │
└──────────────────┬──────────────────────┘
                   │
┌──────────────────▼──────────────────────┐
│  Step 2 — Trie Traversal                │
│           in Partitions.db              │
│                                         │
│  Read root page (4 KB)                  │
│    ← usually in OS page cache           │
│                                         │
│  [root] ──'O'──> [node]  (page 0)       │
│  [node] ──'R'──> [node]  (page 0)       │
│  [node] ──'D'──> [node]  (fetch page 1) │
│  [node] ──'-'──> [node]  (page 1)       │
│  ...                                    │
│  [leaf] payload = Data.db or            │
│                   Rows.db pos 2,097,152 │
│                                         │
│  Typical: 2–6 page fetches.             │
│  Top pages cached → often 0–1 disk I/Os │
└──────────────────┬──────────────────────┘
                   │
┌──────────────────▼──────────────────────┐
│  Step 3 — Read Data.db                  │
│  Seek to offset 2,097,152               │
│  → read partition header                │
│  For large partitions: read Rows.db     │
└──────────────────┬──────────────────────┘
                   │  (range/clustering queries only)
┌──────────────────▼──────────────────────┐
│  Step 4 — Read Data.db at position      │
│           returned from index           │
└─────────────────────────────────────────┘

Page Layout: Packing Parent and Children Together

The most critical write-time optimization is ensuring that a node and its children land on the same 4 KB page. This means an entire trie neighborhood is readable in a single I/O, even on the first (cold) access.

┌──────── 4,096-byte page ────────────────┐
│                                         │
│  [A] ──'p'──> [B] ──'p'──> [C]         │
│         │                │              │
│         └──'r'──> [D]    ├──'l'──>[E]* │
│                          │              │
│                          └──'y'──>[F]* │
│                                         │
│  (* = leaf node with payload)           │
│  (padding bytes to align next subtree         │
│   to page boundary)                     │
└─────────────────────────────────────────┘

trie_writer algorithm (trie_writer.hh):

  1. Maintain rightmost path root → current
     node (_stack)
  2. On each new key: branch off rightmost
     path, add new nodes
  3. Accumulate nodes until a finished
     subtree exceeds a page
  4. Flush child subtrees with padding so
     each subtree fits within one page

ScyllaDB vs. Cassandra reference impl:

  Cassandra: one character per node
  ScyllaDB:  characters grouped into
             "chains" (up to 300 bytes)
  → dramatically faster writes for long
    keys, same read-side page structure

Old vs. New: Side-by-Side Summary

Legacy me/md Trie ms/mt
On-disk files Summary.db + Index.db Partitions.db + Rows.db
In-memory component Summary (always loaded, never evicted) None. Trie top-nodes live in OS page cache (evictable)
Index structure Flat sorted list Prefix tree
Partition lookup (cold cache) Binary search in summary (RAM) + scan of Index.db window Trie traversal byte-by-byte: O(key_length) page reads
Key storage Full key per entry Shared prefixes stored once
Clustering key lookup Flat promoted-index list (binary search) Sub-trie in Rows.db (trie traversal)

Benchmark Results

The following table shows the key results, measured at client-side P99 ≤ 10 ms on a 3-node AWS production-class cluster. All tests were run on 3 × i8g.2xlarge instances, each on a different zone, with Replication Factor (RF) of 3.

Test case Legacy (me) Throughput Legacy (me) P99 Trie (ms) Throughput Trie (ms) P99 Throughput gain
Test 1: Typical (~20% row cache) 130k ops/s 5.1 ms 170k ops/s 1.9 ms +31%
Test 2: Key / Value 90k ops/s 5.2 ms 300k ops/s 3.6 ms +233%
Test 3: Large Partitions 23k ops/s 7.7 ms 37k ops/s 4.6 ms +61%
Test 4: Long shared clustering key prefixes 22k ops/s 5.4 ms 38k ops/s 3.3 ms +73%

Discussion: Why Trie Indexes Improve Performance, and When to Use Them

ScyllaDB CTO Avi Kivity has mentioned three reasons why Trie indexes improve performance:

  • Improved cacheability: the index is denser, so it is more likely to fit in cache, requiring no I/O for the index itself.
  • Fewer I/O operations after cache miss: if the index is not in cache, fewer I/O operations are required to fetch it since the index is more compact and shallower. This is especially true for large partitions, often seen in materialized view workloads.
  • CPU efficiency: less CPU is needed to process the index during reads.

A possible downside is that more CPU is needed to create the index during memtable flush and compaction. However, this is more than offset by the read-side advantages.

Summary

ScyllaDB 2026.2 has adopted a Trie-based index format as its new default, replacing the legacy index structure to significantly enhance read path performance. By transitioning from separate summary and index files to a prefix tree, this design optimizes cache efficiency, reduces disk I/O, and reduces memory overhead. Benchmarks indicate that this architectural change delivers throughput improvements ranging up to 3x across various workloads, offering a more scalable and efficient solution than the legacy index format.

As always, actual results depend on your workload. To evaluate the gain of Trie index, we highly recommend testing it yourself with the latest ScyllaDB releases.

DevOps aillm

Improving token efficiency for GitHub Copilot in VS Code

The VS Code team improved GitHub Copilot's efficiency by implementing client-side tool search, extended prompt caching, and WebSocket-based transports.

Summary

What: By deferring tool loading, caching prompt prefixes for up to 24 hours, and replacing HTTP with WebSockets, they significantly reduced token usage and latency for both OpenAI and Anthropic models.
Why it matters: Usage-based pricing for LLMs is forcing engineering teams to move away from brute-force context stuffing toward smarter, stateful management of prompt windows.
Takeaway: If building your own agentic harness, prioritize separating prompt prefixes from dynamic tool payloads and use persistence layers like WebSockets to maintain cached state across sequential tool-calling turns.

Deep Dive

  • Prompt Caching: Uses prompt_cache_retention set to 24h to keep cached model state in GPU-local storage.
  • Tool Search: Replaces loading all tool definitions with an on-demand, embedding-guided search for only relevant tools.
  • WebSockets: Replaces intermittent HTTP requests with a persistent connection, enabling response state caching.
  • Breakpoint Anchoring: Anthropic-specific optimization that places cache markers at stable prompt boundaries.
  • Client-side Search: Moves tool searching from the server to the client using embeddings for intent-based discovery.

Decoder

  • Prompt Prefix: The initial, unchanging portion of a prompt (e.g., system instructions) that can be cached to save costs.
  • Time to First Token (TTFT): The latency between a request and the beginning of the model's response.
  • Tool Search: An optimization where models discover available tools during inference instead of receiving the full catalog upfront.

Original Article

Improving token efficiency for GitHub Copilot in VS Code

With the recent move to usage-based billing for GitHub Copilot, every token in an agentic session matters. They affect your credits, latency, and the context window an agent has left to finish the task. Each new model generation tends to consume more tokens per task than the last, as we've witnessed in our own data. This means that harness-level efficiencies are increasingly important to counter this trend. As agents take on longer, more autonomous work, an inefficient harness adds up fast.

Making the GitHub Copilot agentic harness in VS Code more token-efficient is continuous work, and it's the best way to counter this trend. For most changes, we run A/B experiments in production and offline evaluations against task suites, confirming that task success rate holds or improves while token usage drops. It's rarely one big win, usually a steady stream of small ones. Below, we walk through recent gains, first for OpenAI models and then for Anthropic models.

How agentic requests spend tokens

Two costs sit at the heart of every agentic request, and two ideas help us reduce them. Both apply across OpenAI and Anthropic models, even though each provider exposes them differently.

The prompt prefix and caching. In an agentic coding session, a large share of every request repeats across turns: system instructions, tool definitions, repository context, and conversation history. This repeated beginning is the prompt prefix. When requests share the exact same prefix, the inference provider can reuse cached model state instead of recomputing it from scratch on each request. Despite the name, the cached artifact is not a human-readable copy of the prompt. It is the model state computed while processing that prefix, represented internally as key/value tensors. Reusing the prefix cuts both cost (cached tokens can be up to 10 times cheaper) and latency, which is why we work to keep the prompt cache hit-rate high.

Tool-definition overhead. Agents can pull in a large number of tools: those exposed by MCP servers, built-in tools, or extension-provided tools. Each tool is sent to the model with a full definition (a name, a description, and a complete JSON parameter schema), and historically every one was loaded into context on every request. Even when that data is cached, the context window overhead is fixed on each turn and grows as the toolset does.

Tool search. Tool search reduces that overhead by letting the model load tool definitions on demand instead of all at once. Upfront, the model sees only lightweight metadata, the name and description of each deferred tool, and the heavier parameter schemas stay out of context until the model searches for a tool and loads it. Because deferred tools are added at the end of the context window rather than the prefix, the cached prompt prefix stays reusable and the caching gains keep working across turns. The payoff is a leaner context window: the model spends fewer tokens on tools it never uses, leaving more room and budget for the actual task.

Efficiency wins for OpenAI models

For OpenAI models, our recent work focused on reducing usage costs and latency for Copilot users through improved token efficiency. We pursued that through three changes: retaining cached model state for longer, reducing tool-definition overhead, and replacing repeated HTTP requests with persistent WebSocket connections.

Extended prompt caching

OpenAI models cache the prompt prefix automatically: the provider infers the reusable prefix and reuses its model state across requests. That reuse has a direct cost benefit. For most OpenAI models that support cached input pricing, uncached input tokens cost 10 times as much as cached input tokens.

Caching the prefix happens on its own, but how long that cache survives is something we can configure. After careful evaluation, we enabled extended prompt caching for supported models through the prompt_cache_retention body parameter. By default, the cache lives in fast GPU memory, where it is dropped after about 5 to 10 minutes of inactivity (up to an hour in some cases) to make room for other work. Setting "prompt_cache_retention": "24h" moves the cache to slower but roomier GPU-local storage and keeps it for up to 24 hours.

The benefit is simple. With the default cache, a pause of more than a few minutes throws the cache away, so your next request has to reprocess the whole prefix at the full, uncached price. Extended retention keeps the cache warm, so picking up where you left off is still fast and cheap, even after a long break.

Time between requests GPT-5.2 GPT-5.3-Codex GPT-5.4
10-20 min +13% +32% +10%
20-30 min +135% +142% +137%
30-40 min +301% +203% +679%
40-60 min +338% +279% +919%

Tool search

To avoid sending all tool definitions on every request, tool search makes this on-demand. Available to models GPT-5.4 and newer, OpenAI's native tool search implements this deferral with a defer_loading flag.

Upfront, the model only sees lightweight metadata: the name and description of each deferred function or, when deferred functions are grouped into a namespace, only the namespace's name and description.

Metric Model Delta
P50 Total tokens used per turn GPT-5.4 -9.81%
P50 Total tokens used per turn GPT-5.5 -8.61%
P50 Time to first token (TTFT) GPT-5.4 -6.88%
P50 Time to first token (TTFT) GPT-5.5 -7.34%
P50 Time to complete (TTC) GPT-5.4 -5.31%
P50 Time to complete (TTC) GPT-5.5 -5.42%

WebSockets

An agentic coding turn can involve many sequential requests to the inference provider; one for each step the model takes as it calls tools and works toward a solution. Responses API WebSocket mode keeps a persistent connection open and provides a lower-latency continuation path for those sequential requests.

Tracking metric Percentile GPT-5.3-Codex GPT-5.4
Time to first token (TTFT) p50 -19.46% -16.37%
Time to first token (TTFT) p95 -12.92% -15.78%
Time to complete (by turn) p50 -13.55% -11.74%
Time to complete (by turn) p95 -7.86% -6.26%

Efficiency wins for Anthropic models

For Anthropic models, our recent work targeted the same two repeating costs: the prompt prefix we keep warm in cache, and the tool payload we send on every turn. We pursued that through two changes: spending our prompt-cache breakpoints more deliberately, and deferring tool definitions through tool search.

Smarter prompt caching

Anthropic's prompt caching works differently from the automatic prefix caching that OpenAI models use. Rather than the provider inferring the reusable prefix, the caller places explicit cache_control breakpoints, and the API caches everything up to each marker.

  • The end of the tool definitions and the end of the system prompt, the parts that change least between turns.
  • A pair of rolling anchors on the two most recent cacheable messages.

Tool search

Anthropic's tool search tool applies the same deferral idea. Tools are marked with defer_loading: true, and alongside the deferred catalog we keep a small, curated set of core tools loaded.

Metric Percentile / scope Delta
Time to first chunk p50 (by turn) -2.45%
Total prompt tokens p50 (by turn) -11.30%
Total prompt tokens p95 (by turn) -8.85%
Total prompt tokens p50 (by user) -18.32%
Total tokens p50 (by turn) -11.09%
Total tokens p95 (by turn) -8.74%
Total tokens p50 (by user) -18.03%

We moved the search itself client-side, backing it with the same tools-grouping system we built for VS Code's reduced toolset. The model still calls a tool_search tool, but instead of Anthropic matching against the deferred catalog, we run the search locally and return tool_reference blocks for the best matches.

What's next

The work above makes our agentic harness leaner: a higher cache hit rate, fewer tool definitions per request, and less transport overhead. The next step is to move whole classes of work off the main agent entirely. We're building specialized subagents, and exploring custom-trained ones, for narrow tasks like searching the workspace, running commands, and summarizing results. Each runs on the smallest, cheapest model that can do the job, instead of the main model paying for that work in its own context.

AI agentsrust

Autoresearch, Claude, and Constrained Optimization

An experiment with autonomous research agents shows they thrive on well-constrained, quantifiable tasks but struggle with vague, real-world objectives.

Summary

What: Researcher Elliot Smith tasked an AI agent with writing a file compression algorithm, finding that clear metrics and strict constraints allowed for effective unsupervised optimization, while warning that poorly defined 'proxy' metrics lead to agent behavior issues.
Why it matters: The efficacy of agentic coding is currently limited by the developer's ability to define success mathematically; without a precise 'reward function', agents often optimize for the wrong outcome, a problem known as 'agent psychosis'.
Takeaway: When building agent-driven workflows, enforce strict timeout and success constraints (e.g., bit-perfect output matching) to prevent runaway optimization loops.

Deep Dive

  • The project aimed to test 'autoresearch' loops using Claude Code (Sonnet 4.6).
  • The task was to develop a compression algorithm in Rust.
  • Success was defined by a bit-perfect round-trip compression/decompression check.
  • Optimization was constrained by a 300-second execution time limit per file.
  • The agent iteratively improved the compression factor over 10 loops.
  • Agent-based optimization outperformed some off-the-shelf tools on audio and video but was less effective elsewhere.
  • The researcher notes that agents often 'race to be done,' requiring better looping logic.

Decoder

  • Autoresearch: A proposed method where AI agents autonomously perform iterative research or development tasks by setting goals, executing code, and analyzing results.
  • Gradient Optimization: A mathematical technique used to find the minimum or maximum of a function by following the steepness of the curve.

Original Article

Introduction

You don't need to look far to find claims that folks have been using AI to do the work of dozens of people. I tend to be skeptical of any claim that discusses improvements without evidence. I decided to take that skepticism and put it to work. This had a minor overlap with the whole 'loops' discussion on X but that's coincidental.

Over the last few weeks I have put together a project in the theme of Kaparthay's 'Autoresearch'. I wanted to choose a problem that was not a traditional machine learning or numerical optimization problem but one that still had some objective measure of success.

I chose this kind of problem because many of the projects or products I have worked on are structured that way. You have some metric that you want to change (up or down) and ideally some way to measure it. You likely also have some constraints e.g. we can't let the page load time exceed 500ms for this feature.

I have yet to work on a problem like this where the path from unknown to success is a clear, gradient optimization akin to machine learning. More often you complete some work, test it in the 'real world', look at how it performed and then make a decision about next steps. Not all changes result in a positive outcome and it's easy to go deep down a path that results in a locally optimal outcome.

I wanted an experiment that would give me some intuition about how to task AI agents with bigger pieces of work in a mostly unsupervised way. There are already other mechanisms to try and achieve this outcome, such as Ralph Loops and the /goal command that's now in Claude Code. The difference in this setup is that I would pick a quantifiable number as the primary measure of success and bound the problem with some pass-fail constraints.

Not wanting to over complicate things I chose the problem of file compression. I picked it because the objective and the constraints were simple. A compression algorithm is better if the final file size is smaller. I added two constraints to the problem, one being that the uncompressed file needed to match perfectly and the other that neither compression or decompression could exceed 300 seconds. I was deliberately not optimizing for speed but wanted to cap the time and ensure the process could run mostly unsupervised with the knowledge that a timeout would catch and infinite loops.

The other nice thing about file compression is that there are many existing tools I could use for a final benchmark. Given this was a small proof of concept I wasn't expecting to create a new top-of-the-line algorithm.

Despite that, knowing how well this home cooked version performed against existing tools also helps provide a data point on how much we might move away from libraries and off the shelf solutions. If an agent can quickly and reliably solve a problem previously solved by an external dependency there must be some point at which the value of an in house solution exceeds the risk of things like supply chain attacks. This isn't something one single experiment would answer but it would help determine if this was worth looking at more.

Methodology

Problem Setup

First, a reminder that the goal here was to see if this approach was viable rather than to benchmark any particular model.

Second, before we get into it, all the code for this project is available here: https://github.com/smitec/agent-compression

For this work I used Claude Code with default settings on Sonnet 4.6. I am certain different models would have done things differently, that's an exercise for another day.

Prior to any agent involvement I setup a basic scaffold for the project. I picked Rust because some of the implicit constraints like "don't modify the function signature" were easily enforceable via the type system. I put together a stub of the compress and decompress function which both just copied the bytes across. This 'worked' but provided zero compression to any of the data.

I then put in place a couple of basic unit tests to test the compress-decompress round trip on both a string and a simple file. These tests weren't exhaustive but did validate that the compress and decompress function were adhering to their goal of a bit perfect round trip.

From there I put together a bench-marking script. This script fetched some public domain file samples across video, audio and text as well as created some files filled with random data of various sizes. Many of these files were in formats that were already somewhat compressed so I added a step to convert them to less compressed formats. This gives a good file wise benchmark alongside the overall compression benchmarks.

Having this sample set meant that there were a mix of high and low entropy file formats. A good compression algorithm will shrink low entropy formats and leave high entropy formats mostly unchanged. You can expect some minor change in file size due to format specific bytes but overall you don't want file size to increase in a meaningful way.

The largest file in the sample set was around 150MB. While compression is likely more meaningful on even larger files it would have resulted in a very slow test loop, especially in later steps.

The bench-marking script looped through each of the files, compressed them individually and then decompressed them. The script checked the decompressed file was a bitwise match to the original and noted down the change in size and how long the compress and decompress steps took. There was a 300 second timeout applied to each file's steps mainly to check for accidental infinite loops.

The script produced a debug.csv file which outlined the changes per file and, if there was an improvement, wrote the key metrics to a results.csv file. One thing of note was that the combined compression metric was (total compressed bytes) / (total original bytes). I had also considered taking the average percentage compression across the sample set. I'll get into the differences and impact of this choice a little later.

Once all of this was setup I ran the benchmark for the stub implementation and considered the experiment ready to run.

Iterations

To keep things relatively well controlled I cleared the Claude context before each iteration and prompted the model with "Review the current codebase and attempt another iteration of improvement." I have Claude Code set to plan mode by default so I waited for the plan and then after a quick review accepted the plan and let the agent run on its own.

I intentionally didn't modify any of the plans in this experiment, wanting to let it make fully autonomous choices. There were a few times where I think an intervention would have been useful but that’s a lesson learned.

I ran ten iterations and then completed a final extended benchmark against some common compression tools and on a new dataset to control for any data-specific optimization. These iterations were run over the course of about two weeks usually kicked off and left to run while I was doing other things. This extended time period wasn't a design feature of the trial, it was mostly to avoid exhausting my Claude Code limits while working on other things.

Results

Iterations

During the first iteration the agent produced a custom LZSS implementation, a fairly standard and well known method of compression. The next nine iterations were extensions to this method, adding new entropy checks and encoding techniques to try and remove entropy.

Each loop varied a lot in time taken and tokens used. On average, based on the /usage command in Claude Code, a single iteration cost about $4 USD. Again this was on the default settings so I am not reading too much into the price given how much that varies per model.

Interestingly the model never made more than one set of changes in a given iteration. It would form a hypothesis, add the code, run the benchmark and call itself 'complete'. This may come down to the prompting setup of not using the /goal command.

The results below show that the model was able to continue to make improvements to the compression factor. Looking in particular at the 'compressible' ratio the results were, in my opinion, pretty impressive given how loose the task was.

Benchmarks

To assess the final results I ran several compression tools over the same dataset. These tools were chosen because they happened to already be installed. This is not the most robust method of choosing a benchmark but it does reflect a comparison to common tooling.

Overall the custom algorithm performed fairly well, it excelled at audio and video compression and was slightly worse or on par in other categories. The lower scores in audio and video aren't surprising given the metric used to optimise. These file types represented most of the bytes being compressed so the combined score was moved most by wins there.

Coming back to the goal of this project, this wasn't a quest to find a breakthrough compression algorithm but instead to develop some intuition about tasking an agent with optimizing software.

Learnings

To wrap this up, and give folks something to skip to if this post is too long, here are some high level take-aways from this project. Overall, I think if you can find a robust, measurable and well constrained metric to optimise then this auto-research/loop style work makes sense. Finding one of those is often tricky.

Models race to be 'done'

The overall feeling I had while watching/reviewing the setup was that it wanted to be 'done' as quickly as possible. Based on this I think having some explicit looping mechanism setup would be important for a real world version of this setup.

The choice of objective function is key

Another observation I had was that the 300 second time parameter was likely far too loose a constraint. It was useful for capping the downside of a change but the model was only ever optimising for compression. A phenomenon recently captured by Mitchell Hashimoto in this X thread:

I've got an agent in a loop optimizing a renderer with the goal to minimize frame times (and tests to measure). It got times down from 88ms to 2ms and allocations down from ~150K to 500. Sounds good, right? Wrong. This is exactly why agent psychosis is a big fucking problem.

As… — Mitchell Hashimoto (@mitchellh) May 28, 2026

A real world application of this method would either need a more complex 'score' to optimize or to later switch to an optimisation for speed. The same can be said for other secondary metrics like code length, memory usage etc.

This is by no means a new issue or one that is unique to agent based coding. Choosing measures of 'success' and 'done' has long been a challenge in engineering organisations. Realistically any metric or combination of metrics is going to come with trade offs. You probably just need to get comfortable with that fact and be willing to shift your focus over time as the needs change.

I saw recently that PostHog was doing some work in this space with their new PostHog Code product. Allowing users to bring product analytics into their coding agent context to better guide decisions. I'm yet to test it out but it feels like the right direction.

Real world objectives are rarely as simple to measure

While discussing metrics it's worth considering how this technique might differ in the 'real world'. A compression tool has a very fast feedback loop. You can take a file, compress it, decompress it and compare the results. If this change was more broad, say "Improve the checkout conversion rate," you'd need a lot more time to gather samples and you'd be a lot more susceptible to noise in the data.

One solution here is to optimize a proxy metric with the hope/hypotheses that it will improve the conversion rate. That might be something like 'improve page load speed' or 'reduce the number of clicks needed to checkout'. This could certainly be more easily iterated on but you then run the risk of over-optimising on a proxy metric that only loosely correlates with your final goal. It is rate to find a proxy metric that is perfectly and linearly correlated with a more complex one.

Limitations

Some very brief acknowlegements of limitations here:

  • Model choice, how long are these results valid. Models change all the time (Sonnet 5 came out today), realistically the results of this same trial today will likely be quite different.
  • Cost, is this sensible? Based on Claude's estimates each loop cost about $4 in tokens. You'd need an ROI to do this in a 'real' product. $40 (10 loops) isn't a high bar but running a loop like this for every change in a code base could be costly.
  • Single machine, single thread results. Compression benchmarks vary wildly across CPUs, these were all done on an M2 Macbook Pro but I am sure the results would have differed in other scenarios.
  • Choice of optimisation function. This is the biggest one, outlined several times above, had I chosen something like average( compressed / raw) or even median the path to better would have looked very different. I've written a lot about choosing metrics in the past and this applies to agents as much as it does to humans.
AI llmopensource

Introducing Laguna XS 2.1

Poolside released Laguna XS 2.1, a 33B MoE model focused on coding, using a new permissive OpenMDW-1.1 license to boost community adoption.

Summary

What: The new 33B-A3B Mixture-of-Experts model improves on the previous version with a 63.1% score on SWE-bench Multilingual. It supports vLLM, SGLang, and Ollama, and is licensed under the new OpenMDW-1.1 framework to ease distribution friction.
Why it matters: This release marks a push for specialized, resource-efficient coding models that can be run locally, competing with larger frontier models on agent-specific tasks.
Takeaway: If you are building local agentic workflows, test Laguna XS 2.1 using quantized checkpoints (FP8, INT4, or NVFP4) to optimize VRAM usage while maintaining performance.

Deep Dive

  • Laguna XS 2.1 is a 33B parameter Mixture-of-Experts (MoE) model.
  • Activated parameters per token: 3B.
  • 5.4 point improvement on SWE-bench Multilingual compared to the previous XS.2 model.
  • Fully supported by vLLM, SGLang, TensorRT-LLM, and Ollama.
  • Released under the permissive OpenMDW-1.1 license.
  • Quantized formats: FP8, INT4, and NVFP4.
  • DFlash speculator models are included to increase token generation speed for local users.
  • API access provided at a fixed cost for input/output/cache-read tokens.

Decoder

  • Mixture-of-Experts (MoE): A model architecture where only a subset of parameters (experts) is active for each token processed, enabling larger capacity with lower compute costs.
  • Quantized: The process of reducing the precision of a model's weights (e.g., from 16-bit to 4-bit) to reduce memory requirements and improve inference speed.

Original Article

Today we're releasing Laguna XS 2.1, an upgraded version of our Laguna XS.2 model.

Laguna XS 2.1 is a 33B total parameter Mixture-of-Experts model with 3B activated parameters per token, designed for agentic coding and long-horizon work on a local machine. It's the same architecture as XS.2, with a notable improvement on SWE-bench Multilingual and stronger performance on terminal-style tasks.

XS 2.1 vs XS.2

XS 2.1 improves upon XS.2 across a key field of agentic coding benchmarks. The largest move is on SWE-bench Multilingual, up 5.4 points to 63.1%.

  • Laguna XS 2.1 33B-A3B
  • Laguna XS.2 33B-A3B
  • Qwen3.6 35B-A3B
  • North Mini Code (Cohere) 30B
  • MAI-Code-1-Flash 137B
  • gpt-oss-120b 120B
  • Claude Haiku 4.5 -
  • GPT-5.4 Nano -

SWE-bench Verified

Resolved tasks on SWE-bench Verified.

SWE-bench Multilingual

Resolved tasks on SWE-bench Multilingual.

SWE-Bench Pro

Resolved tasks on SWE-Bench Pro.

Terminal-Bench 2.0

Resolved tasks on Terminal-Bench 2.0.

Benchmarks as of 2 July 2026. We have chosen to include dense models with larger activated parameter counts to highlight the relative efficiency of MoE models.

A better local experience

XS 2.1 is supported in vLLM, SGLang, NVIDIA TensorRT-LLM, HF transformers and Ollama, with llama.cpp support coming soon. We’re also making three quantized checkpoints available—FP8, INT4 & NVFP4—allowing XS 2.1 to be deployed in setups with tighter VRAM & compute budgets. We also intend to make quantized GGUF checkpoints available in the near future as part of our native llama.cpp support.

We’re also open-weighting DFlash speculator models for each XS 2.1 checkpoint. We trained these speculators to balance overhead and acceptance rate. In our tests, these speculator models double the achieved tok/s, making local inference of XS 2.1 even faster than it was before.

We are serving the model at 256K context length on our API and through OpenRouter.

A more open license

We are licensing Laguna XS 2.1 under OpenMDW-1.1.

We are making this change to support open model distribution for the community. OpenMDW-1.1 is fully permissive and designed for models and related artifacts, giving developers and organizations a more consistent framework for using, modifying and deploying open models.

We are glad to support the direction NVIDIA and the Linux Foundation are taking with OpenMDW, and we think this is a useful step toward reducing licensing friction for open model releases.

Get started

  • Download the weights from the Laguna XS 2.1 collection on Hugging Face — BF16, FP8, NVFP4, and INT4.
  • Use the model on OpenRouter (poolside/laguna-xs-2.1) or via our API. Free and paid endpoints are both available with paid pricing matched to XS.2 at $0.10 / $0.20 / $0.05 per 1M input / output / cache-read tokens.
  • Run it locally with Ollama, llama.cpp, TRT-LLM, vLLM, or SGLang, and add the DFlash draft model for faster inference.
  • Install pool, our terminal-based coding agent, for the best agent experience with the model.

We want to see what people build with XS 2.1, and we want your feedback. Try both models side by side and tell us where 2.1 is better and where it isn't. Join our Discord to share what you find and talk to the team directly, or reach us at models@poolside.ai or on X.

Laguna XS.2 will sunset on our API after 1 week. XS.2 will remain available as part of Baseten’s Model Library for dedicated deployments.

Footnotes

All benchmarking for Laguna XS 2.1 was completed using Laude Institute’s Harbor Framework with our agent harness, with a maximum of 500 steps and sandboxed execution. The same sampling parameters were used for all Laguna XS 2.1 benchmarking: temperature=1.0, top_k=20 and top_p=1, with thinking mode enabled and a context length of 256K tokens. All tasks were run in their own sandbox using 8 GB RAM/2 CPUs, with the exception of Terminal-Bench 2.0, which used 48 GB RAM/32 CPUs.

Some base task images and verifiers were patched to fix infrastructure reliability issues inherent in task setup, such as rate limits on third-party dependencies in external registries used by the verifier. All four agentic benchmarks were run with patched images. We also ran a reward-hack judge post-hoc on Laguna XS 2.1 evaluation runs and did not find significant reward hacking after joint judge review and manual review.

  • SWE-bench Verified: mean pass@1 averaged over 4 attempts per task
  • SWE-bench Multilingual: mean pass@1 averaged over 4 attempts per task
  • SWE-Bench Pro: mean pass@1 averaged over 2 attempts per task
  • Terminal-Bench 2.0: mean pass@1 averaged over 5 attempts per task; 48 GB RAM/32 CPUs

We used the highest publicly-referenced scores for all comparison models across each benchmark. In all cases these were official scores published in release blog posts or equivalent, with the exception of gpt-oss-120b and Claude Haiku 4.5 where the highest published (verified) scores for SWE-Bench Pro and Terminal-Bench 2.0 are from their respective official leaderboards.

AI llmresearch

Residual Context Diffusion Language Models

Residual Context Diffusion (RCD) prevents computational waste in diffusion language models by recycling discarded token data as contextual residuals.

Summary

What: RCD is a module that injects discarded token information back into the denoising process of block-wise diffusion language models. The research, published by a team including UC Berkeley researchers, shows 5–10 point accuracy gains and 4–5x fewer required denoising steps.
Why it matters: This addresses a major inefficiency in diffusion-based LLMs, potentially making them more competitive with standard autoregressive models for inference tasks.

Deep Dive

  • Diffusion LLMs typically discard lower-confidence tokens during block-wise decoding.
  • RCD captures these discarded representations to improve future denoising steps.
  • Uses a decoupled two-stage training pipeline to manage backpropagation memory overhead.
  • Validated on long Chain-of-Thought (SDAR) and short instruction-following (LLaDA) tasks.
  • Achieves significant accuracy improvements with minimal extra compute.

Decoder

  • Diffusion Large Language Model (dLLM): An LLM that generates text through iterative denoising, allowing for parallel token generation instead of strictly sequential (autoregressive) generation.
  • Chain-of-Thought (CoT): A prompting technique where the model explicitly generates reasoning steps before providing a final answer.

Original Article

Residual Context Diffusion Language Models

Diffusion Large Language Models (dLLMs) have emerged as a promising alternative to purely autoregressive language models because they can decode multiple tokens in parallel. However, state-of-the-art block-wise dLLMs rely on a “remasking” mechanism that decodes only the most confident tokens and discards the rest, effectively wasting computation. We demonstrate that recycling computation from the discarded tokens is beneficial, as these tokens retain contextual information useful for subsequent decoding iterations. In light of this, we propose Residual Context Diffusion (RCD), a module that converts these discarded token representations into contextual residuals and injects them back for the next denoising step. RCD uses a decoupled two-stage training pipeline to bypass the memory bottlenecks associated with backpropagation. We validate our method on both long CoT reasoning (SDAR) and short CoT instruction following (LLaDA) models. We demonstrate that a standard dLLM can be efficiently converted to the RCD paradigm with merely ∼1 billion tokens. RCD consistently improves frontier dLLMs by 5–10 points in accuracy with minimal extra computation overhead across a wide range of benchmarks. Notably, on the most challenging AIME tasks, RCD nearly doubles baseline accuracy and attains up to 4–5x fewer denoising steps at equivalent accuracy levels.

AI devopsperformance

CursorBench 3.1

CursorBench 3.1 updates its evaluation criteria to better measure how AI agents handle ambiguous, multi-file software engineering tasks.

Summary

What: CursorBench 3.1 ranks various coding agents (including Fable 5 and Opus 4.8) based on their ability to perform complex refactors, bug fixes, and code reviews, accounting for both score and cost-per-task.
Why it matters: Benchmarks based on real-world repository sessions provide a more accurate metric for software agents than static code generation evals like HumanEval.

Original Article

Model Score Cost / task Tokens / task Steps / task
1 Fable 5 Max 72.9% $18.02 63,842 76
2 Fable 5 Extra High 72.0% $13.74 48,754 63
3 Fable 5 High 70.6% $10.81 37,173 54
4 Fable 5 Medium 69.8% $8.27 28,507 47
5 Opus 4.7 Max 64.8% $11.02 62,989 96
6 GPT-5.5 Extra High 64.3% $4.37 17,905 46
7 Fable 5 Low 64.2% $5.70 18,882 36
8 Opus 4.8 Max 63.8% $7.59 77,370 60
9 Composer 2.5 63.2% $0.55 15,152 37
10 GPT-5.5 High 62.6% $3.59 13,329 40
11 Opus 4.8 Extra High 62.1% $6.14 55,622 54
12 Opus 4.7 Extra High 61.6% $7.11 43,942 72
13 Sonnet 5 Max 61.2% $6.87 93,485 93
14 Opus 4.7 High 59.4% $5.01 32,227 59
15 GPT-5.5 Medium 59.2% $2.22 9,065 35
16 Opus 4.8 High 58.4% $4.41 36,788 45
17 Sonnet 5 Extra High 58.4% $5.23 58,228 86
18 Sonnet 5 High 57.0% $3.74 41,735 66
19 Opus 4.8 Medium 56.6% $3.83 31,684 41
20 Sonnet 5 Medium 54.9% $2.57 27,469 53
21 GLM 5.2 Max 54.6% $3.11 51,312 83
22 Opus 4.8 Low 54.3% $2.93 22,726 36
23 Opus 4.7 Medium 52.7% $2.93 19,193 41
24 Kimi K2.7 Code 52.7% $1.92 32,902 70
25 Composer 2 52.2% $0.56 14,163 40
26 GLM 5.2 High 50.7% $2.46 30,621 76
27 Gemini 3.5 Flash 49.8% $1.94 35,105 79
28 Sonnet 4.6 Max 49.0% $3.09 40,280 55
29 GPT-5.5 Low 48.8% $1.19 4,923 24
30 Sonnet 4.6 High 48.8% $3.06 37,352 57
31 Opus 4.7 Low 48.3% $1.87 13,164 29
32 Sonnet 5 Low 47.7% $1.46 17,028 37
33 Kimi 2.6 47.6% $1.27 24,783 56
34 Sonnet 4.6 Medium 46.0% $2.64 31,360 50
35 Sonnet 4.6 Low 41.5% $1.89 21,211 50
36 Kimi 2.5 31.9% $0.87 9,446 30

Changelog

CursorBench 3.1

  • Introduced problems focused on codebase understanding, bugfinding, planning, and code review.
  • Improved grading criteria for some edit tasks.

CursorBench 3.0

  • Initial set of tasks focused on edit, refactor, and bugfix problems.

Avg cost / task is computed by applying each model's published per-million-token pricing (input, cache read, cache write, and output) to the tokens it used on each CursorBench 3.1 task, then averaging across tasks. Results are subject to variance; small differences in scores may not be statistically meaningful.

AI enterprisellm

New analytics and cost controls are available for Claude Enterprise

Anthropic is introducing granular cost tracking and model-level spend controls for Claude Enterprise to help organizations justify AI return on investment.

Summary

What: Anthropic updated Claude Enterprise with a new analytics dashboard and an Analytics API that allows admins to track usage and cost by user or group. The update includes model-level entitlements, spend alerts at 75% and 90% thresholds, and programmatic integration with external tools like Datadog and CloudZero.
Why it matters: As AI usage shifts from experimentation to core workflows, companies are moving beyond simple flat-rate subscriptions toward unit-economic tracking where AI costs are mapped directly to business outcomes.
Takeaway: If you are managing Claude for a team, integrate the new Analytics API with your existing cloud spend monitoring tools to automate budget tracking and prevent service disruptions.

Deep Dive

  • Admins can now filter usage data by existing SCIM groups to match organizational structures.
  • The Claude Code dashboard includes value tracking metrics like estimated productivity lift and cost-per-commit.
  • An Analytics API allows for programmatic ingestion of spend data into third-party platforms.
  • Model-level routing lets admins restrict expensive models to specific roles or tasks to manage operational costs.
  • Users receive in-app notifications as they approach spending thresholds.
  • Administrators can automate request-for-increase workflows via the Admin API.
  • The system tracks not just tokens, but also specific skill and plugin adoption to measure utility.

Decoder

  • SCIM (System for Cross-domain Identity Management): A protocol used to automate the exchange of user identity information between an identity provider and a service provider.
  • MCP (Model Context Protocol): An open standard developed by Anthropic that allows AI models to connect securely to data sources and developer tools.
  • Claude Code: Anthropic’s CLI-based agent tool designed to perform software engineering tasks like editing files and running commands.

Original Article

Giving admins more visibility and control over Claude spend

We’re introducing richer admin analytics, model-level entitlements, and spend alerts for Claude Enterprise. As Claude takes on increasingly difficult and complex agentic work across the organization, usage and cost patterns look different from a standard chat tool. These controls give admins the visibility to understand how Claude is being used and the tools to manage costs.

Today's additions build on controls Anthropic already provides: spend caps at every level, access and model routing, a usage analytics dashboard with exports and an Analytics API, and effort controls. Richer analytics and more granular cost controls are the newest additions to a control surface we've been building on for months.

Track adoption and cost

The analytics dashboard for admins now shows usage and cost by group and by user, with output like artifacts created, files edited, skills and connectors used displayed directly next to their cost. Admins can filter by the SCIM groups their IT team already manages, so the breakdown follows their existing org chart.

Claude Code gets richer insights with two new tabs focused on value and usage inside the admin console. Usage shows active developers, session counts, and top commands across the org, and is updated daily. The value tab summarizes usage and cost data to help admins understand value of Claude Code at a glance, estimating productivity lift, cost per commit, and annual value. Every formula is visible in the tab, and the inputs are adjustable.

Analytics chat can now answer a much broader set of questions and produce richer artifacts that you can dive deeper into. Admins can ask questions in plain language — "Which teams doubled their Claude usage this month?" or "Where are we getting the most value per seat?" — and Claude returns charts that can be exported and shared with stakeholders.

Usage and cost data is available programmatically through the Analytics API, so finance and IT can bring Claude usage and cost data into the tools they already run — like Datadog Cloud Cost Management and CloudZero — and see it alongside the rest of their cloud and AI spend. Results can be filtered by date range, team, product, or model. Skills report their own usage and cost, and new endpoints track plugin adoption and artifact creation.

Admins can extend usage visibility to individual users — cost, product and model breakdowns, and progress against spend limits — so no one hits a surprise cutoff. Users can also see their own usage trends over time, including which products, models, and skills they rely on most, and how that activity adds up in spend.

Controls for managing spend

Model defaults and entitlements let admins set which Claude model new conversations start with across chat, Cowork, and Claude Code so routine work doesn't necessarily default to the most expensive option. Admins control which models are available to specific roles or across the entire organization.

Spend-threshold alerts notify admins at 75% and 90% of an org-level spend limit, giving them time to raise the cap before anyone gets blocked mid-task. Users receive in-app notifications at 75% and 95% thresholds and can request a limit increase directly from their admin without leaving Claude.

For organizations managing limits across many groups, the Admin API moves cost-control workflows into scripts so controls scale with the org. Automate increase-request reviews, identify members close to their spend limit, and flag rapidly changing usage all at scale.

"Cost visibility isn't a once-a-month exercise. Granular spend data and alerts give teams regular nudges to reassess how they're using Claude, instead of a surprise at the end of the billing cycle. With the Analytics API, we can bring that data into the tools we already use every day." — Kyra Abbu, Product Manager
“I'm not going to slow down the people driving our best quarter, and my CFO isn't asking me to. He's asking for ROI. We've tied Claude, connected to our enterprise MCP servers, to a 4% revenue lift, and seeing cost next to business impact by team is how I make that case stick.” — Carter Busse, CIO
"Token usage alone doesn't tell you much. What I actually want to see is which skills get run again and again across the org — that's the real signal of value." — Ciro Yamada, Product Director

Getting started

For admins managing Claude across their organization: explore usage and cost breakdowns in the admin console, set model defaults and spend limits by group, and configure spend-threshold alerts to stay ahead of overages. Usage data is available in the admin dashboard, and the Analytics API lets finance and IT pull the same metrics into existing reporting systems.

Tech aiagents

Building an Intern

A project called 'Junior' demonstrates how to build and integrate an AI intern directly into Slack workflows.

Summary

What: Junior is an AI-powered agent designed for Slack that allows users to feed it data, review its outputs, and steer its task execution through collaborative interaction.
Takeaway: Developers interested in agentic workflows can examine the project's open-source repository for implementation details.

Original Article

'Junior' is an AI intern in Slack. Users can give it information to work with, review the work, and steer the agent in the right direction. This article discusses the process of developing Junior, which took over four months. The full source for the project is available.

Tech aiagents

Understanding is the new bottleneck

Geoffrey Litt argues that developers must maintain a deep understanding of agent-written code to remain active participants in the creative process.

Summary

What: The author proposes using 'literate diffs,' interactive micro-worlds, and embedded quizzes to force human engagement with AI-generated code rather than relying solely on automated verification.
Why it matters: This challenges the trend of total automation, suggesting that 'cognitive debt'—the cost of not understanding a system—will eventually limit a team's ability to innovate or maintain complex codebases.

Deep Dive

  • Verification vs. Participation: Verification is a binary check of correctness, whereas participation requires a conceptual mental model of how the system works.
  • Explanations: Beyond raw diffs, use structured documents that explain the 'why' and architectural context of a change.
  • Interactive Figures: Use embedded tools to experiment with parameters in real-time to build intuition.
  • Literate Diffs: Review code by reading prose-like walkthroughs rather than alphabetical file lists.
  • Spaced Repetition: Embed quizzes in code review artifacts to ensure human comprehension before merging.
  • Micro-worlds: Build custom debuggers or interactive UIs that allow stepping through logic in time-scrubbable environments.
  • Shared Spaces: Use collaborative environments (like Notion) to ensure teams share a unified mental model of agentic progress.

Decoder

  • Cognitive debt: The accumulated deficit in human understanding of a system that makes future modifications more difficult or error-prone.
  • Literate diff: A code review artifact that structure changes as human-readable prose paired with relevant code snippets rather than raw file modifications.

Original Article

Understanding is the new bottleneck

This is a written version of a talk I gave at the AI Engineer conference in July 2026, also shared as a tweet thread.

Hot take: I think it's still important to understand the code that our agents write!

In this talk I'll explain why that's the case, and show some ideas for how to efficiently understand code. Alright, let's dive in.

Agents are writing more and more code for us, and we all know it's getting harder to keep up.

But the good news is: there are many ways to understand code! Reading diffs line by line is not the only way.

Most of this talk will be about techniques I have found helpful to understand systems my agents are building:

  • Code explainer docs
  • Quizzes to check my understanding
  • Micro-worlds that I can play with to understand the system

But first we have to ask a more basic question…

Why understand?

Why? Why understand?

Aren't we supposed to be taking ourselves out of the loop now, and letting the agents loop themselves? As the agents get smarter, doesn't it become less important for us to be in the details?

I think many people — even those who are pro-understanding — have a slightly incorrect answer to this question!

One possible answer: we understand to verify. We check the agent's work, we see if it's correct.

Correct can mean many things: does it match the spec, is it well architected… but it's fundamentally a thumbs-up / thumbs-down question.

Here's the thing: the agents are getting better and better at verifying their own work. And this is good! I like it when my agent doesn't make mistakes.

But hmm. Where does that leave us humans?

That's where another answer comes in: we can understand to participate.

You can learn what the agent is doing to make sure you can be an active participant in the creative process. Here's why this matters…

It's never just one loop! A project is many, many loops with the agent.

And the understanding you have of the system is part of your ability to come up with the next idea to evolve it.

You need a rich set of concepts in your mind to think creatively and fluently about how to move something forward. If you're lacking that fluency, your ability to participate in the project is meaningfully limited.

By the way, this relates closely to the idea of cognitive debt, popularized by Margaret Storey and Simon Willison.

It's like tech debt: you can get away with not understanding what's going on in the short term, but it'll bite you eventually.

OK, so fine, understanding matters.

But this raises the next question: how? How do we build this human understanding when we're working with AI and moving fast?

Well, turns out this is not the first time anyone has ever thought about how to communicate understanding. I think we can look to education as an inspiration. Can we steal the best ideas ever invented for education and apply them to this problem?

Technique 1: Explanations

Today I want to share three techniques that show how we can attempt this.

First: explanations. What makes a good explanation?

Whenever an agent finishes some work, it's an opportunity for an explanation — an artifact.

Most naively, we can read a code diff: the raw material that changed.

But what if we ask:

What would the best explanation be? If you had a team — human or AI — that really sweat the details of explaining something well to you, how would that feel?

Here's one answer. I made a skill called /explain-diff, which I use every day and many coworkers have found valuable.

It outputs thoughtfully structured code explainers as HTML, markdown, or Notion docs. Notion is a good place for collaborating on and discussing these explainers as a team.

Let's see what's in one of these explainers, using an example of editing the perspective of a video game.

First principle: teach me background info!

Before we even get to what changed, help me understand what was already there. In this case, teach me about the game engine.

Second principle: intuition before details.

Before any code, it states the goal — “make the garden feel three-dimensional with 2D drawing tricks” — and explains related concepts, like what isometric projection is.

All of this builds my intuition for the essence of the change. It's catching me up as the human so I can be an equal participant in understanding.

You can also build intuition with interactive figures.

Here I'm understanding the isometric perspective by dragging rocks around the garden and watching their coordinates move.

We finally get to the code. But a typical diff is a pile of files edited in alphabetical order with no explanation.

A “literate diff” as I call it is structured as prose — walking through the changes in a sensible order, with surrounding explanation and embedded code snippets. Faster to review than a raw diff.

The end result of all of this is a nice explainer packet. I still read the code diff but I always read this first.

Sometimes I'll print these out and take them to the café — less distracting.

It's beautifully ironic: AI turns an interactive activity into a static paper report I can focus on deeply :)

There's only one problem: reading is hard work 😅

As Andy Matuschak says: “books don't work”! It's too easy to fool yourself into thinking you did the reading when you really didn't retain or understand.

How do we fix this? I took inspiration from Andy and Michael Nielsen's work on embedding spaced repetition quizzes in essays.

I do something similar with my code explainers now. At the bottom of an explainer there's an interactive quiz — five questions about the change — and I try to answer them.

My rule: I won't send code to others until I can pass the quiz, and I do the same when reviewing others' code.

A quiz is a speed regulator. Working with AI, it's easy for the loop to run faster than the speed of human understanding.

The quiz is a counterbalancing force: I mechanically ask “do I actually understand?” so that I can remain a full creative participant.

OK, so that's explain-diff. Here's the skill if you want it: two variants that output either HTML or a Notion page.

Technique 2: Micro-worlds

Next idea: micro-worlds. This one's inspired by the visionary educator Seymour Papert.

Papert had this beautiful idea he called living in Mathland: if you want to learn math, live in Mathland — just like if you want to learn French, you go live in France. Could we build an environment where children learn math naturally, as a consequence of their curiosity?

So how do we apply that to code? Can we make worlds you inhabit and naturally intuit how the system works and how it's changing?

Last year I was coding a Prolog interpreter and struggling to intuit what was happening inside.

I worked with an agent to build this debugger, which let me step through the execution of my logic language — scrub through time, see what's on the stack and which rules are evaluated at each step. I could even leave comments for myself (“nice, we correctly applied that rule”).

There's a big difference between making a tool for me to debug and letting the agent debug — doing it myself is how I develop understanding along the way.

Another example. I was migrating my personal website from one framework to another, and Claude wrote a script that did it. But it was very hard to review: I wasn't familiar with the new framework, and all I could say was “I guess that looks about right.”

So I asked Claude to make me a video game — a command center where I do the port myself, step by step, watching the visible effects and the file tree evolve. It produced a UI where I click buttons to run the port step by step, with my old site and new site running side by side.

In this command center I watched the new site come to life incrementally. That left me with a similar understanding to doing it by hand — but much faster, because the whole experience was laid out for me.

The point here is that agents can write bits of code that help us humans understand other code.

This is a big deal!

Technique 3: Shared spaces

Alright, last technique: shared spaces. So far this has all been about understanding solo… but when you're working on a team, you need to understand together.

When you and someone else hold the same mental model, you can communicate efficiently. You have a shared vocabulary that evokes the same images, so you can jam and riff and have creative conversations. Without those shared structures, those conversations are much harder.

I'm really excited about creating shared environments where teams build that understanding together. It's kinda what Notion is all about too.

Recently in Notion we've been shipping tons of new features for humans and agents to work together, so your whole team develops a shared understanding instead of each working in a silo.

One tiny example: you can now run Claude and Cursor agents in Notion. I do a lot of my coding that way now.

And when those agents make a technical plan in Notion, it's in a collaborative page by default, so I can comment on it with my team and discuss immediately. Thinking together, not alone!

The point was always to augment

Alright, let's wrap up. Today we've covered some techniques that were about understanding code… but actually I think this is a much bigger issue.

It's still important for humans to understand how things work in general! Not just to verify, but to participate.

And surprise surprise, this is not a new idea. It harkens back to the very origins of our field of computing…

50 years ago Alan Kay envisioned that computers could be a new medium, better than the book, for teaching people — especially kids — how to think about the world.

In this picture, it might look like these kids are watching YouTube on an iPad, but they're not. They're playing an interactive game and editing the code as they play it to get a better understanding of physics. This was 50 years ago!!

The point was always to augment, not just automate.

It's beautiful that AI now makes creating simulations so accessible… Having AI teach us is one of the greatest possibilities computing has ever opened up.

This makes me very optimistic about the future!

If we build the right tools, we can now understand the world better than we ever could before. We don't have to merely take ourselves out of the loop, we can get deeper in the loop too. It's up to us.

FIN

Tech aillmbackend

A new trend, smart model routing

Intelligent model routing is becoming a necessary layer in production stacks to manage the 10-20x price variance between frontier and open-source models.

Summary

What: Gergely Orosz identifies a wave of routers—including Not Diamond, Factory Router, and LiteLLM—that automatically dispatch prompts to the most cost-effective or performant model per task.
Why it matters: This signals that developers are moving away from manual model selection, shifting toward automated, rule-based infrastructure that prioritizes token cost-efficiency.
Takeaway: If you are managing high LLM spend, integrate a routing layer like Not Diamond or LiteLLM to dynamically route 'easy' prompts to cheaper open-source models.

Deep Dive

  • Companies are experiencing significant cost bloat due to over-relying on expensive, frontier models for simple tasks.
  • Intelligent routers analyze prompts to determine the minimum model capability required to fulfill a request.
  • Vendors like Factory, Not Diamond, and Morph are leading the market by offering 'smart' routing APIs.
  • Open-source models are increasingly sufficient for roughly 60% of coding-related workflows, making routing more attractive.
  • AI gateways (e.g., OpenRouter, Kilo, Requestly) are incorporating routing as a core feature to simplify token management.

Decoder

  • Token router: A middleware layer that inspects an LLM request and dynamically chooses the appropriate model based on criteria like cost, latency, or expected performance.

Original Article

Two weeks ago, I covered a trend of companies trying to reduce spending on AI within their engineering departments. While talking to my sources about this, one head of engineering at a larger company told me that they wished there was an ‘intelligent’ router that picks the right model for the right task.

The reason for such a wish is clear; prices for tokens vary greatly per model, and there can easily be a 10-20x difference between a cheap, average model, and a state-of-the-art one.

I did some digging into whether any solutions like this currently exist because the benefits look obvious, and what I found is listed below.

Vendors:

  • Factory Router: automatically selecting the right model per session, claiming 20-25% cost savings.
  • Not Diamond: auto-selection of coding models, claiming around 30% cost savings. Used by OpenRouter, under the hood.
  • Vercel AI gateway. Hundreds of AI models, smart routing and billing in one place.
  • Prism by Augment Code. Choosing the “best” model automatically for coding tasks.
  • Model Router by Morph. An API to suggest model selection for a prompt, based on a list of models.
  • Weave router: a token router that works inside Codex, Claude Code and Cursor. “Hard” requests stay on frontier models, while “easy” ones go to open source ones.

AI gateways with routing built in. API gateways are popular ways to use LLMs in workplaces.

  • OpenRouter: comes with “auto router” functionality where, after analyzing the prompt, the best one is selected. Uses Not Diamond under the hood.
  • Kilo Gateway: route requests the model considered the best price-per-value. Supports using your own model keys, and using the service only as a router.
  • Requestly.ai: automatically route requests to the right model based on cost, latency, and availability, and tons of configuration.
  • LiteLLM: define routing rules that automatically select the best model, based on input content with the “auto routing” functionality. The setup is more manual, but you get more control than with many other AI gateways.
  • Envoy AI Gateway: an open source gateway that offers some routing configuration, though it feels that the routing engine focuses more on availability, not cost optimization and smart model routing.

Cursor and GitHub Copilot also have an “Auto” model selection that does automatic model selection. For Cursor, it’s a fixed-price model where any savings made are for Cursor: they are not passed on to customers, but the model is cheaper than most others. For Copilot, the Auto mode results in intelligent model selection – but I’ve not heard much positive feedback about this mode from the few devs I asked about it. For Pro plans, Copilot supports pretty old models: GPT-5.5 and Opus 4.8 are not available. These are, however, available on the Pro+ and above plans.

Demand seems to be extremely high for intelligent routing. I asked Matan Grinberg, cofounder and CEO at Factory AI, who told me:

“Demand has been off the charts, especially from the enterprise [from large companies.] I’ve met with practically every bank CEO since we launched this offering, because they want a layer to control spend, while still generating high-quality code.

Pretty much everyone in tech is starting to see that open models are often sufficient. We’re seeing open model usage strictly increasing the last six months. My guess is that hosted open models are sufficient in performance for around 60% of coding-related work, in terms of token spend.”

It feels to me that “intelligent routing” will become table stakes, and so we can expect pretty much all AI vendors to build some version of it, and many new vendors to offer this kind of functionality.

Tech aiwebpolicy

Cloudflare sets deadline to block AI crawlers that bundle search with AI training

Cloudflare will block AI crawlers that combine search indexing with data harvesting unless they separate these functions by September.

Summary

What: Cloudflare is enforcing a strict deadline for AI companies to distinguish between bots used for search and those used for model training, blocking the latter on ad-supported sites if they don't comply.
Why it matters: This represents a major shift in internet governance, as infrastructure providers use their leverage to force transparency on how AI companies acquire training data.

Original Article

Cloudflare, a company that oversees much of the internet’s web traffic, is digging its heels in over AI web crawlers.

Tech aiagentsfrontenddesign

Skill engineering and the case against one-shot AI design

Paul Bakaus argues that AI agents for design should function as steerable tools, rejecting the industry trend toward fully autonomous 'auto-mode' development.

Summary

What: Paul Bakaus, creator of the open-source system Impeccable, advocates for 'skill engineering'—defining precise, domain-specific vocabularies that allow designers and engineers to guide AI agents through iterative steps rather than relying on one-shot prompts. Impeccable bridges the gap between design and code, letting users select sections in a development environment to apply styled changes like 'bolder' or 'quieter' based on predefined design system logic.
Why it matters: This challenges the prevalent 'software factory' narrative that seeks to remove humans from the loop, suggesting instead that the future of agentic workflows relies on compressing expert professional vocabulary into machine-understandable skills to maintain human creative control.

Deep Dive

  • Skill engineering focuses on creating a domain-specific vocabulary to translate subjective design terms (like 'bold' or 'quiet') into actionable code-level logic.
  • The framework emphasizes steering agents through granular control, preventing the model-wide convergence where all AI-generated designs look identical.
  • Impeccable serves as a 'design harness' that operates within existing project codebases rather than external design tools.
  • Bakaus suggests that design and engineering roles are converging, as designers use these tools to move closer to implementation while engineers adopt more design-centric workflows.
  • The philosophy promotes an 80/20 split, where AI handles the baseline implementation and human judgment is reserved for the final 20% of the creative process.
  • The system accounts for cross-model discrepancies (e.g., how Claude Code versus GitHub Copilot handle permissions) by routing tasks internally.
  • Bakaus explicitly rejects automated 'one-shot' generation, insisting that human interaction is necessary to instill taste and project-specific context.

Decoder

  • Skill Engineering: The practice of developing modular, reusable instruction sets for AI agents that define specific design or functional parameters, rather than relying on open-ended prompts.
  • Loopmaxxing: A term describing the push for fully autonomous AI agents that operate with minimal human intervention, often through continuous self-correction loops.
  • Agent Harness: A wrapper or orchestration layer that manages how an AI agent interacts with a specific environment (like an IDE) and its available tools.

Original Article

Skill engineering and the case against one-shot AI design

Paul Bakaus thinks the emerging discipline of “skill engineering” can make AI agents more capable — but he absolutely does not want to remove people from the creative process. He chats to Latent Space about his approach to design in the AI age.

Bakaus is the creator of Impeccable, an open-source design skills system that gives coding agents a vocabulary for improving interfaces. Instead of asking an agent to redesign an entire website in one shot, users can tell it to make a section “bolder,” “quieter,” “denser,” or more polished.

Behind those apparently simple commands is a larger argument about how AI products should be built. Agents need more than instructions, Bakaus said: they need domain knowledge, context and carefully defined ways for humans to steer the result.

“The point is to give you a way to steer what you want to end up with,” he said during a session at the AI Engineer World’s Fair. “It’s never going to be a tool for one-shot design. That’s not the intent.”

The emerging craft of skill engineering

Impeccable began as a relatively simple extension of Anthropic’s frontend design skill. As its audience grew, Bakaus expanded it into a more complex system with multiple components and workflows.

That process led him to start thinking of skill engineering as a discipline in its own right. His workshop at the conference explored what he called the “dark arts” of building skills.

“One of the interesting topics was that most skills — [and] most models — are not very creative,” Bakaus told me. “They converge in one direction, and if everybody uses the same skill to do frontend design work or something like that, everything ends up looking the same.”

Skill engineers must also account for differences between agent harnesses and models. Codex and Claude, for example, do not necessarily handle subagents or permissions in the same way. A skill intended to run across Claude Code, Cursor, GitHub Copilot and Codex cannot assume they all provide identical capabilities.

Bakaus has also experimented with routing inside a skill, allowing it to combine several capabilities and direct a task toward the relevant instructions. He compared this to a mixture-of-experts model, with routing used both to conserve tokens and improve effectiveness.

Giving agents a design vocabulary

Impeccable’s core innovation is to take terms familiar to designers and give them a more precise operational meaning for an agent.

An unassisted model asked to make a page “bolder” may add gradients, neon effects or glass-like surfaces. Impeccable instead defines boldness through concepts such as hierarchy, scale and decisive typography — changes that attract attention without necessarily breaking the existing design system.

“An adjective with nothing behind it is just a nice apostrophe,” Bakaus said. “You really have to tell the agent what you mean.”

He described these terms as words that have been “imbued with meaning.” The model already has some conception of what words such as “bold” or “quiet” mean, but the skill translates them into a specific professional domain.

This is the key, because experts often possess a vocabulary that non-experts do not. Bakaus said he had observed large differences between the work produced by a designer and an engineer using the same model, simply because the designer knew how to articulate the desired result.

“I’ve been trying to put that language — basically compress it into a skill and into a system — to be able to express yourselves better,” he said.

However, he does not believe every part of design can be controlled from this level of abstraction. Directly manipulating spacing may still be the fastest option for a small adjustment, while open-ended prompting can be useful during initial exploration.

The objective is not to replace every tool with an agent, he insisted. It is to determine “the exact level of control” and insert the person at the point where their judgment is most valuable.

Designers and engineers move up the stack

Bakaus sees the boundaries between design, engineering and product management becoming less distinct.

“Designers are moving into code, engineers are moving into design, and vice versa,” he said. “These worlds are all colliding.”

That shift will be uncomfortable for people whose work primarily consists of translating an existing artifact into another form. Engineers who mainly turn Figma designs into code face growing automation, while designers whose contribution is limited to making an existing interface look competent face similar pressure.

“Designers all have to move one layer up the stack to think more about the what,” he said. “I think the role of the product manager and designer is actually converging.”

At the same time, designers are moving closer to implementation — into code. Bakaus initially expected Impeccable to appeal mostly to engineers and assumed professional designers might resent that. Instead, he estimates that designers now make up at least half of its audience.

“So rather than moving directly into code and, you know, having no help,” Bakaus said about designers, “they use Impeccable as a bridge, because it communicates the way they communicate. And that was not obvious to me when I first built it.”

Impeccable also has a live mode that combines visual selection with an underlying coding agent. A user can select a section inside a development environment and request several alternative layouts or (for example) ask for a bolder or quieter treatment. The system operates within the project’s existing code and design system rather than exporting an isolated mockup from a third-party design tool.

Bakaus described this as a potential “design harness” at the intersection of chat and direct visual manipulation.

There will be no auto mode

The AI industry often treats complete automation as the natural endpoint of product development. Bakaus rejects that premise.

He sees two dominant camps: people trying to preserve the traditional Figma-centered workflow, and on the other side advocates of “loopmaxxing” who want agents to work with as little human intervention as possible.

“The truth is somewhere in the middle,” he said.

His preferred model is for AI to produce the first 80% quickly: the competent layout and basic implementation that would otherwise consume a lot of time. The person then owns the final 20%, where taste, context and a distinctive point of view enter the product. This is a key part of Bakaus’s design philosophy in the agentic era.

“People need purpose, and they want to play a role in whatever they create,” Bakaus said. “When you work with the agent, then you feel more ownership of the product.”

Users regularly ask him to add an automatic mode to Impeccable so that the system chooses the commands itself. He has no intention of doing so.

“There is no auto,” he said, “and there will be no auto.”

Asked about the language of software factories and other visions that appear to remove people from engineering altogether, his response was unambiguous.

“I’m squarely against that.”

DevOps enterprisesecuritycloud

Boundary 1.0 releases RDP session recording and improved management

HashiCorp's Boundary 1.0 reaches production readiness with RDP session recording and improved governance for nonhuman identities.

Summary

What: Boundary 1.0 adds native RDP session recording, Kubernetes Helm charts for easier deployment, scoped aliases, and a simplified admin dashboard. It targets secure access management for hybrid environments, AI agents, and Windows workloads.
Why it matters: The inclusion of RDP recording and specific support for non-human identities like AI agents reflects a growing focus on securing automated service access in privileged management.

Decoder

  • RDP: Remote Desktop Protocol, a secure network communications protocol designed for remote management and access to virtual desktops and applications.
  • Privileged Access Management (PAM): A security discipline that requires and manages access to highly sensitive IT resources.

Original Article

Boundary 1.0 marks a production-ready milestone for privileged access management with RDP session recording, Kubernetes Helm charts, scoped aliases, and a simplified admin UI, while advancing secure access for Windows, AI agents, and nonhuman identities through continuous authorization, dynamic credentials, and unified governance.

DevOps cloudopensource

Floci (GitHub Repo)

Floci emerges as an open-source alternative to LocalStack, offering free local AWS emulation with real Docker-backed execution for core services.

Summary

What: Floci is a tool for running AWS-compatible services locally without cloud accounts or paid feature gates. It supports services like Lambda, RDS, and S3 using a Docker-based architecture designed for low memory usage and CI compatibility.
Why it matters: The pivot in the local emulation ecosystem follows the monetization of previously free community tools, forcing teams to seek open-source alternatives that maintain performance in test pipelines.
Takeaway: If your CI tests rely on local AWS emulation, consider testing Floci as a lightweight, no-cost alternative to your current provider.

Deep Dive

  • Provides local emulation of 68+ AWS services, including Lambda, S3, DynamoDB, and ECS.
  • Uses native Docker containers for high-fidelity emulation of compute and database services.
  • Offers configurable storage modes ranging from memory-only to write-ahead-log persistence.
  • Supports multi-account isolation based on AWS access key IDs.
  • Integrates with existing SDKs, Terraform, CDK, and OpenTofu via a local HTTP endpoint.
  • Includes specialized modules for Testcontainers in Java and Node.js.

Decoder

  • Emulator: Software that mimics the behavior and interface of a cloud service locally, allowing developers to test against APIs without incurring cloud costs.
  • Testcontainers: A library that facilitates the use of throwaway instances of databases or other services in Docker containers for automated tests.

Original Article

Full article content is not available for inline reading.

Read the original article →

DevOps infrastructureaimcp

ContextForge (GitHub Repo)

ContextForge acts as a central governance gateway for federating various MCP, A2A, and REST/gRPC APIs into a unified endpoint.

Summary

What: Built by IBM, ContextForge provides centralized discovery, rate limiting, authentication, and OpenTelemetry observability for AI infrastructure. It supports deployment via Docker, PyPI, or Kubernetes and includes a built-in admin UI and virtualization for legacy APIs.
Why it matters: As organizations deploy fragmented pools of AI tools and agents, a middleware layer becomes necessary to manage governance, observability, and protocol translation consistently across diverse backends.
Takeaway: Developers can use the `mcp-contextforge-gateway` PyPI package to begin federating local MCP servers and legacy REST services behind a single authenticated gateway.

Deep Dive

  • Tools Gateway: Translates REST, gRPC, and A2A protocols into standard MCP.
  • Governance: Centralizes auth, rate-limiting, and retries for all integrated AI tools.
  • Observability: Built-in support for OpenTelemetry to track tool usage, token costs, and latency.
  • Deployment: Supports multi-cluster Kubernetes federation using Redis for caching and session management.
  • Admin UI: Includes an interface for monitoring logs, managing configurations, and viewing tool catalogs.
  • Virtualization: Allows wrapping legacy REST APIs as MCP-compliant tools via JSON Schema extraction.

Decoder

  • MCP (Model Context Protocol): An open standard for connecting AI assistants to data sources and tools.
  • A2A (Agent-to-Agent): Protocols enabling autonomous agents to interact and share context.
  • TOON compression: A technique for reducing the size of tool definitions sent to LLMs.
  • KRM (Kubernetes Resource Model): The declarative YAML-based approach to defining Kubernetes objects.

Original Article

Full article content is not available for inline reading.

Read the original article →

DevOps infrastructurekubernetes

(re)introducing kpt: Your toolchain for infrastructure automation

kpt joins the CNCF as a sandbox project, providing a package-centric toolchain for managing Kubernetes configurations as data rather than templates.

Summary

What: Developed by Google and now maintained by a community including Ericsson, kpt offers a 'What You See Is What You Get' workflow for GitOps by separating configuration data from the functions that transform it, enabling better auditing and composition compared to Helm or Kustomize.
Why it matters: This highlights the industry's pivot toward 'Configuration as Data' to solve the complexity and auditability issues inherent in template-heavy infrastructure pipelines.

Deep Dive

  • Package-centric: Operates on bundles of Kubernetes Resource Model (KRM) files.
  • Configuration as Data: Stores configuration as plain, versioned data rather than code.
  • Separation of concerns: Logic lives in functions, while intent is captured in declarative YAML.
  • Auditability: Enables easy diffing and linting of final manifests before cluster application.
  • WYSIWYG: Ensures what you review is exactly what gets deployed to the cluster.
  • Extensibility: Allows integration with existing tools like ArgoCD, Flux, and Helm.

Decoder

  • KRM (Kubernetes Resource Model): Standardized YAML manifest structures for Kubernetes objects.
  • WYSIWYG: In this context, it means the configuration reviewed in source control is identical to the applied cluster state.
  • GitOps: A paradigm where the cluster state is managed exclusively through a Git repository.

Original Article

What is kpt?

The opening tagline of the kpt documentation describes it as

“… a package-centric toolchain that enables a WYSIWYG configuration authoring, automation, and delivery experience, which simplifies managing Kubernetes platforms and KRM-driven infrastructure at scale by manipulating declarative Configuration as Data.”

This is a concise and detailed description of kpt, but sometimes it feels like it was written by a lawyer, or a consultant on a per-industry-buzzword contract. Let’s break it down.

package-centric

Kpt works on packages – specifically bundles of Kubernetes Resource Model (KRM) files, declarative YAML manifests that define the desired state of cluster resources for Kubernetes (or Kubernetes Operator extensions) to continuously reconcile. These are pretty lightweight – they can be a directory on your computer, a zip file, or (most typically, given the GitOps nature of kpt) the contents of a git repository or git repository subfolder.

toolchain

Kpt is a CLI, but also provides a number of tools – validators and mutators – that can be executed in a kpt pipeline to verify and / or modify the contents of a kpt package.

WYSIWYG

What You See Is What You Get is more typically associated with graphical editing tools or print-friendly editors. In this context, it refers to the fact that the kpt file contents you have at any point in time are exactly the resources that will end up in your cluster – they are not modified out-of-band on the way to the cluster, nor do they depend on external templates or metamodels.

configuration authoring, automation, and delivery

kpt supports the full lifecycle of a package of kubernetes resource descriptors. The CLI supports bootstrapping a new package with the basic content and configuration templates, while kpt pipelines provide a mechanism to automate the process of specializing those templates into site-specific parameterized packages, potentially across hundreds of different sites. Kpt also supports the package review and validation processes required to ensure configuration correctness before applying to live networks. Finally, kpt can be used to deploy packages to a live environment, monitoring their reconciliation status as it does so.

However, kpt is modular and can be integrated into any existing Kubernetes-centric toolchain to provide some or all of the automation capabilities.

managing Kubernetes platforms and KRM-driven infrastructure

Kpt’s primary raison d’etre is the management of Kubernetes-native constructs – manipulating KRM files which enable Kubernetes tooling to robustly and autonomously deploy clusters, provision workloads into clusters and even manage the day-to-day configurations of those workloads. If it can be described in KRM, it can be packaged and automated using kpt.

manipulating declarative Configuration as Data

Configuration as Data is an approach where system and application configuration is stored, managed, and versioned as plain data, rather than embedded with code that must be executed to be understood.

The key benefits of this approach include:

  • Auditability: because config is plain structured data, you can diff it, lint it, and review it without running anything. A diff tells you exactly what changed between two versions of your system state before applying to the cluster
  • Separation of concerns: logic lives in functions/controllers; intent lives in data. This makes both easier to reason about independently. You can validate the data against a schema without invoking any business logic
  • Composability: different tools can read, transform and validate the same data models, enabling function chaining and independently testable mutation phases

Why do we need it?

Given the proliferation of infrastructure orchestration and configuration management tools out there, it’s a fair question to ask – why kpt? There are a few reasons why we feel the kpt project has a unique value in the Kubernetes automation ecosystem.

Firstly, kpt’s “in-place” update paradigm differs from other tools, such as Kustomize, which generate final manifests on the fly during render / apply phase by overlaying patches on a base configuration. It’s our view that being able to examine, and optionally approve, the final configurations before applying simplifies troubleshooting and fault resolution. This is what we describe as “WYSIWYG”.

Secondly, the paradigm of “configuration as data” is fundamentally different to the “configuration as code” approach many other tools adopt. By clearly separating configurations in packages (pure KRM models) from the business logic which transforms them (KRM functions), there is a clean separation of concerns. Data files describe precisely the desired state, and native tooling can process and validate them. This reduces side effects, drift, and complexity. Connecting with the “in-place” paradigm means the pure configuration data that precisely describes the desired state is available for auditing and verification, reducing operational risk and making deployments more deterministic.

Lastly, kpt doesn’t try to solve all problems. It is a simple tool with a core set of capabilities, leveraging an extensible library of functions to support common needs and allowing users to bring their own business logic. It integrates well with other gitops tools – ArgoCD, Flux, Helm and Porch to name a few – allowing it to coexist well in the broader Kubernetes ecosystem.

Use cases for kpt

The suite of examples available in the kpt repo include basic examples using WordPress, nginx and so on. The “kpt-samples” repository includes a new example using the Headlamp project, and we hope to add more such examples of off-the-shelf tools that can benefit from packaging using kpt in the near future.

The most robust and complex suite of examples available for the use of kpt in the Nephio project, where kpt forms a critical part of the end-to-end toolchain for defining and specializing the deployment of open source RAN and core networks on top of Kubernetes.

Finally we hope to have a working example of a more complex, multi-site and multi-service deployment scenario using kpt and a more typical IT workload over the summer. Watch this space, and please let us know if you are using kpt in your own toolchains!

Current status and plans

The kpt project has recently been fully onboarded into CNCF as a sandbox project. We would like to see it grow in usage and build a bigger community of users. The plans for world domination include making a number of changes to the way pipelines execute, to improve performance, allow for optional steps, enable external connectivity and improving documentation amongst others. We will be undertaking a significant restructuring of the documentation to improve readability and align with industry best practices. Secrets handling, multi-cluster support and helm support are oft-requested features and the aim is to stabilise the kpt APIs to release v1, so that end users can have confidence in the stability of their packages.

Join us!

The kpt project meets weekly, on Wednesdays at 2pm CET. We use GitHub to collaborate on issues and discussions and we use the kpt channel on the Kubernetes slack for ad hoc communication.

If what you see interests you and you would like to know more, please reach out. If you are already using kpt and would like to see a new feature, or want to help improve it, please reach out – we would be delighted to see you!

Design aisecurityenterprise

Trust You Can Verify: Figma is Now ISO 42001 Certified

Figma is the latest major platform to secure ISO/IEC 42001:2023 certification, providing independent verification of its AI governance and risk management.

Summary

What: Figma completed a two-stage audit by Schellman, covering 38 controls across nine areas including data governance and human oversight. This credential supplements existing ISO 27001 and SOC 2 Type II certifications.
Why it matters: As AI features become standard in enterprise software, companies are shifting from self-reported safety claims to accredited third-party audits to satisfy procurement and regulatory requirements like the EU AI Act.
Takeaway: Security teams can now reference Figma's verified AI management system documentation at compliance.figma.com during vendor risk assessments.

Decoder

  • ISO/IEC 42001:2023: An international standard for artificial intelligence management systems (AIMS) that defines requirements for establishing, implementing, maintaining, and improving responsible AI governance.
  • VPAT (Voluntary Product Accessibility Template): A document that explains how information and communication technology products meet accessibility standards.

Original Article

Trust you can verify: Figma is now ISO 42001 certified

Saying you use AI responsibly is easy, but proving it to an accredited auditor is harder. We decided that was a standard worth meeting.

ISO/IEC 42001 is the international standard for AI management systems, published in December 2023. It is the AI equivalent of ISO 27001: a framework that defines what responsible AI governance requires and subjects it to third-party verification.

An Artificial Intelligence Management System (AIMS) is the operational backbone of AI governance: the policies, processes, and controls that govern how AI is built, deployed, and monitored across an organization's products.

Security teams reviewing AI vendors face a familiar problem: every vendor's documentation looks the same whether the governance is real or not.

Figma's AI features are embedded in workflows across banking, healthcare, insurance, and the public sector, where strict regulatory, security, and privacy requirements are non-negotiable when building software. When security teams run vendor risk assessments, boards ask about AI governance, or customers weigh whether to turn on AI-assisted design features, just taking AI seriously isn't good enough.

Figma has achieved ISO/IEC 42001:2023 certification. Schellman, an accredited independent certification body, audited our AI governance policies, risk management processes, and development practices across the platform and confirmed Figma met the standard. This joins Figma's existing ISO/IEC 27001 and SOC 2 Type II certifications.

Why independent verification matters

There are two ways to answer a customer's question about AI governance. You can document your practices, publish a whitepaper, answer questionnaires, and ask customers to trust your account of your own controls. Or you can open your management system to an accredited third party, let them test it against an international standard, and hand customers the result. We do both. But only one of them is verifiable.

ISO 42001 certification means Schellman examined our governance policies, data practices, risk processes, and technical safeguards and confirmed they meet the standard.

What the certification covers

The certification covers the AI Management System governing how Figma designs, develops, and operates AI features across our platform: Figma Design, Figma Make, FigJam, Dev Mode, Figma Sites, Figma Slides, Figma Draw, Figma Buzz, and Figma Weave.

What was actually audited

Our ISO 42001 certification was issued by an ANAB-accredited certification body, which matters because accredited and unaccredited certifications are not the same thing. Accreditation requires the certification body itself to undergo a formal audit before it can issue conformant certificates to anyone.

The audit ran in two stages. Stage 1 assessed the design of Figma's AIMS, including documentation, policies, and risk methodology. Stage 2 tested whether it actually works, with auditors interviewing staff, observing processes, and evaluating operational effectiveness across 38 controls organized into nine Annex A control objectives:

  • AI impact assessment
  • Governance and accountability
  • AI-specific risk management
  • AI system lifecycle management
  • Data governance
  • Third-party AI risk
  • Monitoring and performance evaluation
  • Human oversight
  • Responsible use of AI systems

These are not new concepts. What the certification validates is that we have put them into practice, not just documented them.

What this means for Figma users

At Figma, we sit on both sides of this problem. We build the programs that answer your security questionnaires, and we run the vendor risk programs that evaluate our own suppliers. We know what thorough documentation looks like.

ISO 42001 certification is independent evidence of AI governance. Instead of parsing our questionnaire responses, you have something more reliable: a recognized international standard, verified by an accredited body, that you can cite in vendor risk assessments, board reporting, and regulatory submissions.

This matters especially under the EU AI Act and emerging AI procurement standards, which require proof, not promises. For organizations in regulated industries such as financial services, healthcare, insurance, and the public sector your own vendor risk program is subject to audit. An externally verified AI governance posture is a different class of evidence than a vendor's self-assessment.

ISO 42001 is built to provide exactly that.

Our commitment going forward

As Figma's AI capabilities expand, so does our commitment to governing them responsibly. We will keep submitting that governance to third-party verification rather than asking you to take our word for it. Figma's ISO 42001 certificate, alongside our full security and compliance documentation, is available at compliance.figma.com, and can be verified through Schellman's certificate directory.

This certification is the start, not the finish. We will keep compliance.figma.com current so you can verify our posture at any point in the vendor relationship. And we will be transparent when our AI governance practices change in ways that affect your risk assessment.

Questions about Figma's AI governance or security posture? Reach out to your account team.

Design frontendai

Figma Just Made Your Design System Debt Everyone's Problem. Now Use It

Figma’s 2026 feature set turns design system debt from an invisible maintenance task into a public, measurable delivery risk for engineering and stakeholders.

Summary

What: Tools like Check designs, Code Connect, and AI agents now rely on design system integrity, meaning poor systems cause immediate, visible failures in generated code and prototypes, quantified by concrete inconsistency counts.
Why it matters: Designers now have a path to frame system maintenance not as 'craft' but as a dependency for shipping features, forcing product stakeholders to acknowledge the cost of design debt.
Takeaway: Run Figma's 'Check designs' on your projects before the next planning cycle to quantify drift and present the metrics as a delivery risk to stakeholders.

Deep Dive

  • Dependency Shift: Every new feature (AI agents, Code Connect, Make) acts as a consumer of the design system.
  • Quantification: 'Check designs' provides an objective, numerical count of system violations that can be shared across teams.
  • Visibility vs. Authority: While the tool provides evidence, it does not solve the underlying governance issues or roadmap prioritization disputes.
  • Operational Impact: Design system rot now impacts developers via broken code mapping and stakeholders via inaccurate AI-generated artifacts.

Decoder

  • Design System Debt: The accumulation of inconsistent components, hard-coded values, and broken mappings in a design library, which makes future UI changes slower and more error-prone.
  • Token: A design system primitive (e.g., color, spacing) that acts as a single source of truth used across code and design tools.
  • Slot Guardrails: A Figma feature that restricts what types of components or elements can be placed inside a specific container, enforcing architectural consistency.

Original Article

Figma just made your design system debt everyone’s problem. Now use it.

For years, design system work lost the roadmap fight quietly. Config 2026 made the losing loud. That changes who has to care.

Figma’s Check designs panel on a demo file. Four tabs at top show violation counts: Colors 4, Dimensions 13, Typography 3, Components 11. The Dimensions tab is expanded, listing off-token spacing values on the left being matched to correct design system tokens on the right, each row tagged “Match” in green. The bottom status bar reads “Update 13 dimensions across 25 items” next to an “Apply 13” button.
A staged demo file run through Figma’s Check designs (June 2026). The violations were seeded on purpose; the count is real. Each green “Match” is a one-click fix the tool is proposing. The argument is not the number — it’s that the number now exists, visible to anyone in the room.

“Let’s revisit next quarter.”

You have heard it enough times to know what it means. It means never. You are standing in front of the planning board, and the design system work is pinned next to six feature requests, and you have just made the case you have made before. The tokens need restructuring. Components are drifting out of the library. Half the buttons in the product file are detached from their source. Someone with budget authority nods, says the line, and moves on. The features ship. The debt compounds. You lose, politely, again.

If you maintain a design system, you know this meeting. You have probably blamed yourself for losing it. Wrong deck, wrong metrics, wrong framing. The belief that if you could just present it better, you could make a maintenance backlog feel urgent to people who measure their quarter in shipped features.

The pitch was never the problem. The problem is that design system rot is invisible to everyone who is not the person maintaining it. You cannot make a leadership team feel a cost they cannot see.

That is the thing Config 2026 actually changed. Not the tools. Who feels it when the tools fail.

Everything Figma shipped reads from your design system

Look at the lineup, not as features, but as dependencies.

In the first half of 2026, before Config even opened, Figma shipped a native AI agent that generates and remixes UI on the canvas (May 20). It opened the canvas to third-party agents like Claude and Cursor through its MCP server, governed by “skills,” markdown files that encode your team’s conventions so an agent knows to use Button/Primary and never a raw hex value (March 24). It launched the Code Connect UI, which maps Figma components to their real code counterparts so Dev Mode shows a developer your actual React import instead of generic CSS (March 4). It turned Make from a prototyping toy into something that edits a production codebase and opens a pull request (May 28). Then on June 4 it shipped Check designs. At Config itself came Figma Motion, agent-built WebGPU shaders, and slot guardrails that constrain what can go inside a component.

Different surfaces. One input. Every single one reads from your design system, and every single one fails in public when that system is bad.

The agent asked to build a settings screen without a clean library and skill files produces something plausible and generic, using none of your components. Figma’s own launch post admitted agents without that context feel “unfamiliar and generic.” That generic screen does not stay in your file. A stakeholder sees it in a review and decides that is what your team produces. Code Connect with no mapping leaves the developer back where they started, eyeballing pixels, except now the gap has a name and the name is the empty mapping you never filled in. Make pushes an off-brand change into the repo, and the people who feel it are engineers reviewing the PR. Slot guardrails are only as good as the system defining the slot. A messy system does not just sit there quietly anymore. It propagates, fast, into code, into prototypes, into 200 templated marketing banners that all came out wrong from one bad template.

A three-column flow diagram. Left: one purple node, “Design system — Tokens, components, skills, mappings.” Five arrows fan out to a middle column of green nodes: AI agent, Code Connect, Check designs, Make, and Buzz/Sites/Slides. Each green node has a horizontal arrow to an amber node on the right naming who absorbs the failure: Stakeholder sees generic UI, Developer eyeballs pixels, Leadership reads the number, Engineer reviews off-brand PR, Marketer ships wrong banners.
Every surface Figma shipped in 2026 reads from the design system. When the system is bad, the failure no longer stops with the person maintaining it.

The number nobody can argue with

Check designs is the sharpest of these, and most Config recaps undersold why.

It compares a file against your design system and flags everything that does not match: hard-coded values that should be tokens, contrast that fails WCAG, detached components, tokens from libraries the file is not even subscribed to. And here is the part that matters: it does not use AI. It does not interpret or guess. It counts. When it opens, a total sits at the top of the panel.

For years, “our design system has drift” was a sentence. Soft, deniable, easy to defer. Now it is a number on a screen anyone in a review can see. Forty-seven inconsistencies in one file. Then the next file. Then the file the contractor shipped last sprint.

A sentence loses the roadmap fight. A number that keeps climbing in front of stakeholders does not lose the same way.

This is the real shift. Design system debt used to be paid by one person, quietly, in files nobody else opened. Now it gets paid by the engineer whose generated component is wrong, the PM whose timeline slips because handoff broke, the marketer whose banners came out off-brand. The cost moved out of your corner and into theirs. Designers have argued for years that a design system is not a deliverable but a product that needs governance, funding, and an owner. The argument was always correct and almost always lost, because the people who held the budget never felt the failure. Now they do.

The gap Config did not close

Here is where I stop being optimistic.

Visibility is not authority. The person who understands the system best is rarely the person who controls the roadmap. The parts of a design system that actually decide whether it survives are the invisible ones, the processes, the review and approval models, the question of who owns it, and those are exactly the parts that never show up in a demo. You can run Check designs, screenshot the number, and walk into planning with the cleanest evidence you have ever had, and still watch a director decide the feature ships first and the cleanup waits. Evidence does not grant standing. It just removes one excuse.

And the tooling has its own honesty problem. Within days of launch, designers were on the Figma forum pointing out that Check designs flags intentional exceptions as violations: annotation colors, developer notes, disabled states that are not required to pass contrast.

Two posts from the Figma forum thread “Check designs: Ignore or exclude suggestions.” Karin1’s original post says the tool flags items used outside the design system on purpose, like annotations, and asks for a way to ignore those suggestions. A reply from Damian_Summersall says the same fix would help disabled states, which fail contrast but don’t need to pass WCAG. His reply includes a Check designs screenshot flagging a disabled button text color as “AA Contrast standard not met.”
Within days of Check designs launching, designers were flagging that the tool marks intentional exceptions as violations — including disabled button states that legitimately don’t need to pass WCAG. The count is real, but it is not pure.

There is no clean way to mark “this one is on purpose.” So the number is real, but it is not pure. Hand a skeptical executive a count that includes false positives and watch them use the false positives to dismiss the whole thing.

A count is leverage. It is not a strategy. The work of separating signal from noise, of deciding which violations are debt and which are intent, still falls on the person who already could not get the time to do it.

What I would actually do with this

Stop pitching the design system as craft. Craft loses to features every time, because craft sounds like preference. Start pitching it as the dependency under every AI feature leadership just got excited about at the keynote. The agent, the code handoff, the production editing: none of it works on a broken system. That is not a design argument. That is a delivery argument, and delivery is a language the roadmap room already speaks.

Run Check designs on your worst file before the next planning cycle, not your best. Bring the number. Then bring the second number, from the file a non-design-system person shipped, so it is not your work on trial. Make the cost theirs, because now it actually is.

Then watch what happens. Because Config 2026 handed the design system lead something we never had: proof that the failure is shared. It did not hand us the authority to act on it. Figma gave you a seat at the table and a number to put on it.

Whether anyone in that room lets you spend it is a different question. And it is the one nobody at Config was on stage to answer.

Design frontendaidevops

Why Accessibility is an Operational Capability, Not a Feature

AI-generated UI code is structurally inaccessible by default, shifting accessibility from an optional audit to a core operational requirement for engineering maturity.

Summary

What: AI models prioritize visually 'correct' code over semantic HTML, resulting in non-accessible patterns like div-based buttons. The fix requires baking constraints (e.g., Cursor rules, semantic defaults) into the development pipeline rather than relying on late-stage audits.
Why it matters: Automated code generation industrializes the creation of technical debt, making accessibility failure a systemic risk that threatens compliance and consumer spending power ($13 trillion total).
Takeaway: Implement eslint-plugin-jsx-a11y and mandate semantic HTML in your AI generation system prompts to catch accessibility failures during development rather than during production audits.

Deep Dive

  • The AI Gap: Models are trained on existing web 'soup', leading them to default to non-semantic, inaccessible code structures.
  • Process Failure: Accessibility, like security, is failing because teams treat it as an outcome rather than a process constraint.
  • Operational Maturity: Treating accessibility as infrastructure (e.g., GOV.UK design system) is a proxy for high-quality, maintainable engineering.
  • In-flow Automation: Tools like @storybook/addon-a11y and LevelCI are required to maintain compliance at the speed of modern deployments.
  • The Business Case: Accessibility is a major procurement factor; having a weak or non-existent ACR (Accessibility Conformance Report) can stall B2B sales cycles.

Decoder

  • ACR (Accessibility Conformance Report): A document that reports how a product conforms to accessibility standards (like WCAG), often used in procurement.
  • Semantic HTML: The use of HTML tags that convey meaning (e.g., , ) rather than just presentation (e.g., ), which is critical for screen reader compatibility.
  • Shift-left: A software development practice of moving tasks like testing and security analysis earlier in the development lifecycle to reduce costs and complexity.

Original Article

Teams can generate UI faster than ever, but they still have to guarantee that what they ship is usable, secure, and maintainable. Accessibility as an operational capability rather than a compliance checklist or end-of-project audit, and what that looks like in practice.

We know that right now, a senior engineer is shipping a checkout flow they “built” in a single afternoon. AI assistant does the heavy lifting, happy path runs clean, and a rotating chevron spins on the order summary. Two weeks later, engineering gets a notice from customer support: a blind customer using a screen reader can’t complete the purchase because the “Pay Now” control is a <div> with a click handler. No role. Not focusable. Not working.

That gap — between code that runs and a product people can actually use — is becoming one of the defining engineering challenges of the AI era. Teams can generate UI faster than ever, but they still have to guarantee that what they ship is usable, secure, and maintainable.

Accessibility sits right in the middle of that problem.

This is not an article about compliance checklists or end-of-project audits. It’s about engineering systems. Specifically, why accessibility should be treated as an operational capability — alongside privacy, security, reliability, and observability — and what that looks like in practice.

The Audit Trap

For years, the default way to “do” accessibility was the one-time, audit-only approach: hire a firm, get a list of 200 findings, fix some of them, file the report. A lot of teams have now moved beyond this model — and the reason is worth looking into.

Audits do matter. For sales, procurement, governance — they’re essential. When a buyer asks for a VPAT or an ACR, you need one. When legal asks if you’re meeting requirements, you need documentation. Audits serve those purposes well.

But audits don’t help you build accessible features during sprint planning. Audits can cost points during a sprint. They don’t catch problems before merge requests. They don’t scale with deployment velocity. The mistake, essentially, is tackling accessibility as a snapshot when you really need constant monitoring. Six months after the audit, the product has shipped dozens of releases, multiple new features, and a redesigned nav. The report is now fiction. Compliance is not a state you reach — it’s a state you maintain, and complexity fights you the whole way.

The WebAIM Million report, which scans the top one million home pages every year, found that 95.9% of pages had detectable WCAG failures in its 2026 run, with an average of 56.1 errors per page. The number of page elements jumped more than 20% in a single year, likely driven by AI-enabled development and ‘vibe coding’ — and more elements mean more places to break. Accessibility debt behaves exactly like technical debt: every inaccessible component you ship becomes a future remediation project, and the interest compounds.

Any strategy that treats accessibility as a periodic event rather than a continuous property of the system is going to lose.

The AI Problem Nobody Wants To Name

With the scale at which teams now generate UI, the gap doesn’t just persist; it multiplies.

Start with how fast this arrived. In February 2025, Andrej Karpathy coined “vibe coding” — a way of working where you “fully give in to the vibes” and “forget that the code even exists”. You describe intent, the model generates, you accept the diffs without reading them. It was meant for weekend projects. It did not stay there. Y Combinator reported that 25% of its Winter 2025 batch had codebases that were 95% AI-generated.

Models don’t land on non-semantic markup by accident — three forces push them there. Most React code on GitHub uses non-semantic “soup”, so that’s what the models learn. Human reviewers and evaluators judge output visually, so the feedback loop rewards looks, not semantics. And <div onClick> is fewer tokens than <button aria-expanded="true" ...>, so absent a constraint, the model takes the cheap path.

Here’s the thing about AI-generated UI: it’s inaccessible by default. Not occasionally — by default. A developer writing in Frontend Masters tested AI-generated React components across multiple tools and documented the pattern. A typical AI-generated sidebar had ten distinct accessibility failures in twenty-nine lines: no landmark, no heading, no list structure, elements with click handlers instead of buttons, no aria-expanded, no keyboard handling, and unlabeled icons. The accessibility tree — the structure screen readers actually read — came back as flat, unstructured text. “Same pixels” as the author put it. “One is a door. The other is a painting of a door”.

Now connect this to security, because the two failures come from the same root. Veracode’s 2025 GenAI Code Security Report tested large language models across dozens of coding tasks and found that a large fraction of AI-generated code introduced security vulnerabilities — including OWASP Top 10 flaws. Cross-site scripting failures were particularly common, and security performance did not meaningfully improve with newer, larger models. The issue wasn’t model intelligence. It was process: developers generating code without specifying security constraints and accepting output without systematic verification.

The same shortcut that skips the security review skips the accessibility review. At scale, AI won’t close the accessibility gap — it has industrialized the very thing that creates it.

The fix is not to ban AI. Your developers are already using it. The fix is to constrain it and verify it — to treat AI as a very fast teammate who always needs guardrails.

Velocity and Accessibility Are Not Enemies

This is usually where someone says, “Guardrails? Sounds great, but they will slow us down.”

In practice, the opposite tends to be true.

Shift-left is the entire DevOps thesis, and it applies cleanly here. An accessibility issue caught during design review is a comment. The same issue found in production is a remediation project.

Catching an accessibility issue as a component is built takes minutes. Fixing one after the fact — discovering it in an audit, diagnosing the root cause, restructuring the markup, applying the necessary fix, writing tests — can easily take hours. Multiply that across hundreds of findings from a late-stage audit, and you have weeks of unplanned work that earlier automated checks — whether in design reviews, development workflows, or CI — could have prevented.

Teams that integrate accessibility into everyday workflows avoid the expensive surprises: emergency audits, remediation sprints, procurement blockers, and redesigns that quietly break core user journeys. Accessibility doesn’t reduce velocity. Unexpected work reduces velocity. In-flow accessibility is one way of eliminating unexpected work.

What Enterprise-Ready Actually Looks Like

The organizations that scale accessibility successfully do not rely on heroes. They rely on systems.

The highest-leverage place to start is the design system. One accessible component can be reused thousands of times. The GOV.UK Design System is a useful example: components undergo both automated and manual testing using assistive technologies such as JAWS, NVDA, VoiceOver, and TalkBack. The team is explicit about the limits of automation and supplements tooling with user testing involving people with disabilities. They’re equally clear that using the design system doesn’t “magically” make a service accessible; it just gives you a higher starting point.

Accessibility becomes infrastructure. That’s the lesson.

From there, it moves into the engineering workflow:

  • Accessibility requirements are included in the Definition of Done.
  • Pull request reviews include explicit accessibility checks.
  • Interactive controls use semantic elements (<button>, <a>) by default.
  • Keyboard navigation and focus management are treated as standard engineering concerns, not optional polish.

Finally, accessibility becomes enforceable through automation:

  • eslint-plugin-jsx-a11y catches common issues before code is committed.
  • LevelCI, Pa11y, and similar tools provide automated testing in CI/CD pipelines.
  • @storybook/addon-a11y surfaces issues during component development.

At that point, accessibility stops depending on memory and starts depending on the process. It becomes part of your platform.

Patterns That Actually Scale

A few implementation patterns consistently show up in teams that do this well.

Constrain AI Before It Generates

Instead of fixing accessibility after generation, bake requirements directly into tooling through Cursor rules, Copilot instructions, or repository-level standards. Tell the model to use semantic HTML. Tell it when to use buttons versus links. Tell it to expose the state and labels correctly. Models follow persistent constraints far more reliably than one-off prompts.

Stop Hand-Rolling Complex Widgets

Comboboxes, menus, tabs, modals, and similar controls routinely become accessibility hotspots. Libraries such as Radix UI, React Aria, and Headless UI already solve many of these problems. The scalable approach is not about repeatedly implementing accessibility correctly. It’s inheriting accessible behavior from well-tested primitives.

Capture Accessibility During Design Handoff

Focus order, labels, heading hierarchy, and interaction states should be specified before implementation begins. If accessibility requirements are absent from the design artifact, they are often absent from the final product. A simple memo at design handoff — what is the tab order, what are the labels, what happens on error — removes a huge amount of guesswork later.

None of these patterns is exotic. They’re just DevOps and platform thinking applied to accessibility.

The Broader Business Impact

Engineering leaders rarely prioritize accessibility solely because of regulations. But regulations, procurement requirements, user retention, and product quality all point in the same direction.

Legal pressure continues to increase. Digital accessibility lawsuits in the United States have stayed in the thousands per year, and they are not limited to large enterprises. The European Accessibility Act is now enforceable across the EU, applying to e‑commerce, banking, ticketing, telecoms, and more, regardless of where the company is headquartered. The message is clear: accessibility is no longer a “nice-to-have” in the eyes of regulators.

But compliance is only part of the story. The bigger story is the market you leave on the table. The World Economic Forum estimates that the world’s 1.3 billion people with disabilities, “along with their friends and family, has a spending power of $13 trillion”; disabled consumers alone control roughly $8 trillion in annual disposable income.

In the UK alone, the Click-Away Pound Report 2019 found the “Click-Away Pound has risen to £17.1 billion” — more than 4.9 million users with access needs who abandon inaccessible sites and spend elsewhere. People don’t file a bug report. They leave and buy from a competitor.

There is also a procurement reality that turns accessibility from a cost into a moat. If you sell B2B or to government, you will increasingly be asked for proof of accessibility — VPATs/ACRs or equivalent documentation. A strong ACR accelerates the sales cycle; a weak one, or none at all, creates redlines that stall or kill it. For some buyers, this is a hard requirement before your product can even enter evaluation.

Step back and the deeper pattern is clear: accessibility is a proxy for engineering maturity. A team that ships semantic HTML, manages focus, exposes state correctly, and tests it in CI is a team that has its house in order. The same discipline that produces an accessible component produces a maintainable, testable, less buggy one.

For dev and product leaders, that’s the real business case: accessibility work is platform work. It pays off every time a feature ships faster and more smoothly, with less rework, than it otherwise would have.

Systems, Not Sprints

If you take one thing from this, make it this: accessibility doesn’t come from an audit, a hero, or a heroic remediation sprint before launch. It comes from systems.

An accessible design system so components start right. A Definition of Done so they stay right. Automated testing and CI gates so regressions fail the build. Governance, so someone owns it. Guardrails for AI-assisted development so your fastest tool stops being your biggest liability.

None of those practices is particularly glamorous. That’s exactly why they work. They’re the same kinds of boring, reliable systems you already trust for security, reliability, and performance.

But there’s one thing no tool on that list can do. No linter, no automated scanner run, no dashboard will ever tell you what it’s actually like to use your product as a blind person with a screen reader, or to navigate your checkout with a keyboard because a tremor makes a mouse inoperable. So build the systems — you need them, and they’re the only way accessibility survives contact with a real release schedule. But test with real users with disabilities regularly. The first time you sit behind someone using JAWS to fight through a form your team thought was “done”, something changes. The tooling tells you whether you passed. A real person tells you whether it actually works.

Accessibility is not a feature. It’s an operational capability. Treat it that way, and you get something dev and product leaders already care about: a faster, safer, more reliable way to ship software.

Design aienterprise

AI vs. Human Illustrators? AI + Human Illustrators!

Evil Martians reduced blog illustration costs by 50% and improved quality by replacing pure AI generation with a hybrid human-AI agency workflow.

Summary

What: After failing to maintain visual consistency with models like Midjourney, Recraft, and Nanobanana, Evil Martians transitioned to using the agency KOJI. KOJI uses a hybrid process where AI handles initial generation and a human illustrator performs the 'last mile' of cleanup, reducing the cost per asset to $100 compared to previous workflows.
Why it matters: This highlights the 'last mile' problem in generative AI, where models consistently fail to handle brand-specific constraints and fine-grained composition, necessitating human intervention to bridge the gap between impressive demos and production-ready output.
Takeaway: If your AI-powered application requires high-quality, consistent output, build a 'human-in-the-loop' correction layer rather than relying on the model to achieve 100% accuracy.

Deep Dive

  • Initial Workflow: Relied on manual contractors, leading to inconsistent turnaround times and high costs.
  • AI-Only Attempt: Tried Midjourney, Recraft, and custom GPT wrappers but faced 'hallucinations,' poor mascot consistency, and excessive manual cleanup time.
  • The 'Slop' Factor: Spending hours in Photoshop correcting AI errors destroyed the economic case for automation.
  • Hybrid Solution: Partnered with KOJI to integrate 3D model prep with AI rendering and professional human retouching.
  • Economic Shift: Reduced per-illustration cost from ~$200-300 or 8+ hours of internal time to $100 per asset.
  • Operational Insight: Most enterprise AI failures occur because teams confuse a lucky demo result with a reliable production distribution.

Decoder

  • Last mile: The final stage of a process that is often the most difficult to automate, requiring manual intervention to achieve finished quality.
  • OKLCH: A color space model used in CSS that is perceptually uniform, meaning it is easier for developers to create accessible color palettes.

Original Article

AI vs. human illustrators? AI + human illustrators!

Evil Martians take huge pride in our blog. Illustrations are there to support the vision. We fantasized that image gen tools are ready for us to one-shot illustrations, but we were wrong. Instead we built a process where blog editor owns the first mile, AI does bulk of the work and an illustrator owns the last mile. The quality is where we want it to be, and our team is finally happy.

Evil Martians is a developer-focused product studio that’s been at it for over 20 years. We build developer-facing brands, products and infrastructure. We also run one of the larger devtools blogs on the internet: 50+ technical articles a year, written by pretty much every Martian, for an audience of 500K+ developers.

Here’s what you may not know: the engineers who work on the blog are encouraged to do so, but it’s entirely optional for them. It’s a real labour of love, and that level of care passes on to the cover illustration for each article.

Technically speaking we don’t really have to have these pictures. If Evil Martians itself is a cake, the blog is like the cherry on top, and the blog cover illustrations are like the cherries on top of those cherries. If any team should have been able to prompt their way to a fully automated illustration pipeline, it probably should have been us.

The problematic past of the Evil Martians blog illustration workflow

Illustrations have always been part of how the blog reads, and our mascot, a green alien chasing a humanoid, anchors every illustration. For years, we sourced those illustrations through a small bench of contract illustrators. Even though we have a strong design department, doing this character illustration work (with AI or not) is not our specialization, nor is it something our busy designers are often free to do.

There were multiple factors which made this process a real headache:

  1. The turnaround was unpredictable. We worked with numerous illustrators over the years and their availability would constantly vary, and one illustrator could only take on one illustration at a time.
  2. The mascot drifted between hands and different illustrators interpreted the character differently.
  3. Timing concerns went beyond turnaround speed. Authors weren’t always satisfied with the first pass, and a back-and-forth on a single illustration could stretch over days, even into the next week.
  4. Although the budget and cost was relatively low, we always wanted to bring production of our brand materials in-house.

As far back as 2023, we were experimenting with simple AI images to augment the availability of our in-house and external illustrators. But in 2025, once LLMs really caught up with image generation, the conversation on our team changed.

The folie à deux effect

Folie à deux is a shared delusion that takes hold inside a tight group, where each member’s confidence reinforces everyone else’s. That’s basically what happened to us as image models from Recraft, ChatGPT, Midjourney, and others kept getting better. Every time a new model landed, the illustration conversation on our team started again. This was early 2025, so we tried a few things:

  1. Raw-doggin it with Midjourney. We actually spent hours prompt-engineering and trying to even understand what it is that we want for our blog. We landed on isometric Pixar-looking 3D objects, however, it didn’t always work.
  2. Recraft. They had proprietary models that produced vector and a user-friendly canvas UI. As Nanobanana came out, we wanted to play with it within Recraft, but that didn’t work, so we decided to move to other platforms.
  3. Nanobanana. It came out and really seemed to have it all. With good references we could get close to perfection, but any edit—even a minor one—would muddy the image and lose quality. We tried Flora Fauna for background removal and Topaz Upscaler for upscaling, but quality still suffered, and fixing things by hand drained our time.
  4. We even built a custom GPT app for different styles. Our design team spent a few weeks on creating a tool that would be convenient to use, but at the time GPT was not great with images and we just stalled having to reach for other tools.

In most instances we were able to get 90% there, but those odd last 10% really threw us for a loop. In situations where we managed to produce an image without attracting designers, the results weren’t always excellent.

What the models still couldn’t reliably do

After enough iterations, a few failure modes kept showing up.

Mascot integrity across a series. A single shot of our mascot looked fine in almost any modern model. But we couldn’t produce consistent proportions, expression, and silhouette without per-asset human cleanup.

Small details sucked. You know that feeling when an illustration looks good at a first glance, but as you look at it longer, it starts to fall apart? That was us. Texts, weird hallucinations, odd shadows and tiny details became a huge headache for my team. As a lean team, we tried to handle it ourselves. But that wasn’t the best use of our time, and the process felt unnecessary.

Composition and transparency. Our illustrations ship at 1200×1000 px, often with transparent backgrounds. AI image models default to opaque output and a single aspect ratio, which meant cleanup on every asset before it could ship.

Tokenmaxxing and time drain. To our disappointment, what initially looked like a promising cost-saving workflow ended up completely draining our resources. We ended up with each illustration costing us $600-700 in our time equivalent, caught in the smallest Photoshop manipulations cleaning up slop at 7pm on a Friday.

Author’s note: None of these are arguments against AI in the workflow. They’re arguments against AI as the entire workflow.

It eventually became soul-suckingly chaotic to deal with all of these quirks. And I remember the specific moment when it all ended. By that point Travis and I were completely exhausted.

How the workflow looks now

What kept the slop out is what we call the last human mile. In our case, that human, a professional designer, sits inside an AI-first design agency called KOJI that pairs AI generation with a human illustrator who owns the cleanup, the shadows, and the small things the model still gets wrong. Each illustration costs us $100.

The author of the post and our editor agree on a one-line visual thesis for the illustration. This is the part that requires editorial taste and a sense of what the post is actually about. We hand the one-liner over to KOJI along with any references or must-haves the author wants visible.

From there, KOJI runs the whole creative process internally. The AI generation, the selection of the strongest direction, and the human illustrator’s cleanup all happen inside the agency.

Let’s be clear here: KOJI is not one-shotting this. There was an extensive preparation process that allowed them to take the flat mascot and make it into a versatile 3D model that now can be effectively used in the AI pipeline.

In terms of the process, KOJI sends us a first draft, the author gets a 24-hour window to flag specific issues against the brief, and KOJI returns the final version before it ships.

What stayed exactly the same is that humans own the brand at both ends of the pipeline. We own the visual thesis at the start, and a human illustrator at KOJI owns the final pixels at the end.

Our publication bottleneck has been significantly reduced, and because each illustration costs less, we actually now commission 2-3× more illustrations than we used to.

cost our time quality bottleneck
human $200-300 1-2 hours medium high
AI $20 8+ hours inconsistent medium
hybrid $100 1-2 hours consistent low

Lessons learned beyond the illustrations

The illustration pipeline is a specific case of a more general pattern that’s worth paying attention to even if you don’t care about how a devtools blog gets its illustrations made.

Many devtools founders shipping AI features will recognize the beats of this story from their own products: the demo works, the first batch of production runs works. Then, a failure mode shows up that wasn’t visible in the demo, because that demo was selected from a distribution where the model happened to be lucky.

Pretending a model can cross that final mile is the most common and most expensive mistake we see devtools teams making in 2026.

AI llmresearch

Meta's Watermelon Matches GPT-5.5 Benchmarks

Meta’s internal AI chief claims a new model codenamed Watermelon matches GPT-5.5 performance, though no public data supports the assertion.

Summary

What: Alexandr Wang, Meta’s head of superintelligence, told employees in a town hall that the upcoming model Watermelon has reached performance parity with OpenAI's GPT-5.5. The model is still training and reportedly utilizes significantly more compute than its predecessor, Muse Spark.
Why it matters: This illustrates the current trend of 'benchmark diplomacy' where companies use unverified internal claims to signal progress in the competitive AI frontier, often to maintain investor and employee confidence.

Deep Dive

  • Meta AI leadership claims parity with GPT-5.5 based on internal benchmarks.
  • The model, Watermelon, is currently in the training phase.
  • Development utilizes a significantly higher compute budget than the April 2026 release, Muse Spark.
  • No methodology or specific benchmark datasets have been released to the public.
  • The claim originates from a single report via Business Insider based on internal sources.

Original Article

Meta's superintelligence chief Alexandr Wang told employees in a town hall that the company's upcoming model, codenamed Watermelon, has "caught up" with OpenAI's GPT-5.5 on closely followed AI benchmarks, according to Business Insider, which cited two people familiar with the matter. Wang reportedly said Watermelon is still in training and uses "an order of magnitude more compute" than Muse Spark (Meta's April model, internally codenamed Avocado), which had trailed rival models despite solid benchmark scores. Business Insider notes it was not clear which benchmarks Wang cited, and neither Meta nor OpenAI has confirmed the claim. For practitioners, an internal, single-sourced benchmark claim is not equivalent to a published, reproducible evaluation and should be treated as an early signal, not a verified result, until Meta releases the model publicly.

An unconfirmed internal benchmark claim from Meta's AI leadership is a reminder that town-hall statements are not evaluation artifacts: until Meta publishes reproducible results or a model card for Watermelon, "caught up with GPT-5.5" is a single-sourced assertion, not verified parity. For practitioners tracking the frontier-model race, the more concrete signal here is the compute trajectory Wang described, not the benchmark claim itself.

What happened

According to Business Insider, Alexandr Wang told Meta employees in a town hall that the company's upcoming model, codenamed Watermelon, "has caught up" with OpenAI's GPT-5.5 based on closely followed AI benchmarks, citing two people familiar with the matter. Business Insider reports Wang said Watermelon, the successor to Avocado (Meta's internal codename for Muse Spark), is "currently in training" and "uses an order of magnitude more compute than Avocado." OpenAI released GPT-5.5 in April and introduced GPT-5.6 late last month, per Business Insider. Meta declined to comment and OpenAI did not respond to a request for comment. Investing.com, redistributing the Business Insider report, added that it was not immediately clear which benchmarks Wang was citing.

Technical context

Meta released Muse Spark in April 2026, its first major model since hiring Wang, and it performed well on some benchmarks while still falling short of leading rivals overall. Wang's description of Watermelon using "an order of magnitude more compute" than Muse Spark points to continued aggressive scaling as Meta's primary lever, consistent with the company's reported multibillion-dollar spending on chips and data centers under Zuckerberg's direct oversight of AI development.

For practitioners

Treat this as a leading indicator, not a procurement signal. Internal benchmark claims announced without published methodology, evaluation datasets, or third-party replication carry a real risk of optimistic framing. Wait for a public model card, an official benchmark table, or independent evaluations before factoring Watermelon into model-selection or capacity-planning decisions.

What to watch

Meta has not given a release timeline for Watermelon. Watch for a public launch announcement, published benchmark results, and whether the model narrows the gap with GPT-5.5 and GPT-5.6 on independently run evaluations rather than internally cited ones.

Key Points

  • Meta's AI chief told staff Watermelon has matched GPT-5.5 on internal benchmarks, per a single Business Insider report citing anonymous sources.
  • Wang described Watermelon as using far more training compute than April's Muse Spark, underscoring compute scaling as Meta's core strategy.
  • Practitioners should wait for published benchmarks or independent evaluations before treating the parity claim as verified for deployment decisions.

Scoring Rationale

Notable signal in the Meta-OpenAI frontier-model race given Meta's competitive stakes, but the claim rests on a single anonymous-sourced town-hall statement with no published benchmark data, and neither company confirmed specifics, so it stays provisional pending independent verification.

AI enterprisehardware

Anthropic Exploring a Samsung Chip Partnership

Anthropic is in talks with Samsung to develop custom AI chips, joining a growing trend of labs seeking to bypass Nvidia’s dominance.

Summary

What: Following reports that OpenAI is partnering with Broadcom for its 'Jalapeño' processor, Anthropic is exploring a custom chip collaboration with Samsung to diversify its hardware stack alongside existing providers Google, Amazon, and Nvidia.
Why it matters: AI labs are increasingly treating silicon as a core product differentiation strategy, attempting to secure supply chain independence and specialized performance as demand for inference capacity explodes.

Deep Dive

  • Anthropic is in discussions with Samsung regarding custom AI hardware.
  • The strategy aims to reduce dependency on Nvidia's centralized hardware ecosystem.
  • Current compute strategy continues to rely on Google, Amazon, and Nvidia hardware.
  • Move follows OpenAI's collaboration with Broadcom on the 'Jalapeño' inference processor.
  • Samsung is simultaneously partnering with Nvidia to manufacture existing GPU demand.

Decoder

  • Inference Processor: Hardware specifically designed to run, or 'serve', pre-trained AI models rather than training them from scratch.

Original Article

Back in April, Reuters reported that Anthropic was toying with the idea of producing its own AI chips as a means of responding to chip shortages. Now, it would appear that the company is getting serious about this idea.

On Thursday, The Information reported that Anthropic was in contact with Samsung to explore a collaboration around the pending chip. However, Anthropic hasn’t yet decided what the chip will be used for, how it will fit into the server, or how powerful it will be, according to the report.

When reached for comment, Anthropic told TechCrunch that a diversified hardware stack that includes chips from Google, Amazon, and Nvidia will continue to be pivotal to its compute strategy. On the topic of a potential Samsung partnership, the company said it had nothing further to add.

A number of AI companies have sought to develop custom chips — both as a way to create unique hardware for specific compute tasks and to gain a certain amount of independence from Nvidia, which continues to be the undisputed leader of the chip industry.

Anthropic’s announcement may also be a response to one made last week by its key competitor, OpenAI, which has teamed up with Broadcom to announce its own custom-built inference processor, dubbed “Jalapeño.” OpenAI says that the chip is more efficient, demonstrating better performance-per-watt, than other competitor chips. Amazon and Google both offer custom-built TPUs as part of their cloud offering.

Samsung is already embedded in the AI industry, and acts as a major partner of Nvidia, producing chips that the company needs to train or run its AI models. In turn, Samsung uses Nvidia’s software to manufacture its chips. The duo are working on an AI chip factory in South Korea. Samsung has also discussed partnering with Google on its chip-making efforts.

AI research

Seed2.0 Model Card

ByteDance's Seed2.0 model card emphasizes a user-centric evaluation strategy for complex, long-horizon real-world tasks.

Summary

What: The Seed2.0 model series focuses on improving 'long-tail' knowledge and complex instruction following, backed by an evaluation system designed around realistic user needs rather than just generic benchmarks.
Why it matters: The shift toward 'model cards' that detail specific real-world usage scenarios indicates a maturing industry moving away from 'general intelligence' hype toward reliability in specific, high-value enterprise applications.

Deep Dive

  • Seed2.0 aims to solve real-world, long-horizon tasks.
  • The core focus is on long-tail knowledge and instruction following.
  • The team developed an evaluation framework grounded in specific user needs rather than static academic benchmarks.
  • The model exhibits improvements in reasoning, visual understanding, and search capabilities.
  • The model card aims to provide transparency on the model's capabilities for high-complexity scenarios.

Decoder

  • Long-tail knowledge: Information or specialized data that falls outside of the most common, frequent queries (the 'head'), requiring specialized or broad training datasets.
  • Model Card: A short document accompanying a machine learning model that provides information about its intended use, limitations, and performance characteristics.

Original Article

Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity

We present Seed2.0, a model series that takes a meaningful step toward solving complex, real-world tasks. Our approach begins with identifying users' genuine needs and constructing a reliable, forward-looking evaluation system by selecting and abstracting benchmarks grounded in these needs and in realistic, complex scenarios. Guided by this evaluation system, Seed2.0 targets two persistent challenges, long-tail knowledge and complex instruction following, substantially improving the model's reliability on intricate, long-horizon tasks. Beyond these, Seed2.0 delivers world-leading reasoning intelligence, visual understanding, and search capabilities that address the most common needs of a broad user base. Through extensive real-world use cases documented in this model card, we demonstrate that Seed2.0 begins to exhibit the ability to handle initial complex real-world tasks, delivering greater value to hundreds of millions of users.
AI hardware

The Hardware Coup: Why AI Hardware Just Changed Forever

Late June marked a watershed moment as custom AI silicon from companies like OpenAI, Etched, and Amazon moved from concepts to physical hardware.

Summary

What: This period saw rapid deployment of custom AI hardware, indicating that companies are moving beyond general-purpose GPUs to proprietary chips optimized for specific AI workloads.
Why it matters: This transition signals that the AI infrastructure bottleneck is shifting from software and model development to custom silicon design.

Original Article

The Hardware Coup: Why AI Hardware Just Changed Forever

The AI hardware world just went through an incredibly busy two weeks. In late June, custom chips went from ideas on a page to working physical hardware. OpenAI, Etched, Amazon, and SambaNova all made...

AI researchmath

The Ramanujan Challenge for AI

The Ramanujan Challenge for AI tests whether language models can move beyond guessing formulas to providing verifiable, symbolic derivations.

Summary

What: Launched by the Ramanujan Machine project, the challenge consists of ten research-level math problems regarding mathematical constants. Participants must submit formal proofs or code-based derivations by August 1, 2026.
Why it matters: Testing AI on symbolic derivation rather than just output generation is a critical step in assessing true mathematical reasoning capabilities.
Takeaway: If you are interested in AI mathematics, you can view the ten research-level problems at ramanujanmachine.com/ramanujan-challenge.

Decoder

  • CAS (Computer Algebra System): Software that can perform symbolic mathematics, such as equation solving, integration, and algebraic manipulation.

Original Article

The Ramanujan challenge

Ido Kaminer shared with me the following information about The Ramanujan Challenge for AI, and I am happy to share it with the readers of this blog. The challenge page is at ramanujanmachine.com/ramanujan-challenge; Here is the The full challenge paper, and a quote from Ido’s email.

“The challenge launched today and will run until August 1, 2026. It consists of [ten] research-level problems on explicit formulas for mathematical constants, designed to test whether AI systems can move from a concrete formula to a valid proof or symbolic derivation.

We designed the rules to make the challenge compatible with formal and code-based systems. Accepted submissions may be formal proofs, CAS-based derivations, or human-readable proofs accompanied by reproducible code. The goal is not only to test whether AI can find answers, but whether it can produce derivations that can be checked in a structured way.”

The second problem

Tech aistartupenterprise

Mark Zuckerberg tells staff that AI agents haven't progressed as quickly as he'd hoped

Mark Zuckerberg told Meta staff that AI agent development is lagging behind internal expectations.

Summary

What: During an internal town hall, Zuckerberg admitted AI agent progress is slower than anticipated and described earlier workforce layoffs as not sufficiently clean, despite Meta's massive $145 billion investment in AI infrastructure this year.
Why it matters: This reveals a gap between high-level strategic pivots toward AI and the operational reality of implementing complex agentic systems at scale.

Decoder

  • AI agent: A software program designed to perform tasks autonomously by interacting with tools and digital environments.

Original Article

Mark Zuckerberg tells staff that AI agents haven’t progressed as quickly as he’d hoped

Replacing people with AI doesn’t seem to be that easy to do, if Meta can be seen as an example.

Reuters reports that at an internal town hall Thursday, CEO Mark Zuckerberg told staff that the pace of AI agent development had not “accelerated in the way” executives had previously expected them to.

Earlier this year, Meta laid off some 8,000 employees — approximately 10% of its corporate workforce — and reassigned another 7,000 to various AI groups, including one called Agent Transformation, Bloomberg reported.

During this week’s meeting, Zuckerberg apparently commented on these job cuts — noting that they were not as “clean” as they should have been. The cuts were made because top officials at the company “were worried that we weren’t going to move fast enough ‌to adapt” to the changing landscape of the tech industry, Zuckerberg reportedly added.

The corporate leader also apparently said that the perceived upside of the new AI-focused company structure hadn’t “come to fruition yet,” although he said that he believed the company would begin to see improvements from its AI investments during the next three to six months. Several other investigative reports have depicted Meta’s months-old AI unit as a soul-crushing gulag, according to some of the engineers assigned to it.

Meta has invested heavily in AI and is expected to spend as much as $145 billion on AI infrastructure this year, Reuters reports.

TechCrunch reached out to Meta for comment.

Tech fintechaidata

China Quant Funds Draw Billions as AI Trounces Human Traders

Chinese quantitative investment funds have doubled their assets to 2.6 trillion yuan as AI-driven trading strategies consistently outperform human traders.

Summary

What: Quantitative funds in China are experiencing massive growth as firms shift toward AI-based strategies that can monitor and execute trades across thousands of stocks simultaneously.
Why it matters: The shift highlights how AI is becoming a baseline requirement for competitive asset management, moving from a niche technology to the primary driver of institutional trading performance in major markets.

Decoder

  • Quant fund: An investment fund that relies on mathematical and statistical models to identify and execute trading opportunities.

Original Article

Quants' assets under management in China have more than doubled in less than a year to over 2.6 trillion yuan. The trend is being fueled by the rapid adoption of artificial intelligence. Quants' breadth in covering thousands of stocks has made it mainstream. Funds are responding to the frenzy by creating new investment products.

Tech policyhardware

FAA proposal: Supersonic airliners can fly over US cities if they're quiet

The FAA is proposing to allow commercial supersonic flights over the US if aircraft meet strict noise limits, overturning a 53-year-old ban.

Summary

What: The FAA's proposal sets a sonic boom overpressure limit of 0.11 pounds per square foot, potentially enabling a new generation of supersonic travel, though critics argue this metric does not accurately reflect actual human perception of noise.
Why it matters: The attempt to regulate 'quiet' supersonic flight reflects an industry-wide effort to justify the viability of supersonic airliners like Boom Supersonic's Overture, which currently lack a profitable commercial model.

Decoder

  • Sonic boom: The loud, explosive sound caused by an object moving faster than the speed of sound.
  • Overpressure: The transient increase in air pressure caused by shock waves, used here as a technical metric for measuring sonic boom intensity.

Original Article

A long-standing ban on commercial supersonic flights over the United States would be overturned in a new rule proposed by the US Federal Aviation Administration. That could pave the way for the possible return of commercial supersonic airliners—as long as such aircraft can reduce the ground-level impacts of their sonic booms.

The FAA originally banned overland supersonic flights by civil aircraft in 1973, following US military tests involving supersonic flights over US cities such as Oklahoma City, Chicago, and St. Louis in the 1960s. But the Trump administration has championed the repeal of the ban to pave the way for supersonic airliners that could operate without disruptive sonic booms. So the FAA’s new rulemaking action on June 30, 2026, follows the direction of an executive order issued by President Trump on June 6, 2025.

The newly proposed rule would replace the 53-year prohibition with an interim “noise-based” certification standard requiring any sonic boom overpressure at the surface to be kept below 0.11 pounds per square foot. That proposed standard is based on the Colorado-based startup Boom Supersonic having demonstrated quiet Mach cutoff flights with its XB-1 aircraft—harnessing specific atmospheric conditions while flying just beyond supersonic speeds at higher altitudes so that the aircraft’s shockwaves are refracted upward into the atmosphere rather than traveling to the ground.

For comparison, the Concorde supersonic airliner that flew commercial transatlantic flights between 1976 and 2003 created a sonic boom overpressure equivalent to 1.94 pounds per square foot when flying at a speed of Mach 2 at an altitude of 52,000 feet.

A NASA fact sheet suggests that “some public reaction could be expected between 1.5 and 2 pounds” but rules out damage to buildings and other structures at one pound of overpressure. It further explains that humans have experienced sonic boom overpressure between 20 and 144 pounds without injury when supersonic aircraft flew at altitudes below 100 feet.

However, not everyone is sold on this proposed standard for allowing overland supersonic flights. Dan Rutherford, senior director at the nonprofit International Council on Clean Transportation, told Aviation Week that the overpressure metric was previously discarded by United Nations experts in 2014 because “it doesn’t actually measure loudness or annoyance.”

“I’m honestly surprised that the FAA would propose a rule this weak,” Rutherford told the publication.

US lawmakers in Congress have also been pushing forward the Supersonic Aviation Modernization Act. That would require the FAA to allow for overland supersonic flights “so long as the aircraft is operated in such a manner that no sonic boom reaches the ground in the United States.” The bill passed the House on March 24, 2026, and is still awaiting a vote in the Senate.

Another way for quiet supersonic flight

Meanwhile, NASA has been testing a different approach to quieter supersonic flight with the Lockheed Martin X-59 Quesst—a needle-nosed experimental aircraft with an airframe designed to reduce the typical sonic boom to a sonic thump. NASA has relied on perceived levels of decibels (PldB) to evaluate sound levels, with the goal of consistently demonstrating sonic thumps around 75 PldB that would sound like a car door slamming about 20 feet away.

A NASA test pilot and mission integration manager previously told Ars that the X-59 aircraft’s future supersonic flight tests over US cities and towns nationwide would provide community feedback on perceived sound levels that could help inform regulations by civil aviation authorities.

The FAA still has time to further refine its proposed noise regulations for overland supersonic flights before attempting to finalize them by mid-2027. The agency also plans to propose another rule later this year that would set takeoff and landing noise standards for supersonic aircraft.

Legalization of quieter overland supersonic flights does not guarantee a successful comeback for commercial supersonic airliners. The Concorde supersonic airliner cut transatlantic flights between New York and London from seven hours to under three hours, but the aircraft’s massive fuel consumption made it difficult to sustain profitable operations—never mind recovering the more than $2.8 billion in development costs shared by the UK and French governments.

Boom Supersonic is developing a supersonic airliner called Overture with the goal of delivering the first aircraft to customers by 2029. The company has signed commercial agreements with American Airlines, Japan Airlines, and United Airlines that give the companies options to purchase the Overture aircraft.

But Boom has also pivoted away from its main goal in recent months to produce natural gas turbines to power AI data centers. Boom CEO Blake Scholl has suggested that revenue from this side venture would help pay for the development costs of the Overture supersonic airline. At the same time, United Airlines CEO Scott Kirby has said he gives Boom a “50/50” chance of getting its supersonic airliner flying.

Tech aihardwareenterprise

Tesla caps employee AI spending at $200/week except for Grok

Tesla has imposed a $200 weekly cap on employee AI spending for third-party models, highlighting the rising cost of internal AI development.

Summary

What: Employees are now restricted to $200 per week for external AI services, excluding the company's internal 'Grok' model, which Tesla provides to staff.
Why it matters: This reflects a growing trend of companies treating AI compute as a strictly limited budget item rather than an unlimited utility for employees.

Original Article

The cap is a sign that even companies betting their future on AI are struggling to control their costs.

DevOps securitycloud

The Hidden Cost of Misconfigurations in Hybrid Cloud

Automated hybrid cloud environments are scaling configuration errors faster than security teams can remediate them.

Summary

What: Tyler Carrigan at Check Point notes that fragmented policy enforcement in hybrid clouds leads to widespread misconfigurations in IAM, secrets management, and network security. He advises adopting unified policy-as-code and integrated CI/CD scanning.
Why it matters: The industry is realizing that automation, while necessary for scalability, creates a significant security risk when it lacks cross-environment consistency and drift detection.
Takeaway: Integrate infrastructure-as-code (IaC) scanning directly into your CI/CD pipelines to catch misconfigured rules before they reach production.

Decoder

  • Infrastructure as Code (IaC): The process of managing and provisioning computer data centers through machine-readable definition files.
  • Configuration Drift: The phenomenon where an environment evolves over time, diverging from its original configuration or intended state.
  • Blast Radius: The potential extent of a security breach or system failure, specifically regarding how many services or data points are affected by a single error.

Original Article

You set up your CI/CD pipelines, automation workflows hum along, and your hybrid cloud orchestration feels bulletproof. All is good until some misconfigured rule from a testing environment silently propagates across your infrastructure. No effective alerts trigger — or the misconfiguration slips through unnoticed — until it's too late, and attackers exploit the minor misconfiguration faster than teams can respond.

This scenario represents the blind side of cloud automation: it supercharges productivity while amplifying mistakes. Hybrid setups scatter policy enforcement, which becomes fragmented across tools and platforms unless you adopt unified frameworks. By the time these ghosts are discovered, the damage is done.

The Misconfiguration Minefield in Hybrid Cloud

Hybrid cloud environments promise flexibility and scalability, but the problem with them isn’t just architectural complexity. It’s human error accelerated by automation. Misconfigurations in one layer can silently propagate across environments, leaving security teams in the dark and expanding the attack surface.

Hybrid setups introduce more moving parts, tools, and people who have access to them, which usually leads to higher chances of misalignment. Cloud and on-prem systems rarely share a single policy engine or observability stack. Each team may use different IaC tools, security frameworks, and monitoring platforms. Automation helps, but it scales mistakes faster than they can be caught. Some of the most common types of misconfigurations include:

  • Overly Permissive IAM Roles - Excessive or wildcard permissions remain a leading cause of breaches. In hybrid environments, privileges can be duplicated or inherited between systems, making it harder to track who has access to what and why.
  • Misconfigured Security Groups and Firewalls - Security rules in the cloud and on-prem often live in separate silos. It is difficult to identify risky gaps, like a cloud environment allowing inbound traffic that on-prem defenses would normally block.
  • Publicly Accessible Storage Buckets - Buckets and blobs left exposed due to default or reused policies are still among the most frequent and costly misconfigurations. In hybrid deployments, misaligned IaC templates can replicate a mistake at scale. Additionally, misaligned access control or improperly replicated configurations between environments are the root of issues.
  • Secrets and Credentials in Logs and Repos - Hardcoded secrets and unmasked credentials continue to surface in logs and repos, especially when secrets management isn't consistently enforced across hybrid and CI/CD systems.
  • Configuration Drift and Monitoring Gaps - IaC helps manage complexity, but inconsistent application across environments leads to drift. Without unified visibility, teams often miss these discrepancies until they’re exploited.

Strategies for Preventing Hybrid Cloud Misconfigurations

Unify Policy Enforcement

Robust information security starts with consistent policies. In hybrid environments, configuration rules often get spread across tools, teams, and platforms. Defining those policies as code (and enforcing them through a unified platform) helps eliminate drift and ensures predictable (and secure) behavior no matter where your workloads run.

Treat Infrastructure Code as Code

Misconfigurations often spawn in IaC templates (rather than production). Integrate IaC scanning tools in your CI/CD pipelines to catch them early. Validating infrastructure as code before deployment lowers the odds of misconfigurations or embedded secrets making it into your production environment.

Keep an Eye Out for Configuration Drift

Even well-defined policies can be undermined by manual changes or automation misfires. Drift detection tools provide continuous feedback when real-world settings diverge from intended baselines and policies. This visibility is especially critical in hybrid environments, where undocumented changes, cloud platform updates, and legacy infrastructure (among other issues) introduce variables.

Automate Secrets Management

Storing credentials in plaintext (or, worse, hardcoding them into templates) is still somehow frighteningly common. Instead of risking your secrets, use centralized secrets management tools to encrypt, rotate, and manage all code secrets.

Enforce Least Privilege Across the Stack

Over-privileged accounts and services are a familiar problem in hybrid environments. To avoid this, enforce least privilege by default and regularly audit access policies across both cloud and on-prem. Temporary and just-in-time access controls can help minimize exposure without slowing development velocity.

Break Down Team Silos

Misconfigurations thrive in environments where communication is lacking. Security, DevOps, and platform teams must share responsibility for configuration hygiene, but often lack the tools and channels to do so. Consider adopting standardized tooling and defining clear ownership to ensure changes are understood and implemented adequately across the stack.

Isolate and Segment Environments

Unlike your teams, your environments should be segmented and isolated to contain vulnerabilities when they slip by defenses. Use network segmentation and environment-level isolation to reduce the blast radius of a misconfiguration. Zero trust design principles can further help reduce implicit assumptions and curb lateral movement in the event of failure.

Ghost-Proofing Hybrid Infrastructure

Teams can take strategic steps like shifting security checks left into CI/CD pipelines, enforcing least privilege, and segmenting environments. More than any other tactic, fostering collaboration across DevOps, security, and platform teams by replacing silos with shared responsibility is key to successfully tackling hybrid cloud security and complexity without slowing operations down.

By baking configuration hygiene, visibility, and accountability into deployment pipelines, you can exorcise the ghost of misconfiguration before it haunts your systems into a state of perpetual vulnerability.

DevOps aienterprise

How AI-First Operations Unlocks Compounding Engineering Productivity

AI-first operations aim to shift engineers from manual incident firefighting to managing autonomous systems that resolve routine service failures.

Summary

What: PagerDuty outlines a transition where AI agents handle incident triage and routine remediation, allowing engineering teams to focus on feature development rather than manual troubleshooting.
Why it matters: This signals a structural change in the DevOps lifecycle, moving from 'human-in-the-loop' incident management to 'human-on-the-loop' supervision of autonomous remediation workflows.

Decoder

  • Incident Triage: The process of determining the severity and urgency of a system incident to prioritize the response.

Original Article

AI-first operations let agents handle incident triage, coordination, and routine remediation so engineers spend more time building and less time firefighting. Teams progress from AI-assisted workflow support to supervised investigations and autonomous resolution of well-understood failures.

DevOps aimcp

From Error Log to Closed Ticket, Without Leaving Your Terminal

A new Azure MCP server allows developers to manage the full lifecycle of support tickets directly from their terminal using contextual inference.

Summary

What: The open-source tool automatically infers context from error logs or resource IDs, generates tickets, and enables updates, replies, and attachments without requiring browser-based portal access.
Why it matters: This indicates a shift toward terminal-centric agentic workflows that reduce context switching for engineers during incident response.

Original Article

An open-source Azure MCP server turns the full support ticket lifecycle into a conversational workflow inside the terminal by inferring context from logs or resource IDs, generating and filing tickets, and managing updates through replies and attachments. It minimizes portal usage via local-first data, safe preview-confirm actions, and continuous ticket tracking.

DevOps aiopensourceresearch

Open source maintainership in the age of AI

Kubernetes has established formal AI contribution policies to mandate human accountability and disclosure while integrating automated review tools.

Summary

What: The policy requires contributors to disclose AI assistance, forbids AI from being listed as a co-author, and requires humans to verify and explain any AI-generated code. The community is also testing tools like CodeRabbit to automate PR reviews.
Why it matters: This codifies a standard approach for large open-source projects to mitigate 'maintainer burnout' while preventing the degradation of code quality from automated, unverified contributions.

Deep Dive

  • Disclosure: PRs must include clear statements if generative AI was used.
  • Accountability: Only humans are legally responsible for code; AI cannot be a co-author.
  • Verification: Maintainers will close PRs if the human contributor cannot personally explain or defend the code.
  • CLA Enforcement: Contribution License Agreement checks are now enabled for co-authors to prevent unverified AI-generated commits.
  • Automation: Introduction of CodeRabbit as an automated quality gate for quick spot checks.

Decoder

  • Maintainer Burnout: The operational exhaustion of open-source project leads due to the high volume of review requests.
  • CLA (Contributor License Agreement): A legal document defining the terms under which intellectual property is contributed to an open-source project.

Original Article

Open source maintainership in the age of AI

AI has really changed the game around software development. More people are leveraging AI than ever to contribute patches to projects they use. To me, this is a good thing as more folks will contribute patches rather than fork or not fix them. The main problem is that AI has made generating code fast but there has been very little improvement in maintaining code bases. In this post, we will highlight the ways the Kubernetes community is adapting to the world of AI assisted coding.

The first step of this journey was to develop an AI policy. This seems mundane and bureaucratic but there were many PRs that derailed into discussions around AI usage. The AI policy helps steer the conversation around the project's stance on AI and provides a clear signal to contributors on how to use these tools responsibly.

Kubernetes AI policy

The Kubernetes project has established clear guidelines for AI-assisted contributions that balance innovation with accountability. These policies are designed to maintain code quality and ensure human oversight while acknowledging that AI tools can be valuable aids in the development process.

Transparency first

Contributors must disclose when AI tools have been used to assist with a pull request. A simple statement in the PR description such as "This PR was written in part with the assistance of generative AI" is sufficient. This transparency helps reviewers understand the context and apply appropriate scrutiny.

Human accountability

While AI tools can assist, the human contributor remains fully responsible for every change. The policy explicitly prohibits:

  • Listing AI as a co-author on commits
  • Using AI co-signing on commits
  • Adding trailers like "assisted-by" or "co-developed" that attribute work to AI

This isn't about diminishing AI's role as a tool—it's about maintaining clear accountability. If something breaks, there needs to be a human who understands why and can fix it.

CLA enforcement for co-authors

The CNCF provides a tool for verifying the contributor license agreements on each pull request. AI agents are not able to solve these contributor license agreements so one enforcement the project made is to enable the CLA check for co-authors. This provides a flag to reviewers that the PR is not ready to merge.

Human engagement required

Perhaps the most critical aspect of the policy: reviewers expect to engage with humans, not with AI. Contributors cannot rely on AI to respond to review comments. If you cannot personally explain changes that AI helped generate, your PR will be closed. This requirement ensures that knowledge transfer happens and that contributors genuinely understand the code they're submitting.

Verification obligations

Contributors must verify AI-generated changes through code review, testing, and personal understanding. It's not enough for the code to work—you need to know why it works and be able to maintain it.

These policies reflect a mature approach to AI: embrace it as a tool, but never let it replace human judgment, understanding, or responsibility.

Automated AI reviews

There exist many tools to aid in reviewing code. AI pull request tools introduce governance challenges so one of the first tasks the community took on was to document the process for what is needed to bring in new AI tools. One of the major evaluation criteria for these tools is to find maintainers willing to test drive them in kubernetes-sigs repositories. Kueue, JobSet and Agent-Sandbox have been experimenting with these tools to provide more support for maintainers.

Copilot

One tool that many maintainers started using was GitHub Copilot. The CNCF provides access for maintainers so this ended up being the first tool many started using. It provides some good experience on tuning reviews but there were some growing pains with this tool. The biggest blocker for community adoption is relying on contributors to have a copilot license. Only maintainers were able to request copilot reviews and automated reviews of pull requests was out of reach for the community. One of the goals of AI review tools is to provide an automated review tool that maintainers don't need to request. This demonstrated the need for organization control rather than relying on contributors having access.

CodeRabbit

In mid 2026, the Kubernetes community has rolled out CodeRabbit to a few projects. As with copilot, some tuning has been required to provide better reviews but the overall feedback has been positive. There is a lot of configuration available for this tool and one of the most interesting uses of this tool comes from agent-sandbox.

AI pull request tools can be a quality gate. Contributors can at least get a quick spot check review without waiting for a maintainer. Agent-sandbox has added a label on PRs to reflect that there is still a need to resolve some of the comments from AI tools.

Next steps

The reality is that leveraging AI in open source projects is an area of active exploration. The community could use your help in tuning reviews tools, evaluating tools or evaluating emerging technologies in the AI space.

Some areas we are exploring more:

  • The use of AI skills to reduce maintainer burnout.
  • AI assisted triage of failing tests.
  • Skills to aid the operational aspects of Kubernetes.
Design aillmstartup

Vibe-coding Platform Base44 Launches Own Model as AI Startups Seek Defensibility

Vibe-coding platform Base44 is moving away from frontier models to launch its own proprietary model, Base1, in a bid to boost margins and defensibility.

Summary

What: Base44, acquired by Wix for $80 million last year, reached $150 million in ARR while training Base1 on tens of millions of user interactions. Founder Maor Shlomo claims the custom stack optimizes for latency and cost compared to general frontier models.
Why it matters: The move reflects a growing trend where applied AI companies seek to mitigate the 'commodity' nature of off-the-shelf LLMs by creating vertically integrated stacks that claim data and infrastructure moats.

Deep Dive

  • Vertical Integration: Base44 is vertically integrating distribution, data, and model infrastructure.
  • Cost Pressure: Rising enterprise concerns over inference costs drive the shift to optimized, domain-specific models.
  • Model Strategy: Generalist models like those from Anthropic or OpenAI are increasingly being challenged by specialized, user-interaction-trained models for specific UI tasks.
  • Defensibility: Data ownership from user interactions serves as the primary barrier to entry for smaller competitors.

Decoder

  • Vibe-coding: A software development paradigm where developers focus on natural language descriptions of intent rather than writing raw code, often relying on AI to bridge the gap between intent and implementation.
  • Frontier Model: Large-scale, general-purpose AI models (e.g., GPT-4, Claude 3.5 Opus) developed by well-funded labs that set the current state-of-the-art performance benchmarks.
  • Inference: The process of running a trained AI model to make predictions or generate outputs based on new input data, which incurs significant computational costs.

Original Article

Base44, the vibe-coding platform that Wix acquired for $80 million just one year ago — when the company was barely six months old and had a team of eight — has started rolling out its own AI model to support its users in creating apps with natural language.

The move comes as the discussion in AI circles has intensified over whether frontier models are best suited for all use cases. A related question is whether businesses built on top of someone else’s models are truly defensible long term. The latest move of Base44, based in Tel Aviv, speaks to both.

While its custom LLM is only just rolling out, Base44 hopes that it will eventually outperform frontier models. According to its founder, Maor Shlomo, “training and owning the model as part of [our] entire stack allows us a lot more optimizations on latency, cost, and efficiency.”

At first glance, this could be a way to stay ahead of competitors such as Swedish startup Lovable, which reached unicorn status in its Series A round last summer and that relies on external LLMs. However, Shlomo expects that others will train their own models — “at least the players that have gotten enough scale and velocity to have enough data.”

According to Jonathan Userovici, a general partner at VC firm Headline — whose portfolio includes AI companies like Mistral AI, but not Base44 — data is one of three key ingredients of defensibility for AI startups, alongside distribution and tech stack.

The upshot is that players with strong brands are now leaning into their data and infrastructure to increase their defensibility, and Base44 fits that pattern. The company says the first iteration of its LLM, Base1, was developed and trained on a dataset generated from “tens of millions of real user interactions on the platform.”

This dataset will keep on growing with the company; but so will its rivals’. The bigger competition may not be vibe-coding startups at all but instead come from frontier AI labs that are getting closer to Base44’s home turf — Cursor and Grok’s parent company xAI now both belong to SpaceX, and Claude Code has become a vibe-coding player in its own right.

This gives Anthropic and other foundational AI providers access to data and feedback loops they can use to improve models for app creation, but Shlomo thinks specialization gives Base44 a leg up. “Models are progressing, but they’ll stay very general in what they can do,” he predicted.

Userovici, for his part, cautioned against underestimating frontier models, citing the example of the legal tech startup Harvey, which abandoned plans to train its own model. He doesn’t expect applied AI companies to become frontier labs en masse but frames Base44’s move in a broader context — one in which inference costs have become a meaningful part of the equation.

That cost pressure, Userovici says, has driven change that enterprise customers are now demanding. “They don’t necessarily see a [return on investment] when using the latest models for all use cases, so an entire infrastructure is being set up to do orchestration and optimization to select the right models for them so that costs don’t skyrocket while maintaining the same or similar performance across the majority of use cases.”

Enterprise companies still are a minority among the audience of the vibe-coding platforms, but they represent a growing share of platform revenue, and users of all sizes are starting to express concerns over the cost of using AI. Base44’s decision to develop its own LLM stemmed from multiple factors, but cost reduction is likely among the benefits.

“We want to get a model that is going to be more aligned to what we think is the right thing, is going to be more optimized to what we see users like in terms of the results we’re getting, and is going to be faster and cheaper for customers eventually than using the frontier models like Opus,” Shlomo said.

As for Base44 itself, cost reduction isn’t as clear cut. In a press release, the company explained that “ownership of the model gives Base44 direct control over compute and inference spend, expected to result in a structurally stronger margin profile over time.”

Even with a delayed payoff, improved margins would be good news for Base44’s parent company, which recently announced it would lay off 20% of its workforce. In contrast, Base44 has been growing in headcount since the acquisition — and announced it had passed $150 million in annual recurring revenue in May just two months after crossing $100 million in ARR.

That’s still less than Lovable, which said it hit $500 million in ARR earlier this month. But Shlomo is betting that the “huge engineering effort” to develop Base1 will cement Base44’s positioning as the “only vertically integrated vibe-coding application — meaning, in Userovici’s terms, a player that owns its distribution, data, and infrastructure all at once.

This article was updated to correct Base44’s location and add its latest ARR.

Design aifrontend

Got skills? Make the Figma agent a better collaborator

Figma’s new Skills feature allows teams to standardize AI behavior by creating reusable, plain-English workflows for design critiques, documentation, and task automation.

Summary

What: Skills function as persistent prompt libraries that instruct the Figma agent on team-specific standards, allowing users to share processes like project summaries or onboarding routines across the organization.
Why it matters: By moving from ephemeral, individual-level prompts to organizational skills, companies can ensure that AI-generated artifacts align with internal conventions rather than producing generic outputs.
Takeaway: Create a shared Skills file for your team to standardize how the Figma agent writes specs or formats design feedback.

Original Article

Full article content is not available for inline reading.

Read the original article →

Design frontend

Inspect, Replace, and Test New Fonts on Live Websites (Website)

Tinkerfont allows developers to inspect and replace typefaces on live websites directly within the browser.

Summary

What: Built by Mighil, the tool lets users test fonts across existing layouts and breakpoints, similar to the Fontanello browser extension.

Original Article

Right-click font inspector

The right-click font inspector is inspired by Fontanello. Thank you to Lars Wästfelt & Fred Bergman for the original idea.

Open Tinkerfont in the context menu for live typography on any text. Click a value to copy it — family, color, contrast, or an experimental font file URL when available.

Design ai

What are Hypertokens? The Layer Between Tokens and Components, Rebuilt for Agents

Figma's Jake Albaugh has proposed 'Hypertokens' as a way to bundle style properties, aiming to reduce design drift and improve agent readability.

Summary

What: Hypertokens would group related design tokens—such as color, spacing, and typography—into a single semantic unit, potentially aligning with the W3C 'composite token' specification.
Why it matters: Current design tokens are often exported as flat, disjointed lists, making it difficult for AI agents to understand the intent behind a design and leading to poor automated code generation.

Decoder

  • Composite tokens: A design token concept that groups multiple base values (e.g., color + opacity + stroke) into a single semantic category to ensure consistency across design systems.

Original Article

Hypertokens, a concept coined by Figma's Jake Albaugh at Config 2026, would bundle related style properties into one named unit that every tool builds from, replacing scattered hand-copied versions. This addresses drift between design tools and helps AI agents read files accurately instead of guessing, since agents currently receive ungrouped lists of token values with no indication of how they combine. It's not a shipped Figma feature but an early exploration. It closely resembles existing "composite tokens" in the W3C design token spec.

Design frontendpolicy

Why Accessibility Complaints are Increasing

Digital accessibility litigation is surging, with over 5,000 lawsuits filed in 2025 as legal and procurement standards tighten.

Summary

What: Vispero reports that companies are moving away from reactive patching toward permanent governance models that integrate testing and training into daily development workflows.

Original Article

Digital accessibility complaints are rising as legal, procurement, and customer-experience expectations mature, with over 5,000 lawsuits filed in US courts in 2025. Growing regulations, sprawling digital ecosystems, and greater public awareness are driving the trend. Reactive remediation often lets barriers resurface, so sustainable programs instead build governance, continuous testing, cross-team alignment, training, and strategic planning into daily operations.

Design frontend

Why Adding Friction Improved Built for Mars' Conversion Rate by 25%

A Built for Mars case study found that adding a specific onboarding step increased user conversions by 25% by reducing cognitive load.

Summary

What: The study suggests that 'frictionless' design sometimes confuses users; by forcing a single meaningful choice instead of presenting excessive, low-effort options, the company improved engagement and data quality.
Why it matters: This challenges the industry standard that lower conversion time always correlates with higher success, suggesting that meaningful user interaction is sometimes more valuable than speed.

Decoder

  • Psychological reactance: A behavioral theory where people feel their freedom of choice is threatened by too many options, leading them to resist or disengage from a process.

Original Article

Reducing friction isn't always the best way to improve conversions—adding a thoughtful onboarding step increased sign-ups by around 25%, while simpler "obvious" optimizations had no effect. Optional questions and excessive flexibility can create unintended problems, such as psychological reactance, lower-quality personalization data, or making users feel they should invest more effort before they've committed. The most effective onboarding balances guidance with autonomy, asking only what's necessary to deliver immediate value without overwhelming users with unnecessary choices.

AI enterpriseinfrastructure

Teaching AI to run with the turbines

Woodside Energy is scaling 50 agentic AI systems across its LNG operations to assist panel operators with complex plant startups and maintenance.

Summary

What: Woodside Energy has moved from predictive analytics to agentic AI, using a 'Startup Advisor' tool to support operators in LNG plants. The company focuses on a 'think big, prototype small, scale fast' strategy for its 50+ production AI agents.
Why it matters: This highlights how asset-intensive industrial companies are moving toward 'agentic' systems that integrate into existing operational workflows rather than just consumer-facing chatbots.

Decoder

  • LNG (Liquefied Natural Gas): Natural gas that has been cooled to a liquid state for easier transport and storage.
  • Agentic AI: Systems capable of performing multi-step tasks and interacting with external systems to achieve goals with minimal human intervention.

Original Article

Woodside Energy leverages AI to enhance operational efficiency by integrating agentic AI systems in complex industrial workflows, particularly in LNG plant startups through their "Startup Advisor". These AI tools, built upon years of investment in predictive analytics and machine learning, augment human expertise rather than replace it.

Tech hardwareenterprisetesla

Model Y Long Wheelbase

Tesla has launched a new 3-row, 6-seat Model Y 'Long Wheelbase' variant in the US and Puerto Rico.

Summary

What: The Model Y Long Wheelbase features a 3-row, 6-seat layout, 325 miles of range, a 0-60 mph time of 4.4 seconds, and updated interior amenities including heated/ventilated captain's seats and a 19-speaker audio system.

Decoder

  • Frunk: The storage compartment located under the hood of an electric vehicle.
  • FSD Supervised: Tesla's advanced driver-assistance system that requires constant human supervision.
  • Grok AI: A generative artificial intelligence model developed by xAI, integrated into Tesla vehicles.

Original Article

The Model Y Long Wheelbase is now available in the US and Puerto Rico. The 3-row, 6-seat configuration has ample headroom and legroom for all passengers. The trunk fits a 28" and a 20" suitcase each, and the frunk can hold an additional 20" suitcase. More details about the wheelbase, along with pictures and videos of the vehicle, are available in the post.

Tech careerai

Career advice in the age of AI

Professional value in the AI era is found at the 'last mile' of execution on problems that cannot be solved by simply defining a loss function.

Summary

What: Phil Chen argues that career longevity depends on identifying complex, ambiguous problems that fall outside the domain of automated model training and evaluation.
Why it matters: This perspective suggests that developers should prioritize high-level problem discovery and execution over rote coding tasks that are increasingly becoming automated.

Decoder

  • Loss function: A mathematical method used in machine learning to quantify the error between a model's predicted output and the desired result, serving as the basis for model optimization.

Original Article

The valuable work of the next decade is everything that can't be graded within the span of model training. Focus on resources that are truly limited and learn to find problems in addition to solving them. Work on the most ambitious form of a problem and learn to execute well in the last mile. Put yourself in the right position to see opportunities. The key to unlocking opportunity is to focus on finding interesting problems and delivering extraordinary results.

Design hardwaremobileapple

This iconic iPhone design will get closer to retirement with iPhone 18 Pro

Apple is set to shrink the iPhone 18 Pro's Dynamic Island by approximately 35% in 2026, advancing its trajectory toward an all-screen design.

Summary

What: The upcoming iPhone 18 Pro will feature a significantly reduced display cutout, a change hinted at by the new Siri AI interface in iOS 27. The redesign will eventually extend to the MacBook Ultra and the iPhone 18e.
Why it matters: This reduction confirms Apple's long-term hardware strategy to minimize and eventually eliminate visible sensors and cutouts in favor of a continuous display.

Original Article

Apple is expected to shrink the iPhone 18 Pro's Dynamic Island by around 35%, giving users more usable screen space and marking another step toward a future all-screen iPhone. The change, hinted at by iOS 27's Siri AI interface, would be the first major reduction in the display cutout since the Dynamic Island debuted on the iPhone 14 Pro. Apple is also rumored to bring a smaller Dynamic Island to the MacBook Ultra and next year's iPhone 18e as it gradually works to eliminate display cutouts altogether.

Design frontendweb

RetroMac (Website)

RetroMac is an open-source browser built on Chromium that recreates the visual aesthetic of early-2000s desktop software.

Summary

What: The browser uses modern web standards while mimicking the classic UI patterns and windowing styles of legacy operating systems.

Original Article

RetroMac is an open-source desktop browser that brings back the look & feel of the late-90s and early-2000s browsers, built on modern Chromium.

Design policy

Polaroid's new anti-AI billboard is an important reminder to touch grass

Polaroid is utilizing its 'Go Generation 3' campaign to criticize the environmental impact of AI and data centers through analog-focused messaging.

Summary

What: Creative Director Patricia Varella led a billboard campaign in cities like New York and London, framing analog photography as a human-centric rebellion against the encroachment of AI and cloud-dependent digital media.
Why it matters: Polaroid is positioning its brand identity in opposition to the digital ubiquity of AI, attempting to capture market share from consumers who are increasingly wary of automated creative media.

Original Article

Polaroid's "The Best of Summer is Analog" campaign uses bold, minimalist messaging to encourage people to spend less time on screens and embrace real-world moments in an increasingly AI-driven world.

Design web

Top Five Serif Fonts Designers Love in 2026

Pangram Pangram Foundry highlights five serif typefaces—Kyoto, Lettra Mono, Fragment Serif, Editorial Old, and Editorial New—as top choices for 2026 design projects.

Summary

What: Pangram Pangram Foundry curated a list of five serif fonts for 2026, emphasizing their versatility in branding, editorial, and digital design. The list includes Kyoto, Lettra Mono, Fragment Serif, Editorial Old, and Editorial New, all of which feature extensive language support.
Why it matters: This underscores how specialized foundries are maintaining relevance by positioning their libraries as functional tools for modern digital-first design systems rather than just decorative assets.

Decoder

  • Serif: A small decorative line or stroke attached to the end of a larger stroke in a letter or symbol.
  • Monospaced: A font style where every character occupies the same amount of horizontal space.

Original Article

Pangram Pangram Foundry highlights five serif typefaces designers favor in 2026: Kyoto, Lettra Mono, Fragment Serif, Editorial Old, and Editorial New.

Digest devoured!

Jul 3

Home