Loading digest...
Jun 15
1 / ?
AI llmpolicy

Anthropic Suspended Access to Fable 5 and Mythos 5

Anthropic has suspended its Fable 5 and Mythos 5 models following a US government directive citing national security concerns and potential jailbreak risks.

Summary

What: Anthropic disabled access to Fable 5 and Mythos 5 for all users after the US government issued an export-control directive. While the government claims the models could be jailbroken for cyberattack assistance, Anthropic argues that these vulnerabilities are minor and exist in other publicly available models.
Why it matters: This incident highlights the growing friction between frontier AI labs and national security regulators, specifically regarding the standard of proof required to mandate a model recall.

Decoder

  • Jailbreak: A technique used to bypass an AI model's safety filters, allowing it to produce content or perform tasks it is explicitly programmed to avoid.

Original Article

Statement on the US government directive to suspend access to Fable 5 and Mythos 5

The US government, citing national security authorities, has issued an export control directive to suspend all access to Fable 5 and Mythos 5 by any foreign national, whether inside or outside the United States, including foreign national Anthropic employees. The net effect of this order is that we must abruptly disable Fable 5 and Mythos 5 for all our customers to ensure compliance. Access to all other Anthropic models will not be affected.

We received the directive from the government today at 5:21pm (ET). The letter did not provide specific details of its national security concern. Our understanding is that the government believes it has become aware of a method of bypassing, or “jailbreaking” Fable 5. We reviewed a demonstration of this specific technique being used to identify a small number of previously known, minor vulnerabilities. These vulnerabilities all appear relatively simple, and we have found that other publicly-available models are able to discover them as well without requiring a bypass.

Anthropic’s posture with respect to Fable’s safeguards, as laid out in our launch blog post, is the following:

  • We have instituted strong safeguards that greatly reduce the likelihood that Fable is misused for tasks related to cybersecurity (among others). In fact, our safeguards are so strong that many users have complained that they are overly broad.
  • In the weeks leading up to the launch of Fable, Anthropic worked with the US government, the UK AISI, multiple private third-party organizations and internal teams to red-team Fable’s safeguards for thousands of hours in total.
  • These tests showed that Fable’s safeguards are substantially more effective than those of any previously deployed model.
  • No testers have yet been able to find a universal jailbreak—a jailbreak method that can very broadly bypass the model’s safeguards, unblocking a wide range of cyber capabilities.
  • We suspect that perfect jailbreak resistance is not currently possible for any model provider. Every safeguard used in the industry is vulnerable to non-universal jailbreaks (which can elicit some cyber information in specific circumstances), and it is likely that universal jailbreaks will eventually be found in the future. We stated this clearly when we released Fable 5.
  • Given that perfect jailbreak resistance does not appear to be possible today, Anthropic adopted a defense in depth strategy with Fable 5. We aimed to make jailbreaks either narrow (in the case of non-universal jailbreaks) or very expensive to produce (in the case of universal jailbreaks), and to combine this with thorough monitoring to quickly detect and shut down any successful attacks. This is also why Anthropic has required 30-day retention of customer data with Fable—a policy change that carries real costs for us with customers, but that allows us to research and mitigate jailbreaks.
  • We stand by this defense in depth strategy. It reduces the risks posed by Fable, making them comparable to the risks of existing models already deployed across the industry.
  • We have not even received a disclosure of a concerning non-universal potential jailbreak that led to a harmful result. The potential jailbreaks that have been disclosed to us are either entirely benign responses or are minor findings that provide no Mythos-specific uplift.

To date, the government has only given us verbal evidence of a potential narrow, non-universal jailbreak, which essentially consists of asking the model to read a specific codebase and fix any software flaws. Our understanding is that one potential jailbreak was shared with the government. We have reviewed a report that we believe is the basis of the government's directive and validated that the level of capability displayed there is widely available from other models (including OpenAI’s GPT-5.5), and is used every day by the defenders who keep systems safe. We will share more details over the next 24 hours.

We are complying with the government’s legal directive and are removing access to Fable 5 and Mythos 5 for all users. However, we disagree that the finding of a narrow potential jailbreak should be cause for recalling a commercial model deployed to hundreds of millions of people. If this standard was applied across the industry, we believe it would essentially halt all new model deployments for all frontier model providers.

As we have stated publicly, we believe the government should have the ability to block unsafe deployments, as part of a statutory process that is transparent, fair, clear, and grounded in technical facts. This action does not adhere to those principles.

We apologize for this disruption to our customers. We believe this is a misunderstanding and are working to restore access as soon as possible.

AI llmpolicy

Amazon CEO's Talks With US Officials Triggered Crackdown on Anthropic Models

Amazon CEO Andy Jassy reportedly held discussions with US officials that preceded the government's order to disable Anthropic’s Fable 5 model.

Summary

What: Amazon researchers reportedly demonstrated that Anthropic's Fable 5 model could provide information helpful for cyberattacks, leading to direct government intervention. Anthropic maintains that these findings are not unique to their models but is complying with the directive.
Why it matters: This reveals how internal competitive intelligence and security research from cloud providers like Amazon can trigger regulatory enforcement against their own AI partners.

Original Article

Conversations between Amazon chief executive Andy Jassy and US officials prompted the Trump administration to halt all foreign use of Anthropic's most capable AI models. Researchers at Amazon had used a series of prompts to get Anthropic's Fable 5 model to provide them with information that could be used to aid cyberattacks. White House officials asked Anthropic to fix the vulnerabilities or take down the model. Anthropic has shut down access to its Mythos and Fable models to comply, but says that the vulnerabilities flagged by Amazon are relatively basic and that other publicly available models are also capable of discovering them.

AI infrastructureperformance

Inference cost at scale with napkin math

Calculating the real-world cost of LLM inference is a straightforward exercise in balancing GPU memory bandwidth, compute throughput, and KV-cache management.

Summary

What: The article uses napkin math to demonstrate that a single NVIDIA B200 GPU can serve between 40 and 800 users depending on application duty cycles and prompt lengths. It highlights how optimizations like Grouped-Query-Attention (GQA) and PagedAttention are essential for managing memory and maintaining profitability.
Why it matters: Understanding these hardware-level constraints is critical for engineering teams building LLM-based products to ensure they don't over-provision hardware or misprice their services.
Takeaway: If you are building an LLM application, calculate your per-user cost by estimating memory usage for your expected KV-cache size and assuming a duty cycle of 20% for conversational apps.

Deep Dive

  • GPU Metrics: Peak throughput (TFLOPS) and memory bandwidth (TB/s) are the primary hardware constraints.
  • Matrix Multiplications: Memory access is often the bottleneck, not raw floating-point operations.
  • KV-Cache: Storing keys and values for previous tokens prevents redundant computation but consumes significant VRAM.
  • Grouped-Query-Attention (GQA): Essential for reducing the KV-cache footprint by sharing KV-heads across query-heads.
  • PagedAttention: Allows memory to be fragmented and dynamically allocated, increasing the number of concurrent users per GPU.
  • Duty Cycle: Most users are idle during a chat, meaning a GPU can often handle more concurrent sessions than the 100% load math suggests.

Decoder

  • KV-Cache: A memory buffer storing the keys and values of previous tokens in an LLM, allowing the model to generate the next token without re-processing the entire sequence.
  • Auto-regressive: A model property where the output of one step is fed back as input for the next step, typical of sequence-generating LLMs.
  • FP-8 Quantization: The practice of using 8-bit floating-point numbers instead of 16-bit to reduce VRAM requirements and increase speed, with minimal impact on accuracy.
  • Tiling: A matrix multiplication optimization that processes smaller blocks (tiles) of data at a time to reduce redundant memory accesses.

Original Article

Inference cost at scale with napkin math

If you serve AI models as a part of your product's stack (hard not to in 2026), you've done the math on how much juice you can get out of a single A100/H100/H200/B200/whatever. This directly affects the pricing for subscription-based products.

I want to show that even as models, hardware, and inference engines evolve; the dollar price-per-user remains straightforward to work out on paper. This exercise should also reveal how various optimizations in inference engines help SaaS products remain profitable.

If you were actually working this out on paper, you'd need only the following information:

  1. GPU hardware specs: Memory bandwidth and peak throughput (explained below).
  2. Context length: assumed 200k tokens.
  3. Active parameter count of the model: Assumed 32B to keep things simple on a single GPU.
  4. Some idea about your product: Whether it's driven by user prompts or programmed loops, duty cycle of your user profile (explained at the end), etc.

The specifics of the model architecture matter surprisingly little, unless it's something entirely different like diffusion.

Resources on a single GPU

For any GPU on the market, you can find on its spec sheet two key metrics:

  1. Peak throughput: Number of floating-point operations executed per second. Usually in TeraFLOPs (1 TFLOP/s = 10^9 ops/sec).
  2. Memory bandwidth: Amount of data that can be moved from global memory (VRAM) to registers (SRAM). Usually in TB/sec.

We'll assume FP-8 quantization to compute throughput, though it's easy to adjust the math for FP-16 as well.

Cost of a Matrix Multiplication

If you bothered to click on this article, you know that AI models do many matrix multiplications on massive matrices. That we start by finding the cost of a matmul should be no surprise then.

Assume two matrices: A_{N \times d} and B_{d \times M}. Let their product be the matrix O_{N \times M}. From high school algebra, we know that each element of O can be computed as:

O^{i,k} = \sum_{j=1}^{d} A^{i,j} * {B}^{j,k}

In this, we find our first insight into the "cost" of a matrix multiplication. For each O^{i,k}, we need to start with an initial value of 0 and:

  1. Load A^{i,j} from memory.
  2. Load (B^{j,k}) from memory.
  3. Multiply them together.
  4. Add the result of #3 to the cumulative sum so far.

And this is done a total of d times per item. So, the cost of a (N,d)*(d,M) matrix product is 2NMd memory accesses and 2NMd floating-point operations.

With an optimization called tiling, the memory access goes down to about d(N+M). The details aren't necessary to proceed.

An Overview of Language Models

At their core, LLMs are simple – they receive a sequence of N words and generate the N+1th. Each word is represented as a vector with d entries. Using repeated applications of a function called "attention" (explained later), they predict the next word.

A single forward pass roughly looks like this:

y = input() # y = matrix of size N x d
for each layer in the network:
  y = attention(y)

# Convert the final layer's output to word-probs.
# W_vocab = matrix of size d x vocab_len,
# and vocab_len is the number of all words
# in the model's vocabulary.
token_probs = softmax(y * W_vocab)
next_tok    = token_probs(argmax(token_probs))
# next_tok is a (1 x d) vector

This is also why LLMs are called auto-regressive. They can keep doing multiple forward passes over their own output until a <stop> token is generated.

Attention in Greater Detail

Let's place the attention function under a magnifying glass.

As you saw, the input is a matrix X \in \mathbb{R}^{N \times d}, and X_i is a single d dimensional vector. For every "layer" in the network, the model stores matrices W_Q,W_K, W_V \in \mathbb{R}^{d \times d}, and computes "attention" as follows:

Q = X.W_Q, K = X.W_K and V = X.W_v

Attention(Q,K,V) = softmax(Q.K^T/\sqrt{d}).V

Or, in python:

def attention(X, W_q, W_k, W_v):
    Q,K,V = X @ W_q, X @ W_k, X @ W_v
    Q_KT = Q @ K.transpose(2,1)
    return softmax(Q_KT / sqrt(d_model)) @ V

Where @ is the dot-product of two matrices.

In reality, multiple LLM conversations are processed in parallel. So inference is batched—where we process B chats concurrently. This means our input sequence X \in \mathbb{R}_{B \times N \times d}.

For a batch size of B, we get:

  1. Floating-point operations: 2BNd^{2}.
  2. Memory accesses: Bd(N+d).

Assume N to be roughly 200k, and d to be 8192. Meaning that to generate one token for a single user, we need 26 trillion floating-point ops and 1.7 billion memory accesses. This is with the tiled matmuls.

Reducing Compute with KV-Cache

The intermediate output on every chat, namely K and V, is cached at every layer, and stored in a region of VRAM called the KV Cache.

For our napkin math, the existence of KV-cache allows one simplification: for every forward pass, we get to process only the most recently generated word, rather than the entire history. i.e., instead of processing a X \in R_{N \times d}, we get X \in R_{1 \times d} (the most recent token).

For a batch size of B, we get:

  1. ~26.2 million memory accesses
  2. ~52.4 million ops

How much does a token cost?

Let's take the NVIDIA B200 as our leading example. It has a memory bandwidth of 8 TB/s and a compute intensity of 4500 TFLOP/s.

A Blackwell class GPU can crunch bytes 562 times faster than it can load them. We're doing 2*B operations per memory access. So, how many users should we serve to fully exhaust a B200's compute and bandwidth budget?

2B = 562 \implies B = 331

With a single NVIDIA B200 GPU, we should be serving 331 users concurrently to get the most out of our investment.

How many users can you serve realistically?

We'll assume a 32B dense model. This is 32GB in VRAM. Let's assume a context window of N=200k tokens. For each layer, we need to store 2Nd bytes for a pair of K and V matrices. With a model of d=8192 and L=64, that is 210 GB of VRAM—more than the GPU has.

Using Grouped-Query-Attention (GQA) cuts down the KV cache size by about 8x. With GQA, our KV-cache is now at ~26GB per chat sequence. We're already using 32GB for weights, so how many concurrent chat contexts can we store in the remaining 160GB? That's 160/26 = 6.

Optimizing for hundreds of users on a GPU

Most contexts will never reach the 200k token limit. We can split the KV-cache into chunks, and then allocate those chunks to different users as their token use increases. Conversation threads that are abandoned/cold can be flushed out of the cache. Depending on the median user activity, you can serve anywhere between 40-60 users per Blackwell chip.

Remember that the nature of your product matters too. In most ChatGPT-style apps, the user spends more time reading than prompting. For a median chat session, a user will likely have 80% idle time. So realistically, one chip can serve ~300-800 users comfortably depending on the style of app.

Tokens Per Second

Earlier, we saw that we can comfortably support 6 users at 100% duty cycle. Every 24ms, we generate B=6 tokens. For 1s, we generate roughly 250 tokens for 6 users, or about 40 tokens per user per second. 40 tokens per second is beyond most people's reading speed.

Dollar cost per user

This largely depends on whether you own or rent your hardware. At $40,000 per B200, serving 500 users per GPU you'll spend a lifetime cost of about $133 per user, plus the datacenter/upkeep bill.

If you rent the GPU, the cost is more straightforward. At $3 per hour for a B200, your cost per user per hour is 3/num_users. For num_users=500, you get a cost of about $0.006 per user per hour, or $4.32 per month. As long as you charge them more than $4.32, your operating costs are covered.

AI infrastructureperformancellm

MiniMax Sparse Attention for Million-Token Contexts (GitHub Repo)

MiniMax's new sparse attention architecture achieves a 30x compute reduction at 1M tokens by using Top-k block selection.

Summary

What: MiniMax released MSA (MiniMax Sparse Attention) for NVIDIA SM100 GPUs, allowing models to process million-token contexts by only attending to the most relevant blocks of the KV cache. The library supports FP8, BF16, and NVFP4 formats and integrates via a JIT-compiled Python interface.
Why it matters: Optimized sparse kernels are becoming critical as memory and compute costs for ultra-long context windows become the primary bottleneck for deploying frontier agentic models.
Takeaway: If you are running million-token contexts on NVIDIA SM100 hardware, test the `sparse_topk_select` functionality in the MiniMax MSA repository to reduce attention compute overhead.

Decoder

  • Sparse Attention: A technique that reduces attention compute by only calculating scores for a subset of tokens or blocks rather than the entire sequence.
  • KV Cache: Memory storage for Keys and Values in transformers, which grows linearly with context length and is a major memory consumer.
  • SM100: NVIDIA's latest generation GPU architecture.
  • JIT (Just-In-Time) compilation: Compiling code at runtime rather than beforehand, allowing for optimized kernels specific to the current hardware and input parameters.

Original Article

MiniMax Sparse Attention (MSA)

MSA (fmha_sm100) ships dense FlashAttention and sparse top-k attention kernels for NVIDIA SM100. Two JIT-compiled stacks share one Python package:

Algorithm reference: MiniMax Sparse Attention paper.

Stack Path What it gives you
csrc JIT python/fmha_sm100/csrc/ Dense FMHA (fmha_sm100, fmha_sm100_plan) + sparse_topk_select indexer, compiled from Jinja templates by jit.py at runtime.
CuTe-DSL python/fmha_sm100/cute/ Full sparse attention (forward + paged FP8 decode, BF16 / FP8 / NVFP4 / FP4), compiled at runtime via cute.compile.
Bridge python/fmha_sm100/sparse_fmha_adapter.py Adapts the fmha_sm100 API to call sparse_atten_func for sparse prefill paths.

License: MIT. Self-authored files carry SPDX-License-Identifier: MIT. See LICENSE and NOTICE. Bundled / derived third-party code retains its own license — see Third-party licenses.

Requirements

  • GPU: NVIDIA SM100.
  • Toolchain: CUDA Toolkit with nvcc on PATH (or CUDA_HOME / CUDA_PATH set).
  • Python: ≥ 3.10.
  • OS: Linux x86_64 (aarch64 untested; JIT builds may need small Makefile edits on WSL).

Quick sanity check before installing:

nvcc --version                # expect ≥ 12.x
nvidia-smi --query-gpu=compute_cap --format=csv | grep "10.0"  # confirm SM100
python -c "import sys; print(sys.version_info[:2])"              # ≥ (3, 10)

Using with the kernels library

To quickly get started using MSA kernels, you can use the kernels library:

# make sure `kernels` is installed: `pip install -U kernels`
from kernels import get_kernel

kernel_module = get_kernel("MiniMaxAI/msa", version=0)
sparse_atten_func = kernel_module.sparse_atten_func

sparse_atten_func(...)

Install

# --recursive pulls the NVIDIA CUTLASS submodule (python/fmha_sm100/cutlass/),
# whose headers are required for JIT/AOT compilation.
git clone --recursive https://github.com/MiniMax-AI/MSA.git msa
cd msa
# If you cloned without --recursive:
#   git submodule update --init --recursive
pip install .           # standard install (works from a wheel too)
# or
pip install -e .        # editable install for development

This pulls in the CuTe-DSL stack via nvidia-cutlass-dsl and quack-kernels; the csrc kernels are JIT-compiled at first import from sources shipped inside the package.

Verify

Run a small CUDA smoke test. The first run JIT-compiles sparse_topk_select, which takes 30 s – a few minutes on a cold nvcc cache — this is normal, not a hang. Subsequent runs hit the JIT cache and finish in seconds.

python tests/smoke/test_sparse_topk_forced.py

Usage

import torch
from fmha_sm100 import fmha_sm100, fmha_sm100_plan, sparse_topk_select

# Page size and top-k for the sparse prefill path.
page_size, topk = 128, 16

# Dense proxy pass: compute per-block max score from a cheap Q slice.
proxy_plan = fmha_sm100_plan(
    qo_lens, kv_lens, proxy_q.shape[1],
    num_kv_heads=1,
    page_size=page_size,
    output_maxscore=True,
)
_, max_score = fmha_sm100(
    proxy_q, proxy_k_pages, proxy_v_pages, proxy_plan,
    kv_indices=kv_indices,
    output_o=False,
    output_maxscore=True,
)

# max_score -> sparse KV block indexes.
kv_block_indexes = sparse_topk_select(
    max_score.contiguous(), topk, num_valid_pages=num_pages,
)

# Sparse attention with the selected blocks.
sparse_plan = fmha_sm100_plan(
    qo_lens, kv_lens, q.shape[1],
    num_kv_heads=k_pages.shape[1],
    page_size=page_size,
    kv_block_num=topk,
)
out, _ = fmha_sm100(
    q, k_pages, v_pages, sparse_plan,
    kv_indices=kv_indices,
    kv_block_indexes=kv_block_indexes,
)

Test

# Fast smoke tests.
python -m pytest tests/smoke -q

# API and end-to-end integration tests.
python -m pytest tests/integration -q
python tests/integration/test_proxy_kv_e2e.py

# Large regression suites.
python tests/regression/test_correctness.py
python tests/regression/test_sparse_attn.py

# CuTe-DSL forward-only sparse attention.
cd python/fmha_sm100/cute
python -m pytest test_sparse_atten.py -q

Benchmark

benchmarks/bench_sparse_attention_ops.py covers dense prefill, paged prefill, sparse prefill, dense decode, paged decode, sparse decode, in fp8 and bf16 (nvfp4 is sparse-prefill only).

python benchmarks/bench_sparse_attention_ops.py --help     # full flag list

Layout

python/fmha_sm100/                  Python package
  __init__.py                       Public re-exports (lazy for the CuTe-DSL stack)
  api.py                            fmha_sm100 / fmha_sm100_plan / sparse_topk_select
  jit.py                            Runtime JIT (nvcc + ninja) for the csrc stack
  sparse.py                         Lazy shim that loads the cute/ stack
  sparse_fmha_adapter.py            Bridge: fmha_sm100 API → sparse_atten_func
  csrc/                             CUDA kernels + Jinja templates (JIT-compiled)
    include/                        Vendored FlashInfer / CUTLASS-derived / TRT-LLM headers
  cutlass/                          NVIDIA CUTLASS git submodule (include/ + tools/util/include/)
  cute/                             CuTe-DSL sparse attention (loaded via sys.path)
tests/                              Correctness tests
  smoke/  integration/  regression/
scripts/                            Warmup + cache-management helpers
benchmarks/                         bench_sparse_attention_ops.py

Stacks

  • csrc JIT — dense FlashAttention, page KV, and sparse_topk_select indexer. Compiled at runtime from csrc/*.cu.jinja plus csrc/include/. Public entry: fmha_sm100.plan → run.
  • CuTe-DSL — block-sparse prefill, FP8 / NVFP4 / FP4 quantization, paged FP8 decode (SparseDecodePagedAttentionWrapper), FP4 block-score indexer. Public entry: fmha_sm100.sparse_atten_func, fmha_sm100.sparse_decode_atten_func, fmha_sm100.fp4_indexer_block_scores.
  • Bridgesparse_fmha_plan / sparse_fmha adapt the dense-API call site to the sparse backend for prefill paths; useful when you already drive the dense kernel and want a one-line swap to sparse.

Third-party licenses

fmha_sm100 bundles, derives from, or depends on the third-party components below. Each retains its original license; this section summarizes them.

Vendored / derived source (shipped in this repo)

Component License Where
NVIDIA CUTLASS BSD-3-Clause Git submodule at python/fmha_sm100/cutlass/ (provides include/ + tools/util/include/), plus BSD-3-tagged headers under python/fmha_sm100/csrc/include/.
FlashInfer Apache-2.0 Headers and sources under python/fmha_sm100/csrc/ and python/fmha_sm100/csrc/include/.
NVIDIA TensorRT-LLM + NAVER Corp (CLOVA) Apache-2.0 Portions of python/fmha_sm100/csrc/include/sparse_topk_select.cuh.

Citation

@software{msa2026,
  title  = {MiniMax Sparse Attention (MSA): FlashAttention and block-sparse
            attention kernels for NVIDIA SM100},
  author = {{MiniMax}},
  year   = {2026},
  url    = {https://github.com/MiniMax-AI/MSA}
}

Contributing

Issues and PRs welcome on the issue tracker. For kernel or runtime-contract changes, open an issue first to align on the public surface.

DevOps datainfrastructurerust

Ingesting the Milky Way: Petabyte-Scale with Zerobus Ingest

Databricks' new Zerobus Ingest service sustains 12 GB/s throughput to a single table, eliminating the need for Kafka-based streaming infrastructure.

Summary

What: Databricks has reached general availability for Zerobus Ingest, a serverless streaming API. It achieved 1 petabyte of data ingestion in under 24 hours during benchmarks using NASA's NEOWISE dataset. The system uses a custom zero-copy protobuf parser written in Rust and replaces static partitioning with connection-level stream ordering.
Why it matters: By removing the need for manual Kafka management, Databricks is aggressively commoditizing the streaming layer for lakehouse architectures.
Takeaway: If you are running Kafka primarily to feed a Delta lake, evaluate switching to Zerobus to reduce infrastructure overhead.

Deep Dive

  • Dynamic Partitioning: Scales compute resources automatically based on stream connections rather than static broker partitions.
  • Zero-copy parsing: Uses custom Rust-based protobuf decoding to avoid memory allocations and buffer copying.
  • Performance: Achieved 12 GB/s per table throughput in benchmarks.
  • Ordering Guarantees: Moves ordering guarantees from individual partitions to the stream connection level.
  • Managed Service: Fully integrated with Unity Catalog and serverless compute.

Decoder

  • Zero-copy: A technique where data is processed without being copied between different memory areas, significantly improving performance.
  • Delta Table: A storage layer that brings reliability and ACID transactions to data lakes.
  • Protobuf: A language-neutral, platform-neutral extensible mechanism for serializing structured data.

Original Article

  • Databricks Zerobus Ingest is a serverless streaming API that enables teams to instantly deploy petabyte-scale data pipelines without manual infrastructure management.
  • Zerobus’ architecture relies on dynamic partitioning to automatically scale compute resources, efficiently handling unpredictable data volumes without complex tuning.
  • This zero-setup framework easily processes massive workloads, demonstrating the ability to sustain over 12 GB/s throughput to a single table during 24-hour benchmarks.

Telemetry data is everywhere. IoT sensors on factory floors. Satellite arrays scanning the atmosphere. Autonomous vehicles are logging thousands of events per second. Every one of these systems has the same underlying problem: a continuous, high-volume stream of time-series observations that needs to land somewhere queryable. It needs to be fast, reliable, and without an engineering team spending weeks tuning and maintaining infrastructure that is typical of Kafka based workloads.

That's the problem Zerobus Ingest is built to solve. Zerobus is Databricks' fully managed, serverless streaming ingest service. It's a push-based API that accepts data from any producer and writes it directly into Delta tables, governed by Unity Catalog.

  • No infrastructure to provision.
  • No connector pipeline to maintain.
  • No partitions or broker decision-making.

Instead, you create a table and push data. It lands in your lakehouse, ready to query in seconds. You no longer need to run Kafka as a pipe when your destination is the lakehouse.

We used NASA’s NEOWISE dataset, representing 200 billion data points over 11 years, to benchmark Zerobus Ingest, ingesting 1 petabyte in under 24 hours, with zero pre-configuration and stable latency.

By ingesting 1PB within 24 hours, we demonstrate Zerobus’s ability to maintain continuous throughput of 12 GB/s to a single table!

This post walks through three of our design decisions that made this possible.

  • Designing a system that autoscales via dynamic partitioning.
  • Building our own zero-copy protobuf decoder.
  • Implementing a latency-optimized write-ahead log before data is published to the lakehouse.

Our key design decisions

Our aspiration was to build a streaming system that could support petabyte-scale and auto-scale to handle fluctuating ingestion patterns.

Traditional streaming architectures require you to decide how many brokers and partitions a given workload needs. This requires knowledge of peak load and consumer ingestion constraints, as well as forecasting and an understanding of the end-to-end pipeline.

By going back to first principles, we designed and built a system that scales to handle petabyte-sized workloads for data producers “magically.”

Autoscaling achieved through dynamic partitioning

The problem we were trying to solve was how to have efficient autoscaling to achieve elastic “limitness” scaling.

Our thesis was that by moving away from static partitioning and toward the logical unit of a stream/connection, we could unlock true autoscaling and rebalancing while maintaining ordering guarantees, which are important for consumption workloads.

The static partition problem

In message bus architectures, partitions are the unit of both parallelism and ordering. This coupling creates a constraint that can be painful once you have consumers who depend on it.

Ordering is typically a per-partition guarantee, not per-producer. The number of partitions and the distribution of data across them affect a consumer's ability to keep up with ingestion.

We moved the ordering guarantee to the stream connection

In traditional systems, ordering is a partition-level guarantee. In Zerobus Ingest, ordering is a stream connection-level guarantee.

When a producer opens a stream with Zerobus (a connection to our server), they're registering a logical identity with the service. For the lifetime of that connection, their data arrives in order, regardless of which “partition” pod processes it.

Hot routing and true autoscaling

Internally, Zerobus Ingest distributes streams across a pool of pods. Routing is heuristic-based: if a pod is running hot, new incoming streams are routed to a different pod. The producer is unaware. Their ordering guarantee is unaffected.

Ordering lives at the stream level, which means pods can be added when demand spikes and removed when demand drops. Existing streams then drain gracefully, and new streams stop routing there. The pool then shrinks, keeping compute utilization efficient.

Zero-copy high-performance data handling

Zerobus's main goal is to allow an efficient, row-by-row transfer of data streams of any volume. To achieve this, we needed to completely avoid any needless copying and memory allocations - from the input formats that clients send to Zerobus, to the internal formats that guarantee durability and open Delta formats.

The result was that we built zeroparser: Bridging this gap by using single-pass parsing with zero memory allocations, enabling it to sustain throughputs of ~1 GB/s protobuf parsing per CPU core even with dynamic descriptors and complex schemas.

Write Ahead Log

Streaming is not just about being able to handle high-throughput workloads. To be a true streaming service, you also need to support message handoff as quickly as possible. This low latency of handing off data is what truly distinguishes streaming workloads from batch.

To support this low-latency handoff with a durability guarantee, Zerobus implements a latency-optimized write-ahead log (WAL). Once messages are durable, Zerobus sends an acknowledgement back to the client.

Proof: Ingesting the Milky Way

The key to benchmarking a system comes with the understanding of how it would be used in a production setting, and then emulating that behavior and usage.

Why Locust? The Fan-In Problem

Zerobus Ingest is built to aggregate streams from many independent producers into a single destination table. Its throughput scales with the number of concurrent open streams.

The results

Our test results showed Zerobus Ingest’s ability to sustain 12 GB/s to a single table over a 24-hour period from 2,048 concurrent workers to a single table. Over this period, Zerobus ingested over a trillion records.

What's next

Zerobus Ingest is now Generally Available on Databricks and ready for all your production workloads.

On the roadmap:

  • Kafka Producer API support
  • MQTT API support
  • Rescue column
  • System metadata column
  • Avro support
DevOps aikubernetes

Diagnose EKS Node Issues Faster with AWS DevOps Agent and Custom MCP

AWS released a custom MCP server allowing the AWS DevOps Agent to perform deep, node-level diagnostics on EKS clusters without manual SSH access.

Summary

What: The new EKS node diagnostics MCP server allows the AWS DevOps Agent to interface with node-level data sources like iptables, kernel logs, and CNI configurations using SSM Automation runbooks. It was demonstrated by identifying a fault-injected DNS failure caused by iptables rules, which was otherwise invisible to standard Kubernetes monitoring.
Why it matters: This pushes autonomous troubleshooting into the operating system layer of cloud infrastructure, moving AI agents beyond simple application-level observability.
Takeaway: Deploy the reference implementation from the `aws-samples` repository in a non-production cluster to test how the agent handles complex, node-level networking issues.

Deep Dive

  • Diagnostic Architecture: Uses SSM Automation to gather logs across 20+ sources (iptables, CNI state, kernel logs).
  • Abstraction: Wraps node-level data in MCP tools to avoid granting the agent direct shell access.
  • Workflow: The agent sequences discovery, logs collection, healthy-node comparison, and root cause synthesis.
  • Visibility Gap: Addresses issues where pods report status as 'Running' while infrastructure-level rules (like iptables) disrupt actual traffic.

Decoder

  • Model Context Protocol (MCP): A standardized interface enabling AI agents to read data and execute tools across disparate systems.
  • SSM Automation: A service for creating automated workflows to manage and maintain AWS resources without direct shell access.

Original Article

Diagnose EKS Node Issues Faster with AWS DevOps Agent and Custom MCP

AWS DevOps Agent can investigate a growing range of production incidents autonomously. It diagnoses CrashLoopBackOff failures, traces ConfigMap deletions through audit logs, and correlates Amazon CloudWatch metrics with cluster events — all without human intervention.

But AWS DevOps Agent has a visibility boundary. When the data it needs lives outside its native integrations — on a node’s operating system, inside a third-party monitoring tool, behind a database’s internal diagnostics — the agent stalls. It can describe symptoms, but it can’t reach the evidence needed to identify root causes.

This post shows how to extend AWS DevOps Agent by building a custom Model Context Protocol (MCP) server that bridges that gap. Using a concrete example, we give AWS DevOps Agent structured access to Amazon EKS worker node diagnostics and explain how the same approach applies to data sources the agent can’t natively reach. By the end of this walkthrough, you will have a working MCP server that gives AWS DevOps Agent access to 20+ node-level log sources — providing autonomous investigation capabilities that can assist in root cause analysis compared to manual SSH sessions.

Prerequisites

Before you begin, make sure you have the following:

  • An Amazon EKS cluster with AWS Systems Manager Agent (SSM Agent) running on the worker nodes (included by default on Amazon EKS optimized AMIs)
  • Node.js v18 or later
  • AWS CLI v2
  • AWS CDK v2 installed and bootstrapped in your target account and Region
  • An AWS account with permissions to create IAM roles, Lambda functions, and Amazon S3 buckets
  • Familiarity with Amazon EKS, AWS Systems Manager, and the Model Context Protocol (MCP)

How AWS DevOps Agent discovers custom tools through MCP

MCP is an open standard that defines how AI agents discover and invoke external tools. AWS DevOps Agent supports connecting to custom MCP servers, which means you can expose new capabilities to it without modifying the agent itself. When you connect an MCP server to AWS DevOps Agent, the agent automatically discovers the available tools, understands their schemas, and calls them as part of its investigation workflow. You build and connect the MCP server — the agent handles the rest.

The extensibility model follows three steps: first, identify the data source that AWS DevOps Agent cannot natively access; second, build an MCP server that wraps safe, structured access to that data source; and third, connect the MCP server to AWS DevOps Agent so it can incorporate the new tools into its investigations.

Three design principles make this work. Return structured data, not raw text — pre-index findings with severity levels and stable IDs so the agent can filter, reference, and correlate them. Never give the agent a shell — mediate interactions through a controlled, auditable execution model. Make tools composable — design tool outputs to serve as inputs to other tools, creating a chain of evidence the agent can follow.

Why Amazon EKS node OS visibility matters

AWS DevOps Agent integrates with Amazon EKS to inspect pod status, read container logs, query CloudWatch Container Insights, and correlate cluster events. This covers application crashes, container-level resource exhaustion, and configuration drift.

However, EKS production issues with nodes originate in a layer these tools cannot reach: the node operating system. Artifacts such as iptables rules, full CNI configuration and IPAMD state, route tables, conntrack entries, dmesg kernel messages, containerd runtime logs, sysctl parameters, ENI metadata, and the unfiltered kubelet journal exist exclusively on the node. These artifacts are the primary evidence for diagnosing IP allocation failures, DNS resolution issues, network policy enforcement problems, storage mount timeouts, and node registration failures.

Integrating AWS DevOps Agent with an EKS node diagnostics MCP server

The sample-eks-node-diagnostics-mcp repository demonstrates this pattern. It provides an MCP server that gives AWS DevOps Agent structured access to node-level diagnostic data, backed by AWS Systems Manager (SSM) Automation for safe, auditable execution.

How it works

Figure 1: End-to-end architecture of the EKS Node Diagnostics MCP server. AWS DevOps Agent discovers and invokes 19 tools through AgentCore Gateway, which dispatches SSM Automation runbooks to worker nodes for log collection and uploads results to Amazon S3 for extraction and indexing.

  1. AWS DevOps Agent calls a collect tool with an instance ID.
  2. The MCP server dispatches an SSM Automation execution to the target node, running the AWS-managed AWSSupport-CollectEKSInstanceLogs runbook.
  3. The runbook collects 20+ log sources — kubelet, containerd, iptables, CNI config, route tables, dmesg, sysctl, ENI metadata, IPAMD logs, and more — packages them into an archive, and uploads it to an Amazon S3 bucket where you configure AWS KMS encryption.
  4. A processing pipeline extracts the archive, pre-indexes errors with severity classification and stable finding IDs, and provides the results to you through additional MCP tools.

The server exposes tools for log collection, pre-indexed error retrieval, cross-file search and correlation, structured network diagnostics, and live packet capture. A typical agent workflow chains these together: collect → status → errors → search → correlate → read → summarize, with each step producing outputs that feed into the next.

AWS DevOps Agent does not get a shell on the node. Every interaction is mediated by SSM Automation — an auditable, IAM-controlled, non-interactive execution model.

Connecting through Amazon Bedrock AgentCore Gateway

The reference implementation uses Amazon Bedrock AgentCore Gateway to expose the Lambda-backed MCP server to AWS DevOps Agent. AgentCore Gateway converts Lambda functions into MCP-compatible tools and handles authentication, protocol translation, and tool discovery through a single managed endpoint.

The integration follows three steps:

Step 1: Create an OAuth authorizer with Amazon Cognito. The CDK stack provisions a Cognito User Pool configured for the OAuth 2.0 client credentials flow. This secures inbound access to the gateway — only clients with valid tokens can invoke tools.

Step 2: Create a gateway and register the Lambda as a target. Register the Lambda function that handles tool invocations as a target on the gateway. AgentCore Gateway automatically discovers the tool schemas from the Lambda and makes them available through the MCP protocol. The gateway endpoint becomes the single MCP URL for AWS DevOps Agent.

Step 3: Connect AWS DevOps Agent. Register the MCP server at the account level in the AWS DevOps Agent console, providing the gateway URL and OAuth configuration. Then allowlist the specific tools each Agent Space needs. AWS DevOps Agent authenticates by obtaining a JWT from the Cognito token endpoint using the client credentials grant and passes it as a Bearer token in requests to the gateway URL.

Deploying the MCP server

Deploy the entire stack using AWS CDK :

git clone https://github.com/aws-samples/sample-eks-node-diagnostics-mcp.git
cd sample-eks-node-diagnostics-mcp
chmod +x deploy.sh
./deploy.sh

The script walks you through cluster selection and node role configuration. Have the following ready before running the script: your target EKS cluster name, the IAM role ARN you attached to your worker nodes, and the AWS Region where your cluster runs. The script outputs your MCP gateway URL, OAuth credentials, and token endpoint — everything you need to configure the connection in AWS DevOps Agent. See the repository README for detailed deployment instructions, CI/CD mode, and prerequisite details.

Seeing it in action

To demonstrate the MCP server’s capabilities, we walk through a realistic node-level failure scenario on a test EKS cluster. We manually inject a fault that blocks pod DNS resolution at the iptables level — an issue that is invisible from kubectl since pods appear Running — then show how AWS DevOps Agent investigates and identifies the root cause using the MCP server’s tools.

Setting up the scenario

Start with an EKS cluster that has a managed node group with SSM Agent running (included by default on Amazon EKS optimized AMIs). Deploy a sample workload to one of the nodes:

kubectl create namespace demo-app

cat <<EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-frontend
  namespace: demo-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-frontend
  template:
    metadata:
      labels:
        app: web-frontend
    spec:
      containers:
      - name: nginx
        image: nginx:latest
        ports:
        - containerPort: 80
EOF

Identify the node and instance ID where the pods are running:

kubectl get pods -n demo-app -o wide

Injecting the fault

⚠️ WARNING: The following commands will disrupt DNS resolution for all pods on the target node. Only run these in a non-production test environment. Do not execute on production nodes.

Connect to the target node using SSM Session Manager and run the following commands to block pod DNS traffic at the iptables level. This simulates a subtle networking issue – pods continue running but can’t resolve DNS, and the root cause is only visible in the node’s iptables rules:

# Block pod traffic to kube-dns ClusterIP — pods run but DNS fails
# Only affects FORWARD chain (pod traffic), not the node's own DNS
sudo iptables -I FORWARD -d 10.100.0.10/32 -p udp --dport 53 -j DROP
sudo iptables -I FORWARD -d 10.100.0.10/32 -p tcp --dport 53 -j DROP

Replace 10.100.0.10 with your cluster’s kube-dns ClusterIP (kubectl get svc kube-dns -n kube-system -o jsonpath=’{.spec.clusterIP}’).

This fault is particularly insidious because kubectl get pods shows all pods in Running state. The applications fail with DNS resolution errors, but there is no Kubernetes event or pod status that points to the cause. The iptables DROP rules targeting the kube-dns ClusterIP exist only in the node’s firewall configuration — a layer that no Kubernetes API call can inspect.

Investigating with AWS DevOps Agent

An engineer notices applications reporting DNS failures and asks AWS DevOps Agent to investigate:

“Pods on node i-xxxxxxxxxx in cluster EKS-sample (us-east-1) are running but applications report DNS resolution failures. Collect the node logs and investigate.”

Figure 2: Starting an investigation in AWS DevOps Agent. The engineer provides the symptom description and incident timestamp, and the agent autonomously plans and executes the investigation.

AWS DevOps Agent begins the investigation by recording the symptom and launching two parallel actions: collecting node logs via the nodelog_collect tool and checking cluster health. The cluster health check confirms all four nodes are running and SSM-online. The agent then polls the log collection status, tracking progress from 25% through 75% to completion. Once collection finishes, the agent fans out into parallel workstreams — running network diagnostics, performing quick triage, and collecting logs from a healthy node for comparison.

Figure 3: Investigation timeline showing the initial data collection phase. The agent identifies the symptom, confirms cluster health, collects node logs via SSM Automation, polls for completion, and launches parallel diagnostic workstreams.

With the initial data collected, the agent launches four parallel investigation tasks to maximize coverage and minimize time-to-root-cause: (1) deep-dive-iptables-routes examines the node’s firewall rules and routing table in detail, completing in 1 minute 44 seconds across 8 tool calls; (2) search-network-errors scans the collected logs for network-related error patterns, running 15 tool calls over 7 minutes 51 seconds; (3) collect-healthy-node gathers the same diagnostics from a known-good node for comparison, taking 13 tool calls over 4 minutes 55 seconds; (4) check-oom-and-pod-status investigates kernel OOM kills and pod health, executing 19 tool calls over 8 minutes 12 seconds. Each task produces a structured report that feeds into the final synthesis.

Figure 4: Parallel investigation phase. The agent runs four concurrent deep-dive tasks — iptables/route analysis, network error search, healthy node comparison, and OOM/pod status check — then synthesizes the findings into a unified report.

The iptables and route table deep-dive reveals the root cause. The agent identifies two CRITICAL findings: a FAULT-INJECT-DROP-POD-TO-POD rule in the FORWARD chain that drops inter-pod traffic, and a FAULT-INJECT-DROP-SERVICE-CIDR rule that drops forwarded traffic to the service CIDR range. It also flags a MEDIUM-severity finding — a blackhole route for 10.96.0.0/12 (the Kubernetes service CIDR) that does not exist on healthy nodes. The remaining checks come back normal: kube-proxy chains are intact, AWS VPC CNI SNAT/CONNMARK chains are properly configured, and the default gateway and ENI route tables are correct. This structured severity classification allows the agent to immediately focus on the critical items.

Figure 5: Deep-dive findings from the iptables and route table analysis. Two CRITICAL fault-injection DROP rules in the FORWARD chain are identified as the primary issue, while standard networking components — kube-proxy, VPC CNI, and routing — check normal.

The healthy node comparison confirms the diagnosis. The agent compares the unhealthy node against a known-good node across seven dimensions: security groups, ENI count, DNS configuration, iptables rules, route tables, conntrack entries, and IPAMD state. The key differences are definitive: the blackhole route for 10.96.0.0/12 exists only on the unhealthy node, kubelet API server timeout errors appear only on the unhealthy node, conntrack entries are 12x higher (1,962 vs 169), and IPAMD reconciliation errors are 5x more frequent. The iptables FORWARD chain counters show 2.4 billion packets processed on the unhealthy node versus zero on the freshly-started healthy node — confirming sustained traffic disruption.

Figure 6: Healthy node comparison confirming the diagnosis. The agent compares diagnostics across both nodes and identifies five key differences — the blackhole route, elevated conntrack entries, and high FORWARD chain packet counts exist only on the affected node.

The agent synthesizes the findings into a definitive root cause determination. It identifies a fault-injection namespace on the EKS cluster that is running chaos experiments, introducing three specific network-disrupting modifications on the target node: (1) a FAULT-INJECT-DROP-POD-TO-POD iptables rule in the FORWARD chain that drops inter-pod traffic, (2) a FAULT-INJECT-DROP-SERVICE-CIDR rule that drops forwarded traffic to the Kubernetes service CIDR, and (3) a blackhole route for 10.96.0.0/12 that does not exist on healthy nodes. Together, these three modifications create a multi-vector network disruption — pods appear Running but cannot communicate with each other or reach Kubernetes services, including kube-dns.

Figure 7: Root cause determination. The agent traces the multi-vector network disruption to three fault-injection modifications — two iptables DROP rules and a blackhole route — deployed by a chaos experiment namespace on the target node.

Cleaning up the fault

To restore the node after the demo, connect via SSM Session Manager and run:

sudo iptables -D FORWARD -d 10.100.0.10/32 -p udp --dport 53 -j DROP
sudo iptables -D FORWARD -d 10.100.0.10/32 -p tcp --dport 53 -j DROP

Extending this pattern to other data sources

The EKS node diagnostics use case demonstrates the pattern, but the architecture generalizes to systems where the SSM Agent is running and you can define an SSM Automation runbook to collect the data you need.

For example, an EC2 instance with SSM Agent can use this same approach — collect OS-level logs, network configuration, package state, or application diagnostics through a custom or pre-built SSM Automation runbook, upload results to S3, and expose them through MCP tools. The same applies to ECS container instances (Docker daemon logs, ECS agent state, iptables), on-premises servers registered via SSM Hybrid Activations, or managed nodes in your fleet.

The pattern also extends beyond SSM-managed hosts. Network devices can be reached through API calls to their management planes, databases through read-only diagnostic queries, and third-party APM tools through vendor API integrations. In each case, the same three-step approach holds: identify the unreachable data, build an MCP server that wraps safe access to it, and connect it to AWS DevOps Agent.

When to use this approach: This pattern works well for incident response where diagnostic data lives outside AWS DevOps Agent’s native reach, fleet-wide triage where manual access to individual systems is impractical, and cross-source correlation where evidence spans multiple log sources.

It is not a replacement for continuous monitoring (use CloudWatch Container Insights or Prometheus for real-time alerting), log shipping (if you have compliance requirements for continuous retention), or native integrations where the agent already has access to the data source.

The reference implementation requires SSM Agent running on the nodes with appropriate IAM permissions. It is a proof of concept — validate it in non-production environments before using it with production workloads.

Clean up

Cost considerations: This solution uses AWS Lambda, Amazon S3, AWS KMS, Amazon Cognito, and Amazon Bedrock AgentCore Gateway. Costs vary based on usage. Lambda charges apply per invocation and duration. S3 charges apply for log storage. KMS charges a per-key monthly fee plus per-request charges. Cognito charges per monthly active user. AgentCore Gateway pricing is based on API calls. For current pricing details, see the AWS Pricing page for each service. To minimize costs during evaluation, delete the stack when not in use.

Remove the deployed resources by running cdk destroy from the repository root. The S3 log bucket uses a RETAIN removal policy — delete it manually after stack destruction if needed.

Conclusion

MCP provides a standardized extensibility mechanism that lets you bridge visibility gaps in AWS DevOps Agent without modifying the agent itself. The pattern is straightforward: identify the unreachable data source, build an MCP server that wraps safe and structured access to it, and connect it to AWS DevOps Agent through Amazon Bedrock AgentCore Gateway. The agent handles the reasoning. The MCP server handles the data access.

To get started:

  • Deploy the reference implementation (sample-eks-node-diagnostics-mcp repository) in a non-production environment.
  • Review the MCP specification.
  • Explore the Amazon EKS troubleshooting documentation.
  • Connect custom MCP servers to AWS DevOps Agent — see the Connecting MCP Servers guide in the AWS DevOps Agent documentation.
  • Set up AgentCore Gateway — see the Amazon Bedrock AgentCore Gateway quick start guide.
DevOps securitynpm

GitHub pulls pin on npm's auto-run scripts

GitHub will disable automatic install-time scripts in npm 12 by default to mitigate common supply chain attack vectors.

Summary

What: In npm 12, scheduled for July 2026, the preinstall, install, and postinstall scripts will require explicit allowlist permission, and remote dependency fetching will be restricted by default to combat malicious activity like the Shai-Hulud worm.
Why it matters: The ecosystem is finally abandoning the dangerous legacy behavior of executing arbitrary code on developer machines during package installation, aligning npm with pnpm, Bun, and Deno.
Takeaway: Audit your project's package.json dependencies now and identify any tools like Playwright or Electron that rely on install scripts so you can permit them before the upgrade to version 12.

Deep Dive

  • Disables automatic execution of install-time scripts by default.
  • Restricts Git-based and remote URL dependency fetching unless explicitly enabled.
  • Requires use of an allowlist for packages that rely on native modules or binary downloads.
  • Replaces the advisory nature of flags with enforced security defaults.
  • Introduces protection against Shai-Hulud-style malware infections.

Decoder

  • Supply chain attack: A cyberattack that targets a software's source by injecting malicious code into upstream dependencies or build tools.
  • Transitive dependency: A dependency that a project's direct dependency requires to function, creating a deep tree of external code.

Original Article

GitHub pulls pin on npm's auto-run scripts

GitHub will change npm's defaults so the install command no longer runs scripts automatically, disabling a feature commonly exploited by malicious packages such as the notorious Shai-Hulud worm.

Maintainer Leo Balter said: "Install-time lifecycle scripts are the single largest code-execution surface in the npm ecosystem. Every npm install runs scripts from every transitive dependency, so a single compromised package anywhere in your tree can execute arbitrary code on a developer machine or CI (continuous integration) runner."

In npm 12, due July, three security-focused defaults are changing. Scripts configured for preinstall, install, or postinstall will no longer run unless explicitly permitted via allow-scripts. The --allow-git flag, which pulls dependencies from remote URLs, will default to off, closing an attack path where a malicious .npmrc file could override the Git executable and achieve arbitrary code execution. Finally, allow-remote will default to none, blocking dependency downloads from remote URLs entirely.

It will still be possible to allow scripts to run via an allowlist in the package.json configuration file. This will be pinned to the installed version of a package by default.

These are breaking changes, and Balter recommended developers run the commands to allow scripts for every currently installed package in a project that requires them. "This gets you protected against new, unexpected scripts immediately," he said. The next step is to review these packages and deny scripts for those where they are not needed.

Some packages require script approval to function, including native modules that compile on install, testing tools like Playwright and Puppeteer (which fetch binaries via postinstall), and Electron, which wraps the Chromium browser engine for cross-platform desktop applications.

These features have been available since npm version 11.10.0, released in February, but as opt-in flags rather than defaults. That version also introduced min-release-age, which blocks installation of package version newer than a specified number of days, designed as a safeguard against newly published malicious packages.

Best security practice for developers using npm 11.16, the current version, is to set these flags on in .npmrc or via environment variables, which will also prepare a project for the changes in version 12. One annoyance is that the existing flag ignore-scripts does not support an allowlist, other than via an additional tool. The ignore-scripts setting will override allow-scripts, so developers will need to remove it, if set to true, to enable approved scripts to run. The allowScripts setting exists in npm 11 but is advisory only.

Will this fix npm security issues? Unfortunately not. "Now all the malware can move from the install script to the module itself where it will inevitably still be run," said one developer. Another common view is that developers should use pnpm, which already has safer defaults than npm, including a minimum release age.

There is consensus, though, that these changes do improve npm security and are long overdue. The pull request for this change includes the remark that "npm is the only remaining major package manager that runs dependency install scripts by default. pnpm v10+, Yarn Berry, Bun, and Deno all block them."

DevOps securitydata

DASH 2026 Security &amp; Compliance: Guide to Datadog's newest announcements

Datadog unveiled autonomous threat hunting agents and AI-native security remediation tools at DASH 2026.

Summary

What: Datadog introduced 'Bits' AI agents for threat hunting and incident response, expanded Code Security with AI-native SAST, and achieved FedRAMP High certification, alongside a new identity-aware API authentication model.
Why it matters: Security tooling is shifting from passive monitoring to active, AI-orchestrated remediation, where agents not only identify vulnerabilities but also generate and suggest PR fixes in real-time.
Takeaway: If you use Datadog, migrate your legacy API application keys to the new scoped credential types like Service Access Tokens (SATs) or Workload Identity Federation before the Q3 2026 legacy deprecation.

Deep Dive

  • Bits agents provide autonomous threat hunting and SIEM investigations.
  • AI-native SAST uses LLMs to reduce false positives in vulnerability detection.
  • Code Threat Detection monitors pull requests for supply chain threats.
  • Supply Chain Firewall blocks malicious npm and pip packages at installation.
  • New API authentication includes Workload Identity Federation and customer-managed OAuth.
  • FedRAMP High certification enables use by US federal agencies and regulated industries.
  • Dynamic routing automatically directs security notifications to relevant service owners.

Decoder

  • SAST (Static Application Security Testing): A methodology that examines application source code for security vulnerabilities without executing the program.
  • SIEM (Security Information and Event Management): Software that aggregates and analyzes log data from across an organization to identify security threats.
  • FedRAMP High: A stringent federal security authorization for cloud services handling mission-critical data.

Original Article

Full article content is not available for inline reading.

Read the original article →

Tech aipolicyllm

Anthropic shuts down Fable, Mythos models following Trump admin directive

Anthropic abruptly disabled its newly launched Fable 5 and Mythos 5 models after a US Commerce Department directive citing national security concerns.

Summary

What: Anthropic pulled access to Fable 5 and Mythos 5 following a government order. Officials reportedly fear a jailbreak allows the models to assist in cyberattacks or biological research. Anthropic disputes the severity, noting the vulnerability is minor and common among other frontier models like GPT-5.5.
Why it matters: This incident highlights the rising tension between AI labs and the federal government, where regulatory oversight is moving from voluntary guidance to immediate, forced shutdowns of deployed commercial products based on classified or non-public security findings.

Deep Dive

  • Anthropic disabled Fable 5 and Mythos 5 after a Commerce Department mandate.
  • The directive triggers export controls, forcing a complete shutdown of access.
  • Government concerns focus on a jailbreak for cybersecurity and biosecurity queries.
  • Anthropic claims the jailbreak is minor and reproducible in competing models like GPT-5.5.
  • The shutdown impacts all users regardless of location.
  • Further details on the specific "narrow" vulnerability are expected within 24 hours.
  • This follows a recent executive order from President Trump urging voluntary AI security testing.

Decoder

  • Jailbreak: A technique used to bypass the safety filters and content policies of an AI model to elicit prohibited or sensitive information.
  • Frontier Model: A highly capable AI model that sits at the current technical edge of the industry in terms of reasoning, scale, and general utility.

Original Article

Anthropic completely shut off access to its Mythos 5 and Fable 5 models Friday night, just days after they were launched.

The move comes after Anthropic’s receipt of a US Commerce Department directive Friday evening, subjecting the new models to export controls restricting their use anywhere outside the United States. In a message posted Friday night, Anthropic said the only way for it to ensure compliance with that government order in the immediate term “is that we must abruptly disable Fable 5 and Mythos 5 for all our customers.” Access to other Anthropic models is not affected.

An Axios report cited an administration official saying that the administration is concerned by reports of a jailbreak that reportedly gets around broad classifier-based safeguards meant to block Fable 5 prompts regarding cybersecurity, chemistry, and biology. The administration reportedly requested a pause in the release of these models to gain time for the “national security apparatus” to be “hardened” against this kind of threat. That hardening could be complete “in the next few weeks,” Axios’ source suggested.

In its Friday night announcement post, Anthropic said the government has only provided it with “verbal evidence of a potential narrow, non-universal jailbreak” that involves getting Fable 5 to review a specific codebase for software flaws. The company says it has only seen evidence of this kind of jailbreak being used to find “minor” and “relatively simple” software vulnerabilities, and that other publicly available models like GPT-5.5 have similar capabilities on this score.

“We are complying with the government’s legal directive and are removing access to Fable 5 and Mythos 5 for all users,” Anthropic writes. “However, we disagree that the finding of a narrow potential jailbreak should be cause for recalling a commercial model deployed to hundreds of millions of people. If this standard was applied across the industry, we believe it would essentially halt all new model deployments for all frontier model providers.”

Earlier this month, President Trump signed an executive order urging AI model makers to submit to voluntary government security testing. That order came after an initial signing ceremony planned for last month was abruptly postponed amid reported concerns of disagreements about it within the administration.

Anthropic apologized to customers for a “disruption” that it said is the result of a “misunderstanding,” and said it will release more details about the situation in the next 24 hours.

Data aisearch

Semantic Search for AI Agents at Scale: Retrieval and Ranking for LinkedIn's Hiring Assistant

LinkedIn's 'MUSE' embedding model powers their Hiring Assistant by combining expert-labeled qualification data with a billion-scale semantic search index.

Summary

What: LinkedIn developed MUSE (Member Understanding Semantic Embeddings) using a dual-tower Matryoshka architecture. It trains on millions of relevance labels from a proprietary 'Teacher' LLM—grounded in LinkedIn's specific product policy—to ensure search results reflect qualification fit rather than just keyword matching or engagement metrics.
Why it matters: This highlights the 'LLM-as-a-judge' pattern for creating high-quality training sets, enabling the transition from engagement-based ranking to policy-aligned semantic retrieval at a billion-member scale.

Deep Dive

  • MUSE maps candidate profiles and recruiter queries into a shared 4096-dimensional embedding space.
  • Uses Matryoshka embeddings to allow variable-size truncations: 2048 dimensions for fast retrieval, 4096 for precise ranking.
  • The MUSE Teacher LLM uses chain-of-thought prompting to evaluate qualification fit according to product policy.
  • System utilizes a Lambda architecture (batch and speed layers) for daily index updates.
  • Embedding performance is a strong predictor of recruiter engagement in the downstream L2 ranker.
  • Found that high-confidence model labels often outperformed human annotators on complex, knowledge-heavy tasks.

Decoder

  • Matryoshka Embedding: A technique that optimizes a model so that truncated versions of its embedding vectors still retain meaningful semantic information.
  • Dual-tower architecture: A neural network design where two separate towers encode different inputs (e.g., query and profile) into a shared vector space, allowing for fast offline pre-computation of one side.
  • Lambda Architecture: A data processing pattern that balances speed and throughput by using both a batch processing path and a real-time 'speed' path.

Original Article

Full article content is not available for inline reading.

Read the original article →

Data aienterpriseinfrastructure

The Bill Arrives: How to Manage Agentic AI Costs at Scale

Enterprise agentic AI costs are driven by inefficient task-loop economics rather than model pricing, requiring teams to optimize context and manage stateful retries.

Summary

What: Uber and others experienced budget exhaustion due to agentic 'token multipliers,' where one task triggers multiple inference calls. Costs are hidden in re-sent context (reprocessing the same prompts), inefficient RAG retrieval, and runaway retry loops, forcing companies to track value per task rather than total tokens.
Why it matters: The transition from chatbot (single-turn) to agent (multi-turn) changes the economic model from 'prompt pricing' to 'task-economics,' making infrastructure consistency and context management the primary levers for controlling costs.
Takeaway: If using Anthropic models, implement prompt caching for static instructions, but avoid putting session IDs or timestamps in the cached prefix, which invalidates the cache.

Deep Dive

  • Agentic workflows consume 5-30x more tokens than basic chatbot queries.
  • 'Re-sent context' accounts for ~62% of agentic inference costs.
  • Prompt caching can reduce prefix costs by 90%, but changing content (like timestamps) breaks cache hits.
  • Database latency triggers agent retries, which exponentially inflate token costs.
  • Measurement must shift from token consumption to value-per-task (e.g., tickets resolved).
  • Orchestration, evaluation, and governance represent ~80% of TCO for production agents.

Decoder

  • Prompt Caching: A feature where prefixes (like system prompts or large documentation) are saved in memory to avoid repetitive re-processing charges.
  • Context Rot: The phenomenon where an LLM's performance degrades as the input length increases, even if within the supported token limit.
  • TTL (Time-To-Live): A setting that defines how long a cached item remains valid before it must be re-fetched.

Original Article

Full article content is not available for inline reading.

Read the original article →

Data aiagentsopensource

Introducing Omnigent: A Meta-Harness to Combine, Control, and Share Your Agents

Databricks released Omnigent, an open-source meta-harness designed to orchestrate, secure, and monitor interoperable agent sessions across diverse frameworks.

Summary

What: Omnigent provides a unified API for managing agents like Claude Code, Codex, and Pi, offering cross-agent composition, real-time collaboration via URL, and security/cost governance. It is available under the Apache 2.0 license and supports execution on platforms like Modal and Daytona.
Why it matters: As agent development matures, individual harnesses act as isolated silos; a meta-layer is required to standardize governance, state management, and orchestration across an ecosystem of disparate tools.
Takeaway: Install the Omnigent alpha and connect a command-line agent to test its security policies and cross-interface capabilities: https://omnigent.ai/quickstart/install

Deep Dive

  • Implements a common runner interface to wrap terminal-based and SDK-based agents
  • Enables live session sharing via URL for collaborative debugging and steering
  • Supports multi-interface access including web, native apps, and APIs
  • Includes OS-level sandboxing with egress proxying and dynamic security policies
  • Features cost-tracking policies per session for budget management
  • Architecture allows for YAML-based agent composition and portability
  • Roadmap includes automated optimization (GEPA) and an MCP-based server

Decoder

  • Meta-harness: A software layer that sits above individual agent frameworks to provide unified control, policy enforcement, and interoperability across different model backends.
  • MCP (Model Context Protocol): An open standard designed to enable AI models to securely connect with local data and tools.

Original Article

At Databricks, we use and build agents extensively, from coding with them at scale to shipping agent products like Genie. But even though the capabilities of agents have gotten much better, working with them feels clunky. As users, we often have 4-5 agents open at once (coding agents, Gemini search, etc) and spend our time copy-pasting text between them and Docs, Slack, and other collaboration tools. And as agent builders, we’re on a treadmill to improve our agents by combining the latest harnesses, SDKs and models. The problem is that LLM capabilities are wrapped into an agent harness, and these harnesses have different interfaces that make combining them or swapping them difficult.

So we built Omnigent: a meta-harness that sits above the agents you already use (Claude Code, Codex, Pi, or custom agents) and makes them interoperable parts of a richer system. Omnigent targets the problems where a single harness stops: it adds easy ways to compose multiple agents, control them with advanced policies, and collaborate live with teammates.

We believe people will soon work with agents through this new layer, the meta-harness. That’s why today we’re open sourcing Omnigent under Apache 2.0.

Why build a meta-harness?

We adopted coding agents early across our 5,000+ member engineering team and built thousands of agents for customers. That experience convinced us that the frontier of agent engineering is moving up a level. The best results no longer come from a single model in a single harness: Harvey beat a frontier model on quality and cost by giving an open-source worker model a frontier advisor it can call, Anthropic built its research product as a lead agent orchestrating parallel subagents, and our own Genie uses different LLMs for planning, search, and code generation. Engineers are changing how they work, too: instead of prompting one agent at a time, they design loops that drive whole teams of agents.

These patterns span multiple harnesses, models, and people, but each harness only understands its own sessions. To combine agents, govern them, and work on them with other people, you need a layer above the harness. Omnigent is that layer, and it provides:

  • Composition. Combine multiple models, harnesses, and techniques without rewriting code, and switch between Claude Code, Codex, Pi, and your own agents with one-line changes.
  • Control. Stateful, contextual policies that track agent actions and enforce guardrails like cost budgets and permissions at the meta-harness layer, not via prompts.
  • Collaboration. Share live agent sessions via URL and review files in them together, so teammates can review, comment, and steer agents together in real time.

How Omnigent works

Omnigent introduces a common interface above command-line agents and agent SDKs to let you easily combine and interchange them, and then focuses on the shared problems where a harness stops. The key insight is that however each agent harness calls into its LLM internally, the interface to users is the same: messages and files in, text streams and tool calls out. Thus we built a common API that wraps both terminal-based coding agents (Claude Code, Codex, Pi, etc) and SDKs (OpenAI Agents, Claude Agents SDK, etc).

On top of this interface, the current version of Omnigent adds the following key features:

  • Real-time collaboration: you can invite other people to view your agent session, comment on files in its workspace, or even send commands, so your sessions and working directories become the main place you collaborate.
  • Multiple interfaces to the same agent: once you connect an agent such as Claude Code to the Omnigent server, you can access it on the web, mobile, Mac OS native app, or APIs.
  • Cloud execution: launch any agent on your own machine or on hosted sandbox providers like Modal and Daytona, for safe collaboration in a hermetic environment.
  • Contextual security policies: Omnigent’s security policies go beyond the simple “allow X / deny Y” of coding agents, to track dynamic state about each session and make smarter decisions. For example, you can say that after an agent downloads a new package from npm, it should require human approval to git push, or that it should only be able to write to docs it created, not any doc.
  • Cost policies: One of the things we track dynamically is each session’s LLM cost. For example, you can ask Omnigent to pause an agent and ask to continue after every $100 it spends.
  • Strong OS sandbox: In Omnigent, we include a flexible OS sandbox from our security team with the ability to flexibly lock down OS access and intercept and transform network requests (e.g., don’t let an agent ever see your GitHub security token, but instead, inject it only in the egress proxy on approved requests).
  • Multi-harness authoring: Specify a custom agent as a YAML and port it across harnesses with a one-line change, or combine subagents using different harnesses in the same agent.

These features are just scratching the surface of what can be done at the meta-harness layer, however, and we expect to see a lot more ideas soon from our team and the open source community. Some items on our roadmap include automatic optimization at the meta-harness level with GEPA, code-based introspection within agents similar to MemEx and RLM, an Omnigent Server MCP so agents can work across your sessions, and more harnesses. We’ve also made Omnigent easy to deploy on a wide range of infrastructure, including Fly.io, Railway, Modal and Daytona sandboxes, and many LLM providers, and we welcome patches for more integrations.

A new layer for working with agents

Many of the biggest shifts in our industry came from moving to a new layer of abstraction: for example, while engineers used to manage individual processes and servers, they can now manage a whole fleet via cloud systems like Kubernetes and Terraform.

We think agents are at the same point today. Each harness is its own silo, with its own context, its own controls, and its own way of running, and none of it carries over when you switch tools. Moreover, many problems intrinsically span harnesses, including composition, security and collaboration. A meta-harness lifts your work above any single harness, so your sessions, policies, and skills stay with you no matter which agent or model is running. The models and harnesses will keep changing as the field evolves; the layer you work at shouldn't have to.

We're building that layer in the open, and we'd love for you to build it with us.

Try it out

Omnigent is open source in alpha today.

  • Get started with our quickstart: https://omnigent.ai/quickstart/install
  • Star and clone the repo: https://github.com/omnigent-ai/omnigent
  • Read the docs: https://omnigent.ai/
  • Join us on Discord: https://discord.gg/omnigent
Data databaseperformancerust

Apache DataFusion 54.0.0 Released

Apache DataFusion 54.0.0 delivers massive performance gains for joins and introduces advanced SQL features like lateral joins and lambda functions.

Summary

What: The release improves sort-merge join speeds by 20–50% and repartitioning by 50%. New features include LATERAL joins, SQL lambda functions for array processing, and transparent spill-to-disk for nested loop joins.
Why it matters: DataFusion is becoming the de facto standard for building high-performance query engines in the Rust ecosystem, and these optimizations further close the performance gap with proprietary analytical databases.

Deep Dive

  • Optimized sort-merge joins for semi, anti, and mark variants using per-row bitsets
  • Introduced morsel-driven Parquet scans to improve parallelism on skewed datasets
  • Added support for CROSS, INNER, and LEFT LATERAL joins
  • Implemented SQL lambda syntax (x -> expr) and array-specific higher-order functions
  • Enabled disk-spilling for memory-constrained nested loop joins
  • Switched to the arrow-avro crate for more efficient Avro reading
  • Added statistics-driven sort pushdown and top-K optimization

Decoder

  • LATERAL join: A SQL join that allows a subquery to reference columns from a preceding table, enabling per-row processing.
  • Morsel-driven: A parallel execution model that dynamically assigns small units of work to idle threads rather than static partitions.

Original Article

Apache DataFusion 54.0.0 Released

We are proud to announce the release of DataFusion 54.0.0. This post highlights some of the major improvements since DataFusion 53.0.0. Notable additions include LATERAL joins, SQL lambda functions, and a new Avro reader, alongside significant join, scan, and planning performance improvements. The complete list of changes is available in the changelog. This release represents roughly 11 weeks of development and 740 commits. Thanks to the 139 contributors (a new record!) for making it possible.

Performance Improvements

We continue to make significant performance improvements in DataFusion, as explained below. This release prunes more redundant work out of plans and makes joins, repartitioning, scans, and many built-in functions faster.

Execution Operator Improvements

Physical Execution of Uncorrelated Scalar Subqueries: DataFusion previously executed an uncorrelated scalar subquery (one that doesn't depend on the outer query) by rewriting it into a join. DataFusion 54 instead evaluates it once with a new physical operator. This lets functions use their specialized scalar code paths, and allows uncorrelated scalar subqueries in ORDER BY, JOIN ON, and as arguments to aggregate functions.

Faster Sort-Merge Joins: Semi, anti, and mark joins now track matches with a per-row bitset instead of materializing (outer, inner) pairs. Batched deferred filtering makes near-unique LEFT and FULL joins 20-50x faster. Finally, join-key comparisons now use a DynComparator that resolves the column type once rather than per row, making microbenchmarks up to 12% faster and TPC-H ~5% faster overall.

Faster Repartitioning: RepartitionExec now coalesces batches before sending them to distributor channels, cutting per-batch overhead for up to 50% faster execution on some repartition-heavy queries.

Faster Functions and Hashing: DataFusion ships hundreds of built-in functions, so speeding them up pays off across many workloads. This release optimizes many, including array_to_string, array_concat, array_sort, split_part, substr, strpos, left, right, string_agg, and approx_distinct, plus better NULL handling across many array and datetime functions. The first_value and last_value aggregates are also substantially faster over Utf8 and Binary columns thanks to a new GroupsAccumulator. DataFusion 54 also swaps ahash for foldhash in datafusion-common, and optimizes regexp_replace by stripping trailing .* from anchored patterns.

Planner Improvements

Pruning Functionally Redundant Sort Keys: Sorting is expensive, so it pays to sort by as few columns as possible. DataFusion 54 now drops functionally redundant ORDER BY keys: when an earlier key determines a later one, the later key can't change the ordering, so removing it cuts sorting cost without affecting results.

Skip Redundant Parquet Filters: When statistics prove a filter matches every row in a Parquet row group, DataFusion now skips evaluating it — both row filters and page-level pruning — for that row group instead of re-checking each row.

Statistics-Driven Sort Pushdown and TopK: Files and Parquet row groups are now ordered using statistics, which can avoid sorting entirely and improve dynamic filtering and early stopping for TopK (ORDER BY ... LIMIT) queries. The most promising data is read first, often satisfying the LIMIT before scanning the rest.

Improved Statistics and Cardinality Estimation: Good plans depend on good statistics. This release extracts NDV (number of distinct values) statistics from Parquet metadata, uses NDV for equality-filter selectivity, adds a pluggable StatisticsRegistry for operator-level statistics propagation, and improves cardinality estimation for semi and anti joins.

Scan Improvements

Morsel-Driven Parquet Scans: Parquet scan parallelism was previously bounded by the slowest scan thread, so data skew (large row groups, less-selective filters, or variable object store latency) left cores underutilized. DataFusion 54 reworks the Parquet scan around a morsel-driven design, where idle threads dynamically pull small units of work ("morsels") instead of each being assigned a fixed partition up front. This spreads work more evenly and can be up to ~2x faster for skewed scans such as ClickBench.

Struct Field Filter Pushdown and Leaf-Level Projection: Filters on struct fields (e.g. WHERE s['foo'] > 67) are now pushed down into the Parquet decoder rather than evaluated after a full scan, and both filtering and projection read only the struct leaves they actually access, significantly improving performance for nested and Variant data in large Parquet files.

New Features

LATERAL Joins

Lateral joins have been long requested. DataFusion 54 adds basic support for CROSS JOIN LATERAL, INNER JOIN LATERAL, and LEFT JOIN LATERAL. A lateral subquery in the FROM clause can reference columns from preceding tables — handy for expanding a per-row series or correlating against a set-returning function. It uses decorrelation, so the subquery is evaluated once rather than re-executed per outer row.

-- For each row in t1, expand a series 1..t1_int and join the values back
SELECT t1_id, t1_name, i
FROM join_t1 t1
CROSS JOIN LATERAL (
    SELECT * FROM unnest(generate_series(1, t1_int))
) AS series(i);

Lambda Functions

DataFusion now supports lambda expressions (x -> expr) with column capture, plus new higher-order array UDFs like array_transform, array_filter, and array_any_match. Lambdas express per-element computation directly in SQL:

-- Apply `x * 10` to every element
SELECT array_transform([1, 2, 3, 4, 5], x -> x * 10);
-- [10, 20, 30, 40, 50]

-- Keep only elements where `x > 2`
SELECT array_filter([1, 2, 3, 4, 5], x -> x > 2);
-- [3, 4, 5]

-- True if any element satisfies `x > 2`
SELECT array_any_match([1, 2, 3], x -> x > 2);
-- true

Spilling Nested Loop Joins

NestedLoopJoinExec previously failed with an out-of-memory error when the build side exceeded the memory budget. DataFusion 54 adds a memory-limited path that transparently spills to disk and completes the query instead, with zero overhead when memory is sufficient. It currently covers INNER, LEFT, LEFT SEMI, LEFT ANTI, and LEFT MARK joins.

New Avro Reader

The Avro reader now uses the arrow-avro crate, replacing internal conversion code with a faster, better-maintained implementation shared with the Arrow ecosystem.

Extension Type Registry

Arrow extension types let users layer their own semantics on top of a physical storage type. DataFusion 54 adds a registry for registering their behavior, several more canonical extension types, and the ability to cast to an extension type in logical expressions.

Content-Defined Chunking for Parquet

DataFusion's Parquet writer can now use content-defined chunking (CDC), which aligns data page boundaries with the data rather than fixed row counts. This improves deduplication and incremental storage, since inserting or editing a few rows no longer shifts every later page boundary.

New Functions

SQL and Scalar Functions: DataFusion 54 adds new scalar functions including array_compact, cosine_distance, inner_product, array_normalize, cast_to_type, and with_metadata, plus nanosecond date_part support and the : JSON access operator. The cosine_distance, inner_product, and array_normalize additions round out DataFusion's vector-search building blocks.

Spark-Compatible Functions: The datafusion-spark crate gains many new or improved Spark-compatible functions, including round, floor, ceil, soundex, xxhash64, array_contains, array_compact, int/float-to-timestamp casts, and UTF-8 validation functions.

Upgrade Guide and Changelog

Upgrading to 54.0.0 should be straightforward for most users, though there are some breaking changes. See the Upgrade Guide for details and migration snippets, and the changelog for the full list of changes.

About DataFusion

Apache DataFusion is an extensible query engine, written in Rust, that uses Apache Arrow as its in-memory format. DataFusion is used by developers to create new, fast, data-centric systems such as databases, dataframe libraries, and machine learning and streaming applications. While DataFusion's primary design goal is to accelerate the creation of other data-centric systems, it provides a reasonable experience directly out of the box as a dataframe library, Python library, and command-line SQL tool.

How to Get Involved

DataFusion is not a project built or driven by a single person, company, or foundation. Rather, our community of users and contributors works together to build a shared technology that none of us could have built alone.

If you are interested in joining us, we would love to have you. You can try out DataFusion on some of your own data and projects and let us know how it goes, contribute suggestions, documentation, bug reports, or a PR with documentation, tests, or code.

Design aisecurityfrontend

Can Your AI Pass the Accessibility Test?

AI scales accessibility barriers just as efficiently as it writes code, requiring developers to bake inclusive checks into their automated CI/CD pipelines.

Summary

What: Aaron Gustafson (web developer) and Jessie Lorenz (Microsoft PM) argue that AI-assisted coding typically produces inaccessible results. They advocate for 'shifting left' by using tools like the Figma Accessibility Assistant and GitHub's open-source Accessibility Scanner to enforce rules during development.
Why it matters: As AI agents increasingly handle UI generation, the 'accessibility debt' will accumulate at machine speed. Development teams that do not treat accessibility as a programmatic contract will eventually face massive remediation costs.
Takeaway: Integrate the GitHub Accessibility Scanner into your CI pipeline and start testing your AI-generated code against deterministic accessibility rules rather than relying on LLM-inferred quality.

Deep Dive

  • Accessibility debt follows an exponential cost curve: it is cheapest in planning and most expensive post-release.
  • AI models are trained on an inaccessible web, meaning they default to generating non-compliant code.
  • 'Shift-left' means moving accessibility checks into the editor (linting) and CI/CD (automated testing).
  • Deterministic tests are essential because automated scanners only catch ~50% of issues; manual user testing remains mandatory.
  • The 'curb-cut effect' demonstrates that inclusive design patterns improve usability for all users, not just those with disabilities.

Decoder

  • Shift-left: A software engineering practice of moving testing, security, and quality checks earlier in the development lifecycle.
  • Deterministic Test: A test that produces the same output every time it is run with the same input, unlike AI-based heuristic testing.
  • Curb-Cut Effect: The phenomenon where innovations designed for people with disabilities end up benefiting the entire population.

Original Article

Last week at Microsoft Build, Jessie Lorenz, Carie Fisher, and I gave a short talk on a question I think every team using AI should be asking: can your AI pass the accessibility test?

The core point was straightforward: AI does not fix a broken process. It accelerates whatever process you already have. If accessibility is already part of the workflow, AI can help scale inclusion. If it’s not, AI will absolutely accelerate and scale the barriers teams are already shipping.

Full Transcript

Jessie Lorenz: Everybody here ships software on a pipeline that looks something like this, right? Planning, design, development and coding, code review and CI/CD, public release, and then feedback. And then you kind of turn around and start the whole merry-go-round again, incorporating feedback back into planning.

It is a six-stage pipeline. You know it. I know it too. I know it as a blind PM at Microsoft and I know it as someone who was born blind and often encounters the accessibility errors or barriers that get shipped.

So hold on to this pipeline, because everything we are about to talk about lands somewhere on it.

Jessie: Let us look at what happens when you solve your accessibility issues in the planning process. If you do that, it is a conversation. It is usually free to fix.

Wait until design? Well, it is going to cost you ten times more. Wait until it reaches development? One hundred times more. If you do not notice your accessibility barriers until they actually ship, it compounds even more: one thousand times more.

And each order of magnitude that compounds, if you really think about it, is a person, and multiples of people, being locked out of what you ship.

Accessibility debt is a lot like security debt, and you would not just let security debt lie. Accessibility debt should be treated the same.

Jessie: So we’re all excited about AI, right? Well, it accelerates the pace of software development and it also accelerates the creation of accessibility barriers. Both sides of the coin are true.

Why? AI is trained largely on a web that is inaccessible. AI learned from a web that is primarily inaccessible, full of barriers, and AI ships these barriers faster and in places where we’ve not seen them before.

Jessie: Accessibility belongs in your workflow. We are talking today not about slowing things down, not about trying to make things harder. We are actually talking about catching things early, catching things when they are cheaper, and implementing accessibility checks in every stage of the workflow.

That could be a lint rule in the editor. That could be a gate in CI. It could be a flag in code review. It is the same workflow that we talked about before, the same pipeline. It is just that each step has an accessibility check on it.

Jessie: I really hope I can make this real for you.

People are always coming to talk to me about how to get money for their accessibility ideas. One guy came and talked to me and he was trying to get VC funding to re-carpet the San Francisco airport. He thought it would be a really good idea to give blind people canes that could guide them on this special carpet and the canes would have Bluetooth in them.

Well, I saw this guy’s demo. He actually had some investor money lined up, and I asked him one simple question. It is the question I ask everyone: Did you talk to a single person with disabilities before creating this demo?

The answer to my question is all too often no. Unfortunately, this is an example of what ungrounded planning looks like. You have a clever solution, but you are not solving the right problem.

So the first step in creating an accessible software development lifecycle is to make sure accessibility is in your planning documents. That could be a question about how your feature will serve people with disabilities in a PR/FAQ. But if you do not put us in the roadmap, you are going to end up building something really beautiful that is answering the wrong question.

Jessie: My team in Microsoft AI is building voice-first features in Copilot, and we have data that shows 37% of voice-initiated tasks in Copilot are abandoned. That is not good.

We also have data that shows that more than one billion people in the world live with some kind of disability, a limitation to one or more major life activities. And 76% of folks who have dyslexia or other neurological impairments say that they are better at work when they can use Copilot.

So what the data shows us is that when Copilot voice breaks, it hits the people who need it most first and hardest. That is why we created a feature called Speak to Done, and that is how it got on our roadmap.

Now I am going to pass it to Aaron.

Aaron Gustafson: Once the planning is done, most teams begin designing the user experience and the user interface. Now, whether they are using traditional design software or newer vibe-design approaches, it is imperative that designers have the right tools and information to help them make accessible choices.

Aaron: One of the efforts that I have had the pleasure of working on in this space is the Accessibility Assistant plugin for Figma. It offers a suite of tools to help designers clarify the intent of their interface and assess the quality of their design when it comes to supporting people with disabilities.

Here on the screen I have a visual showing an annotated UI. Those annotations can actually act as an accessibility spec to guide engineering work. In fact, we are working on a new feature that will allow designers to export this accessibility information and hand that off to developers, whether they are human or whether they are agentic.

We have seen some really positive results from our early tests of handing off that accessibility spec to agentic workflows, both in being able to fix existing bugs in the interface based on the accessibility spec and in being able to build interfaces from scratch. They are even able to take simple things like an example row in a grid and extrapolate the accessibility annotations for that one row out to all the rows within the grid, which is really exciting.

The demo also shows one of the visualization tools that we have. We have a bunch of tools we are adding in for designers to help them understand how their designs end up being understood or experienced by people. In this case, we have done a focus order overlay to visually display how somebody would move through the interface.

By having this information early, and as Jessie said, in a way that is very cheap to make changes, improving accessibility at the design stage can take mere moments, but it can have a huge impact. It costs a lot more to change that once it is already ensconced in code.

Aaron: Even with a rigorous design system and robust accessibility specs, it’s still critical to embed accessibility into the coding process. As Jessie mentioned at the beginning of this talk, that becomes even more critical in the era of AI-assisted coding.

Aaron: Out of the box, most coding agents are pretty terrible when it comes to accessibility. That’s not surprising though. As Jessie said, they’re trained on what we created, and the web we created has not been all that accessible, so they learned from us.

Left to their own devices, most of the code they write only passes about 8–25% of automatable accessibility checks.

Instruction files are often touted for their ability to steer models toward better outcomes by teaching them what the accessibility expectations should be up front, but even with that the pass rate only climbs to 37–60% of automatable checks.

When you start imbuing agents with skills, you get closer to a pass rate of roughly 86%, but it is not until you give them actual deterministic tests to run, and instructions on how to use them, that you get them to iterate until the code passes all of those automatable checks.

But even that only gets you so far because it takes you to the limit of what is testable purely in an automated fashion, via unit tests and integration tests. It does not cover things like usability, so I want to put it in context: automatable tests only cover about 50% of what you need for a UI to be considered truly accessible.

Still, every step in the right direction counts. And now that we have discussed accessibility in the code authoring context, I am going to hand it off to Carie to talk about embedding accessibility in the code review process.

Carie Fisher: I am from GitHub, and we made an Accessibility Scanner. It’s open source. If you have not checked out the booth, we have a demo there.

What we are talking about here is bringing accessibility alongside the other feedback developers already expect. In pull requests, it should show up alongside code quality and security feedback. In CI/CD, accessibility checks should run where automation makes sense, whether you are using GitHub Actions, Azure Pipelines, browser tests, or deterministic tests. We want to be where you are and make sure accessibility is considered.

After release, issues and feedback should flow back into the backlog and into better patterns for next time. The PR is necessary, but it is not enough. If accessibility only appears at the end, we are still fixing after the fact.

The bigger direction here is to embed accessibility into AI tooling and engineering practices so innovation scales inclusion, not exclusion.

Carie: GitHub’s AI-powered Accessibility Scanner makes that workflow concrete with a simple loop: Find, File, Fix.

First, it helps find repeatable accessibility issues. Then it files actionable GitHub issues instead of leaving teams with a separate report that lives outside their workflow. And then those issues become useful context for fixing, including with Copilot-assisted remediation.

Under the hood, we’re using Deque’s aXe ruleset, which is basically the gold standard for automated accessibility checks. It is free, it is open source, and we created the GitHub workflow around it so you can scan your page for accessibility errors, create issues, and then even assign those issues to Copilot to draft a pull request.

We deliberately left a spot for a human to stay in the loop and make sure the result is truly accessible. As Aaron and Jessie have said, automated checks may only catch about half of the issues overall. You still need manual checks and you still need to work with people with disabilities to make sure your product is really inclusive.

Carie: We don’t have time to run through the full demo on stage, but this is the workflow we wanted people to remember. Accessibility is not happening in a separate process. It is part of the same system teams already use to build, triage, review, automate, and ship.

Carie: When accessibility is integrated, magic happens.

Jessie: I am a blind PM at Microsoft and most days I spend them telling software what I want it to do and then spending an inordinate amount of time cleaning up what it did wrong.

My team owns two really powerful features called Copilot Tasks and Copilot Cowork. They are super powerful. However, they did not have voice included in them, so we created a feature called Speak to Done, which is the voice layer over Copilot Tasks and Copilot Cowork.

I said, “Find that track meet email from three weeks ago, add it to my calendar, and text the other parent.”

Copilot did not lose context, did not break, found the track meet in the Gmail thread, put it on the calendar, and sent it to the other parent without breaking.

Now let me be clear: without the shift-left practices, without the tools that we talked about today, the Figma plugin at the design process, the deterministic testing, the code review, none of these features reach me.

Speak to Done is really cool because it solves a problem that disproportionately impacts folks with disabilities, like we talked about earlier, but it also helps everybody who wants to use Copilot with their voice.

In disability circles, we talk a lot about the curb-cut effect. Back in the 1960s, they did not have curb cuts. Wheelchair users in Berkeley literally started taking sledgehammers to the curbs and making them themselves. Now we see curb cuts everywhere, and who benefits? People who use strollers, delivery workers, all of society.

That is a little bit of what we are trying to convey today. When we are talking about the things we build, we are not talking about tools. We are talking about who gets to use what we ship.

I encourage you to be conscious about creating a more accessible world, one code snippet at a time.

Design aillm

Context Architecture

AI product design is pivoting from simple prompt engineering to 'context architecture,' which treats the AI's information environment like an information architecture project.

Summary

What: Paz Perez at Nielsen Norman Group argues that as agents become autonomous, designers must use principles like hierarchy, taxonomy, and labeling to structure the system's memory, tools, and retrieval pipelines for better accuracy.
Why it matters: This signals that the bottleneck for reliable AI agents is no longer model capability, but the structural design of the information pipeline feeding those models.
Takeaway: When designing agentic systems, audit your tool definitions and knowledge base taxonomies; ensure they reflect the user's mental model rather than the internal engineering terminology of your APIs.

Deep Dive

  • Moving beyond static 'prompt engineering' to dynamic 'context architecture'.
  • Applying classical Information Architecture (IA) principles to LLM ecosystems.
  • Structuring retrieved knowledge to reduce retrieval 'noise' and model hallucinations.
  • Designing categorization and labeling that mirrors user mental models to improve tool selection.
  • Establishing clear retention and scoping rules for long-term agent memory.
  • Managing information overload in the model's 'attention' window.
  • Treating context design as a non-neutral UX task that actively shapes agent behavior.

Decoder

  • Context Engineering: The practice of assembling instructions, retrieved data, tools, and memory into the AI's prompt window to influence model output.
  • RAG (Retrieval-Augmented Generation): An architecture where the AI retrieves external information from a database to ground its answers before generating a response.
  • MCP (Model Context Protocol): An open standard that allows AI models to connect to different external tools and data sources in a uniform way.
  • Agentic System: An AI application that can perform tasks, use tools, and make decisions across multiple steps without constant human intervention.

Original Article

Full article content is not available for inline reading.

Read the original article →

AI llmopensource

GLM-5.2

Z.ai has released GLM-5.2, a new flagship model featuring 1M-context support, which will be open-sourced under the MIT License next week.

Summary

What: GLM-5.2 is now available to 'GLM Coding Plan' users with High and Max 'thinking-effort' levels. API and chatbot services are scheduled for release next week alongside the full open-source weights.
Why it matters: Z.ai continues to aggressively position its open-source flagship models as a high-performance alternative to proprietary closed-source frontier models in coding and agentic tasks.

Original Article

Intelligence should be open, accessible, and ready to build with, empowering every developer, everywhere.

GLM-5.2 is now available to all GLM Coding Plan users, including Lite, Pro, Max, and Team plans.

As our new flagship model, GLM-5.2 delivers powerful coding capabilities, usable 1M-context support, and continued strengths in long-horizon tasks.

API and Chatbot services will launch next week. The model will also be officially open-sourced next week under the MIT License.

The future of AI is open, and it belongs to the people.

GLM-5.2 supports two thinking-effort levels: High and Max.

For coding tasks, we recommend using Max effort to enable deeper reasoning and more reliable performance.

AI llmresearch

The Physics of a Fable

The core competitive advantage in current LLMs is shifting from raw scale to verifiable reward signals and test-time compute.

Summary

What: Rafa Schwinger theorizes that models like Claude Mythos utilize 'environment foundry' architectures where the primary innovation is using long-horizon process rewards and GRPO-style verification to constrain reward hacking. He argues the bottleneck for top-tier models is now the quality of verifiable feedback during pretraining, rather than just raw text or compute.
Why it matters: This reveals a pivot in the industry where the 'moat' is increasingly found in the automated feedback loops and synthetic data pipelines that train models to reason, rather than the architecture or parameters themselves.

Deep Dive

  • Architecture: Capabilities are derived from a product of base foundation model quality and the gradeable signal extracted during training.
  • Verifier RL: The primary constraint is sound reinforcement learning, specifically preventing reward hacking via improved verifiers.
  • Context Handling: Learned context-folding is replacing massive context windows, allowing models to perform effectively with smaller active windows (32K tokens).
  • Test-time Compute: Models now utilize 'effort dials' where best-of-N search and extended compute are explicitly exposed for high-reasoning tasks.

Decoder

  • GRPO (Group Relative Policy Optimization): An RL technique where models are evaluated relative to a group of responses, helping to stabilize training.
  • Reward hacking: When a model optimizes for a flawed reward function to achieve high scores without actually performing the desired task.
  • Context-folding: A method of compressing or summarizing information to fit long-horizon data into a fixed, smaller memory window.
  • Test-time compute: Increasing the amount of processing a model does (e.g., chains-of-thought or multi-step search) after the model has been trained, specifically to improve accuracy on difficult queries.

Original Article

The Physics of a Fable

The Verifier is the Moat

How Fable was probably built, and why its lead is measured in months rather than years

Disclaimer: Everything here is personal inference with publicly available information,

AI researchcomputer-vision

Count Anything

The 'Count Anything' model uses dual-granularity counters to achieve domain-agnostic object counting across diverse visual contexts.

Summary

What: Mengqi Lei and colleagues released 'Count Anything', a generalist vision model trained on a new 220K-image dataset (CLOC) that tracks instances across remote sensing, biology, and general scenes. Unlike traditional density-map approaches, it uses a Region-level Sparse Counter for large objects and a Pixel-level Dense Counter for crowded, small targets.
Why it matters: This moves object detection away from niche, single-domain models toward a unified, text-promptable standard that can be applied to vastly different data types using one model.

Decoder

  • Density-map: A common computer vision output where models predict a heatmap to estimate object density rather than identifying individual discrete objects.

Original Article

Count Anything

Object counting remains fragmented across domain-specific datasets and task formulations, despite rapid progress in generalist vision models. Existing counting models are often tailored to scenarios such as crowds, vehicles, cells, crops, or remote-sensing objects, and thus struggle to generalize across categories, visual domains, object scales, and density distributions. In this paper, we study text-guided object counting across domains, where a model takes an image and a natural-language query as input and returns an instance-grounded set of target points whose cardinality gives the count. This formulation unifies category-conditioned counting with interpretable spatial localization. To support this setting, we construct CLOC, a Cross-domain Large-scale Object Counting dataset that reorganizes diverse public data sources into a unified benchmark. CLOC covers six visual domains: General Scene, Remote Sensing, Histopathology, Cellular Microscopy, Agriculture, and Microbiology, with about 220K images, 619 categories, and 15M object instances. Based on CLOC, we propose Count Anything, a generalist model for text-guided object counting. Unlike density-map-based methods, which dominate counting models, Count Anything adopts discrete instance points and performs dual-granularity instance enumeration. A Region-level Sparse Counter provides object-level anchors for large and sparse targets, while a Pixel-level Dense Counter handles small, crowded, and weakly bounded targets via dense point prediction. A point-centric supervision strategy enables learning from heterogeneous annotations, and Complementary Count Fusion combines both counters in a parameter-free manner. Extensive experiments show that Count Anything achieves strong accuracy and multi-domain generalization, outperforming existing open-world counting methods. Code is available at: https://github.com/Mengqi-Lei/count-anything.
AI dataweb

Introducing the Open Knowledge Format

Google Cloud's Open Knowledge Format (OKF) standardizes 'LLM-wiki' patterns as a vendor-neutral, file-based specification for AI-ready documentation.

Summary

What: Sam McVeety and Amir Hormati from Google Cloud introduced OKF v0.1, a specification that uses markdown files with YAML frontmatter to represent structured metadata (tables, runbooks, metrics). It aims to replace bespoke 'LLM-wiki' patterns with a standard directory structure that any agent or human can read without proprietary SDKs.
Why it matters: This signals a move toward standardized 'context-as-code' patterns, acknowledging that agent performance often hinges on the quality and interoperability of the data provided to them.
Takeaway: Try the reference implementation by creating a directory of markdown files with `type` in the YAML frontmatter to index your team's internal documentation for agent ingestion.

Decoder

  • YAML frontmatter: A block of YAML metadata at the top of a Markdown file, commonly used by static site generators and now increasingly used for LLM context injection.

Original Article

Introducing the Open Knowledge Format

As foundation models continue to improve, the lack of relevant context often limits what they can do, especially as they are used to build agentic systems. While these models can help you write code, summarize documents, or analyze a dataset, they still need the right information to produce accurate and actionable results.

That’s why today, we’re introducing the Open Knowledge Format (OKF), an open specification that formalizes the LLM-wiki pattern into a portable, interoperable format. This is a vendor-neutral, agent- and human-friendly standard for representing the metadata, context, and curated knowledge that modern AI systems need.

As published, OKF v0.1 represents knowledge as a directory of markdown files with YAML frontmatter, with a small set of agreed-upon conventions that let wikis written by different producers be consumed by different agents without translation.

That's it. No complex compression scheme, no new runtime, no required SDK. A bundle of OKF documents is:

  • Just markdown — readable in any editor, renderable on GitHub, indexable by any search tool
  • Just files — shippable as a tarball, hostable in any git repo, mountable on any filesystem
  • Just YAML frontmatter — for the small set of structured fields that need to be queryable: type, title, description, resource, tags, and timestamp

If you've used Obsidian, Notion, Hugo, or any of the LLM wiki patterns that have emerged over the past year, the shape will feel familiar. OKF formalizes the small set of conventions needed to make these patterns interoperable.

Let’s take a look at the problem that OKF can solve for your organization, how it works, how to get started with it, and what’s next.

A fragmented context landscape

In most organizations, the information that foundation models use is overwhelmingly internal knowledge: the schema of a table, your business’ meaning of a metric, the runbook for an incident, the join paths between two systems, the deprecation notice for an old API, etc.

Today, these atoms of knowledge live in a variety of highly fragmented systems:

  • Metadata catalogs with their own APIs
  • Wikis, third-party systems, or in shared drives
  • Code comments, docstrings, or notebook cells
  • The heads of a few senior engineers

When an AI agent needs to answer "How do I compute weekly active users from our event stream?" it has to assemble the answer from these scattered, mutually incompatible surfaces. Every vendor offers its own catalog, its own SDK, its own knowledge-graph schema, and none of the knowledge is easily portable across products or organizations.

The result: Every agent builder is solving the same context-assembly problem from scratch, every catalog vendor is reinventing the same data models, and the knowledge itself is locked behind whichever surface created it.

Knowledge as a living wiki

Developer teams are changing how they build AI agents. Instead of using models to search the same documents for the same facts over and over, you can give your agents a shared markdown library that grows more useful over time. This lets your agents take on the drudgery of reading and updating their own files, while your team curates the content and manages it like code.

Andrej Karpathy, the prominent AI researcher and educator, articulates this idea most crisply in his LLM Wiki gist. "LLMs don't get bored, don't forget to update a cross-reference, and can touch 15 files in one pass," he writes. The bookkeeping that causes humans to abandon personal wikis is exactly what LLMs are good at.

Similar knowledge-as-Wiki pattern keeps reappearing under different names: Obsidian vaults wired to coding agents, the AGENTS.md / CLAUDE.md family of convention files, repos full of index.md and log.md artifacts that agents consult before doing real work, and "metadata as code" repositories inside data teams.

The pattern is compelling and powerful, but each instance is bespoke. Karpathy's wiki and your team's wiki and a vendor's catalog export may all look alike (markdown, frontmatter, cross-links), but none of them are intentionally designed to cooperate. There is no agreed-upon answer to what fields every document should carry, or what filenames mean what. As a result, the knowledge encoded in wikis remains siloed within the original teams, leading to redundant effort whenever a new agent is built.

What's missing is a format, not another service

The answer to this problem isn’t another knowledge service. You need a format, a way to represent knowledge that:

  • Anyone can produce, without an SDK
  • Anyone can consume, without an integration
  • Survives moving between systems, organizations, and tools
  • Lives in version control alongside the code it describes
  • Is readable by humans and parseable by agents: the same file, no translation layer

By design, OKF is that format.

How OKF works: The design in one screen

An OKF bundle is a directory of markdown files representing concepts: anything you want to capture, including tables, datasets, metrics, playbooks, runbooks, and APIs. Each concept is one file. The file path is the concept's identity.

sales/
├── index.md
├── datasets/
│ ├── index.md
│ └── orders_db.md
├── tables/
│ ├── index.md
│ ├── orders.md
│ └── customers.md
└── metrics/
│ ├── index.md
│ └── weekly_active_users.md

Each concept document has a small block of YAML front matter for structured fields and a markdown body for everything else:

---
type: BigQuery Table
title: Orders
description: One row per completed customer order.
resource: https://console.cloud.google.com/bigquery?p=acme&d=sales&t=orders
tags: [sales, revenue]
timestamp: 2026-05-28T14:30:00Z
---
# Schema
| Column | Type | Description |
|---------------|-----------|------------------------------------------|
| `order_id` | STRING | Globally unique order identifier. |
| `customer_id` | STRING | FK to [customers](/tables/customers.md). |
# Joins
Joined with [customers](/tables/customers.md) on `customer_id`.

Concepts link to each other with normal markdown links, turning the directory into a graph of relationships that is richer than the parent/child links implied by the file system. Bundles can optionally include index.md files (for progressive disclosure as agents navigate the hierarchy) and log.md files (for chronological history of changes).

The full v0.1 specification (including conformance criteria, cross-linking rules, and the small number of reserved filenames) fits on a single page.

Three principles behind the design

1. Minimally opinionated. OKF requires exactly one thing of every concept: a type field. Everything else (e.g., what types exist, what other fields to include, what sections the body has) is left to the producer. The spec defines the interoperability surface, not the content model.

2. Producer/consumer independence. OKF cleanly separates who writes the knowledge from who consumes it. A bundle hand-authored by a human can be consumed by an AI agent. A bundle generated by a metadata export pipeline can be browsed in a visualizer. A bundle synthesized by one LLM can be queried by another. The format is the contract; the tooling at each end is independently swappable.

3. Format, not platform. OKF is not tied to any specific cloud, database, model provider, or agent framework. It will never require a proprietary account or SDK to read, write, or serve. We're publishing it as an open standard because the value of a knowledge format comes from how many parties speak it, not from who owns it.

What we're shipping with the spec

To make the format concrete, we're publishing reference implementations at both the producer and consumer ends:

  • An enrichment agent that walks a BigQuery dataset, drafts an OKF concept document for every table and view, then runs a second LLM pass that crawls authoritative documentation and enriches each concept with citations, schemas, and join paths.
  • A static HTML visualizer that turns any OKF bundle into an interactive graph view in a single self-contained file; no backend, no install on the viewing side, no data leaves the page.
  • Three ready-to-browse sample bundles: GA4 e-commerce, Stack Overflow, and Bitcoin public datasets, produced by the reference agent and committed to the repo as living examples of conformant OKF.

Where we go from here

OKF v0.1 is a starting point, not a finished standard. The format will evolve as more producers and consumers emerge and as we collectively learn what knowledge representations agents actually need in practice.

We're publishing in the open from day one because that's the only way a knowledge format earns its name, whether you're building a knowledge catalog, an enrichment pipeline, a wiki tailored to AI agents, or anything in the AI knowledge domain.

From here, we encourage you to:

  • Read the spec (it's short!)
  • Write a producer for your source system, your database, your documentation site
  • Write a consumer: a viewer, a search index, an agent that reasons over bundles
  • Try the reference implementation against your own data
  • File issues, send PRs, or propose extensions: The spec is versioned and explicitly designed for backward-compatible growth

The repo, the spec, and the sample bundles are available in GitHub. We have also updated Google Cloud’s Knowledge Catalog to be able to ingest Open Knowledge Format and serve it to our agents.

The format itself is the contribution. The tools we've shipped exist to make it real, and to lower the cost of trying it out. Whatever shape your knowledge takes today, OKF is designed to be the lingua franca it can be exchanged for tomorrow.

AI researchdevops

olmo-eval: An evaluation workbench for the model development loop

AllenAI's olmo-eval workbench introduces pairwise checkpoint comparison to help developers track if model changes are real improvements or noise.

Summary

What: Tyler Murray and Kyle Wiggers released olmo-eval, a tool that extends the OLMES standard to support iterative model development. Unlike publishing-focused tools like Harbor, olmo-eval features a modular architecture that lets developers compare model checkpoints question-by-question, handle agentic tool-use loops, and run benchmarks with varying degrees of resource isolation.
Why it matters: This highlights a growing focus on the 'development loop'—moving away from static benchmarking to tools that can detect regressions or minor gains during the actual training process.
Takeaway: Switch to olmo-eval if your current evaluation pipeline struggles to surface performance differences between model checkpoints on a per-question basis.

Deep Dive

  • Modularity: Decouples the benchmark task definition from the runtime policy, allowing the same benchmark to be run in standard or agentic modes.
  • Pairwise comparison: Focuses on showing precisely where one checkpoint beats another rather than just reporting aggregate scores.
  • Resource management: Offers a tiered approach, running simple evals directly while isolating agentic tool-use benchmarks in Docker containers or Modal environments.
  • Minimal Detectable Effect: Includes statistical reporting to determine if performance shifts are statistically significant given the evaluation sample size.

Decoder

  • Harness: The execution layer that controls how a model is run during evaluation, including tools, scaffolding, and sandboxes.
  • LLM-as-a-judge: Using a powerful model to grade the output of a smaller, experimental model based on specific criteria.

Original Article

olmo-eval: An evaluation workbench for the model development loop

Code: https://github.com/allenai/olmo-eval

While you're building an LLM, you evaluate it over and over across many interventions. Every adjustment to its data, architecture, or hyperparameters — and every step up in scale — sends you back through the same loop: adding or reconfiguring benchmarks, re-running them on each new model checkpoint, noting the results, and checking whether something that helped in a small experiment still holds up on the full training run.

Most evaluation tools aren't designed for this—they’re either built to run established benchmarks across finished models or run a model through multi-step, tool-using problems in a sandbox. They don’t keep up with a model that's constantly changing, nor do they reflect how a model might behave under specific real-world conditions.

Our last project to address this evaluation challenge was OLMES, the Open Language Model Evaluation Standard. Introduced in 2024, it was meant to make LLM benchmark scores easier to compare across releases. The same models were being scored on the same benchmarks in different ways — aspects like prompt formatting and task formulation often varied from paper to paper — so claims about which models performed best often weren't reproducible. OLMES pinned benchmarking choices down in an open, documented standard, and it became the basis for evaluating our open models from Olmo to Tulu.

But a model's final score is only part of the evaluation process—which is why we're releasing olmo-eval, a new workbench that builds on OLMES and extends it across the rest of LLM development. Compared to OLMES, olmo-eval cuts down the work of implementing new evaluations, offers more flexibility in defining where and how they run, and makes it easier to compose individual components into larger workflows. Agentic and multi-turn evaluation is supported as a first-class use case, and stronger analysis tools help you judge whether an intervention actually improved on the baseline or the difference amounts to noise.

How olmo-eval differs from existing tools

Is a 2.4pp change in performance enough to make a call?

olmo-eval overlaps in some ways with Harbor, an open framework for evaluating AI agents inside containerized, sandboxed environments. But the two tools differ in their scope. Harbor is aimed mainly at running and publishing agent benchmarks; olmo-eval was built for the everyday work of developing a model—adding and configuring benchmarks, running them across checkpoints, and analyzing the results prompt by prompt instead of as a single overall score.

Harbor runs everything the same way—inside sealed, reproducible containers. Because containers can be resource-intensive, olmo-eval lets you choose how each benchmark runs instead. A benchmark that just needs a model to answer questions can run directly, which is faster and cheaper; a benchmark that needs a locked-down environment — say, one that runs code the model wrote — gets an isolated container setup. The lightweight path is the default, and olmo-eval only opts for the heavy setup when a benchmark actually requires it.

Harbor's process for adding a benchmark is built for evals you plan to publish and share publicly, with the extra verification steps that entails. olmo-eval is built for moving quickly while you develop, and how you add a benchmark depends on what the benchmark needs: a short definition for a basic eval, with options to let a model use tools as it works through a benchmark, or — for a benchmark that already has its own code and procedure — a thin wrapper so olmo-eval can run it as is and report the results alongside other benchmark scores in the same format.

Both Harbor and olmo-eval keep benchmarks separate from the runtime policy (how the model is run to produce its answers) so you can change one without rewriting the other, but olmo-eval is designed for greater modularity. In olmo-eval, the model being evaluated, the tools it can use, the containerized environment, and any helper models – like an LLM-as-a-judge – are all swappable components. You can reuse a tool across many harnesses, or plug a grading model into one benchmark without perturbing the others, and adjust small settings (e.g., the exact wording of the prompt) without extensive effort.

Harbor reports an overall score for each model. olmo-eval reports those scores too, each with a standard error and a minimum detectable effect (the smallest difference that can be reliably distinguished from noise). But the more useful view lines the same questions up across two model checkpoints and compares them one by one, with all else held fixed. This helps you to see whether a tiny change in an overall average might indicate a real improvement or simply noise.

If you're looking for... olmo-eval offers
Authoring a multi-example benchmark Task subclass with a DataSource, metrics, and scoring surface
Wrapping an existing agent-style benchmark with its own runner ExternalEval or SandboxedExternalEval; the benchmark keeps its loop and scoring, and results land in olmo-eval's schema
Swapping the runtime under a fixed benchmark --harness and harness presets; the harness carries provider, tools, scaffold, sandboxes, and auxiliary providers
Parallel container execution Sandbox instances for parallel executors with capability-based routing, Docker or Modal modes
Tool definitions reusable across tasks and harnesses @tool decorator with optional global registry
Multi-turn execution loops Scaffolds, e.g., openai_agents, selected per harness, not baked into the task definition

An integrated evaluation stack

olmo-eval is composed of four components that are useful on their own but designed to work together to tighten the experimental LLM development loop:

  1. A task/suite/harness abstraction that decouples benchmark logic from runtime policy. A task is how you define a benchmark in olmo-eval—what's being evaluated. A suite groups tasks into a set you run together, and a harness controls how each task is run. This separation lets the same task run as a standard baseline or with tools and scaffolding, without changing what it measures.
  2. A sandbox and capability-routing layer, including an asynchronous sandbox planner. This supports evaluations where a model's response depends on the actions it takes using tools, like writing and running code or browsing the web. The point is to evaluate the model's real tool use: when a benchmark calls for tools, olmo-eval runs those tools and feeds the results back to the model.
  3. A normalized experiment schema that records every run, its configuration, and the results in the same structured format. This makes it possible to group related experiments, compare checkpoints over time, and avoid the inconsistencies that often accumulate in long-running model development workflows.
  4. A results viewer for pairwise model comparison: lining two models or checkpoints up question by question surfaces small but real performance changes that an overall average can hide.

In most model evaluation setups, adding a benchmark is a sizeable integration project. In olmo-eval, all that’s needed is a task—tasks define the benchmark dataset, how evaluation requests are built, and how model answers are scored (all code in Python):

from olmo_eval.common.formatters import ChatFormatter
from olmo_eval.common.metrics import AccuracyMetric
from olmo_eval.common.scorers import ExactMatchScorer
from olmo_eval.common.types import Instance, SamplingParams
from olmo_eval.data import DataLoader, DataSource
from olmo_eval.evals.tasks.common import Task, register, register_variant

@register("internal_freshqa")
class InternalFreshQA(Task):
    data_source = DataSource(path="s3://evals/internal/freshqa.jsonl", split="test")
    formatter = ChatFormatter()
    sampling_params = SamplingParams(temperature=0.0)
    metrics = (AccuracyMetric(scorer=ExactMatchScorer),)

    @property
    def instances(self):
        loader = DataLoader()
        for idx, doc in enumerate(loader.load(self.config.get_data_source())):
            yield Instance(
                question=doc["question"],
                gold_answer=doc["answer"],
                metadata={"id": doc.get("id", f"freshqa_{idx}")},
            )

Variants express changes in evaluation policy without duplicating the benchmark:

register_variant("internal_freshqa", "3shot", num_fewshot=3, fewshot_seed=1234)
register_variant("internal_freshqa", "zero", num_fewshot=0)

Suites group benchmarks into standard sets you run together:

from olmo_eval.evals.suites import Suite, register

register(Suite(
    name="base_qa_few_shot",
    tasks=(
"sciq:mc:3shot",
"arc_challenge:mc:3shot",
"internal_freshqa:mc:3shot",
    ),
))

And because runtime policy lives in the harness rather than the task definition, the same benchmark can be easily rerun under different execution rather than relying on whether a generated point track merely looks plausible.

# Baseline
olmo-eval run -m my-instruct-checkpoint -t internal_freshqa:zero

# Same task, same scoring, search/tool runtime enabled
olmo-eval run -m my-instruct-checkpoint -t internal_freshqa:zero --harness search_agent

Reproducible evaluation made open

Use olmo-eval when evaluation is part of ongoing model development rather than a one-off run—when you need to run the same benchmarks repeatedly across checkpoints under reproducible conditions and compare interventions at both the aggregate and per-question level.

If your recurring question is “How does this checkpoint differ from the last one, and where exactly did it improve or regress?”, that’s the workflow olmo-eval is built for.

Reproducible evaluation should keep pace with how models are built—not only how they're scored once they're finished. olmo-eval carries the OLMES standard into active model development, and we're releasing it openly so the community can build on it.

AI mobilepolicy

Why Apple built a third-party AI system for Siri and then refused to show it at WWDC

Apple quietly built a Siri framework for third-party AI, but regulatory, legal, and messaging risks kept it out of the WWDC keynote.

Summary

What: iOS 27 beta code reveals a toggle-off 'Extensions' system that would allow users to swap between models like Gemini, Claude, and ChatGPT inside Siri, a feature delayed due to EU Digital Markets Act concerns and OpenAI's legal pushback.
Why it matters: Apple is trying to balance its closed-garden privacy narrative with the reality that users want model choice, but current geopolitical and contractual friction prevents a seamless rollout.

Decoder

  • Extensions: A proposed framework in iOS 27 allowing users to replace or supplement Siri's default intelligence with third-party models.
  • Digital Markets Act (DMA): EU legislation mandating that 'gatekeeper' tech companies allow interoperability with third-party services.

Original Article

iOS 27 beta has an Extensions system for third-party AI in Siri, but Apple skipped the announcement at WWDC amid EU, legal, and messaging headwinds.

Apple’s iOS 27 developer beta contains underlying support for a feature the company never mentioned at its WWDC keynote on June 8: an Extensions framework that would allow iPhone users to swap between ChatGPT, Anthropic’s Claude, and Google’s Gemini directly inside Siri. Bloomberg’s Mark Gurman has reported that the system includes a settings panel and a dedicated App Store section, both built but toggled off on Apple’s backend. Apple has held discussions with OpenAI, Anthropic, and Google about granting entitlements for the framework, according to Bloomberg.

The feature was widely expected. Gurman first reported in March that Apple was building Extensions to replace the bilateral ChatGPT deal with an open system any qualifying AI provider could join. TechCrunch described the approach in May as a “choose your own adventure of AI models.” By the time WWDC arrived, the question was not whether Extensions would launch, but how prominently Apple would position it.

The answer was: not at all. Apple devoted the WWDC keynote almost entirely to Siri AI, its rebuilt assistant powered by a custom 1.2-trillion-parameter Gemini model running on Nvidia Blackwell GPUs in Google Cloud. The company introduced a standalone Siri app, personal context features, and a three-tier privacy architecture.

Extensions did not appear in any slide, demo, or press release. Three strategic pressures help explain why.

The first is regulatory. Apple confirmed during WWDC week that Siri AI will not launch in the European Union, citing unresolved negotiations with the European Commission over the Digital Markets Act. The EU rejected Apple’s proposal for a Trusted System Agent that would let rival virtual assistants access Siri AI’s capabilities without direct exposure to sensitive device data.

Announcing a framework that invites third-party AI into Siri while simultaneously telling EU regulators that third-party access poses unacceptable risks would have been difficult to reconcile.

The second is legal. OpenAI is preparing possible legal action against Apple over the ChatGPT partnership struck in June 2024. OpenAI’s lawyers are working with an outside firm on options including a breach-of-contract notice, according to Bloomberg.

OpenAI believed the deal would drive billions in subscription revenue, but says Apple buried the integration behind friction, with users required to explicitly invoke “ChatGPT” by name and responses appearing in constrained windows. Announcing Extensions, a system explicitly designed to demote ChatGPT from its exclusive position to one option among several, would have escalated those tensions at a sensitive moment.

The third is messaging. Apple spent two years rebuilding Siri from the ground up after its original AI plans fell short. Siri engineering chief Mike Rockwell said the team had a working version the previous year but scrapped it because it did not meet their vision.

Craig Federighi called Siri AI’s agent-like capabilities “experimental.” Introducing a model-picker at the same moment Apple was trying to convince users, developers, and investors that its own AI had finally arrived would have undercut the relaunch narrative.

Gurman’s hands-on review of the Siri AI beta, published today, suggests the concern is not unfounded. He described the assistant as functional but buggy, with slow responses, cancelled queries, and misunderstood requests. Siri AI is roughly competitive with where leading chatbots were approximately six months ago, according to his assessment.

The assistant still cannot handle advanced workloads like research, programming, or data analysis. Apple is rolling out access through a waitlist, and even the public beta in July will be limited.

The underlying architecture, however, is designed to accommodate Extensions whenever Apple decides to flip the switch. Google’s Gemini already powers Siri AI under the hood through a deal worth roughly $1 billion per year. Extensions would sit on top of that, giving users the ability to route specific tasks through whichever third-party model they prefer.

That means Writing Tools, Image Playground, and open-ended chat could each be powered by a different provider. Apple’s approach would effectively turn Siri into a platform layer rather than a single-provider assistant.

For Anthropic and Google, the stakes are significant. Extensions would give Claude and Gemini native access to more than 1.5 billion active Apple devices without requiring users to download separate apps or leave the Siri interface.

For OpenAI, the picture is more complicated. The Extensions system might actually benefit ChatGPT by giving it more prominent placement through a model-picker interface, but it would also end the exclusive position OpenAI believed it was paying for with the original partnership.

The iOS 27 beta code also contains references to a foldable device internally codenamed V68, expected to debut in September, and macOS 27 includes pull-to-refresh gestures and Sidecar touch input that point toward a touch-screen MacBook under codenames K114 and K116. These hardware signals suggest Apple is building the Extensions framework with new device form factors in mind, not just current iPhones.

Apple has not publicly confirmed or denied that Extensions will ship with iOS 27 this fall. The framework is built, the discussions with AI providers are underway, and the regulatory, legal, and strategic obstacles are all in motion simultaneously. The question is no longer whether Apple will open Siri to third-party AI. It is whether the EU, OpenAI’s lawyers, and Apple’s own messaging discipline will let it happen on the timeline Apple originally intended.

AI llmresearch

The Oracle and the Firm

OpenAI and Anthropic are betting on two different architectural philosophies: 'Oracle' compaction versus 'Firm' agent-delegation.

Summary

What: OpenAI uses server-side thread compaction to maintain a single coherent context window for long tasks, while Anthropic relies on multiple sub-agents passing information back to a parent, which increases token usage and potential for forgetting.
Why it matters: The industry is currently divided on whether model intelligence scales best through larger, unified context windows or hierarchical, multi-agent workflows.

Deep Dive

  • OpenAI's Oracle approach: Uses native server-side compaction to prune old messages and tool calls.
  • Maintains one long, coherent thread.
  • Better memory for small, trajectory-relevant details.
  • Takes advantage of K/V caching for improved performance on long tasks.
  • Anthropic's Firm approach: Splits tasks into sub-problems assigned to separate agents.
  • Mimics human organizational structures but relies on language-based communication between agents.
  • Prone to duplicate work across agents.
  • Higher risk of 'forgetting' as sub-agents may filter out facts deemed irrelevant to their specific task.
  • Higher token cost due to parallelized task execution.

Decoder

  • Compaction: The process of summarizing or pruning a conversation thread to fit within a model's context window without losing relevant state.
  • K/V Caching: A technique that stores the Keys and Values for previous tokens to avoid recomputing them during inference, drastically speeding up long conversations.

Original Article

The Oracle and the Firm

Like most of the internet, I've been diving into Fable 5 over the last 24h. And like most of the internet, I've been pretty blown away with the quality.

But as I've been using both Fable and GPT-5.5, I couldn't help but notice there are clear differences in approach which make the two models behave quite differently. And we're seeing two very different training regimes play out.

For any frontier model, accomplishing real work is an exercise in context management. The model needs to solve a problem across a very large number of tokens; some are explored via tool calls, others are the model thinking. Then it needs to produce a result.

To get models to solve harder and harder tasks that run for increasing amounts of time, you need to figure out how to scale that context management.

OpenAI: the oracle

Since roughly ChatGPT 5.3-Codex, I've noticed that the model has improved a lot at dealing with long context windows. It stays coherent even across long-running tasks or /goal implementations, despite having a smaller context window than the corresponding Opus models (~200k vs 1m).

The approach Codex takes is compaction, and there are two naive approaches you can use:

  1. you have a separate (sometimes smaller) model output a new message based upon the trajectory.
    • e.g. ask 5.5 to summarize everything in the thread up to 1k tokens
  2. you remove certain categories of calls from the conversation.
    • e.g. remove all tool calls, then begin inference

You might even imagine doing this in parallel: compacting up to a certain number of messages in a thread, and then leaving more recent ones in full fidelity.

Since earlier this year, Codex has shipped native server-side compaction built into the responses API. When the produced tokens start to overflow the context window, Codex can run a process to compact everything and retain only the relevant information.

Given that the compaction happens server side, it has two nice properties: a) OpenAI can change the implementation easily and at will without affecting clients and b) the API can take advantage of much better K/V caching for long-running threads by routing to the right GPU.

The net result, I think about like an 'Oracle'. Codex typically keeps one long thread going with frequent compactions. You will see sub-agents which execute on clean paths, but it tends to happen less often, unless nudged by the user.

Because the single thread is managing everything related to user responses, it maintains a lot of coherence. Small details which are relevant to the overall trajectory are remembered.

Anthropic: the firm

Compaction is just one way of dealing with context windows. The other approach you could take is splitting context windows across various agents. In this technique, you split the problem into various sub-problems, then have each agent execute on the sub-problem within it's own context window.

Since at least Opus 4.1, I've noticed Claude Code will eagerly take this approach. For any research of the codebase, the model will invoke Explore sub-agents which leverage Haiku to create a quick summary.

And when running Fable 5, it goes nuts.

This is very different approach: sub-agents are now able to do large amounts of work within a context window, and then pass back only the relevant information to the parent agent.

In effect, this looks more like the way human organizations run. Everyone has their set of goals and inputs and outputs. They see some subset of the total information available, and make decisions based upon that. And we all communicate via language. But we don't get to see the hidden state in one another's heads.

Claude models also compact, and it takes advantage of a few of these different approaches. But the compaction is slow and requires users to upgrade the client consistently, so I assume the training regime has leaned more on delegating to sub-agents.

Takeaways

Cost and token efficiency: I suspect that Anthropic models tend to cost more because the sub-agents often end up doing duplicate work. They may be searching similar files because they aren't actively communicating.

Perceived speed: Anthropic models will often seem to be "doing more", because the tokens are being produced in parallel vs serial. Many more tokens are produced during that time.

'Forgetting': a friend pointed me to some cases where Anthropic models seem less coherent, or more often tend to misreport facts to the user. I think this is true, and it's easily explained with the message-passing approach taken by sub-agents. If a sub-agent deemed a fact was not worth reporting back to the parent agent, then it will be missing from the context. In this way, the Anthropic models can more easily 'miss' obvious facts, even if it seemed like they at one point had done the research. This can happen with compaction too, but it's less likely, because the model is doing 'less work' to omit or preserve tokens.

In the end state, I expect we will see both approaches combined. Anthropic will improve their compaction (which right now is too lossy), and OpenAI will train for multi-agent setups.

AI infrastructurehardware

NVIDIA Blackwell Leads on First Agentic AI Infrastructure Benchmark

NVIDIA's Blackwell NVL72 platform claims a 20x improvement in agent throughput per megawatt over previous Hopper-based hardware.

Summary

What: The new AgentPerf benchmark—the first designed to simulate multi-step agentic workflows rather than simple chat latency—shows NVIDIA's Blackwell Ultra system outperforms the HGX H200 in concurrent agent execution.
Why it matters: Infrastructure benchmarks are shifting away from single-shot token throughput toward 'agentic efficiency,' highlighting the power cost of recursive LLM and tool-use chains.

Deep Dive

  • AgentPerf: A new benchmark measuring how many concurrent agents a platform can run while meeting service-level objectives.
  • Workload profile: Benchmarks simulate code-based agent trajectories, including file reading, code execution, and iterative reasoning.
  • Blackwell advantages: Uses high-speed GPU interconnects to handle large Mixture-of-Experts (MoE) models across 72 GPUs.
  • Efficiency: Separates input processing from output generation using TensorRT-LLM to improve concurrent agent handling.

Decoder

  • Agentic AI: Systems designed to break a user goal into multiple steps, chaining LLM calls and tool uses until the task is complete.
  • Mixture-of-Experts (MoE): An architecture where only a subset of the neural network's parameters are activated for each token, allowing for larger model capacity without a linear increase in inference cost.
  • NVL72: NVIDIA's rack-scale architecture connecting 72 Blackwell GPUs as a single massive processing unit.

Original Article

AgentPerf from Artificial Analysis, the industry’s first agentic AI benchmark, gives developers, enterprises and infrastructure providers a clear way to compare systems for agentic AI. In the first round of published results, the NVIDIA Blackwell Ultra NVL72 platform delivers leading performance across the agentic AI workloads tested, running 20x more agents per megawatt than NVIDIA Hopper.

Agentic AI is a fundamentally different workload than conversational AI. A single chat completion is a sprint: one large language model (LLM) call, one response. An agent functions more like a relay: It breaks a goal into many steps and keeps going until the task is done.

That results in dozens to hundreds of LLM calls chained together, each passing growing context to the next, with tool calls like code compile and execution, database search and web browsing at every handoff. The complexity isn’t additive; it’s multiplicative.

The distinction matters enormously for performance measurement. Existing AI inference benchmarks measure one LLM call: how fast an LLM responds to a single request and how many simultaneous requests a system can handle. They weren’t designed for agentic workloads, where chained LLM calls, tool call delays and growing context stress accelerated computing systems in fundamentally different ways than a single LLM call ever could.

For companies building and deploying agents at scale, it’s important to understand how responsive agents are, how many can be deployed simultaneously and how much useful work AI infrastructure can deliver for every dollar and watt invested.

NVIDIA GB300 NVL72 Runs 20x More Agents per Megawatt

In this first round, AgentPerf measures agentic performance with DeepSeek V4 Pro, a large mixture-of-experts (MoE) model that represents the class of frontier models powering today’s most capable agents. On this workload, NVIDIA GB300 NVL72 delivers the highest performance in the benchmark, running up to 20x more agents per megawatt than the NVIDIA HGX H200 system.

The performance advantage comes from extreme codesign across the full stack. GB300 NVL72 connects 72 GPUs into a single rack-scale system, enabling large MoE models like DeepSeek V4 Pro to distribute model execution efficiently at scale.

CUDA kernels accelerate this further by overlapping communication and compute, so the cost of coordinating across experts is absorbed rather than added to latency.

NVIDIA TensorRT LLM sustains efficiency as concurrent agent sessions scale. For example, it separates the processing of inputs from the generation of outputs so each can be optimized independently.

These results are grounded in a benchmark methodology built from the ground up to reflect how agentic AI actually works in production.

Artificial Analysis AgentPerf: Built on Real-World Agentic Workloads

AgentPerf is built based on real coding agent trajectories: an agent receives a task, reads files, writes and edits code, executes commands and iterates based on the results — all drawn from real public code repositories across 12+ programming languages. The long sequence lengths, tool call patterns and delays are all representative of real-world coding workflows.

AgentPerf then measures how many of these agentic tasks a platform can support simultaneously while meeting defined performance thresholds for responsiveness and output token rate. Tool calls are not executed but simulated using representative CPU processing time, so differences in results reflect accelerated computing performance only.

The results translate directly into infrastructure decisions: how many concurrent agentic tasks can be run per accelerator and per megawatt of power. For enterprises deploying AI agents at scale, those numbers determine how much productive work a given infrastructure investment can actually deliver.

NVIDIA Ecosystem Partners Harness Blackwell’s Leading Performance

Leading inference providers including Baseten, DeepInfra and Together AI are already serving agentic workloads on frontier models such as DeepSeek V4 Pro on NVIDIA Blackwell and powering production agentic applications today.

Together AI powers real-time inference for Cursor, an AI-powered agentic coding platform, on NVIDIA Blackwell. Cursor’s agents debug issues, generate features and execute refactors while developers continue working.

DeepInfra powers Pam.ai, an AI workforce platform for car dealerships, which deploys agents to book service appointments, handle calls and run outbound sales campaigns, entirely on NVIDIA Blackwell.

As NVIDIA and the open source ecosystem continue to optimize inference software, performance and efficiency on agentic workloads will only improve. The NVIDIA Vera Rubin architecture is now in full production, bringing the next generation of infrastructure capacity to meet the growing demands of agentic AI at scale.

DevOps enterprise

Ansible Automation Platform 2.7: Visual Execution Environment Builder and Content Discovery Guide

Red Hat's Ansible Automation Platform 2.7 introduces a visual builder and unified discovery engine to simplify automation environment management.

Summary

What: Ansible Automation Platform 2.7 now includes a visual execution environment builder that eliminates manual YAML configuration. It also features a new discovery engine that syncs scattered collections from GitHub, GitLab, and Private Automation Hubs into a single searchable catalog.
Why it matters: Red Hat is shifting Ansible toward a 'governance-first' model, reducing the reliance on manual CLI work to keep large-scale automation teams consistent.
Takeaway: If you are managing complex Ansible environments, upgrade to 2.7 to begin consolidating your disparate collections into the central discovery catalog.

Decoder

  • Execution Environment: A containerized image containing all dependencies, collections, and plugins required to run Ansible playbooks consistently.

Original Article

Ansible Automation Platform 2.7: Visual Execution Environment Builder and Content Discovery Guide

When building and maintaining consistent execution environments, platform engineers and developers routinely lose valuable time identifying dependencies, tracking down content collections scattered across different repos, and wrestling with manual syntax configurations.

With the release of Red Hat Ansible Automation Platform 2.7, these challenges have become a thing of the past. The execution environment builder and unified content discovery experience within the automation portal work together to dramatically reduce time to automate while embedding governance from the ground up.

Here's how these capabilities eliminate operational friction and help bring consistency across your organization:

Visual Execution Environment Builder

Creating custom execution environments used to require deep familiarity with definition file syntax. The new visual execution environment builder removes this barrier with a guided, step-by-step workflow.

  • Visual authoring without the guesswork: Build highly consistent execution environments through a clean interface that eliminates manual YAML syntax errors and command-line complexity entirely.
  • Start from proven foundations: Base your work on predefined or custom templates rather than starting from scratch. Red Hat provides recommended execution environment templates tailored to specific IT domains, and organizations can add their own approved templates to maintain structural standards.
  • Integration and automated discovery: Effortlessly locate and sync Red Hat Ansible Certified Red Hat Collections alongside your team's custom collections stored in GitHub or GitLab organizations. The interface brings all your Git repositories and automation channels into a single, searchable catalog with flexible, one-click selection and manual upload options.
  • Automated output generation: Once your visual setup is complete, the tool automatically generates the complete execution environment definition file with immediate options to publish it directly to a repository or download it for your project.
  • Automated build pipeline: You can also specify that the builder scaffolds a complete GitHub repository with an automated build pipeline. Select a target registry, and the generated GitHub Actions workflow handles building the container image and pushing it to your registry automatically, providing a ready-to-use execution environment without ever touching the command line.

Content Discovery Guide: Your unified automation catalog

Enterprise automation content naturally fragments over time, tucked away in private automation hubs, hosted on public Git providers, or fragmented within individual teams. The content discovery engine bridges these gaps by syncing everything into a single, searchable catalog.

  • Continuous synchronization: The platform automatically discovers custom collections across your organization's GitHub and GitLab environments, alongside certified and validated collections from private automation hub.
  • See before you build: Browse existing collections and repositories before writing new code, preventing redundant work and maximizing content reuse.
  • Direct pipeline integration: Discovered, trusted collections populate the platform's creation tools natively. When spinning up a new execution environment, developers can jumpstart the process by selecting from known, vetted assets.
  • Deep visibility: From the centralized catalog view, trace playbooks, track continuous integration (CI) activity, and access direct source links across your repositories.
  • The foundation layer: Content is discovery is the foundation layer for our broader, end-to-end Ansible content management vision. In future releases, we will expand this into a full lifecycle solution that guides developers from initial discovery through authoring, validation, and monitoring content performance in production.

Governance by design: Keeping admins in control

Together, these features boost productivity of developers and domain teams without compromising security or architectural standards. Ansible Automation Platform 2.7 enables platform administrators to retain full control over enterprise-wide usage.

  • Predefined scoping: Administrators select exactly which external Git repositories and automation hubs are discoverable, applying custom filtering to fit corporate boundaries.
  • Controlled visibility: Dictate what content is visible to specific domain teams (such as SecOps or CloudOps), so teams only use vetted, relevant automation.
  • Configurable scheduling: Run syncs on demand or on a customizable, automated schedule, with complete administrative control over timing and frequency.

Getting started

To take advantage of the new visual execution environment builder and content discovery capabilities, your environment needs to meet these baseline requirements:

  • Platform version: Ansible Automation Platform 2.5 or higher
  • Installation environment: Red Hat OpenShift Container Platform or Red Hat Enterprise Linux (RHEL) virtual machines (VM) appliance
  • Connectivity: Source Control Management (SCM/Git) access for discovery syncs and repository operations

By unifying content discovery and visually stabilizing how you package automation, Ansible Automation Platform 2.7 removes operational overhead from platform engineering, freeing your developers to focus on writing high-value automation that moves the business forward.

DevOps aisecurityenterprise

How Dropbox uses MCP and Dash to close the design-to-code security gap

Dropbox uses MCP and its Dash AI to bridge the 'design-to-code' gap, ensuring security requirements from threat models are actually implemented in pull requests.

Summary

What: Dropbox found that only 12% of pull requests linked back to security threat models, with a median delay of five weeks between design and implementation. Their new system uses Dash's indexing and MCP-based retrieval to automatically surface relevant security context during code reviews, allowing LLMs to identify missing controls and contradictions.
Why it matters: Security often suffers from 'knowledge decay' where documentation and reality diverge; this approach enforces alignment between intent and execution automatically.

Deep Dive

  • The Disconnect: 12% of PRs link to threat models; 54% are opened over a month after security review.
  • Technical Stack: Uses Dash (semantic search/indexing) + MCP (context retrieval) + LLMs (reasoning).
  • Mechanism: Semantic search identifies links between design docs and code even when explicit references are missing.
  • Validation: The system compares implementation vs. intent, surfacing gaps like missing auth or incorrect data handling.
  • Integration: Surfaces findings directly in the code review interface to minimize noise.

Decoder

  • Threat Model: A systematic process of identifying, quantifying, and addressing the security risks associated with an application.
  • Semantic Search: A search technique that understands the intent and conceptual meaning of queries rather than relying on exact keyword matching.

Original Article

How Dropbox uses MCP and Dash to close the design-to-code security gap

Every security team knows the drill: a new feature goes through design review, a threat model is produced, mitigations are agreed upon, and then development begins. In many cases, by the time implementation reaches code review, the process where engineers review code changes before they go live, the original security requirements are no longer visible in the workflow. A threat model, which outlines potential security risks and the protections a feature should include, often lives in a separate document or system from the code itself.

This separation creates a challenge. Implementation often happens weeks or months after the original security review, making it difficult for reviewers to verify that the agreed-upon security requirements were actually implemented. At Dropbox, we wanted to understand how often this gap appears in practice.

That led us to build a system that combines three technologies: Model Context Protocol, foundational large language models (which we’ll refer to as foundational models), and Dash, the AI capabilities within Dropbox that make it easier to find and understand your team’s content. Together, these technologies automatically retrieve relevant threat models during code review and evaluate whether code changes align with the requirements defined in them. Because Dash already indexes and connects content stored in Dropbox and across our connected applications, the system can draw on years of security reviews and engineering documentation without requiring teams to manually link those sources together.

In this post, we’ll walk through the architecture behind that system, what we learned from analyzing months of threat models, and how we think the same pattern can apply to other forms of design and compliance review.

The design-to-code gap

Organizations make important decisions about a product long before it ships—such as decisions about threat protection. That said, security reviews only create value if the requirements they produce remain visible throughout the development process. Our engineers wanted to make sure those requirements are upheld as development continues and also identify any gaps.

During a security review, engineers identify potential risks, discuss how a feature could be exploited, and agree on the protections that it should include. Those decisions are recorded in a threat model. But once development begins, those decisions often become separated from the code itself. The threat model lives in a wiki or documentation system, while the code is implemented through pull requests (PRs), the units of work engineers submit for review before changes are merged into a product. Unless someone explicitly links them together, reviewers may never see the security requirements that were agreed upon earlier.

At Dropbox, we maintain threat model documents spanning years of product development. Each one represents hours of security engineering work, but that work only provides ongoing value if reviewers can access it when implementation happens. To understand how often that connection persists, we examined the relationship between threat models and the PRs that implement the features they describe. Through that investigation, we learned that only 12% of implementing PRs link back to their original design review and threat model.

The gap is compounded by how much time often passes between review and implementation. When we measured the interval between design review filing and PR creation across 79 verified pairs, we found that more than half (54%) of implementing PRs weren’t opened until over a month after the review was filed. The median delay was about five weeks, with a long tail stretching beyond 11 months. Only 29% of implementing PRs were opened within the first two weeks of security review.

In other words, there can be a long delay between when security requirements are defined and when the corresponding code is reviewed. By the time reviewers look at the implementation, the decisions made during the security review may be buried in documentation they never open.

Why existing tools don’t solve this

Once we understood the scope of the gap, the next question was whether existing security tools could close it. For example, while static analysis tools inspect code for known patterns and potential issues, they can only tell you that a security control is present. What they can’t tell you is whether it was implemented according to the requirements agreed upon during design review. They analyze the code itself, not the context or intent behind it.

Organizations often try to address this challenge by asking engineers to link code changes to design reviews or by deploying bots that remind developers to follow review procedures. But these approaches depend on engineers remembering extra steps, and compliance tends to decline over time. What was missing was a way to connect code changes with the security guidance that already exists. We realized the problem wasn’t a lack of security knowledge. Most organizations have invested significant effort in documenting risks and mitigations through threat models. The challenge is making that knowledge available when code is being reviewed.

Our data suggested another opportunity as well. About 15% of design reviews were filed retroactively, meaning the code was built first and the security review came later, often before a broader launch. These cases suggest that some security-sensitive work isn’t always identified as requiring review when it’s implemented. A system that can surface relevant security context during development and not after could help in both directions: connecting code to existing reviews and providing an early signal when additional review may be warranted.

Using Dash and MCP as a context bridge

We needed a way to connect code under review with the security guidance that already existed elsewhere in the organization. Dash provided a natural starting point. Because it indexes content across connected applications, our collection of threat models was already searchable alongside other engineering documentation. Rather than relying on reviewers to find the right security documentation, we built a system that automatically retrieves relevant threat models when code is submitted for review.

Model Context Protocol (MCP) is what lets the agent access the information it needs. Dash has an MCP server that makes the content it indexes available to other AI tools. In our case, the security review agent uses Dash’s MCP server to search and read the same connected content that powers Dash search, including threat models and related documents. That gives the agent the context it needs without requiring a custom integration for every source system.

When a code change is opened for review, the agent retrieves relevant threat models and other supporting context through MCP. The foundational model can then examine both the documented requirements and the proposed code change together. For example, it can recognize that a threat model requires authentication on an endpoint and determine whether the code being introduced actually enforces that requirement.

That ability to reason across multiple sources of information is what distinguishes this approach from traditional static analysis. The system isn’t just inspecting code. It’s comparing implementation against previously documented security decisions.

Meeting developers where they work

Just as important as the retrieval architecture was where we surfaced the results. Rather than creating a separate security workflow, we integrated the system directly into code review. Engineers review code before it’s merged, so we focused on bringing additional security context into a process that already exists.

This distinction matters because security teams have spent years building tools that generate alerts, comments, and notifications. Developers, in turn, have spent years learning which ones they can safely ignore. The difference between a useful security signal and noise is relevance. A finding tied directly to the code being reviewed is far more likely to be useful than a generic warning that appears on every change.

At the same time, retrieving a threat model is only the first step. Simply placing a security document next to a code review still leaves a human responsible for reading both and determining whether they align. The foundational model performs that comparison automatically, identifying potential gaps between documented requirements and implementation. Human reviewers remain responsible for the final judgment, but the model eliminates much of the manual cross-referencing that would otherwise be required.

Implementing design-to-code traceability

To validate the approach, we analyzed all 150 of our security design reviews from the previous year and a half and mapped each to its implementing code changes. To do this, we used Dash’s semantic search capabilities, which retrieve related content based on meaning rather than exact keywords or explicit references. The connections exist, but they’re often invisible:

  • Using Dash’s semantic search—the same retrieval capability that powers its user-facing search—we successfully linked 80% of design reviews to their implementing code changes
  • Only 12% of those code changes explicitly reference the design review
  • 69% of connections were recoverable only through semantic search, meaning most of the relationship between design reviews and implementation would be invisible through manual references alone

We also evaluated the impact of surfacing threat model context during code review. In our testing, context retrieval consistently surfaced security findings that were invisible without the threat model, including missing controls, contradictions with approved designs, and regressions against known risks. The code was functionally correct in every case. The gaps were only visible when reviewers could compare the implementation against the original requirements.

More importantly, when we examined security incidents, we found cases where the root cause was a security requirement that had been documented during design review but wasn’t enforced in the implementing code. The connection existed; it just wasn't visible at the right moment. These weren’t rare edge cases. They were straightforward requirements that became disconnected from implementation as development progressed.

This is the difference between reviewing code and reviewing implementation against design. The former catches bugs. The latter catches security gaps. And it’s only possible when a model can reason about the relationship between two documents—the threat model and the pull request—rather than analyzing either one in isolation.

Design principles and what’s next

As we integrate this into our development workflows, we’re designing around a few core principles. Findings must be validated against the actual code before they reach a developer, because false positives destroy trust faster than true positives build it. Every finding should be traceable back to a specific requirement and source document so reviewers can verify the reasoning for themselves. Most findings should be advisory rather than blocking, with escalation reserved for confirmed gaps between approved designs and implementation. And because requirements evolve over time, the system must account for stale context rather than blindly applying outdated guidance.

The architecture isn’t specific to security, either. It’s a general solution for any team that produces design documents and needs to verify they’re reflected in implementation. For example, privacy teams can surface data classification requirements when code touches user data flows. A privacy review that specifies a field must not be logged can be checked against future code changes that handle that field. Platform teams can surface API contracts and compatibility requirements when interfaces change. And compliance teams can surface regulatory requirements when code handles data in regulated jurisdictions.

The common pattern is straightforward: organizations already have documented requirements, but those requirements are often disconnected from the workflows where implementation decisions are made. By combining searchable organizational knowledge, MCP-based retrieval, and foundational models capable of reasoning across multiple sources of context, it’s possible to automatically compare implementation against intent.

The scanning tools and threat models already existed. What we were missing, however, was a way to connect them at the right moment. MCP makes that connection technically feasible. Dash makes it practical. And foundational models make it useful, turning "here’s a relevant document" into "here’s a specific gap between what was required and what was implemented." While security is our first use case, the same pattern can help any team ensure that the decisions made during planning and review are reflected in the systems they ultimately build.

DevOps aiagentspython

Aisuite (GitHub Repo)

Aisuite is a new Python library from Andrew Ng's team providing a unified interface for LLMs and agent-native tool execution.

Summary

What: Aisuite provides a standardized API for calling LLM providers (OpenAI, Anthropic, Google, Ollama) and an Agents API for executing tool-calling loops. It includes 'OpenCoworker,' a desktop app demonstrating its capabilities in local research, file management, and scheduled automation, all while maintaining local data storage.
Why it matters: This is a play to standardize the 'agent framework' layer, making it trivial to swap LLM backends while keeping agent logic and tool definitions consistent.
Takeaway: Install `pip install aisuite` to experiment with a unified tool-calling interface that avoids the boilerplate usually associated with switching between provider SDKs.

Deep Dive

  • Unified Chat API: Provides a provider-agnostic interface across 10+ LLM services.
  • Agent Harness: Includes a declarative Agents API to manage toolkits (files, git, shell) and state persistence.
  • MCP Integration: Supports native MCP tool discovery for model interactions.
  • Tooling Abstraction: Turns complex tool schemas into simple Python functions with automated execution loops.
  • Local-First: Includes an OpenCoworker desktop implementation for local task orchestration.

Original Article

OpenCoworker

An AI agent that lives on your desktop, built on aisuite.

OpenCoworker is a desktop AI agent that can not only chat, but also do deep research and carry out tasks for you on your computer. It can read files (with permission) to gain context, read/send messages (slack, email, etc.), and create real deliverables like PDF reports, documents, spreadsheets. It also supports scheduled automations, such as providing you a daily news summary.

Requires bringing your own API key (OpenAI, Anthropic, Google) or run fully local with Ollama. Your data stays on your machine.

⬇ Download for macOS macOS 13+ (Apple Silicon)

⬇ Download for Windows Windows 10/11 (x64)

Quickstart: — install, connect a model, first tasks, automations.

Its source lives in this repository under platform/ — a working reference for building your own agent harness on aisuite.

aisuite

aisuite is a lightweight Python library for building with LLMs, in two layers: a unified Chat Completions API across providers, and an Agents API with tools and toolkits on top. This repo is also home to OpenCoworker, a desktop AI coworker built using aisuite:

┌───────────────────────────────────────────────┐
│                 OpenCoworker                  │   agent harness for doing everyday tasks
├───────────────────────────────────────────────┤
│        Agents API  ·  Toolkits  ·  MCP        │   build agents across multiple LLMs
├───────────────────────────────────────────────┤
│             Chat Completions API              │   one API across multiple LLM providers
├────────┬───────────┬────────┬────────┬────────┤
│ OpenAI │ Anthropic │ Google │ Ollama │ Others │
└────────┴───────────┴────────┴────────┴────────┘
  • Chat Completions API — a unified, OpenAI-style interface for OpenAI, Anthropic, Google, Mistral, Hugging Face, AWS, Cohere, Ollama, OpenRouter, and more. Swap providers by changing one string.
  • Agents API · Toolkits · MCP — give models real Python functions as tools, run multi-turn loops, attach ready-made toolkits (files, git, shell) or any MCP server, and govern it all with tool policies.
  • OpenCoworker — a desktop AI coworker built using aisuite, shipped as an app for everyday tasks.

Installation

The aisuite library (Python)

Install the base package, or include the SDKs of the providers you plan to use:

pip install aisuite               # base package, no provider SDKs
pip install 'aisuite[anthropic]'  # with a specific provider's SDK
pip install 'aisuite[all]'        # with all provider SDKs

You'll also need API keys for the providers you call — the Chat Completions quickstart covers key setup and your first calls.

The OpenCoworker app (desktop)

Download the installer and bring your own API key (or run local models with Ollama):

⬇ macOS (Apple Silicon) · ⬇ Windows 10/11 (x64) · OpenCoworker quickstart

Chat Completions — one API across providers

The chat API provides a high-level abstraction for model interactions. It supports all core parameters (temperature, max_tokens, tools, etc.) in a provider-agnostic way, and standardizes request and response structures so you can focus on logic rather than SDK differences.

Model names use the format <provider>:<model-name>; aisuite routes the call to the right provider with the right parameters:

import aisuite as ai
client = ai.Client()

models = ["openai:gpt-4o", "anthropic:claude-3-5-sonnet-20240620"]

messages = [
    {"role": "system", "content": "Respond in Pirate English."},
    {"role": "user", "content": "Tell me a joke."},
]

for model in models:
    response = client.chat.completions.create(
        model=model,
        messages=messages,
        temperature=0.75
    )
    print(response.choices[0].message.content)

→ Quickstart: docs/chat-completions-quickstart.md — install, key setup, local models, and more examples.

Agents — give models real tools

aisuite turns tool calling into a one-liner: pass plain Python functions and it generates the schemas, executes the calls, and feeds results back to the model.

Tool calling with max_turns

def will_it_rain(location: str, time_of_day: str):
    """Check if it will rain in a location at a given time today.

    Args:
        location (str): Name of the city
        time_of_day (str): Time of the day in HH:MM format.
    """
    return "YES"

client = ai.Client()
response = client.chat.completions.create(
    model="openai:gpt-4o",
    messages=[{
        "role": "user",
        "content": "I live in San Francisco. Can you check for weather "
                   "and plan an outdoor picnic for me at 2pm?"
    }],
    tools=[will_it_rain],
    max_turns=2  # Maximum number of back-and-forth tool calls
)
print(response.choices[0].message.content)

With max_turns set, aisuite sends your message, executes any tool calls the model requests, returns the results to the model, and repeats until the conversation completes. response.choices[0].intermediate_messages carries the full tool interaction history if you want to continue the conversation.

Prefer full manual control? Omit max_turns and pass OpenAI-format JSON tool specs — aisuite returns the model's tool-call requests and you run the loop yourself. See examples/tool_calling_abstraction.ipynb for both styles.

The Agents API

For longer-running, structured work there is a first-class Agents API: declare an agent once, run it with a Runner, and attach toolkits — prebuilt, sandboxed tool families for files, git, and shell:

import aisuite as ai
from aisuite import Agent, Runner

agent = Agent(
    name="repo-helper",
    model="anthropic:claude-sonnet-4-6",
    instructions="You are a careful repo assistant. Use your tools to answer from the code.",
    tools=[*ai.toolkits.files(root="."), *ai.toolkits.git(root=".")],
)

result = Runner.run(agent, "What changed in the last commit? Summarize in 3 bullets.")
print(result.final_output)

The Agents API also gives you the pieces a production harness needs:

  • Tool policiesRequireApprovalPolicy, allow/deny lists, or your own callable deciding which tool calls run.
  • State stores — persist and resume runs (in-memory, file, or Postgres) and continue conversations across processes.
  • Artifacts & tracing — capture what an agent produced and every step it took along the way.

MCP tools

aisuite natively supports the Model Context Protocol, so any MCP server's tools can be handed to a model without boilerplate (pip install 'aisuite[mcp]'):

client = ai.Client()
response = client.chat.completions.create(
    model="openai:gpt-4o",
    messages=[{"role": "user", "content": "List the files in the current directory"}],
    tools=[{
        "type": "mcp",
        "name": "filesystem",
        "command": "npx",
        "args": ["-y", "@modelcontextprotocol/server-filesystem", "/path/to/directory"]
    }],
    max_turns=3
)
print(response.choices[0].message.content)

For reusable connections, security filters, and tool prefixing, use the explicit MCPClient.

→ Quickstart: docs/agents-quickstart.md — manual tool handling, the full Agents API, policies, state stores, and MCP in depth.

Extending aisuite: Adding a Provider

New providers can be added by implementing a lightweight adapter. The system uses a naming convention for discovery:

Element Convention
Module file <provider>_provider.py
Class name <Provider>Provider (capitalized)

Example:

# providers/openai_provider.py
class OpenaiProvider(BaseProvider):
    ...

This convention ensures consistency and enables automatic loading of new integrations.

Contributing

Contributions are welcome. Please review the Contributing Guide and join our Discord for discussions.

License

Released under the MIT License — free for commercial and non-commercial use.

DevOps infrastructurekubernetes

Meshery (GitHub Repo)

Meshery offers a visual, self-service platform for managing multi-cluster Kubernetes infrastructure without manual YAML configuration.

Summary

What: Meshery, a Cloud Native Computing Foundation project, provides a central interface for managing Kubernetes-based infrastructure across multiple clouds with support for 380+ integrations, visual GitOps workflows, and performance testing using the Fortio load generator.
Why it matters: This signals a trend toward abstracting Kubernetes complexity through visual management interfaces and platform engineering tools that standardize infrastructure patterns across heterogeneous cloud environments.
Takeaway: Test your current deployment configurations using the platform's dry-run feature to catch errors before applying changes to your cluster.

Deep Dive

  • Provides a multi-tenant dashboard for managing multiple Kubernetes clusters.
  • Uses GitOps-centric design with visual infrastructure modeling.
  • Includes native support for performance characterization using Fortio.
  • Offers built-in context-aware policy enforcement via Open Policy Agent.
  • Supports extensibility via gRPC, GraphQL, and REST APIs.
  • Enables infrastructure snapshots for pull request previews.
  • Connects to Prometheus and Grafana for observability.

Decoder

  • GitOps: A set of practices to manage infrastructure and application configurations using Git as the source of truth.
  • Dry-run: A simulation mode in Kubernetes that validates resource definitions against the API server without creating actual changes.
  • Rego: The declarative query language used by Open Policy Agent to define and enforce security and configuration policies.

Original Article

Full article content is not available for inline reading.

Read the original article →

DevOps cloudinfrastructurekubernetes

Amazon EKS Capabilities now supports Amazon CloudWatch Vended Logs

Amazon EKS now supports CloudWatch Vended Logs, simplifying observability for managed controllers like Argo CD and ACK.

Summary

What: Customers can now route logs from AWS-managed EKS Capabilities—including Argo CD, ACK, and kro—directly into CloudWatch Logs, S3, or Kinesis Data Firehose via CloudWatch Vended Logs, with no additional EKS-specific charges.
Why it matters: This integration removes the overhead of manually managing log collection for auxiliary Kubernetes controllers, centralizing them within AWS's native logging infrastructure.
Takeaway: Enable log delivery for your Argo CD or ACK controllers through the AWS Console or CloudWatch APIs to gain centralized visibility into managed controller operations.

Decoder

  • Vended Logs: A type of CloudWatch log stream provided by AWS services that bypasses standard log group ingestion costs, often billed based on data volume or delivery destination.
  • ACK (AWS Controllers for Kubernetes): A tool that allows users to manage AWS services directly from a Kubernetes cluster using standard custom resources.

Original Article

Amazon EKS Capabilities now supports Amazon CloudWatch Vended Logs

Amazon Elastic Kubernetes Service (Amazon EKS) Capabilities can now be configured as log delivery sources using Amazon CloudWatch Vended Logs. This enables customers to monitor and troubleshoot their EKS Capabilities for Argo CD, AWS Controllers for Kubernetes (ACK), and kro (Kubernetes Resource Orchestrator) by monitoring logs collected from the managed controllers that run in AWS-managed infrastructure.

Customers can enable log delivery for each capability using CloudWatch APIs or the AWS Console. Logs are configured as a CloudWatch Vended Logs delivery source, enabling reliable, secure log delivery to CloudWatch Logs, Amazon S3, or Amazon Kinesis Data Firehose destinations.

This feature is available in all AWS Regions where the EKS Capabilities feature is supported. Standard CloudWatch Vended Logs pricing applies based on the chosen destination. There is no additional EKS charge.

To learn more about EKS Capabilities, visit the Amazon EKS documentation.

Tech aiagentsdevops

Ponytail (GitHub Repo)

Ponytail is a ruleset and agent plugin that forces AI coding assistants to prefer native features and standard libraries over bloated, generated dependencies.

Summary

What: Ponytail injects a 'less is more' philosophy into popular AI coding tools like Claude Code, GitHub Copilot, and Cursor. It forces agents to check for YAGNI (You Ain't Gonna Need It), standard library alternatives, and native platform features before writing new code.
Why it matters: AI coding agents frequently over-engineer by default, often defaulting to heavy NPM dependencies instead of native browser or standard library functions. This project addresses the 'hallucinated dependency' and bloat problem by imposing strict design constraints at the system prompt level.
Takeaway: Install the Ponytail plugin in your IDE or CLI agent to automatically audit your AI-generated code for unnecessary dependencies and over-engineering.

Deep Dive

  • Forces agents to prioritize standard libraries and native APIs over external dependencies.
  • Implements a hierarchy: 1. Delete unnecessary code, 2. Use standard lib, 3. Use platform features, 4. Use existing dependencies, 5. Write only the minimum needed.
  • Supports major tools: Claude Code, Copilot, Cursor, Windsurf, Cline, and Gemini CLI.
  • Includes audit commands: /ponytail-review, /ponytail-audit, and /ponytail-debt to manage technical debt.
  • Reduces code bloat by 80–94% and cost by 47–77% in benchmark tests.
  • Operates as a plugin or simple rule-file injection for agents that support custom instructions.

Decoder

  • YAGNI: "You Ain't Gonna Need It"; a software development principle where you do not add functionality until it is deemed absolutely necessary.
  • Agent: An AI system configured to take autonomous actions in a development environment, such as writing files, executing tests, or installing packages.
  • Promptfoo: A CLI tool for testing, evaluating, and benchmarking the performance and quality of LLM prompts.

Original Article

Ponytail

He says nothing. He writes one line. It works.

80-94% less code · 3-6× faster · 47-77% cheaper
Median of 10 runs across Haiku, Sonnet, and Opus.

You know him. Long ponytail. Oval glasses. Has been at the company longer than the version control. You show him fifty lines; he looks at them, says nothing, and replaces them with one.

Ponytail puts him inside your AI agent.

Before / after

You ask for a date picker. Your agent installs flatpickr, writes a wrapper component, adds a stylesheet, and starts a discussion about timezones.

With ponytail:

<!-- ponytail: browser has one -->
<input type="date">

Numbers

Five everyday tasks (email validator, debounce, CSV sum, countdown timer, rate limiter), three models, three arms: no skill, the caveman skill, and ponytail. Ten runs per cell, median reported.

80-94% less code, 47-77% less cost, and 3-6× faster than a no-skill agent, on every model. Every shortcut ponytail takes is marked in the code with a ponytail: comment naming its upgrade path.

How it works

Before writing code, the agent stops at the first rung that holds:

1. Does this need to exist?   → no: skip it (YAGNI)
2. Stdlib does it?            → use it
3. Native platform feature?   → use it
4. Installed dependency?      → use it
5. One line?                  → one line
6. Only then: the minimum that works

Lazy, not negligent: trust-boundary validation, data-loss handling, security, and accessibility are never on the chopping block.

Install

The most effort ponytail will ever ask of you:

The Claude Code and Codex plugins run two tiny Node.js lifecycle hooks, so node needs to be on your PATH. If it isn't, the skills still work, the always-on activation just stays quiet instead of erroring on every prompt.

Claude Code

/plugin marketplace add DietrichGebert/ponytail
/plugin install ponytail@ponytail

Codex

codex plugin marketplace add DietrichGebert/ponytail
codex

Open /plugins, select the Ponytail marketplace, and install Ponytail. Then open /hooks, review and trust its two lifecycle hooks, and start a new thread.

GitHub Copilot CLI

copilot plugin marketplace add DietrichGebert/ponytail
copilot plugin install ponytail@ponytail

In an interactive Copilot CLI session, use the slash equivalents:

/plugin marketplace add DietrichGebert/ponytail
/plugin install ponytail@ponytail

Pi agent harness

pi install git:github.com/DietrichGebert/ponytail

OpenCode

Run OpenCode from a checkout of this repo, and add to opencode.json:

{ "plugin": ["./.opencode/plugins/ponytail.mjs"] }

Gemini CLI

gemini extensions install https://github.com/DietrichGebert/ponytail

Antigravity CLI

agy plugin install https://github.com/DietrichGebert/ponytail

Commands

Command What it does
/ponytail [lite | full | ultra | off] Set the intensity, or turn it off. No argument reports the current level.
/ponytail-review Review the current diff for over-engineering, hands back a delete-list.
/ponytail-audit Audit the whole repo for over-engineering, not just the diff.
/ponytail-debt Harvest the ponytail: shortcuts you've deferred into a ledger, so "later" doesn't become "never".
/ponytail-help Quick reference for the commands above.

Development

When changing the compact rule text, keep the agent copies aligned:

node scripts/check-rule-copies.js
npm test

FAQ

Does it need a config file? No. An optional ~/.config/ponytail/config.json or PONYTAIL_DEFAULT_MODE env var can set the default level, but nothing is required.

What if I really need the 120-line cache class? You don't. Insist anyway and he'll build it. Slowly. Correctly. While looking at you.

Does it scale? The code you never wrote scales infinitely. Zero bugs, zero CVEs, 100% uptime since forever.

Why "ponytail"? You know exactly why.

License

MIT. The shortest license that works.

Tech devopsopensource

Software Is Not A Single-Player Game

Code review is evolving from a gatekeeping chore into the central, high-leverage venue for engineering collaboration as implementation costs drop.

Summary

What: David Poll argues that as AI makes producing functional code cheaper, the focus of software engineering must remain on the 'multiplayer' aspects of the cycle—specifically code review—rather than abstract design documentation.
Why it matters: This perspective challenges the 'single-player' narrative of AI-driven coding by asserting that consensus, shared taste, and collaborative judgment remain the only ways to build software that is actually dependable.

Deep Dive

  • Code review is historically a community-driven artifact exchange (e.g., Linus Torvalds and mailing lists).
  • Traditional PRDs were proxies for expensive implementation work; code is now the cheaper, more informative currency for design discussion.
  • Reviewing 'real changes' allows teams to catch social and product-trust issues that runtime observability misses.
  • Some cross-system architecture and compliance decisions still require pre-code planning.
  • AI should be viewed as a tool to accelerate the multiplayer collaborative process, not just individual output.

Decoder

  • SDLC: Software Development Life Cycle.
  • PRD: Product Requirements Document.
  • RFC: Request for Comments, a formal document outlining a proposed design or standard.
  • CL/PR: Changelist or Pull Request; the unit of code proposed for inclusion into a main codebase.

Original Article

When I wrote a few months ago that code review is not about catching bugs, the most common pushback I got was a variation of the same thing:

“If you’re only getting to that decision at the code review stage, you skipped a step. The basic call about whether to build this should have come much, much earlier – at the design doc stage, before code was written.”

I find that argument partly right. The earlier steps it points to – thinking through what to build, why, with what shape, at what cost – are real and important. They don’t disappear in a faster cycle. If anything, they matter more.

Here is something this pushback concedes without quite meaning to. By saying “you skipped a step,” it acknowledges that the SDLC has multiple stages where judgment matters. It doesn’t claim the only judgment left happens in production, or that pull requests are a relic. On that, I agree.

What it gets wrong is the shape of those stages. It treats them as fixed, sequential, and largely complete before code is written. They never were. AI is making the rigidity less tenable by the day.

Underneath all of that is a deeper assumption: that the multiplayer part of software ends before the code exists. People hash things out in documents and meetings, and then development itself is a single-player game. Get the upstream call right and the rest is execution. Code review becomes a checkpoint to verify individual work, and the optimization is to make the checkpoint smaller and earlier.

Much of the current AI development discourse goes further and assumes there was never much of a multiplayer part at all. Either way, the assumption has a ceiling.

What I actually want to argue is that building software that lasts, grows, and that people can depend on is and will remain a multiplayer game. The earlier stages this pushback points to are part of that game, not separate from it. Code review is one of the primary places where the game gets played, and the artifact the game is played on is shifting in a way that makes review more central, not less.

A Brief History Of Looking At Real Things

Before GitHub, before Gerrit, before any of the modern code review tooling, Linus Torvalds was reviewing Linux contributions as patches sent to a mailing list. People wrote code. They mailed it to him. He read it. He had opinions. Sometimes he had loud opinions.

That was the review. That was the place where the decision about whether a change belonged in Linux actually got made. There were no PRDs preceding those patches. There was no architecture committee signing off in advance. There were arguments on a mailing list, and then patches, and then more arguments about the patches.

The lineage is right there in the name. When git arrived, maintainers started emailing Linus requests to pull from their trees – git still ships a command called git request-pull that writes the email for you. The pull request button you click today is named after a message asking Linus to take your changes.

I’m not arguing this was the right model for every team or every product. I am arguing that the act of bringing a real, concrete change to a community for judgment is not a recent invention, and it is not subordinate to a separate planning phase. It has always been one of the moments where consequential engineering decisions actually get made.

PRDs As Workaround

The argument that judgment should happen “before the PR” comes from a real place. For most of software history, writing code was the most expensive step in the process. If you got to a working implementation and realized it was wrong, you’d burned weeks. So we invented a lot of upstream infrastructure to reduce the cost of being wrong: PRDs, design docs, architecture reviews, RFCs.

I ran an API council at Firebase for over five years, reviewing somewhere around 850 proposals before any code shipped. That is about as upstream as engineering judgment gets. I’m not knocking the practice. We caught things that would have been much more painful to walk back.

But I think it’s worth being honest about what was happening. We were reviewing words about code we couldn’t yet afford to produce speculatively. The document was a proxy for the artifact. We were doing our best to apply judgment to an abstraction because the real thing was too expensive to make twice.

When the artifact is cheap, the balance shifts. The document might survive, or it might not. The judgment we were applying through it still has to happen somewhere – and more and more often, it’s cheap enough to just discuss the actual implementation.

The Artifact Moved

Engineering collaboration tends to gather around the cheapest meaningful artifact. For decades, the cheapest meaningful artifact for serious discussion was a design doc. Producing one was much faster than producing the code it described. So that’s where teams pushed and pulled on each other.

The cost curve has bent. Producing a working change, or at least a credible prototype, is dramatically cheaper than it was even two years ago. Not free. Not perfect. But cheap enough that the calculus has shifted. Teams I talk to are increasingly skipping straight to a real change as the venue for discussion, because the real change carries information the design doc cannot.

A design doc tells you what someone thinks the system will do. A working change shows you what the approach actually looks like when it meets the real system. Those are different artifacts. They always were. We just couldn’t afford to find that out cheaply.

When the artifact moves, the collaboration moves with it. That isn’t a new pattern. It’s the pattern. It just got obscured for a few decades by how expensive the artifact was to produce.

Where Software Becomes Multiplayer

Reviews are where software development becomes a multiplayer game.

Most of the conversation about AI in development right now assumes a single-player game. One developer, one terminal. The narrative is about productivity multiplied for an individual. Sometimes about the solo founder shipping a whole product. Those stories are real, and I’m genuinely happy for the people doing this. The accessibility is great. And the solo builder isn’t really alone anymore – agents build and review alongside them as they go, and what one person can ship has grown dramatically.

But there is always a ceiling on how far a single-player game can take you, even with agents. Software that lasts, software that grows, software that people can actually depend on – that is built by groups of people exercising judgment together over time. By teams developing shared taste, shared mental models, shared sense of what their product should be. None of that happens through individual prompting, no matter how clever the prompts.

Reasonable people will disagree about whether agents are teammates or tools. I think they fall somewhere in between. But as long as people are building together for people, we’re the ones who assign value.

Code review is one of the primary places where the multiplayer game actually gets played. It is where one person’s judgment encounters another’s. Where taste gets argued. Where shared understanding gets built. Where a team becomes a team and not just a group of individuals shipping in parallel.

The discourse about AI in development is heavily skewed toward the solo experience because the solo experience is where the gains are most immediate and easiest to see. The much harder, much more interesting question is how groups of people work together with these tools and with each other. We’re still working on that.

What I am fairly certain of is that we will keep collaborating over the core deliverable unit – a change to the product. That part looks durable. And it’s the combination that makes review more central, not less: the game is multiplayer, and the artifact it’s played on is increasingly the real change itself.

Code Is Becoming The Currency

The change to the product is becoming the primary artifact engineering teams reason about. Not the only artifact. Strategy still exists. Roadmaps still exist. But the unit of work that gets debated, refined, and committed to is increasingly a concrete change in a real system, not a document describing one.

It may even be that code is becoming the currency for engineering decisions. The thing you trade in. The thing you accumulate. The thing that, if it doesn’t exist, the decision isn’t quite real yet. Anything that doesn’t eventually manifest as a change to a system risks not really mattering.

And currency gets spent. When code is cheap, trying something and throwing it away is often the fastest way to settle a question. That changes how we use the artifacts themselves: closed PRs become part of the process, not a failure end state. The prototype that showed an approach wouldn’t work did exactly what it was for. Some of the most useful changes never merge.

Some Choices Can’t Be Walked Back

There is a quieter argument for code review that gets lost in the velocity debate. Some choices, once they meet the world, can’t be undone. Even if you roll them back.

I’m watching this play out across the industry right now. Teams test new behaviors by throwing them over the wall, because the rollback feels technically cheap. Push the flag, ship it, see what happens. If it goes badly, flip the flag back. The behavior is reverted. Remove the code. Done.

Except it isn’t done. Users saw it. They formed an opinion. They wrote posts about it. They told other people. The trust that took years to build took minutes to dent. You can roll back the code. You can’t roll back what people now think about your product.

This is the category of decision that no amount of observability catches in time. The instrumentation does its job perfectly. It’s just that learning the ground truth in production requires being in production – and for these choices, by the time you’ve learned it, the damage is already done.

Pre-deployment judgment earns its keep here. Code review is one of the few places where someone can look at a change and ask, “even if this works exactly as intended, do we want our product to do this in the world?” That question doesn’t have a runtime answer. It has to be asked before the artifact meets users.

The artifact moving doesn’t change this. If anything, it sharpens it. When working changes are cheap to produce, the temptation to ship and see grows. The discipline that the cost of code used to enforce by accident now has to come from somewhere on purpose. Increasingly, that somewhere is the review.

What Doesn’t Move

Some decisions still need to happen before any code gets written. Org-scale architecture. Customer-facing commitments. Security and compliance posture. Things that cross too many systems to fit in a single change, or whose blast radius is too large to discover empirically.

That boundary is real. It is also, I notice, smaller than it used to be. Prototypes are cheap enough now that even some of those decisions are getting informed by a real artifact much earlier than they once were. The set of choices where “you can’t possibly try this before we agree on it” applies keeps shrinking. Not to zero. But the line keeps moving.

Not A New World

This is a return to form. Linus was looking at patches because that was the artifact. At Parse, we reviewed APIs inside the pull request because that was the artifact. The teams I worked with at Firebase and Google Cloud did some of their most important judgment work on real CLs, not abstract proposals.

For a while, the cost of producing the artifact pushed a lot of the judgment upstream into proxies. That made sense. But the proxies were always imperfect approximations, there to enable decisions we couldn’t yet afford to make on the real thing. The act of looking at a real change and deciding whether it should be part of a system is the engineering job. It does not happen after the important decisions have been made. It is increasingly where the important decisions get made.

If you came away from the last post thinking “yes, but the judgment should happen earlier,” I’d gently push back. It was never really earlier. We were doing our best to approximate it in a world where the real thing was too expensive to produce on demand.

That world is changing. The cost calculus is shifting. But the game is the same one it has always been: multiplayer. The difference is that now we get to play it with the real thing.

Tech frontendweb

prop-for-that (GitHub Repo)

The prop-for-that library exposes browser runtime data like battery status, scroll velocity, and mouse position as real-time CSS custom properties.

Summary

What: Created by Adam Argyle, prop-for-that uses a single shared ResizeObserver and requestAnimationFrame loop to map JavaScript-accessible values—such as viewport dimensions, input values, and sensor data—directly to --live-* CSS variables for use in stylesheets.
Why it matters: This bridges the gap between imperative JS state and declarative CSS, enabling complex reactive styling (like color-shifting based on battery level or scroll speed) without expensive per-element event listeners or manual DOM manipulation.
Takeaway: Add 'import 'prop-for-that/auto'' to your script and apply 'data-props-for="key"' to any element to start using live state in your CSS.

Deep Dive

  • Uses declarative data attributes to sync JS state with CSS custom properties.
  • Batches updates into one requestAnimationFrame flush to prevent layout thrashing.
  • Features tree-shakeable, opt-in plugins for 20+ sources including battery, network, CPU pressure, and pointer position.
  • Includes a FOUC-safe (Flash of Unstyled Content) entry point for constant values like device memory or scrollbar width.
  • Zero-dependency, TypeScript-native, and SSR-safe for modern web stacks.

Decoder

  • CSS custom properties: Also known as CSS variables, these are entities defined by CSS authors that contain specific values to be reused throughout a document.
  • FOUC (Flash of Unstyled Content): A phenomenon where a web page appears briefly without styling before the stylesheet is loaded and applied.
  • RequestAnimationFrame: A browser API that allows developers to schedule a function to run before the next repaint, ensuring efficient animation updates.

Original Article

prop-for-that

Expose what JavaScript knows but CSS can't see — as live CSS custom properties.

Sliders, pointer position, element visibility, viewport size, battery, network, sensors — JavaScript can read all of it; CSS can't. prop-for-that writes that runtime state into --live-* and --const-* custom properties — batched and diffed down to one setProperty per frame — so your CSS can compose and react to it with plain calc() and var().

Zero dependencies. TypeScript. ESM + CJS. SSR-safe.

npm i prop-for-that

Quick start

<script type="module">import 'prop-for-that/auto'</script>

<input type="range" data-props-for="range" />
/* the slider paints itself from its own value — no event listeners, no render loop */
input {
  background: hsl(calc(var(--live-value-pct) * 120) 80% 50%);
}

Bind any element with data-props-for="key …" and read its --live-* properties in CSS. That's the whole idea.

Why

  • CSS does the work. No per-element event handlers or render loops — bind once, compose in stylesheets.
  • Fast by design. One requestAnimationFrame flush per frame — idle when nothing changes, frozen while the tab is hidden — plus write-on-change diffing and a single shared ResizeObserver / IntersectionObserver for the whole page. Continuously-sampling element sources pause while their element is off screen; event-driven ones (form fields, ranges, selects) run ungated.
  • Ship only what you use. Four lightweight core sources are built in; everything else is an opt-in, tree-shakeable plugin — and under auto each plugin loads on demand, the moment a data-props-for attribute asks for it.
  • Plays with the platform. Opt into typed @property values for interpolation, or FOUC-safe constants written before first paint.
  • Tiny and dependency-free, in every bundle format.

What it can read

Core (built in): viewport, element size, visibility, and <input type="range"> values.

Plugins (opt-in): pointer position, battery, network, online status, page focus & visibility, navigation type, page meta tags, FPS, clock, scroll velocity, device orientation / motion, geolocation, CPU pressure, soft-keyboard geometry, media playback, form & field state, select & color-picker values, and dominant + accent colors extracted from images and video — 20+ in all.

Entry points

Import What it does
prop-for-that/auto Zero-config & declarative: binds every data-props-for element — globals included, via <html data-props-for="…"> — loading plugin sources on demand, kept in sync with the DOM.
prop-for-that Imperative API — propsFor(), register(), configure() — for explicit control and teardown.
prop-for-that/head Synchronous, FOUC-safe constants (scrollbar width, DPR, core count, device memory) before first paint.
prop-for-that/plugins The opt-in plugin catalog.

auto sees the light DOM only (not shadow roots — bind those with propsFor(el, …)), and lazy-loads plugin chunks, so from a CDN use one that serves the dist files verbatim (unpkg / jsDelivr), not a rewriting CDN.

License

MIT © Adam Argyle

Data aiplatform

Encoding Your Domain Expert: The Context Layer Behind Spotify's Data Assistant

Spotify built Vedder, an AI data assistant that curates expert-vetted context for 177 distinct data clusters to ensure reliable, domain-specific insights.

Summary

What: Spotify's Vedder assistant, used by 2,100+ employees across 70,000 datasets, uses human-curated 'clusters' containing vetted SQL-question pairs and business documentation to improve model output reliability. Experts review data, and system health is monitored via metrics like drift and reproducibility; only 12.5% of automatically mined query examples were accepted by human curators.
Why it matters: This demonstrates that for enterprise AI, raw data schemas are insufficient; building a sustainable, trustworthy system requires a feedback loop where domain experts curate context to prevent the 'hallucination' of business logic.

Deep Dive

  • Spotify replaced traditional ad-hoc expert querying with 'Vedder', an agentic AI assistant.
  • The system organizes data into 'clusters' managed by specific domain teams.
  • Context curation includes datasets, vetted SQL/question pairs, and business definitions.
  • 87.5% of suggested auto-generated query pairs were rejected by human curators, highlighting the noise in raw query history.
  • Each cluster features a health score based on schema changes, pair validity, and usage patterns.
  • The architecture utilizes a ReAct loop for reasoning over tool calls and providing explainable results.
  • Future work includes integrating documentation outside of standard schemas.

Decoder

  • ReAct: A prompting technique where the model generates both reasoning traces (thoughts) and task-specific actions (tool calls) iteratively.
  • Schema-only RAG: A retrieval-augmented generation approach that relies solely on database table definitions to generate queries, often failing at complex business logic.
  • Cardinality: The number of unique values in a dataset column, essential for optimizing join strategies and query planning.

Original Article

Encoding Your Domain Expert: The Context Layer Behind Spotify's Data Assistant

At Spotify, data problems used to follow a specific pattern. You'd look for the relevant dashboard, there weren't any. You'd message the corresponding data expert on Slack, wait until they had time to help. But with thousands of teams moving fast, the demand for data insights had quietly outpaced what any individual expert could handle alone.

To solve this problem, we started developing an AI data assistant, but with over 70,000 datasets at Spotify, amounting to petabytes of data, no single individual can claim knowledge of everything. Just putting all schemas into an LLM doesn’t work at this scale.

For one, context windows are limited, even if it’s a million tokens. A million tokens are insufficient to accommodate a whole data warehouse. Secondly, schemas do not convey all the information. If a column has the INT64 type, then it doesn’t say anything about how those less than 100 are legacy test data and how they differ from actual data in terms of definitions or what is meant by “active user.” Provide the same number of tables to a model, and it will be confident in selecting the wrong one.

We needed something in between. A layer that captures what actually matters about a slice of the warehouse, owned by people who own and understand the domain.

Our data agent

Spotify’s data assistant was built to solve this problem. Ask a question in simple English and get reliable data within seconds. It has been actively utilized since August 2025 by over 2,100 Spotifiers within 13,000+ conversations, and 60,000+ messages using 177 clusters covering advertising, podcasts, music, audiobooks, finances, creators' tools, and more than a dozen other fields. More than a quarter of these users haven’t even coded SQL before.

When a question comes in, the agent picks the appropriate context, writes the SQL query, runs it against our warehouse, and returns the answer alongside the query and its sources. It follows a ReAct loop, reasoning and acting in steps, adjusting based on what each tool call returns. You can read how the result was produced, not just what it was.

We built into the surfaces people already work: a Slack bot for quick questions while chatting on a thread, an MCP server for IDEs and AI tools, and a dedicated web UI for interactive exploration. When no knowledge base covers the topic, the data agent informs you about it. That transparency is what makes the answers it gives reliable.

But the interesting part isn't the model. It's how we make sure the answers are trustworthy. That comes down to context and ownership.

The cluster model

At Spotify, we call data domains, clusters. Those domains can be tied to an initiative, an organization, or an adhoc interest. This flexibility enables any insights team to build a cluster around their topics, whilst also informing them if the domain is already covered. Each cluster is owned by a named team of domain experts and consists of three components:

  • Datasets: the data warehouse tables that are relevant, with full schema and profiling. We capture column cardinality, samples of common values, and partition structure. When the model generates a WHERE clause, it helps to know that `country` has values like 'US', 'GB', 'SE' rather than guessing.
  • Pairs: vetted question-and-SQL examples. This is the few-shot mechanism powering the data agent. A domain expert writes or approves each pair, picking examples that teach the patterns they'd want a colleague to follow. They teach the LLM how to query the data and its semantics.
  • Docs: additional business context. This could be terminology, gotchas, definitions that vary by team, which columns to use and which to avoid.

The curation is owned by the data experts, the data scientists and analytics engineers who know how the data is modeled and how to efficiently query them. They decide how to split their domain into clusters, which tables to include, and which examples are important.

Human Judgement

The obvious shortcut was to skip the curator. Our data warehouse holds the complete query history of every data expert who has ever used it. From there, generating question-SQL pairs is straightforward: take a query, ask an LLM to infer the question it was written for, and use those pairs to teach the model how to generate the SQL. These are real queries people actually wrote for answering their domain knowledge made into data. It looks like a way to scale.

And the issue here is trust. With Spotify being the size that it is, an overconfident wrong guess may sway the decision in the wrong way. We wanted the examples that would influence the assistant’s behavior to be reviewed and marked as canonical by those familiar with the data.

So, we tried it out. During our curation phase, we provided the questions and SQL for actual queries issued against the domain by the data scientists in our data warehouse, and we asked the cluster curators to pick which ones were good examples.

They accepted only 12.5% of the proposed pairs.

The other 87.5% were ad-hoc exploration, debugging sessions, one-off answers no one would ask again, queries that used the wrong table, or queries that were technically correct but taught the wrong pattern. Query history is rich. Most of it is noise. And the signal doesn't label itself.

That's why every example runs through an expert. The model reasons over context. It doesn't decide what's true about the data, the experts do. This isn’t about replacing the people that they know the best how to work with our data, it’s about giving them more leverage. Shipping their expertise in a more scalable way.

Keeping clusters healthy

Data changes, business logic shifts, and context that was accurate last month can be wrong today. Schemas evolve, columns get renamed, tables get deprecated and replaced. Vedder needs that information current, without requiring constant manual attention.

That’s why each cluster has a health score made up of signals we calculate and monitor continuously. How healthy is the underlying data that it is used in the cluster? How many of its curated pairs are still valid after recent schema changes? If a column gets renamed, the pairs referencing it degrade immediately. How well does the context cover questions people are actually using? How reproducible was the generated SQL? And a handful of others. If any of these degrade, then the cluster’s health score reflects that and actions are suggested.

Data experts see the score and the underlying signals on their cluster dashboard, and use them to decide where to spend curation time.

Closing the loop

Every conversation with Vedder becomes a data point that feeds back into the system. Vedder logs every conversation and query, and the questions, answers, generated SQL and user feedback are shown to cluster owners.

This is how we scale the knowledge of a data scientist. Every question-SQL pair they approve, every doc they clarify helps the next users get even more accurate insights. The answers are only as trustworthy as the context behind them and that context needs tending.

Beyond Spotify

Spotify has a strong data foundation with well-maintained datasets, a data catalog, and data scientists who care about their domains. That made Vedder possible, but the architecture isn't Spotify-specific.

The core idea remains valid: the people who best understand a data domain are the best ones to curate the context the model sees. Humans and LLMs can only understand raw schemas to a certain extent, but context and understanding is what enables the insights at scale. The role of our data experts grows more strategically. They spent less time answering one-off questions, more time shaping the knowledge layer that answers thousands.

Context curation is the foundation. But what if the knowledge lies outside the schema? What if it exists in documentation and definitions of processes within the organization? These are some of the questions we are exploring next.

Citations

[1] https://arxiv.org/abs/2210.03629

Data backenddatabase

How Feldera Works: A True Incremental View Maintenance Engine

Feldera achieves low-latency continuous analytics by treating data streams as evolving relational views using incremental delta propagation instead of recomputation.

Summary

What: Feldera utilizes DBSP (Database Stream Processor) to process streaming data as Z-sets, where every change (insert, update, delete) is treated as a delta. By applying algebraic rules, it updates complex joins and aggregations incrementally rather than recomputing the entire state, ensuring CPU and memory costs scale with the volume of changes (Δ) rather than the size of the dataset.
Why it matters: This represents a shift toward 'incremental view maintenance' which bridges the gap between traditional batch SQL databases and low-latency streaming requirements, critical for real-time financial or operational systems.

Deep Dive

  • Streaming systems often struggle with joins and complex aggregations due to state recomputation costs.
  • Feldera maps SQL operators to DBSP circuits that handle Z-set deltas.
  • Inserts are +1, deletes are -1, and updates are -1/+1 operations.
  • Complexity is O(Δ), meaning performance depends on the number of changed rows, not table size.
  • The synchronous execution model ensures deterministic, batch-like consistency for continuous pipelines.

Decoder

  • Z-set: A mathematical representation of a set where elements have associated integer counts, used here to represent relational deltas.
  • Incremental View Maintenance (IVM): A technique that updates a materialized view based only on the changes to the underlying source data, rather than rebuilding the view from scratch.

Original Article

How Feldera Works: A True Incremental View Maintenance Engine

Most streaming systems excel at event processing but struggle with relational queries involving joins or multi-stage aggregations. Feldera changes the game by treating streams as incremental relational views, using DBSP to propagate deltas efficiently and avoid recomputation. This article explores Feldera’s incremental execution model and why it scales for complex analytics.

Challenges with Complex Relational Queries

Streaming SQL engines such as Apache Spark Structured Streaming and Apache Flink work well for simple workloads like filters and basic aggregations, where updates can be applied incrementally and the work scales roughly with the incoming data rate.

However, real-world analytical pipelines often involve multi-stage transformations, joins with mutable dimensions, and nested aggregations. In such cases, even incremental updates can amplify intermediate changes, increasing computation and memory usage.

Furthermore, their execution interface fails to deliver the core promise of SQL, its simplicity and ease of use.

Relational queries operate on relations, not events. Efficient continuous query execution therefore requires a different abstraction: algebraic delta propagation.

This is precisely where Feldera diverges.

Inside Feldera: DBSP and Algebraic Delta Propagation

Feldera is built on DBSP (Database Stream Processor), an execution engine for composing incremental computations. DBSP represents queries as circuits built from a small set of primitive operators that can express arbitrarily complex SQL pipelines.

It basically combines the concepts from streaming engines and databases to give one simple solution like a live database.

In Feldera, every update is represented as a relational delta:

  • Insert → +1 row
  • Delete → −1 row
  • Update → −1 old row, +1 new row

This delta-based approach lets Feldera precisely capture all relational changes, including inserts, updates, deletes, corrections, and late-arriving data. Streams are not treated as append-only logs; instead, they are modeled as streams of relational differences, enabling full relational semantics in continuous pipelines.

Feldera compiles SQL queries into DBSP operators by translating relational changes into streams of Z-sets, an algebraic representation of relational deltas. Each relational operator (projection, filter, join, union, aggregation, etc.) becomes a differentiable streaming operator that consumes and produces Z-sets, incrementally propagating changes through the circuit.

Each operator maintains indexes over its input state, mapping keys to rows or partial aggregates. This allows operators to:

  • Quickly look up only the rows affected by a change (critical for joins, aggregations, and filters).
  • Apply inserts, deletes, and updates without scanning the entire operator state.
  • Produce Δ(output) efficiently for downstream operators.

Computation is driven entirely by change propagation, not by event arrival.

Feldera ensures that execution cost scales with the size of the change (O(Δ)), not with the total table size or query depth. This is a key reason why Feldera is extremely resource-efficient.

At the algebraic level, each operator computes output deltas purely as a function of input deltas.

Δ(output) = f(Δ(inputs))

The same principle applies to joins and aggregations, e.g. algebraic join delta rule:

Δ(A ⋈ B) = (ΔA ⋈ B) ∪ (A ⋈ ΔB) ∪ (ΔA ⋈ ΔB)

If an aggregation feeds into a join, and that join feeds into another aggregation, each stage propagates only deltas. This allows arbitrarily deep relational pipelines to maintain:

Total cost ∝ total delta propagation

Synchronous Streaming Execution

DBSP executes computations using a synchronous streaming model, where updates propagate through the circuit in deterministic steps. This ensures that query results remain consistent and reproducible, preserving the same semantics one would expect from batch SQL execution even as new data continuously arrives.

This is particularly important for correctness-critical workloads such as financial processing, compliance analytics, and machine learning feature pipelines.

Computational Complexity

Feldera's asymptotics are guaranteed by the math, design and implementation.

This means Feldera’s performance depends only on how much data changes. This keeps CPU usage low, reduces memory pressure, delivers predictable latency, and enables linear scalability; even for complex continuous workloads.

References

  • DBSP: Automatic Incremental View Maintenance for Rich Query Language
  • Can your incremental compute engine do this?
  • Incremental Database Computations
  • A synchronous streaming model
  • Can Math simplify incremental compute?
Data aicareer

The Mythical Agent-Month

While AI coding agents master accidental complexity, they struggle with essential complexity like architectural design and scope, making human taste and judgment more valuable.

Summary

What: Wes McKinney argues that AI agents generate technical debt at machine speed, creating 'overwrought' codebases that exceed the ability of agents to maintain. He posits that while agents can handle 'accidental complexity' (boilerplate, tests), they fail at the 'essential complexity' of designing maintainable systems, where developers must act as architects to prevent scope creep.
Why it matters: This counters the narrative that coding will become a purely automated task, emphasizing that the bottleneck for high-quality software development remains human design and the discipline to say 'no' to unnecessary features.

Deep Dive

  • Agents excel at refactoring and writing tests, but create immense technical debt when unsupervised.
  • The 'agentic tar pit' refers to managing the bloat created by parallel agent sessions.
  • Brooks's laws from 'The Mythical Man-Month' apply to virtual teams, particularly concerning conceptual integrity.
  • Agents tend to chase their own tails in large, complex codebases (e.g., 1M+ lines).
  • The new differentiator for developers is the ability to manage project scope and exercise taste.

Decoder

  • Accidental Complexity: Challenges arising from the tools and processes used to solve a problem, rather than the problem itself.
  • Essential Complexity: The core difficulties inherent in the problem domain that cannot be eliminated by better tools.
  • Conceptual Integrity: The principle that a system should reflect a unified design vision rather than a collection of disjointed features.

Original Article

Like a lot of people, I’ve found that AI is terrible for my sleep schedule. In the past I’d wake up briefly at 4 or 4:30 in the morning to have a sip of water or use the bathroom; now I have trouble going back to sleep. I could be doing things. Before I would get a solid 7-8 hours a night; now I’m lucky when I get 6. I’ve largely stopped fighting it: now when I’m rolling around restlessly in bed at 5:07am with ideas to feed my AI coding agents, I just get up and start my day.

Among my inner circle of engineering and data science friends, there is a lot of discussion about how long our competitive edge as humans will last. Will having good ideas (and lots of them) still matter as the agents begin having better ideas themselves? The human-expert-in-the-loop feels essential now to get good results from the agents, but how long will that last until our wildest ideas can be turned into working, tasteful software while we sleep? Will it be a gentle obsolescence where we happily hand off the reins or something else?

For now, I feel needed. I don’t describe the way I work now as “vibe coding” as this sounds like a pejorative “prompt and chill” way of building AI slop software projects. I’ve been building tools like roborev to bring rigor and continuous supervision to my parallel agent sessions, and to heavily scrutinize the work that my agents are doing. With this radical new way of working it is hard not to be contemplative about the future of software engineering.

Probably the book I’ve referenced the most in my career is The Mythical Man-Month by Fred Brooks, whose now-famous Brooks’s Law argues that “adding manpower to a late software project makes it later”. Lately I find myself asking whether the lessons from this book are applicable in this new era of agentic development. Will a talented developer orchestrating a swarm of AI agents be able to build complex software faster and better, and will the short term productivity gains lead to long term project success? Or will we run into the same bottlenecks – scope creep, architectural drift, and coordination overhead – that have plagued software teams for decades?

Revisiting The Mythical Man-Month (TMMM)

One of Brooks’s central arguments is that small teams of elite people outperform large teams of average ones, with one “chief surgeon” supported by specialists. This leads to a high degree of conceptual integrity about the system design, as if “one mind designed it, even if many people built it”.

Agentic engineering appears to amplify these problems, since the quality of the software being built is now only as good as the humans in the loop curating and refining specs, saying yes or no to features, and taming unnecessary code and architectural complexity. One of the metaphors in TMMM is the “tar pit”: “everyone can see the beasts struggling in it, and it looks like any one of them could easily free itself, but the tar holds them all together.” Now, we have a new “agentic tar pit” where our parallel Claude Code sessions and git worktrees are engaged in combat with the code bloat and incidental complexity generated by their virtual colleagues. You can systematically refactor, but invariably an agentic codebase will end up larger and more overwrought than anything built by human hand. This is technical debt on an unprecedented scale, accrued at machine speed.

In TMMM, Brooks observed that a working program is maybe 1/9th the way to a programming product, one that has the necessary testing, documentation, hardening against edge cases, and is maintainable by someone other than its author. Agents are now making the “working program” (or “appears-to-work” program, more accurately) a great deal more accessible, though many newly-minted AI vibe coders clearly underestimate the work involved with going from prototype to production.

These problems compound when considering the closely-related Conway’s Law, which asserts that the architecture of software systems tends to resemble the organizations’ team or communication structure. What does that look like when applied to a virtual “team” of agents with no persistent memory and no shared understanding of the system they are building?

Another “big idea” from TMMM that has stuck with people is the n(n-1)/2 coordination problem as teams scale. With agentic engineering, there are fewer humans involved, so the coordination problem doesn’t disappear but rather changes shape. Different agent sessions may produce contradictory plans that humans have to reconcile. I’ll leave this agent orchestration question for another post.

No Silver Bullet

“There is no single development, in either technology or management technique, which by itself promises even one order-of-magnitude improvement within a decade in productivity, in reliability, in simplicity” – “No Silver Bullet” (1986)

Brooks wrote a follow-up essay to TMMM to look at software design through the lens of essential complexity and accidental complexity. Essential complexity is fundamental to achieving your goal: if you made the system any simpler, it would fall short of its problem statement. Accidental complexity is everything else imposed by our tools and processes: programming languages, tools, and the layer of design and documentation to make the system understandable by engineers.

Coding agents are probably the most powerful tool ever created to tackle accidental complexity. To think: I basically do not write code anymore, and now write tons of code in a language (Go) I have never written by hand. There is a lot of discussion about whether IDEs are still going to be relevant in a year or two, when maybe all we need is a text editor to review diffs. The productivity gains are enormous, and I say this as someone burning north of 10 billion tokens a month across Claude, Codex, and Gemini.

But Brooks’s “No Silver Bullet” argument predicts exactly the problem I’m experiencing in my agentic engineering: the accidental complexity is no problem at all anymore, but what’s left is the essential complexity which was always the hard part. Agents can’t reliably tell the difference. LLMs are extraordinary pattern matchers trained on the entirety of humanity’s open source software, so while they are brilliant at dealing with accidental complexity (refactor this code, write these tests, clean up this mess) they struggle with the more subtle essential design problems, which often have no precedent to pattern match against. They also often tend to introduce unnecessary complexity, generating large amounts of defensive boilerplate that is rarely needed in real-world use.

Put another way, agents are so good at attacking accidental complexity, that they generate new accidental complexity that can get in the way of the essential structure that you are trying to build. With a couple of my new projects, roborev and msgvault, I am already dealing with this problem as I begin to reach the 100 KLOC mark and watch the agents begin to chase their own tails and contextually choke on the bloated codebases they have generated. At some point beyond that (the next 100 KLOC, or 200 KLOC) things start to fall apart: every new change has to hack through the code jungle created by prior agents. Call it a “brownfield barrier”. At Posit we have seen agents struggle much more in 1 million-plus line codebases such as Positron, a VSCode fork. This seems to support Brooks’s complexity scaling argument.

I would hesitate to place a bet on whether the present is a ceiling or a plateau. The models are clearly getting better fast, and the problems I’m describing here may look charmingly quaint in two years. But Brooks’s essential/accidental distinction gives me some confidence that this isn’t just about the current limitations of the technology. Figuring out what to build was the hard part long before we had LLMs, and I don’t see how a flawless coding agent changes that.

Agentic Scope Creep

When generating code is free, knowing when to say ‘no’ is your last defense.

With the cost of generating code now converging to zero, there is practically nothing stopping agents and their human taskmasters from pursuing all avenues that would have previously been cost- or time-prohibitive. The temptation to spend your day prompting “and now can you just…?” is overwhelming. But any new generated feature or subsystem, while cheap to create, is not costless to maintain, test, debug, and reason about in the future. What seems free now carries a future contextual burden for future agent sessions, and each new bell or whistle becomes a new vector of brittleness or bugs that can harm users.

From this perspective, building great software projects maybe never was about how fast you can type the code. We can “type” 10x, maybe 100x faster with agents than we could before. But we still have to make good design decisions, say no to most product ideas, maintain conceptual integrity, and know when something is “done”. Agents are accelerating the “easy part” while paradoxically making the “hard part” potentially even more difficult.

Agentic scope creep also seems to be actively destroying the open source software world. Now that the bar is lower than ever for contributors to jump in and offer help, projects are drowning in torrents of 3000-line “helpful” PRs that add new features. As developers become increasingly hands-off and disengaged from the design and planning process, the agents’ runaway scope creep can get out of control quickly. When the person submitting a pull request didn’t write or fully read the code in it, there’s likely no one involved who’s truly accountable for the design decisions.

I have seen in my own work on roborev and msgvault that agents will propose overwrought solutions to problems when a simple solution would do just fine. It takes judgment to know when to intervene and how to keep the agent in check.

Design and Taste as our Last Foothold

Brooks’s argument is that design talent and good taste are the most scarce resources, and now with agents doing all of the coding labor, I argue that these skills matter more now than ever. The bottleneck was never hands on keyboards. Now with the new “Mythical Agent-Month”, we can reasonably conclude that design, product scoping, and taste remain the practical constraints on delivering high quality software.

The developers who thrive in this new agentic era won’t be the ones who run the most parallel sessions or burn the most tokens. They’ll be the ones who are able to hold their projects’ conceptual models in their mind, who are shrewd about what to build and what to leave out, and exercise taste over the enormous volume of output.

The Mythical Man-Month was published in 1975, more than fifty years ago. In that time, a lot has happened: tremendous progress in hardware performance, programming languages, development environments, cloud computing, and now large language models. The tools have changed, but the constraints are still the same.

Maybe I’m trying to justify my own continued relevance, but the reality is more complex than that. Not all software is created equal: CRUD business productivity apps aren’t the same as databases and other critical systems software. I think the median software consulting shop is completely toast. But my thesis is more about development work in the 1% tail of the distribution: problems inaccessible to most engineers. This will continue to require expert humans in the loop, even if they aren’t doing much or any manual coding. As one recent adjacent example, my friend Alex Lupsasca at OpenAI and his world-class physicist collaborators were able to create a formulation of a hard physics problem and arrive at a solution with AI’s help. Without such experts in the loop, it’s much more dubious whether LLMs would be able to both pose the questions and come up with the solutions.

For now, I’ll probably still be getting out of bed at 5am to feed and tame my agents for the foreseeable future. The coding is easier now, and honestly more fun, and I can spend my time thinking about what to build rather than wrestling with the tools and systems around the engineering process.

Data aienterprise

The Hidden Cost of ai_parse_document in Production

Productionizing LLM-based document extraction requires careful pipeline design to avoid ballooning costs and non-deterministic results.

Summary

What: Andy Ho at Xebia warns that blindly using Databricks' `ai_parse_document` and `ai_query` in production leads to high costs from repeated parsing and unreliable audit trails due to LLM non-determinism. He recommends a streaming architecture with checkpoints, versioned prompts, and deduplication logic.
Why it matters: Developers often treat GenAI tools as query helpers rather than ingestion systems; understanding stateful pipeline design is necessary to ensure cost-efficiency and compliance in regulated sectors.
Takeaway: Adopt a multi-stage streaming pipeline (Bronze to Silver) to cache parsed documents, and consider deterministic parsers like OpenDataLoader if your document templates are consistent.

Deep Dive

  • Use cloudFiles to ingest binary files and manage state via checkpoints
  • Separate document parsing from extraction to allow prompt iteration without re-parsing
  • Track prompt versions to maintain auditability
  • Deduplicate files at the Silver layer using ROW_NUMBER() partitioning
  • Clean up noise from document headers/footers using regex-based SQL UDFs
  • Prefer deterministic parsing (e.g., OpenDataLoader) for fixed-template documents to reduce cost and non-determinism

Decoder

  • Medallion Architecture: A data design pattern consisting of Bronze (raw data), Silver (cleaned/enriched), and Gold (business-ready) tables.

Original Article

The Hidden Cost of ai_parse_document in Production

This is not a benchmark and not a full production tutorial. It is an engineering evaluation of a Databricks-native document extraction pattern

A colleague and I were exploring options for a healthcare project that had hit a familiar wall. The structured data was usable — dates, diagnoses, billing records — but the information people actually needed lived elsewhere: discharge summaries, referral letters, intake forms, nursing notes. All unstructured and inconsistent.

We were weighing up OCR tools, custom parsers, third-party APIs — the usual trade-offs of cost, accuracy, and maintenance — when Databricks released a new Function, ai_parse_document, which seemed to cut straight through all of that.

It was a few lines of SQL and messy PDFs came out the other side as structured JSON. The first time you see it work, it genuinely feels like magic — a proof of concept in an afternoon.

That simplicity, though, is a bit deceptive. Because once you move beyond the demo, the problem changes. It is no longer just about extracting fields from a document. It is about running that extraction as a system: reliably, repeatedly, and at scale.

That shift — from query to system design — is where certains problems start to surface. That is what this post is about.

The examples use Databricks – specifically ai_parse_document and ai_query – in SQL or notebooks. You'll need Databricks Runtime 17.1+ and basic familiarity with Delta tables and the medallion architecture.

What this post cover: This is not a benchmark and not a full production tutorial. It is an engineering evaluation of a Databricks-native document extraction pattern: where it shines, where the hidden costs appear, and when a deterministic alternative is a better fit. The snippets run in a Databricks notebook and are simplified to highlight key patterns; the companion notebook has the complete runnable implementation. The examples use synthetic patient data.

What It doesn't cover: Fine-grained cost optimisation (Photon vs. serverless, token reduction strategies, model endpoint pricing) is a post of its own. So are production hardening topics like failure modes, exactly-once semantics, and Delta Expectations. Data residency, PII handling at the document level, and Unity Catalog lineage and governance are also out of scope here — each worth a dedicated treatment.

The Starting Point: ai_parse_document + ai_query

Before looking at the SQL, we need to establish quickly that not all PDFs are the same.

Digitally-born PDFs have real, selectable text baked in – think of saving a Word document as a PDF. Scanned PDFs are photos of pages. The words are visible to a human, but to a computer the page is just an image until OCR (Optical Character Recognition) reconstructs the text. This distinction matters because extraction quality differs sharply between them.

In the PoC below, we'll demonstrate what ai_parse_document and ai_query looks like in practice:

WITH text_extracted AS (
    SELECT
        substring_index(path, '/', -1) AS filename,
        concat_ws('\n', transform(
            ai_parse_document(content, map('version', '2.0')):document:elements,
            element -> element:content
        )) AS full_text
    FROM read_files('/path/to/pdfs/', format => 'binaryFile')
)
SELECT
    filename,
    ai_query(
        'databricks-claude-sonnet-4',
        concat('Extract as JSON: {"patient_ref": "...", ...}', full_text)
    ) AS extracted
FROM text_extracted
LIMIT 5;

Let's walk through what's happening in 2 steps:

Step 1 — Parse the document. The text_extracted CTE calls ai_parse_document on each binary file loaded by read_files. ai_parse_document handles both document types: for digitally-born PDFs it reads the embedded text directly and for scans it runs OCR to reconstruct it first.

The result is a structured document.elements array — text blocks, tables, section headers. Since you usually just want the text, transform and concat_ws flatten everything into a single string, full_text.

Step 2 — Extract the fields. The outer SELECT passes that assembled text to ai_query with a prompt: "extract these five fields as JSON." The fields — patient_ref, document_date, document_type, primary_diagnosis, follow_up_required — are enough to build a queryable patient cohort and flag who needs a follow-up. It gets the structured fields right most of the time. When something is ambiguous or unstated, it infers — and does not tell you it did unless you look closely.

Here is a single row from the query result, before any post-processing:

Filename: `admission_PT_2024_00502_digital.pdf`

Here is the extracted data in JSON format:

```json
{
  "patient_ref": "PT-2024-00502",
  "document_date": "2024-03-28",
  "document_type": "admission_note",
  "primary_diagnosis": "Subarachnoid haemorrhage",
  "follow_up_required": true
}
```

> Note: The `follow_up_required` field is set to `true` based on the fact that the patient is being admitted to the hospital and a management plan is being put in place, which typically involves follow-up care. However, this field is not explicitly stated in the document, so this is an inference based on the context.

A few things to notice here. Because we’re using ai_query, the extracted column is not a JSON object - it is a plain text string. The model leads with a prose sentence, wraps the JSON in markdown code fences, and appends a “Note:” explaining its reasoning. All of that arrives as one continuous string and needs cleanup before the JSON inside can be parsed.

It’s also important to note that this is inference, not extraction. The model made a judgement call on follow_up_required and volunteered an explanation - but it will not always do that, and it can be confidently wrong either way.

Where the Cracks Show

At this point in time, the query works, however, the first run is not the problem, the second is - and it surfaces four issues that don’t exist in a notebook:

Cost: Both functions charge on every run. Without a checkpoint, every run processes every document – and every prompt change re-runs everything. The problem is not the first bill. It is that every correction, prompt revision, and reprocessing cycle reopens the meter.

In healthcare, documents arrive in irregular batches – end-of-month exports, legacy migrations, corrected records re-sent in bulk. At medium-complexity rates (~$4.20–$4.55 per 1,000 pages, using the pricing available at the time of writing), a single corpus of 30,000 pages costs $120 – $137 to parse. That is manageable once. But prompt iteration is not a one-time cost – every prompt refinement re-runsai_query across every document, and onboarding a new hospital’s format can trigger a full re-parse.

Duplicates: Hospitals send the same document more than once – a corrected discharge summary when the original had the wrong diagnosis code, a re-sent referral letter after a system migration. Nothing in the query above catches that. Both versions land in Silver: the same patient appears twice in your cohort query, and if the correction changed the diagnosis, both versions of it appear too.

Non-determinism: This is an uncomfortable part: your pipeline can be “correct” and still produce different answers on different days.

A traditional rule-based parser always produces the same output for the same input. An LLM doesn’t. ai_query defaults to temperature 0, which makes it more predictable – but even temperature 0 doesn’t guarantee identical outputs. Floating-point rounding and GPU parallelism introduce subtle variation regardless.

In practice, follow_up_required can flip between true and false for the same document on different runs - no error, no warning, just quietly different results. In a regulated environment, that breaks auditability: your reporting changes without any change in the underlying data.

There are ways to mitigate this by running extraction multiple times and taking the most common result, or building a labelled test set to evaluate your prompt against known-good outputs. That second approach is especially worth the investment when you update the prompt to support a new document type or hospital format: without a test set, you have no way of knowing whether the change improved extraction for the new format without quietly breaking something for an existing one. Just note that running multiple passes per document multiplies your ai_query costs directly - non-determinism and cost are not separate problems here.

Input noise. Every page has “Confidential - Amsterdam UMC - Page 4 of 11”, for example, stamped on it, and that goes to the LLM too – consistent per hospital but meaningless for extraction. Because ai_parse_document handles the binary internally, noise arrives pre-baked into the text; you handle it in Silver rather than before the LLM sees it. Headers bleeding into free-text fields are the most common symptom – a primary_diagnosis that starts with the hospital letterhead instead of the actual diagnosis.

None of these are fatal. But all of them require system design the demo hides.

Bronze: Where the LLM Stops

Once you stop treating the query as the product and start treating it as ingestion, streaming with checkpointing becomes the natural design.

One thing worth flagging before going further. The actual PII concern here is the full document text: discharge summaries and clinical notes contain names, dates of birth, and diagnoses, and all of that content is sent to ai_query. Before running this on real patient data, you need to either strip identifying information from the text before it reaches the LLM, or confirm that your data processing agreements cover routing clinical content. As we mentioned at the beginning, this sits outside the scope of this discussion but it is worth being acutely aware of before you build.

Figure 2: From PDFs to queryable clinical data. cloudFiles streams new files automatically; ai_parse_document parses each binary into a structured elements array; the assembled text goes to ai_query for field extraction; Bronze stores the result with a streaming checkpoint; Silver unnests the struct into a flat, queryable table.

Our first instinct at this point is to stay in SQL: hash each file, track what’s been processed in a Delta table, and version by prompt. That works, but it's a boilerplate you write and maintain yourself.

Structured Streaming is a cleaner path. Spark manages state automatically via checkpoints - new files flow through, already-processed ones are skipped, and nothing re-runs the LLM unnecessarily. Less boilerplate, less to get wrong.

One practical note: patient_ref is one hospital’s local ID. The same patient will have a completely different ID at the next institution. Store the source hospital identifier alongside every patient_ref from the start. One line now, months of pain if you skip it.

The pipeline splits into three streaming tasks, each writing to its own Delta table and maintaining its own checkpoint:

# Task 1: Parse each PDF once – checkpoint prevents re-processing
(
    spark.readStream.format("cloudFiles")
    .load("/path/to/pdfs/")
    .select("path", ai_parse_document("content").alias("parsed"))
    .writeStream.toTable("bronze_parsed_raw")
)

# Task 2: Assemble elements into full text
(
    spark.readStream.table("bronze_parsed_raw")
    .select(
        "path",
        concat_ws("\n\n", col("parsed.document.elements.content")).alias("full_text")
    )
    .writeStream.toTable("bronze_documents_text")
)

# Task 3: Extract fields
(
    spark.readStream.table("bronze_documents_text")
    .select(
        "path",
        ai_query("databricks-claude-sonnet-4", col("full_text")).alias("extracted"),
        lit("v1").alias("prompt_version")
    )
    .writeStream.toTable("bronze_documents_structured")
)

One clarification on duplicates: cloudFiles tracks files by path. If the same file is re-sent under the same name, the checkpoint skips it automatically. But a corrected document arriving under a new filename is a different path, so it is processed as a new row alongside the original. In practice, that means deduplication belongs in Silver, where ROW_NUMBER() OVER (PARTITION BY patient_ref, document_date, document_typeORDER BY _commit_timestamp DESC) can keep only the most recent extraction per logical document.

The three-task structure also helps with cost. Because parsing and extraction are separate tasks, you can update the prompt without reparsing the documents. ai_parse_document charges per page, so rerunning it on historical files every time you refine the extraction is wasted spend. If a new hospital uses “patiëntnummer” instead of “patient_ref”, or you decide to extract medications as well, you can reset only the Task 3 checkpoint and rerun ai_query against the already-parsed text in bronze_documents_text. The parse cost for existing files stays at zero.

Stamp each run with a new prompt_version, and both generations can coexist in the table. That keeps the audit trail intact and makes prompt changes much easier to manage.

Databricks publishes a Databricks Asset Bundle that implements this same three-task streaming pattern as a deployable reference. In production, the stages are typically orchestrated in a Databricks Workflow: parse, assemble text, extract fields, then rebuild Silver and run a quality check. If Bronze fails, nothing downstream runs.

Here’s what a typical row looks like in bronze_documents_structured – the extracted column holds a parsed struct, queryable via dot notation in Silver:

Table 1: bronze_documents_structured Delta table.

Silver: Mostly Just SQL

Bronze handles the hard work. Silver is mostly SQL.

We have two tasks here, which are to: flatten the extracted struct into a queryable table and strip out the noise that arrived pre-baked in the text. As we mentioned above, because ai_parse_document processes the binary internally - hospital headers and footers, page stamps, etc. - end up in the parsed content alongside the actual clinical text. A SQL UDF in Silver keeps the cleanup reusable:

CREATE OR REPLACE FUNCTION strip_noise(text STRING)
RETURNS STRING
RETURN regexp_replace(
    regexp_replace(
        text,
        'Hospital Header Pattern',
        ''
    ),
    '\\n{3,}',
    '\\n\\n'
);

Apply it to any free-text field where letterhead might bleed in - strip_noise(extracted.primary_diagnosis) in the SELECT below. Each hospital will have its own header pattern; build up a library as you onboard new institutions. The query itself is straightforward:

SELECT
    extracted.patient_ref,
    extracted.document_date,
    extracted.document_type,
    strip_noise(extracted.primary_diagnosis) AS primary_diagnosis,
    extracted.follow_up_required,
    datediff(current_date(), extracted.document_date) AS document_age_days,

    -- Placeholder: assumes a 30-day follow-up window.
    -- Extend the prompt to extract the actual follow-up date
    -- from the document if your use case requires it.
    date_add(extracted.document_date, 30) AS follow_up_due_date

FROM bronze_documents_structured
WHERE prompt_version = 'v1';

Table 2: silver_documents – flat, queryable, ready for Gold. Note that follow_up_due_date is a placeholder using a fixed 30-day window – real follow-up intervals vary by diagnosis and should ideally come from the document itself.

The same principle applies here as in Bronze: do the expensive work once and don’t repeat it. Materialise Silver as a table rather than a view - you pay a small compute cost once per run, but everything downstream reads from a clean, pre-computed dataset. Analyst queries are faster, Bronze isn’t being hit repeatedly, and the LLM only ever runs during ingestion. When you update to a new prompt version, new rows land under the new prompt_version and you update the filter.

At this point you have a clean, flat Silver layer - structured fields, noise stripped, prompt version tracked. What you build on top in Gold depends entirely on your use case, so we won’t dive into that here.

But before you go ahead and build any of this, there is a question worth asking first.

Not Every Pipeline Needs an LLM

In many healthcare pipelines, a large share of documents follow predictable templates. Discharge summaries from the same institution look roughly the same. Referral letters follow a standard structure. Where that’s true, an LLM is the wrong tool - you’re paying for flexibility you don’t need and accepting non-determinism you can’t afford.

OpenDataLoader PDF is one alternative worth knowing about. It’s open-source (Apache 2.0) and its local mode is fully deterministic, using the XY-Cut++ geometric algorithm for reading order and table detection - no model involved. Parsing cost drops to zero, and if you still need field extraction you can add ai_query on top. The limitation is template variance: if different hospitals label the same field differently (“Patiëntnummer” vs “Patient ID”), you’ll need an LLM to normalise them regardless. But where templates are consistent, deterministic parsing combined with regex is cheaper, faster, and fully reproducible.

Other tools sit at different points on this spectrum - layout-aware ML parsers like Docling or MinerU in the middle, fully LLM-native approaches like LlamaParse at the other end. The trade-off holds across all of them: more convenience and flexibility, less determinism and cost control.

Which Approach Is Right?

The honest summary before you decide:


Scenario                                               Priority
----------------------------------------------------------------------------
Consistent templates, digitally-born PDFs              Reproducibility and auditability
                                                       are non-negotiable

Layout or labels vary across institutions              Cost control and reproducible
                                                       extraction

Mixed formats, variable document structure             Fast time-to-value over strict
                                                       operational predictability

When ai_parse_document Is a Good Fit

It’s worth being explicit: ai_parse_document is not a bad tool. It favours fast setup, broad document support, and flexible extraction over strict determinism and cost efficiency at scale.

That makes it a good fit for:

  • Proofs of concept. When you need to validate whether useful data can be extracted from a document set quickly – without building a custom parsing stack.
  • Low-to-medium volume workflows. Where reprocessing cost is manageable and perfect reproducibility is not required.
  • Human-in-the-loop processes. For example, pre-processing invoices, claims, or intake forms before review – where the output is verified downstream rather than trusted directly.
  • RAG ingestion pipelines. Where layout-aware parsing (sections, tables, figures, bounding boxes) produces better chunks than plain text, and exact field-level consistency matters less than good retrieval.
  • Mixed-format document collections. When dealing with PDFs, scans, DOCX, and images in a single pipeline without stitching together multiple parsers.

In those scenarios, the convenience and flexibility can outweigh the trade-offs.

The Databricks-native route is the most convenient starting point – and in some cases, the right end state – but not always where you land after a full production evaluation. If you can’t tolerate output drift, don’tstart with an LLM. Most teams don’t need smarter parsing - they need more predictable systems.

I had a great time working on this and if you’re building something similar or want to discuss some points I missed – you can find me on LinkedIn.

Data aisecurityresearch

New framework for auditing machine unlearning

Google Research introduced a new auditing framework using Regularized f-Divergence Kernel Tests to reliably verify machine unlearning and privacy compliance.

Summary

What: The framework improves upon traditional two-sample statistical tests by using relative distance measurements, effectively catching localized privacy leaks and unlearning failures that older methods miss.
Why it matters: Standard statistical tests often fail in the context of high-dimensional AI models, incorrectly flagging safe models as failures; this new approach provides the sensitivity needed for regulatory compliance.

Deep Dive

  • Uses relative distance tests to compare unlearned models against both a safe (retrained) and a compromised (original) baseline
  • Leverages f-divergences (Chi-squared, KL, Hockey-stick) for highly sensitive anomaly detection
  • Eliminates the need for massive sample splitting by using kernel regularization
  • Successfully validated against selective synaptic dampening and pruning techniques
  • Proves that traditional two-sample tests are insufficient for modern unlearning evaluation

Decoder

  • Machine Unlearning: The process of removing the influence of specific training samples from a trained AI model.
  • f-Divergence: A mathematical function that measures the difference between two probability distributions.
  • Differential Privacy: A statistical method to protect individual privacy by injecting calibrated noise into data.

Original Article

New framework for auditing machine unlearning

Machine unlearning allows AI systems to "forget" specific parts of their training data without the massive cost of retraining a model from scratch. This is essential for regulatory compliance (like GDPR’s "Right to be Forgotten"), AI safety, and model quality.

As models process increasingly massive and highly sensitive datasets, verifying machine unlearning has moved from theoretical ideal to a strict requirement, where developers must now mathematically prove privacy. However, because auditors often don’t have access to the model's internal workings or original training data, they must verify the system strictly by querying it and analyzing the output samples.

One method data scientists and researchers rely on for verification is two-sample testing, a statistical method that determines if two sets of data observations come from entirely different underlying distributions. For example, to verify unlearning, auditors might compare outputs from a model that never saw a specific record against a model that supposedly "forgot" it. If the outputs are statistically different within a defined threshold, the unlearning failed.

As models grow in size and complexity, two-sample testing and other statistical tools used for machine unlearning auditing become challenging to implement and they lose statistical power. To identify a real violation from random noise inherent in large-scale models, and with enough statistical significance, an auditor needs to extract a large number of samples. This makes real-world testing completely computationally very expensive..

To address this growing challenge, we introduce Regularized f-Divergence Kernel Tests, presented at AISTATS 2026, a new framework designed to make auditing ML models much more sensitive, flexible, and accurate. We theoretically prove that our tests naturally control for false positives for any sample size, and that the risk of false negatives reliably converges to zero as the number of available data samples increases.

The challenge: Why standard tools fall short

Evaluating model safety often requires measuring the distance, or divergence, between two complex data sets. Different applications naturally require different notions of “distance”. While popular standard tools like maximum mean discrepancy (MMD) excel at detecting broad, global shifts across data (such as a model systematically generating brighter images than its counterpart), they often lack the necessary specificity to capture complex anomalies. For instance, if the addition of a specific person's data causes a model to generate a highly specific outlier output only when prompted in a very exact way — while having an equal distribution on all other samples — traditional MMD tests might completely overlook this local shift.

Also, most existing testing frameworks force researchers to make error-prone manual choices, such as picking the specific statistic best suited for either global or local shifts or tuning complex settings like kernel bandwidths and regularization parameters.

In addition to being hard in practice, two-sample testing as a verification method is flawed when verifying unlearning of ML models. Consider the example below showing how two models trained from scratch on the exact same data can produce different distributions. The blue distribution is the distribution of a model retrained without compromised data. However, its distribution is different from the standard (green) due to retraining with different batch sizes. This results in a false positive, indicating that the tested model is unsafe.

Furthermore, recent work shows that an AI model can never perfectly “forget” data just by tweaking its current settings; unless it re-traces every step of its original training, it will always leave behind a permanent footprint of the information it was supposed to delete. Accordingly, achieving perfect “retrain equivalence” is fundamentally impossible for standard, local unlearning algorithms and a traditional two-sample test can always find a dependence on the “forget set”.

The framework

We resolve this challenge by proposing a relative distance test that measures whether an unlearned model is distributionally closer to a safely retrained model or to the original, compromised one.

Our test acts as a highly adaptable statistical toolkit that leverages f-divergences to allow auditors to pinpoint highly specific types of data shifts, including:

  • Chi-squared and Kullback-Liebler (KL) divergences: These are highly effective for identifying smooth and localized differences in data, such as outliers in physical models.
  • Hockey-stick divergence: Specially captures definitions for privacy and unlearning, this divergence operates with a parameter that controls the degree of statistical indistinguishability. It effectively establishes an acceptable threshold, ignoring minor differences below a safety budget and only triggering an alert when a meaningful privacy breach occurs.

Calculating these divergences on high-dimensional, real-world data is notoriously difficult. To make these complex optimization problems tractable without requiring massive amounts of compute, we use kernel regularization methods to estimate the differences efficiently.

Our adaptive testing approach automatically selects the best divergence and the optimal hyperparameter configurations to maximize the reliability of the test, entirely eliminating the need for sample splitting.

Experiments

Because our proposed tests are general, we experimented across a wide variety of problems. We evaluated our framework on perturbed uniforms (synthetic two-sample benchmarks), as well as the Expo1D outlier detection task within physics datasets — a specialized area that uses ML to search for new physical phenomena outside the standard model of particle physics. We used high-energy physics data because that field requires the world’s most precise "difference detectors” — the idea being, if the framework can spot a rare particle that defies the laws of physics, it can spot a tiny privacy leak in an AI model.

We then shifted our primary focus to the critical, real-world applications of auditing differential privacy and evaluating machine unlearning:

  • Privacy auditing: Differential privacy provides a framework for protecting user data by introducing calibrated noise, bounding the influence of any single individual. We tested multiple non-private mechanisms by sampling their outputs across two simulated datasets that differed by only one record. If a mechanism is truly private, the two resulting samples must be indistinguishable; if it is flawed, the test should flag the privacy violation.
  • Machine unlearning evaluation: Instead of relying on the flawed approach of simply comparing a gold standard model (one retrained from scratch without the forgotten data) to the unlearned model, we leveraged a three-sample relative test, applying it to various established unlearning algorithms, including Selective Synaptic Dampening, pruning, and random label techniques. Our test evaluated whether the unlearned model distribution was closer to the safe gold standard model, or closer to the original, fully trained model that actively memorized the sensitive data.

Results

Our framework successfully recovered or outperformed all previous baseline methods with significantly less manual tuning.

The experimental results demonstrated that no single test consistently outperforms the others across every possible scenario. Instead, different f-divergences act as specialized sensors that "light up" for different types of localized data shifts. By using an aggregated approach across diverse statistics, our framework successfully caught subtle errors and anomalies that standard tests completely missed.

For privacy auditing, the hockey-stick divergence test proved to be a powerful and effective tool. Because it directly aligns with the mathematical foundations of pure differential privacy, it allows auditors to tightly control the acceptable degree of data shift. Our adaptive testing framework successfully caught privacy violations using significantly fewer data samples and requiring far less hyperparameter tuning than previous baseline testers.

In one notable instance, our framework detected violations in a specific sparse vector technique mechanism (SVT3) using only a few thousand samples, while previously studied techniques like DP-Auditorium required millions of samples to approximate the same violation detection rate.

Our findings also suggest a redefinition of how to evaluate machine unlearning. We observed that none of the approximate unlearning methods we evaluated were compliant with the strict, standard two-sample unlearning definition. Because two-sample tests simply look for any distributional difference, they incorrectly flagged perfectly safe, retrained models as unlearning failures.

In contrast, our proposed relative three-sample test successfully overcame this flaw. It correctly and consistently identified the safely retrained models as "safe". When evaluating the approximate unlearning algorithms, only the random label technique passed the evaluation.

Other popular methods, such as finetuning, pruning, and Selective Synaptic Dampening, were found to be ineffective at truly forgetting the targeted data. We emphasize that our primary goal in these experiments was the evaluation of the unlearning methodologies, rather than designing the algorithms themselves. Consequently, we used simplified implementations of these unlearning procedures; more rigorous setups will be required to rank unlearning methods in practical production environments.

Conclusion

Our newly proposed framework provides a much more precise, adaptable, and mathematically sound lens for examining ML behavior. By leveraging regularized f-Divergence kernel tests, researchers and auditors can now statistically prove whether a model is behaving unsafely or leaking data across a massive class of problems and complex distributional shifts.

As this field evolves, theoretically grounding our empirical observations to characterize exactly which specific divergence is optimal for other novel tasks remains an exciting direction for future work. Establishing tighter sample complexity bounds will also be a key focus to make these audits even more efficient.

Data mlopspythondatabase

Feature Stores from Scratch: A Minimal Working Implementation

Building a minimal feature store with DuckDB and Redis highlights why separating offline historical training data from low-latency online retrieval is essential for LLM agents.

Summary

What: Nate Rosidi demonstrates a 200-line implementation of a feature store using a Python registry, DuckDB/Parquet for offline storage, and Redis for online lookups, designed to prevent training-serving skew and provide real-time context for LLM prompts.
Why it matters: This underscores that for RAG and agentic workflows, the feature store's role has shifted from solely enabling predictive model features to providing structured user context in real-time, functioning as a complement to, rather than a replacement for, vector databases.

Deep Dive

  • Defines a centralized registry to manage feature definitions, types, and sources to prevent inconsistent metrics across teams.
  • Uses DuckDB and Parquet for the offline store to handle historical data and point-in-time joins, preventing data leakage during training.
  • Implements an 'AsOf' join to ensure model training rows are built only using feature values that existed at the time of the event.
  • Uses Redis for the online store to provide sub-millisecond retrieval of user features at inference time.
  • Materialization pipelines must run on custom cadences to move data from offline storage to the online cache.
  • The retrieval API acts as the primary interface for LLMs to inject user context into prompts.
  • Distinguishes between vector databases (for similarity search) and feature stores (for entity-keyed metadata).

Decoder

  • Training-serving skew: A discrepancy where the logic used to calculate features for offline model training differs from the logic used at production inference time, leading to poor model performance.
  • Point-in-time join: A database operation that joins a target table with a feature table based on an entity and a timestamp, ensuring that only information available before the target event is used.
  • Materialization: The process of calculating feature values offline and pushing them to a low-latency storage system like Redis so they can be read quickly during production.
  • Entity: The primary key or subject (e.g., user_id) that features are associated with.

Original Article

Feature Stores from Scratch: A Minimal Working Implementation

Build the five components every feature store needs, then see where AI changes the design.

Introduction

Most teams discover they need a feature store the hard way. A fraud model works in the notebook and quietly breaks in production. A support agent gives a generic answer because it has no idea who the user is. A recommender pipeline duplicates the same "30-day spend" calculation across three jobs, and two of them disagree.

A feature store is the piece of infrastructure that fixes those problems. It defines features once, stores them in two shapes (one for training, one for serving), and keeps both in sync. We are going to build a minimal one from scratch in Python, using DuckDB, Parquet, Redis, and FastAPI. Then we will look at how AI applications change what we actually use it for.

The full code is short enough that we will walk through every component.

What a Feature Store Actually Solves

The classic pitch is training-serving skew: the SQL that built your training set is not the same code path that runs at inference, so the values drift. That problem is real, and the offline plus online split is the standard fix.

The modern pitch is broader. Large language model (LLM) agents and retrieval-augmented generation (RAG) pipelines need structured user context at inference time, on every request, in under 10ms. An LLM has no memory of who the user is. If we want personalized output, we have to inject the user's plan tier, recent activity, and account state into the prompt, and we need a system that can return those values fast and consistently. That is exactly what a feature store's online store and retrieval API give us.

So we build for both. The same five components handle the predictive machine learning use case and the LLM context use case.

The Five Components

  • A feature registry that defines features as code.
  • An offline store on Parquet, queried with DuckDB, for training and backfills.
  • An online store on Redis for low-latency lookups at inference.
  • A materialization pipeline that pushes the latest values from offline to online.
  • A FastAPI service that exposes a typed retrieval API.

Running Example: A Personalized LLM Recommender

We are running a streaming service. When a user opens the app, an LLM generates a short, personalized "what to watch next" message. The LLM needs three things about the user:

Feature Type Freshness
user_segment string daily
watch_count_30d int hourly
last_genre string per-event

The entity is user_id. We will register these three features, materialize them, and serve them to the LLM at request time.

1. Defining the Feature Registry

A registry is just a place where features are declared once, with their entity, dtype, and source. We use a dataclass.

from dataclasses import dataclass
from typing import Literal

@dataclass(frozen=True)
class Feature:
    name: str
    entity: str
    dtype: Literal["int", "float", "str"]
    source: str  # path to a Parquet file or a SQL view

REGISTRY: dict[str, Feature] = {
    "user_segment": Feature("user_segment", "user_id", "str", "data/user_segment.parquet"),
    "watch_count_30d": Feature("watch_count_30d", "user_id", "int", "data/watch_count_30d.parquet"),
    "last_genre": Feature("last_genre", "user_id", "str", "data/last_genre.parquet"),
}

When you run it, the output shows:

Registered features:
user_segment  entity=user_id  dtype=str  source=data/user_segment.parquet
watch_count_30d  entity=user_id  dtype=int  source=data/watch_count_30d.parquet
last_genre  entity=user_id  dtype=str  source=data/last_genre.parquet

This is the contract. Every other component reads from REGISTRY, so renaming a feature, changing its dtype, or pointing it at a new source happens in one place. In production systems, this would be YAML or a Python module checked into a Git repo, with code review on every change.

2. Building the Offline Store with DuckDB and Parquet

The offline store holds the full history of every feature value. We use Parquet files as the storage layer and DuckDB as the query engine. DuckDB reads Parquet directly, which means no separate database to run.

import duckdb
import pandas as pd

def get_historical_features(
    entity_df: pd.DataFrame, features: list[str]
) -> pd.DataFrame:
    con = duckdb.connect()
    con.register("entities", entity_df)
    base = "SELECT * FROM entities"
    for fname in features:
        f = REGISTRY[fname]
        src = f.source.replace("'", "''")
        con.execute(f"CREATE VIEW {fname}_src AS SELECT * FROM '{src}'")
        base = f"""
            SELECT t.*, s.{fname}
            FROM ({base}) t
            ASOF LEFT JOIN {fname}_src s
              ON t.user_id = s.user_id
             AND t.event_timestamp >= s.event_timestamp
        """
    return con.execute(base).df()

The AsOf join is the point-in-time join. For every entity row, it picks the most recent feature value where the feature's timestamp is at or before the event timestamp. That is what prevents leakage — where a training row is built with a feature value that did not exist yet at the moment we are predicting for.

3. Setting Up the Online Store on Redis

The online store keeps only the latest value per entity. Redis is the standard choice because hash lookups are sub-millisecond.

import json
import fakeredis  # use redis.Redis() against a real server in production

r = fakeredis.FakeRedis(decode_responses=True)

def write_online(entity: str, entity_id: str, values: dict) -> None:
    r.hset(
        f"{entity}:{entity_id}",
        mapping={k: json.dumps(v) for k, v in values.items()},
    )

def read_online(entity: str, entity_id: str, features: list[str]) -> dict:
    raw = r.hmget(f"{entity}:{entity_id}", features)
    return {f: json.loads(v) if v else None for f, v in zip(features, raw)}

4. Running the Materialization Pipeline

Materialization moves values from offline to online. In a real system this runs on a schedule (Airflow, cron, a streaming job).

def materialize(features: list[str]) -> None:
    by_entity: dict[str, dict] = {}
    for fname in features:
        f = REGISTRY[fname]
        src = f.source.replace("'", "''")
        df = duckdb.sql(f"""
            SELECT {f.entity}, {fname}
            FROM '{src}'
            QUALIFY ROW_NUMBER() OVER (
                PARTITION BY {f.entity}
                ORDER BY event_timestamp DESC
            ) = 1
        """).df()
        for _, row in df.iterrows():
            by_entity.setdefault(row[f.entity], {})[fname] = row[fname]
    for entity_id, values in by_entity.items():
        write_online("user_id", entity_id, values)

5. Exposing the FastAPI Retrieval Service

The retrieval service is the production surface. It is what the LLM application calls.

f = resp.json()["features"]
print("\nPrompt the LLM would receive:")
print(
    f"  System: You recommend shows for a streaming service.\n"
    f"  User context: segment={f['user_segment']}, "
    f"watched {f['watch_count_30d']} titles in last 30 days, "
    f"last genre watched: {f['last_genre']}.\n"
    f"  Task: suggest 3 titles in a friendly, short message."
)

Where the Feature Store Ends and the Vector Database Begins

A vector database is not a feature store, even though both sit in front of a model at inference. They solve different retrieval problems. A real LLM stack uses both. The vector database returns the three most similar past viewing sessions. The feature store returns the user's segment and recent counts. The prompt combines them.

Common Anti-Patterns

  • Computing features inside the model service.
  • Treating the online store as the source of truth.
  • Skipping the registry.
  • Calling a vector database a feature store.
  • Backfilling without point-in-time joins.

Conclusion

A working feature store fits in five components: a registry, an offline store, an online store, a materialization step, and a retrieval API. Building it once teaches us why the production systems look the way they do. It also shows where the design changes for AI: the online retrieval path is the surface the LLM hits, point-in-time joins matter when we train or evaluate, and the vector database sits next to the feature store, not inside it.

Once we have these pieces, swapping our minimal version for Feast, Tecton, or Databricks is mostly a migration of the registry. The shape of the system stays the same.

Design aifrontend

Every Component in Your Design System is a Promise

Design systems are failing because they are written for humans rather than the AI agents that now interpret and implement them as code.

Summary

What: Murphy Trueman argues that design systems must transition from 'documentation' (guidelines) to 'contracts' (machine-readable types and constraints). Systems like Uber’s 'uSpec' use structured data to allow AI to generate accurate component documentation and code without hallucinating intent.
Why it matters: The industry standard of free-form, human-readable documentation is no longer sufficient for an agentic future. Only explicit, strongly typed contracts can prevent AI from inventing invalid component configurations.
Takeaway: Replace free-form string props in your component APIs with restricted union types (e.g., 'primary' | 'secondary') to provide AI agents with a clear contractual boundary for usage.

Deep Dive

  • Design systems act as contracts; most are currently too vague for machine parsing.
  • 'Documentation' describes intent for humans; 'contracts' define strict rules for machines.
  • TypeScript types are the most effective way to communicate these contracts to AI agents.
  • Token naming should be semantic (e.g., 'color-action-primary') rather than presentational (e.g., 'blue-600') to avoid ambiguity.
  • Compositional patterns, like compound components in React, help enforce structural rules that prevent invalid component assembly.
  • 'Anova' is an example of an extraction tool that converts complex Figma data into compact, high-signal specifications for AI agents.

Decoder

  • MCP (Model Context Protocol): An open standard that allows AI models to connect to external data sources and tools, such as Figma or GitHub.
  • DTCG (Design Tokens Community Group): A W3C group working to standardize how design tokens are defined and shared across design and development tools.
  • Compound Component: A pattern where components are designed to work together (e.g., a Modal and ModalFooter), enforcing structural relationships via code.

Original Article

Every component in your design system is a promise

How the promises your system makes became the thing agents read first.

In March, Ian Guisard from Uber's design systems team published a write-up of uSpec, an agentic system he built to automate component documentation. The numbers were striking: spec pages that previously took weeks now take minutes. An AI agent in Cursor connects to Figma via Southleft's open-source Figma Console MCP, crawls the actual component and sub-component structure, reads token mappings and variant axes, then generates finished spec pages directly into the Figma file from a single prompt.

The response from the community made clear this wasn't just an Uber problem. Guisard wrote that after presenting the manual version of this process at a conference, design systems leads from across the industry reached out asking how to replicate it. The documentation bottleneck is universal.

The detail that deserves more attention isn't the time saving, though. It's the condition that makes it possible. The system works because the Uber design system is structured well enough to be read programmatically. The agent reads the structure and does what the structure allows, without any interpretation of intent behind it. When the structure is explicit, the output is accurate. When it isn't, the agent either fails or invents.

Most design systems aren't structured that way, and the reason comes down to a concept that rarely surfaces in design practice until something breaks: contracts.

A contract, not a guideline

In software, a contract is a formal description of what something does, what it accepts, and what happens when those conditions aren't met. Stripe's API is a useful reference point: when you call a payment endpoint with the wrong parameters, you get a meaningful error immediately. The system doesn't accommodate the mistake, it rejects it and tells you why. That precision is a feature. The contract is specific enough to distinguish a valid call from an invalid one, and everything downstream depends on that distinction holding.

Design systems were built for human consumers. The naming conventions, the implicit logic, the documentation that assumes someone will ask a colleague when it doesn't make sense. Agents don't ask colleagues. They parse what's there and move on, and when there's no contract to read, they invent one. An invented contract is just confident guesswork expressed in code.

Your design system makes the same kind of promises Stripe's API does. When a component exposes a variant prop, it's promising the component will behave differently based on that value. When a token is named color-feedback-critical, it's promising that colour is semantically tied to critical feedback. When a usage guideline says a component shouldn't appear inside a modal without the full-bleed variant, that's a promise about context. Those are all contracts. The question isn't whether you have them. It's whether they're written down in a form that survives interpretation by something that can't ask a clarifying question.

When description isn't enough

Documentation and contracts aren't the same thing, and this is a distinction that design system practice has never needed to make explicit until recently.

Documentation describes. A contract specifies. Your Storybook page that explains when to use the warning variant instead of the critical variant is documentation. A prop definition that enforces appearance: "warning" | "critical" | "info" | "success" with no other valid inputs is a contract. The first communicates intent to a person who reads it. The second encodes intent in a structure that rejects anything outside its defined range.

Both matter, and this is where teams often get stuck. Contracts without documentation leave teams unable to make good decisions about when and how to use the system. Documentation without contracts leaves the system dependent on human interpretation for every implementation. For a long time, the second trade-off was acceptable, because human interpretation was the only thing consuming the system. That changed.

Here's what the difference looks like in practice, with a single component.

A button with documentation only:

The Storybook page says use the destructive variant when the action deletes data or can't be undone. The prop accepts a free-form string. A developer reads the documentation, uses variant="destructive" correctly. An agent working from the same codebase sees a variant prop that accepts a string and has no validation. It doesn't read Storybook. It infers from examples in the codebase. If those examples are inconsistent, the agent's output will be too.

The same button with a contract:

type ButtonVariant = 'primary' | 'secondary' | 'ghost' | 'destructive';

interface ButtonProps {
  /**
   * Use 'destructive' only for irreversible actions: deleting data,
   * removing access, cancelling orders. Not for warnings or errors.
   */
  variant: ButtonVariant;
  children: React.ReactNode;
  onClick: () => void;
  disabled?: boolean;
}

The TypeScript type is the contract. It enforces exactly four valid variants. The JSDoc comment sits directly above variant, carrying the usage guideline into the context an agent actually reads — the type definition, not the Storybook page. If someone passes variant="danger", the system rejects it at compile time. The agent sees the type, reads the comment, and knows what's valid before it generates anything.

That one addition, typing the prop and annotating the definition, changes the quality of every piece of agent-generated code that touches this component. It does the same for every team member who joins after the person who made the original decision has left.

Names as decisions

Tokens carry the same logic. Token naming is where most systems accumulate the most damage without noticing, because the problems are invisible until an agent or a new team member has to make a decision and has nothing concrete to go on.

A token named blue-600 is not a contract. It describes a value. Whether that value is the brand's primary action colour, a decorative element, a hover state, or a data visualisation colour is entirely context-dependent. The name communicates nothing about when or why to use it. A developer can make a reasonable guess. An agent will make a plausible one, which is not the same thing, and a plausible guess that compiles is harder to catch than an error.

A token named color-action-primary is closer to a contract. It asserts purpose: this colour is for primary actions. A token named color-feedback-critical goes further, carrying not just category but semantic meaning, committing to a decision that any consumer of the system, human or machine, can act on without additional context. The DTCG specification, which reached its first stable version in October 2025, formalises exactly this three-level structure: primitive tokens carry the raw value, semantic tokens carry meaning, component tokens carry specific application. Each level is a progressively more specific commitment about where and how the value should be used.

When a system is structured this way, an agent reading your token output has enough to reason correctly. It sees --button-background-primary referenced in a component file and understands it's a component-level token for a specific element, not a general-purpose colour to be reused freely. It sees --color-feedback-critical and understands it belongs in error and warning contexts. The naming has already committed to an answer, so the agent doesn't have to guess at one.

When a system uses presentational names throughout, the agent pattern-matches against names that carry no semantic content. It reaches for blue-600 where you intended color-action-primary because both look plausible from the available structure. The output looks right. The contract is wrong.

The composition problem

Component relationships are the third layer, and the most expensive one to leave undocumented, because unspecified relationships create cascading failures that look like implementation errors rather than contract failures.

For a long time, the instinct was to keep adding configuration, another prop, another variant, another toggle, until the component covered enough ground. That approach has a ceiling, and most mature systems have hit it. The alternative, offering smaller composable parts that implementers assemble themselves, solves the configurability problem but creates a new one. When you hand someone a set of parts and say compose what you need, the contract on each part becomes the most reliable thing standing between a good assembly and a broken one. A monolithic component at least constrains the surface by its own shape. Parts don't. The more a system moves toward composition, the more each piece needs to say clearly what it is, what it expects, and what it won't tolerate.

Consider a Modal. Your documentation says it should always contain a ModalFooter with at least one action button. Most experienced teams follow that. An agent building a modal from scratch, working from a codebase where the rule exists only in a documentation page it doesn't read, will omit the footer whenever the prompt doesn't explicitly request one. The resulting component passes tests. It fails an accessibility review three weeks later, when someone asks why there's no focusable action to close it.

The fix isn't better documentation. It's encoding the relationship in the component's structure so it can't be violated accidentally. In React, a compound component pattern enables exactly this: Dialog.Footer can be written to only exist inside Dialog, and a Dialog without a Dialog.Footer containing at least one child can be written to throw at runtime through explicit validation in the component. The constraint isn't in a document somewhere. It's in the API itself, and it can be made precise enough that missing it produces an immediate, visible error rather than a silent failure that surfaces in review.

In Figma, component relationships are harder to enforce with the same precision, but they can be partially expressed through component properties, explicit nesting in auto-layout, and the layer structure of component sets. The most reliable approach is making sure that what's described in Figma and what's enforced in code describe the same contract, so the system communicates consistent rules in both environments.

What the structure actually looks like

All three layers together, expressed as a single file, are what a Figma plugin called Anova produces. It crawls a component and outputs a structured data file, deterministic and compact, describing everything the component is, has, and does. The output is designed to be fed directly into an agent's context window as a specification, replacing the raw Figma data that most MCP connections generate.

The reason that substitution matters comes down to signal-to-noise. The raw Figma data for a moderately complex component can run to over a megabyte, 42,000-plus lines of JSON covering every property of every node across every variant, most of it describing bounding boxes, transform matrices, and paint object wrappers that say nothing about design intent. An agent receiving that as context has to process all of it to extract the handful of decisions that actually matter, and nothing is cached between sessions, so the next prompt starts from scratch.

The Anova output for the same component is a few hundred lines describing only what varies and why.

Take an Alert with four appearances and three sizes. In your Figma file, that's twenty-four variants. The anatomy section of the Anova output describes what the component contains:

anatomy:
  root:
    type: container
  icon:
    type: instance
    instanceOf: Icon
  content:
    type: container
  label:
    type: text
  dismissButton:
    type: instance
    instanceOf: Icon Button

The props section describes what it accepts, with enums, defaults, and types made explicit:

props:
  appearance:
    type: string
    default: info
    enum:
      - critical
      - warning
      - success
      - info
  size:
    type: string
    default: medium
    enum:
      - small
      - medium
      - large
  dismissible:
    type: boolean
    default: false
  label:
    type: string
    default: "{Label}"

The variants section records only what changes from the default. When appearance is critical, only the fills and text colour differ from the info state. When size is small, only spacing tokens change. Anova evaluates every combination in depth and collapses those combinations into layered diffs rather than duplicating entire variant trees:

variants:
  - configuration:
      appearance: critical
    elements:
      root:
        styles:
          fills: Color/Alert/Critical/Background
      label:
        styles:
          fills: Color/Text/Primary Inverse
  - configuration:
      size: small
    elements:
      root:
        styles:
          itemSpacing: Space/Component/XS
          paddingLeft: Space/Padding/XS
          paddingRight: Space/Padding/XS
      label:
        styles:
          textStyle: Body/Small

An agent resolving appearance: critical, size: small layers those diffs in sequence: start from the default, apply the critical overrides, apply the small overrides. The correct resolved state falls out without inspecting all twenty-four variants independently.

The invalid combinations section records which prop pairings can't coexist:

invalidConfigurations:
  - dismissible: true
    size: small

That line is a contract. Not a documentation note, not a comment in a Storybook story, but a machine-readable statement derived directly from how the component is built in Figma, with no LLM inference involved. The same input produces the same output every time.

What Anova encodes in the invalidConfigurations block is the same principle expressed at the Figma layer. A typed container that only accepts specific children in code and a component that marks certain prop combinations as invalid in its spec are doing the same work in different environments, making the rules of composition legible without requiring someone to know them from memory.

Read it all back and what you have is the complete structural anatomy of the component, its full prop surface with valid enums and defaults, the precise styling changes per variant expressed as semantic token references, and an explicit record of which configurations aren't possible. An agent receiving that context before generating code doesn't need to infer anything. The design decisions are the data.

This is what the Uber pipeline depends on. This is what the Storybook MCP reads when it composes new components from existing pieces. The difference in quality of agent output between raw Figma data and a structured component spec is not incremental. It's the difference between a tool that guesses and a tool that knows.

Where to start

Most systems have contracts in some areas and documentation-only in others. Finding the seam between them is more useful than a comprehensive rebuild, and it's where the practical work actually starts.

Look at the props that get implemented inconsistently across the codebase. Those are almost always free-form string props with no type enforcement, where the codebase is being pattern-matched from examples rather than read as a contract. Type them. Add a JSDoc comment that carries the usage guideline into the definition. That single change makes every agent session that touches those components more reliable, and does the same for every team member who joins after the original decision was made.

Look at the tokens that regularly appear alongside raw hex values for the same colour. That's semantic ambiguity in practice: the token exists, but its purpose isn't clear enough that anyone, agent or human, reaches for it confidently. Rename toward intent rather than value, and the pattern resolves. blue-600 becomes color-action-primary. The hex value stops appearing next to it because the token's purpose is now unambiguous enough to use without checking.

Look at the component relationships that generate the most questions in your team Slack, the ones where someone always asks whether it's valid to use this component outside that container, or whether a particular child is optional or required. Those are undocumented compositional contracts. Some can be encoded in Figma through component properties and nesting structure. Some need to be enforced in code through the component API. The ones that exist only in people's heads are the most expensive, because they get violated the moment someone who wasn't in the original conversation touches the component.

None of this is about making your system perfect before agents consume it. It's about making your existing decisions legible. An agent can read your types, your token names, your component structure, and your JSDoc comments. The closer those four things are to encoding what you actually decided, the less the agent has to invent.

The reason Guisard's uSpec system works at Uber's scale isn't complicated. Someone, over time, made decisions about how components should be described and wrote them down in a form that could be read without a person in the loop.

Your system is already making promises. The question is whether those promises are written down clearly enough that something other than you can read them and get the right answer.

The difference shows up in the output. It always has, and it's just easier to see now.

Design opensourcelinuxinfrastructure

The Virtual OS Museum (Website)

A single enthusiast has curated and pre-configured over 1,700 historical operating systems into a downloadable Linux virtual machine.

Summary

What: The Virtual OS Museum, created by a sole developer over 20 years, provides a plug-and-play environment for emulating systems ranging from 1948 mainframes like the Manchester Baby to NeXTSTEP and early versions of Linux.
Why it matters: As original hardware disappears and emulator environments become increasingly fragmented and difficult to configure, bundling entire ecosystems into stable, portable virtual machines is becoming the primary method for historical software preservation.

Decoder

  • Emulator: Software that mimics the hardware of a different computer system, allowing it to run software designed for that original hardware.
  • Hypervisor: Software or firmware that creates and runs virtual machines by partitioning a host computer's resources.
  • Manchester Baby: The world's first electronic, stored-program computer, built in 1948 at the University of Manchester.

Original Article

This is a virtual museum of operating systems (and standalone applications) running under emulation, implemented as a Linux VM for QEMU, VirtualBox, or UTM.

A custom emulator-independent launcher is provided, and all OSes and emulators are pre-installed and pre-configured. The launcher includes a snapshot feature to quickly revert broken installations back to a working state. Hypervisor installers and shortcuts to run the VM on Windows, macOS, and Linux are also included.

Want to see the earliest resident monitors? The ancestor of all modern OSes (CTSS)? The earliest versions of Unix? The first OS with a desktop metaphor GUI (Xerox Star Pilot/ViewPoint)? Early versions of mainstream OSes? If you want to explore historical OSes and platforms without having to worry about configuring/installing emulators and OSes or corrupting emulated installations, you’ve come to the right place.

Just about every well-known OS and platform (and also a lot of obscure ones) is included in some form, spanning the entire history of stored-program computing from the Manchester Baby of 1948 (the first stored-program computer) to the present day.

The catalogue covers, among many other things:

  • The earliest mainframes: Manchester Baby test/demo programs, Mark 1 Scheme A/B/C/T (the earliest examples of system software that could be considered as an OS), various EDSAC software, etc.
  • Later mainframes and minicomputers: CTSS, MVS, VM/370, TOPS-10/20, ITS, Multics, RSX, RSTS, and more
  • Workstations and Unix variants: PERQ OSes, SunOS, IRIX, OSF/1, A/UX, NeXTSTEP, Plan 9, various BSDs, plus Linux distributions across the decades, and more
  • Home computers: various CP/M variants, Apple II, Commodore 8-bit machines, Atari 8-bit, MSX, Tandy TRS-80, BBC Micro, ZX Spectrum, Sharp MZ, and more
  • Personal computer operating systems: various DOS variants, OS/2, BeOS, Windows from 1.0 to early Longhorn betas, classic Mac OS through Mac OS X 10.5 PPC, and more
  • Mobile and embedded: PalmOS, EPOC/Symbian, Windows CE, Newton OS, early Android and iOS where emulation permits, QNX, etc.
  • Research and obscure systems: ZetaLisp, Smalltalk environments, Oberon, Plan 9, and many more that few people now have ever booted

If a working version of an operating system exists somewhere, the goal is to have it here, in a form anyone can run on a reasonably modern laptop/desktop.

Downloads

Both a full and a lite version are available. The full version ships with everything pre-downloaded and runs offline. The lite version downloads disk/tape/etc. images for guest VMs the first time they are run. Automatic and manual updates are supported on both editions so new installations land without re-downloading the whole VM.

Why this exists

While the state of software preservation has improved significantly over the past two decades, many of the existing software preservation projects are still not particularly accessible.

When I started collecting emulator images (2003), there were only a few small archives of software images and the corresponding documentation, and relatively few emulators for anything other than well-known consumer-oriented platforms. Nowadays there are many large archives of historical software and documentation, and a lot of emulators for even a lot of very obscure platforms.

However, while such efforts are valuable when it comes to keeping historical software available and runnable, it often still takes time and effort to get runnable VM installations from them. OSes may have complicated install procedures. Some may depend on particular device configurations within an emulator. Some will only run in certain emulator versions, breaking in later ones due to regressions. Some emulators might have complex configuration files, or may require a specific environment on the host system.

This project is an attempt to keep reachable as much of the history that’s been preserved in various places as possible. Not theoretically reachable. Not “bootable in principle if you assemble the right toolchain on a Tuesday.” Reachable. You click an entry, it runs, and where possible it runs with software of the era already loaded the way someone might actually have used the machine at the time.

The work behind it

This is the result of over 20 years of collecting. OS installations have been sourced from various places. Some have been downloaded as pre-installed images, whereas others were installed from images of original install media. Some were installed in less than an hour, whereas others took almost a week.

A decent number only run in particular emulator versions due to regressions in later versions, and some emulators needed minor patches to run on modern Linux or to play nice with the launcher. A few emulators have been patched to run OSes that were previously broken.

Many installations also include various add-on software - applications, development tools, games, utilities, etc. - set up the way it actually might have been used.

This is still far from finished; I have many more images sitting around that I have yet to install and emulators I want to fix.

Design frontendwebreactthreejs

Sketching the Impossible: a 3D Portfolio Built Without a Single 3D Model

Tomasz Szmajda built an award-winning 3D portfolio by wrapping flat geometry in hand-drawn textures rather than using traditional 3D models.

Summary

What: Tomasz Szmajda used React Three Fiber, Three.js, and GSAP to create a navigable 3D environment without Blender. The site uses custom GLSL shaders for brush-stroke effects and a chunking system to manage performance, opting for WebP textures over KTX2 for compatibility.
Why it matters: This shows how developers can bypass complex 3D modeling workflows by treating web graphics as a series of flat, textured planes managed via code, effectively using the GPU for visual polish rather than geometry.

Deep Dive

  • Use React Three Fiber and Three.js to render flat geometry as 3D space to avoid complex asset pipelines.
  • Implement a 'chunking' system to mount and unmount scene segments dynamically based on camera position.
  • Use onBeforeCompile to inject custom logic into standard Three.js materials for effects like paint-reveal.
  • Force GPU shader compilation during a preloader phase using gl.compileAsync to prevent interaction lag.
  • Avoid unnecessary lighting calculations; use baked-in shadows in textures to significantly boost frame rates.
  • Use GSAP Observer for unified scroll handling across touch, mouse, and keyboard input.
  • Implement automatic quality tiering based on hardware, RAM, and browser capability.

Decoder

  • GSAP: The GreenSock Animation Platform, a standard JavaScript library for complex web animations.
  • GLSL: OpenGL Shading Language, a C-like language used to write custom rendering logic directly on the GPU.
  • KTX2: A container format for textures designed to optimize memory usage and GPU decompression in WebGL applications.
  • Three.js: A cross-browser JavaScript library used to create and display 3D computer graphics on a web page.
  • React Three Fiber: A React renderer for Three.js that allows developers to manage 3D objects as standard React components.

Original Article

I can’t model 3D. That pretty much explains this entire project.

For months, I had been browsing Awwwards and The FWA and found sites like Igloo Inc, 3D combined with infinite scrolling, and I just thought: I need something like this. Then I saw Bruno Simon’s portfolio with the little car. I knew I wanted a 3D portfolio. I also knew I had zero Blender skills and honestly no interest in faking it with someone else’s models.

So I figured, why not just code simple rectangles, planes, cubes, flat geometry, and wrap them in hand-drawn textures? I couldn’t sculpt a world in Blender, so I sketched one on flat rectangles instead. That workaround accidentally became the whole visual identity of the project.

You can find the source code on GitHub.

Why This Exists

Go browse any Facebook group where developers share portfolios. I did. About 90% of them are the same thing: dark background, neon accent colors, text on the left, image on the right. Many of them look like they were generated by AI… they probably were. AI has this very specific aesthetic tendency: black page, neon glow, done.

I don’t have a problem with those sites technically. Their UX is probably better than mine, honestly – it’s hard to get lost on a standard layout. But I wanted something different. I wanted visitors to actually walk through a space, not just scroll down a page about me. If someone sees my portfolio and wants to work with me, they’ll figure out how to reach the Contact room. I’m not worried about that.

So I set out to build a portfolio you can walk through, not scroll through.

Four Months, From Sketch to Sky

The project started in December 2025. Initially, I thought it would be a 2D illustrated website – hand-drawn textures on flat HTML sections. But somewhere in the first few weeks, I realized flat HTML wasn’t going to cut it. This thing needed actual 3D depth. So I moved the whole thing into Three.js and React Three Fiber, and suddenly I was building rooms.

Four months later, it was live. Four months of fighting with camera systems and scroll mechanics I had no idea how to build. Also generating a ton of textures with AI, because there was no way I was drawing all of them by hand.

The Tech Stack

  • React 19 + React Three Fiber 9 + Three.js 0.182 for the 3D environment
  • GSAP 3.14 for all animations: camera flights, door mechanics, reveal transitions
  • Vite 7 for builds and dev server
  • Custom GLSL shaders extending MeshBasicMaterial for the paint-reveal effect
  • WebP textures generated via AI (Google’s image generation), compressed and trimmed to fit flat 3D geometry
  • PostHog for analytics, Lenis heritage in scroll philosophy

The Aesthetic: Why Everything Looks Hand-Drawn

This was never Plan B. From day one, I wanted it to feel like a sketch – like you opened someone’s notebook and the drawings jumped into 3D. The paper texture backgrounds, the ink-line doors, the doodles floating in the corridor. All intentional.

What evolved later was the color. I had all these sketch-style textures, and one day I thought: what if they painted themselves when you hover? What if hovering over something literally paints it with color?

That became the main interaction of the whole portfolio. Every clickable element starts as a black-and-white sketch and fills with color on hover – a brush-stroke reveal driven by a custom shader. It’s basically a visual hint – if something fills with color when you hover, it means you can click it and something will happen.

The PaintRevealMaterial Shader

The effect works by extending Three.js’s MeshBasicMaterial through onBeforeCompile, injecting custom fragment shader logic that blends between a sketch texture and a painted texture using procedural noise:

// Brush-stroke blend: progressively swap sketch -> painted
if (uProgress > 0.001) {
    vec4 paintedColor = texture2D(uMapPainted, vMapUv);
    float rn = paintNoise(vMapUv * 15.0) * 0.15;
    // Reveal from bottom-left to top-right for organic feel
    float maskValue = (1.0 - vMapUv.y) + rn;
    float threshold = uProgress * 1.5;
    if (maskValue < threshold) {
        diffuseColor = vec4(paintedColor.rgb, 1.0);
    }
}

The noise function gives it messy, organic edges – so instead of a clean wipe, you get something that actually looks like paint bleeding on paper. uProgress is animated from 0 to 1 by GSAP on hover.

I went with extending MeshBasicMaterial rather than writing a shader from scratch because I needed the standard Three.js texture pipeline (UV mapping, color spaces, transparency) to keep working. The custom logic only decides which pixels show the painted version – everything else stays stock.

Keeping all textures visually consistent was honestly one of the hardest parts. Every texture was AI-generated, and getting AI to generate hundreds of assets in the same hand-drawn style is… painful. Sometimes I just generated 20 versions and picked the one that didn’t look completely different from the rest.

The Infinite Corridor

The idea is simple: you enter a building through sketched double doors, and behind them is a corridor that stretches infinitely in both directions. On the walls, at alternating sides, are four doors – each leading to a room with its own world inside.

The Chunking System

The corridor is built from repeating segments, each 80 units long, managed by InfiniteCorridorManager. Only three segments are ever mounted: the one the camera is in, plus one ahead and one behind. As you scroll, segments spawn and despawn dynamically.

On top of that, each segment is wrapped in a SegmentVisibilityWrapper that uses useFrame to check whether the segment is actually in view. If you’ve scrolled 5 units past a segment, it hides entirely – zero draw calls for geometry the camera isn’t even looking at:

useFrame(() => {
    const isBehindCamera = camera.position.z < endZ - 5;
    const isFarAhead = camera.position.z > startZ + 30;
    const isVisible = !(isBehindCamera || isFarAhead);

    if (groupRef.current.visible !== isVisible) {
        groupRef.current.visible = isVisible;
    }
});

The Camera System: 500 Lines of Pain

useInfiniteCamera.js is 500 lines long, and honestly almost every one of those lines exists because of a bug I had to fix.

The camera does several things at the same time:

  • Scroll movement via GSAP Observer (unifies mouse wheel, touch, and trackpad)
  • Mouse parallax on desktop – the camera sways gently as you move the mouse
  • Gyroscope parallax on mobile – tilt your phone, the corridor shifts
  • Auto-glance – as you approach a door, the camera subtly turns toward it
  • Keyboard navigation – arrow keys, spacebar, Page Up/Down for accessibility
  • Camera override mode – when GSAP takes over for door entry/exit animations

The auto-glance system is worth talking about. For each door, the hook calculates proximity using a start/peak/end distance model with eased strength.

The result is a small head-turn that makes it feel like the camera notices the doors on its own. It’s subtle, but it makes the corridor feel way less static.

The Rooms: Abstraction Over Predictability

The original plan was normal rooms. A room with a desk. A room with shelves. You know – rooms. But that felt boring and predictable. If someone walks down a corridor in a building and opens a door, they expect a room. They don’t expect to suddenly be flying a paper airplane through clouds.

So every room became its own little world:

  • The Gallery: Project cards hanging on an infinite clothesline, like laundry drying in the wind. Scroll sideways to browse. Infinite loop.
  • The Studio: Monitors floating in space, scrolling vertically through my content – videos, blog posts, social media. Infinite in both directions.
  • The About: You fly a paper airplane through an infinite sky filled with clouds and story milestones. Your biography as a flight path.
  • The Contact: A beach by the sea. Social media links are floating barrels you click to connect.

Almost every room has infinity built into it. In the Gallery, the cards loop. In the Studio, the monitors loop. In About, the sky never ends, just like the corridor itself. Only Contact breaks the pattern – and I think that’s actually better. Contact is the destination. It should feel like you’ve arrived somewhere.

The Performance Crisis (or: Two Suns, One Moon, and Zero Visible Shadows)

When I first shared a preview link in Facebook developer groups, the feedback was clear: your site is beautiful, but it runs like a slideshow.

I spent days trying to figure out what was killing performance. I was looking at texture sizes, draw calls, shader complexity – everything. Then I found it.

Two directional lights and an ambient light were casting real-time shadows across the entire scene. These were standard Three.js lights I had added early in development for “proper” lighting. The problem? My scene is made entirely of flat textured planes – the shadow maps were computing complex depth passes for geometry that visually showed no shadow difference at all.

The Shader Pre-Compilation Trick

Another piece of community feedback: “Why do I wait for the preloader, and then wait AGAIN when I click a door?”

The answer was shader compilation. Three.js compiles shaders lazily – the first time a material renders, the GPU compiles its shader program, causing a visible stutter. Every room had dozens of materials, and they all compiled on first entry.

The solution was RoomWarmup: a component that mounts all four rooms 500 units below the scene during the preloader phase, forces the GPU to compile every shader via gl.compileAsync(), then unmounts everything.

The KTX2 Experiment That Failed

I read everywhere that KTX2 / Basis Universal textures are the gold standard for WebGL performance – GPU-native decompression, smaller payloads, the works. So I converted everything. It was a disaster. I reverted to WebP.

Adaptive Device Tiering

The site detects device capabilities at load time and adjusts accordingly. Low-end devices get fewer preloaded textures, no room warmup, and simpler rendering. On top of that, a PerformanceMonitor from drei watches FPS in real-time and automatically downgrades the quality tier if frames start dropping.

Sound Design: The Detail That Changes Everything

I noticed that the sites I admired most – Bruno Simon, Igloo Inc – all had sound. Sound is what makes a 3D site feel like a place instead of just a page with graphics. Every room has its own ambient soundscape.

The Achievement System: Gamification as UX

Inspired by Bruno Simon’s portfolio, I added an achievement system that doubles as a tutorial. When you enter a new room, a tooltip appears. These aren’t just instructions though – they’re achievements waiting to be unlocked. Complete the action, and the tooltip transforms into a completed badge with a chime.

Recognition

The portfolio picked up some awards along the way:

  • GSAP Site of the Day + added to the official GSAP Showcase
  • FWA of the Day
  • CSSDA Special Kudos + 3 Public Choice Awards
  • Orpetron SOTD
  • CSS Winner SOTD
  • Awwwards Honorable Mention

What I Would Do Differently

Honestly? Almost nothing in terms of approach. The “mechanics first” philosophy – build the scroll, the camera, the corridor logic before touching a single texture – saved me from the trap of polishing visuals on top of broken foundations.

What This Project Taught Me

That I’m still a beginner. This is maybe my first project that achieved real recognition – the awards, the community response, the chance to write this article. That’s how I know I’m growing.

What Comes Next

My biggest goal right now: building a website for the Polish hip-hop collective 2115 and submitting it to the Webby Awards and Lovie Awards. It’s a high bar. But when you’re getting awards like FWA of the Day at this age, and getting invited to write for Codrops – I mean, there’s really no such thing as impossible.

One Piece of Advice

First: write the screenplay. Before you touch a single line of code, imagine the film. What does the user see? Where does the camera go? What’s the story?

Then build the mechanics with simple shapes. Rectangles. Cubes. No textures, no shaders – just the raw movement and flow. Make it feel right when it looks like nothing.

Only then do you dress the world.

Tech is never the blocker – it’s always your imagination. If you can think of a solution, you can code it. It just takes time and stubbornness.

AI enterprisegoogle

Google is working on Skills Marketplace for Gemini Business

Google is integrating its enterprise tool suite into a centralized Skills Marketplace within Gemini Business to streamline report and dashboard creation.

Summary

What: Google is testing a 'Skills Marketplace' tab inside Gemini Business, featuring a Skills Builder and Skills Management UI. This appears to be part of a broader strategy to consolidate tools like Android Studio under a unified Gemini interface to compete with other 'super-app' platforms.
Why it matters: This signals Google’s move to lower the barrier for non-engineers to build custom software tools, attempting to lock enterprise users into a single Gemini-centric ecosystem.

Original Article

Google's consolidation push inside Gemini Enterprise continues to integrate separate products under one roof, and the latest developments suggest this trend is far from over. A new tab has started loading a user interface that references Android Studio, appearing as a separate page embedded directly into Gemini Business.

There is precedent for this: AI Studio already allows users to build native Android apps through plain-language prompts, complete with a browser-based emulator. This may also signal a preparation for a separate enterprise-focused desktop app from which users will be able to open Android Studio directly.

In parallel, a "Skills Marketplace" is taking shape in its own tab, where users can select from predefined skills tailored for Gemini and, in some cases, optimized for Google services. This initiative appears to encompass three components:

  1. Skills management UI
  2. A Skills Builder
  3. The Marketplace itself

A few organizations may already have access to early versions, although none have been widely released. A developer-facing Skill Registry is already available on the agent platform, suggesting that the consumer-style Marketplace is the front end of a layer Google can adjust based on account tier.

The teams most likely to benefit are those with ideas for dashboards, approval tools, or reporting interfaces that are typically delayed in engineering queues. While there is no firm timeline, and any component could remain experimental, the intent is clear: Google aims to create a unified Gemini surface that integrates its dispersed tools, pursuing the same super-app goal as its competitors but from a slightly different perspective.

AI researchagents

Today's Frontier AI companies will never exceed the AI capability frontier again

Centralized frontier AI models are being superseded by networks of smaller, specialized AI models that offer superior speed, accuracy, and efficiency.

Summary

What: Andrew Trask argues that the 'mainframe era' of AI—dominated by massive, monolithic models—is ending. He suggests that the future of the industry lies in decentralized networks of smaller neural networks working in concert.
Why it matters: This perspective challenges the prevailing 'scaling law' industry narrative and suggests that future competitive advantages will come from agentic orchestration rather than raw model size.

Original Article

Networks of smaller AI models are outperforming every frontier AI system on speed, accuracy, and cost. Everyone in the 1960s was wrong about the mainframe computer, and everyone is now wrong about centralized AI. The future is a network of neural networks.

AI llmbackend

Kimi K2.7 Code (Hugging Face Repo)

Moonshot AI's Kimi K2.7 Code model packs 1 trillion parameters into a Mixture-of-Experts architecture optimized for multi-step software engineering.

Summary

What: Moonshot AI released Kimi K2.7 Code, a 1-trillion-parameter MoE model designed for complex agentic coding tasks. It features improved token efficiency and end-to-end task completion capabilities over the previous 2.6 version, accessible through an OpenAI-compatible API.
Why it matters: Large-scale MoE models are increasingly being specialized for agent-loop workflows where low-latency, high-precision code generation is required.

Decoder

  • MoE (Mixture of Experts): A model architecture where only a subset of total parameters (experts) is activated for each token, allowing for high model capacity with lower computational cost per token.

Original Article

Kimi K2.7 Code is a coding-focused agentic model that has stronger end-to-end task completion across complex software engineering workflows and improved token efficiency compared to Kimi K2.6. The Mixture-of-Experts model has 1 trillion total parameters. It can be accessed on Moonshot's OpenAI/Anthropic compatible API. The model works best with Kimi Code CLI as its agent framework.

AI startupproductivity

$10,000,000 on the line: how we measure Devin's engineering output

Devin is betting $10 million per customer that its autonomous engineering agent can deliver more output than the cost of hiring it.

Summary

What: Ryan Bai claims the startup is backing a performance guarantee with a $10M pledge to prove the engineering productivity of Devin, an AI software engineer, verified through independent datasets.
Why it matters: This signals an attempt to shift the narrative from AI as a cost-saving tool to a guaranteed output-generation asset, pressuring other AI coding startups to adopt verifiable ROI metrics.

Original Article

$10,000,000 on the line: how we measure Devin’s engineering output

We're putting up to $10M per customer behind a single claim: Devin delivers more engineering output than you pay for. This is the system we built to prove it, validated on independent data we...

AI enterprisecareer

A frontier without an ecosystem is not stable

Satya Nadella is framing AI adoption as a dual-investment requirement: companies must cultivate both 'human capital' and 'token capital'.

Summary

What: In a recent statement, Microsoft's CEO suggests that long-term stability in an AI economy depends on organizations investing in both employee skill sets and the computational throughput required for their AI agents.
Why it matters: Microsoft is trying to standardize a framework for enterprise AI adoption that makes 'tokens' a formal balance sheet item alongside traditional headcount.

Decoder

  • Token Capital: A metaphor for the investment in and consumption of large language model inference compute necessary to run autonomous agents.

Original Article

A frontier without an ecosystem is not stable

I’ve been thinking a lot about the future of the firm in an AI-driven economy. This transition is different than any previous platform shift. In the past, we used digital systems to enhance human...

AI devopsenterprise

Ramp SWE-Bench: a private, production-grounded coding benchmark

Ramp has launched an internal, production-grounded coding benchmark to test if AI models can actually solve its specific financial software challenges.

Summary

What: The fintech company developed a private SWE-Bench variant that uses real-world bugs and features from its proprietary codebase, moving beyond generic public benchmarks.
Why it matters: Generic benchmarks are becoming less useful for high-stakes enterprise coding; companies are now building 'internal benchmarks' to quantify exactly how much engineering time they can safely automate.

Decoder

  • SWE-Bench: A common benchmark used to evaluate how well AI models solve real-world GitHub issues; Ramp’s version is a localized, private iteration of this.

Original Article

Full article content is not available for inline reading.

Read the original article →

AI opensourceresearchinfrastructure

DeepSeek's $10 Trillion Long-Term Strategy

DeepSeek’s aggressive open-source strategy may be a long-term play to commoditize foundational model architecture rather than capturing consumer app market share.

Summary

What: DeepSeek is prioritizing the release of open-weight models and research papers to undermine the moats of closed-source competitors like OpenAI and Google. The theory suggests this creates a dependency on DeepSeek's optimized infrastructure stacks and training methodologies.
Why it matters: This shift suggests that for many AI labs, the value lies in becoming the underlying standard for model development rather than winning the race for direct user attention.

Deep Dive

  • DeepSeek's model releases are designed to lower the barrier to entry for training large-scale models.
  • By open-sourcing efficient training recipes, they increase industry reliance on their architectural patterns.
  • The strategy mimics historical shifts where software stacks were commoditized to sell more foundational hardware or cloud utility.
  • Their research papers focus on optimizing compute efficiency, which is the primary bottleneck for massive model deployment.
  • The goal is to position DeepSeek as an unavoidable layer in the global AI supply chain.

Decoder

  • Open-weights: Models where the internal parameters are provided, allowing researchers to run the model locally, unlike closed-source models accessible only via API.
  • Commoditize: To transform a specialized good or service into a standardized commodity, typically driving down prices.

Original Article

This thread explores a theory that DeepSeek's focus on open-source models, research sharing, and infrastructure development is part of a broader strategy aimed at becoming foundational AI infrastructure rather than competing directly on consumer products.

DevOps aidatapython

Introducing Flights: Agent-Native Ingest in MotherDuck

MotherDuck's new Flights feature allows developers to build and run agent-native data pipelines directly within a Python runtime.

Summary

What: MotherDuck launched Flights, currently in public preview, which enables users to create data ingestion pipelines using MCP-capable agents, SQL table functions, or the MotherDuck UI. It supports ingesting data from various sources like S3, SaaS tools, and warehouses while maintaining secure, session-based processing.
Why it matters: This signals an attempt to make data engineering more accessible to AI agents by collapsing the distance between ingestion, transformation, and storage into a single runtime environment.

Decoder

  • MCP (Model Context Protocol): An open standard that allows AI agents to securely connect to data sources and tools.

Original Article

Vibe Coding Is Dangerous, Agentic Engineering Isn't ft. Wes McKinney

Wes McKinney, creator of pandas and co-creator of Apache Arrow, shares how he works with AI coding agents: spec-driven workflows with superpowers, continuous AI code review with Roborev, token economics, and why vibe coding is dangerous but agentic engineering isn't.

DevOps enterpriseai

GitLab: Built for the agentic engineering era

GitLab introduced a new agent-focused platform and flexible consumption model to support the transition to AI-driven software engineering.

Summary

What: GitLab announced a new Git engine for handling massive agent concurrency, the Orbit lifecycle context graph, and expanded governance controls for AI agents, alongside 'GitLab Flex,' which allows users to reallocate spending across seats and AI usage.
Why it matters: Platform providers are re-architecting their cores to accommodate the higher frequency of API calls and data context requirements generated by autonomous AI development agents.

Decoder

  • Agentic engineering: An approach to software development where autonomous AI agents perform tasks such as coding, testing, and debugging, rather than just assisting humans.

Original Article

GitLab unveiled an agent-focused software development platform featuring a next-generation Git engine for massive agent concurrency, the Orbit lifecycle context graph, governance controls for AI agents, and expanded security automation to help organizations scale AI-driven development safely. The company also highlighted strong enterprise AI adoption trends and introduced GitLab Flex, a flexible consumption model that lets customers reallocate spending between seats, AI usage, and platform capabilities as needs evolve.

Tech researchdata

New CRISPR Technique Selectively Shreds Cancer Cells, Including “Undruggable” Cancers

Researchers developed a new CRISPR-based method that selectively destroys cancer cells by targeting common tumor suppressor mutations.

Summary

What: The technique identifies and shreds cells containing specific mutations found in 70% to 90% of aggressive cancers. Unlike traditional small molecule or antibody treatments, this approach can be adapted to new cancer types significantly faster.
Why it matters: The transition from generalized treatments to programmable, mutation-specific genetic medicine is accelerating, potentially lowering the development timeline for targeted oncology therapies.

Decoder

  • CRISPR: A technology used to edit genes by precisely cutting DNA sequences.
  • Tumor Suppressor: A gene that protects a cell from one step on the path to cancer; when it mutates, it loses its ability to control cell division.

Original Article

Researchers have created a new CRISPR-based approach that can selectively destroy cells carrying a mutation in a tumor suppressor found in nearly half of cancers and up to 70% to 90% of cases of some of the most difficult-to-treat cancers. The new approach harnesses CRISPR's ability to find cells with specific mutations and uses its cutting ability to selectively destroy those cells. Making treatments for different cancers using this technique is faster than making a small molecule drug or antibody therapy. Delivery remains a challenge, but a combination of therapies may prove useful for some cancers.

Tech hardwareresearch

Nuclear clocks tick for the first time

Chinese and European research teams have independently built the first working nuclear clocks, achieving timekeeping precision far beyond traditional atomic clocks.

Summary

What: These clocks use the nucleus of a thorium-229 atom rather than the electrons used in standard atomic clocks. The breakthroughs involved overcoming the technical difficulty of generating the exact laser frequency required to interact with the thorium nuclei.
Why it matters: Nuclear clocks could lead to a new standard of timekeeping that is resilient to environmental interference, enabling more precise GPS, deep-space navigation, and tests of fundamental physics.

Decoder

  • Atomic Clock: A timing device that measures the resonance frequency of electrons in atoms (typically cesium) to maintain accuracy.
  • Thorium-229: A radioactive isotope whose nucleus has a unique, low-energy state that makes it ideal for nuclear timekeeping.

Original Article

Chinese and European research teams working independently have both achieved building a working nuclear clock. Both exploit the nucleus of a thorium-229 atom to keep time with extraordinary precision. Building a working nuclear clock has so far proven to be challenging, as the required laser light sits in a part of the spectrum that is technically challenging to generate and control. The Chinese team used a more powerful laser, while the European team worked with a crystal containing a higher concentration of the thorium nuclei.

Tech aiagentsbackend

The Log Is the Agent

An agent's true state is not its model or runtime, but the immutable log of events that dictates its decision-making history.

Summary

What: The author argues that agents should be designed as event logs rather than monolithic loops. By treating the agent as a history of events, developers can better reconstruct, resume, and debug complex agentic workflows.
Why it matters: The industry is moving away from treating agents as 'black box' inferencing loops and toward observability-first architectures, where the audit trail of an agent's reasoning is the primary data structure.

Deep Dive

  • An agent's identity is defined by its history, not the underlying model or local runtime.
  • Event logs enable perfect reconstruction of an agent's mental state and actions.
  • Comparing agents to character save-states in video games (e.g., Skyrim) to explain persistence.
  • Log-based architectures simplify complex, long-running agent workflows.
  • Designing for logs allows for easier debugging when an agent goes off-track.

Decoder

  • Observability: The ability to measure the internal states of a system by examining its external outputs and logs.

Original Article

The Log Is the Agent

Think about a character you have spent 100 hours playing in Skyrim or Elden Ring. What exactly is your character? Is it the game engine? No. Is it the PlayStation? No. Is it the controller? No. Your...

Tech aistartupcareer

Meta's months-old AI unit is a soul-crushing gulag, say the engineers stuck inside it

Meta's internal Applied AI team is reportedly facing a crisis of morale as thousands of employees are drafted into repetitive data-labeling tasks.

Summary

What: Meta's new 6,500-person Applied AI team, led by Maher Saba, is struggling with internal revolt after employees were involuntarily reassigned to generate training puzzles and coding problems. Chief AI Officer Alexandr Wang is central to the strategy of using Meta's engineering workforce for high-quality data generation.
Why it matters: Meta is betting that internal engineering talent produces better training data than third-party contractors, but the coercive nature of these reassignments is creating a significant cultural liability that threatens the company's ability to retain talent.

Original Article

Anyone who works at Meta or knows anyone who works at Meta will tell you the same thing: It is not a happy place, particularly given the seemingly endless layoffs the company has executed over the last few years — cuts that have only accelerated as the company funnels billions into AI.

Now, a new report in Wired suggests the company’s Applied AI team is on the verge of revolt.

The drama kicked off when someone hijacked a livestreamed, employee-only presentation this week with an expletive-laden meltdown, demanding that attendees tell a senior Meta AI executive that he was “a piece of sh*t.” One presenter reportedly covered their face with their hands.

That outburst, Wired reports, reflects simmering rage inside the three-month-old unit of roughly 6,500 engineers and product managers who have been tasked with supporting the company’s AI research ambitions.

A report last month in Business Insider shed light on how many employees originally learned they’d be moved into the group — through a surprise email, a process that one self-described draftee described later on Reddit as “quite random.” According to an internal announcement reviewed by BI, the reason they were enlisted is that Meta’s AI models still lacked the knowledge to outperform humans at technical tasks like coding. “For agents to understand how people actually complete everyday tasks using computers, we need to train our models on real examples,” the announcement read.

In a leaked audio recording from an internal meeting that month, CEO Mark Zuckerberg offered his reasoning for drafting employees rather than outside contractors. Alexandr Wang — who sold his data-labeling startup Scale AI to Meta for $14.3 billion before taking the role of chief AI officer and heading up Meta Superintelligence Labs — knows the data-labeling world well, Zuckerberg said. And candidly, the average Meta employee has “significantly higher” intelligence than third-party contractors, he added, making them the better choice.

Employees describe being forced into the group with no real choice: join or quit. Many call themselves “draftees.” Their assigned work? Generating puzzles and coding problems to train AI models. “It’s literally the gulag,” one employee told Wired. “Most people find the work soul-crushing,” said another.

It’s not just the Applied AI group where morale is lousy. More than 1,600 Meta employees company-wide have reportedly signed a petition protesting a program that monitors their clicks and keystrokes for AI training data. The mood across the company is dark enough that Meta’s chief product officer, Chris Cox, felt compelled to address the “brutal” environment on a call with employees this week, said Wired.

TechCrunch has reached out to Meta for comment.

According to earlier reports, the Applied AI team is led by Maher Saba, a 12-year veteran of Meta who was previously a vice president in its Reality Labs division, the division that burned through $83 billion on the metaverse before Meta moved on to AI. The new organization reports up to Meta CTO Andrew Bosworth.

Originally, the unit was structured in such a way that up to 50 employees reported to one manager.

Zuckerberg, for his part, reportedly addressed the broader situation in an internal memo Friday, acknowledging that recent changes had “caused distress” and admitting the company had made mistakes that it plans to address. According to Wired, he added in his memo that “Meta’s north star is to be the best place for the most talented people in the world to make an impact.”

Tech opensourcecareer

Leaving Mozilla

Mozilla veteran JR Conlin departs after 15 years, cautioning that chasing industry trends instead of serving its 'abnormal' community is eroding the company's foundation.

Summary

What: After 15 years at Mozilla, JR Conlin is leaving, citing concerns over a 'top-down' pivot toward chasing Daily Active User (DAU) metrics through feature imitation, rather than leaning into the community-driven roots of Firefox.
Why it matters: This critique highlights the recurring tension in open-source organizations when leadership attempts to professionalize for enterprise or market share at the expense of the core community that built the product's identity.

Decoder

  • DAU: Daily Active Users, a metric of product engagement.
  • #moco: Internal Mozilla Communications (short for Mozilla Corporation).
  • #cccc: Internal channel related to community or company culture.
  • OKR: Objectives and Key Results, a popular goal-setting framework.

Original Article

jr conlin's ink stained banana

Leaving Mozilla

After more than 15 years, I will be leaving Mozilla on July 21. Friday, June 12th will be my last “real” day, as I am planning on using my 200+ hours of vacation backlog. I've had the honor of working with some of you, and others have no idea who I am, but you might have a sticker of mine.

While I have mostly enjoyed my time here, there are a few things I wish to say upon my departure:

You Are More Important Than You Think

I'm not referring to you The Corporate Entity or you the Collective Organization. I mean you. The person reading this right now. I have beat the drum for mentoring for quite some time. Mentoring is a lot of things, but essentially, it's finding someone else to talk to. In a company full of fellow introverts, I get that idea is uncomfortable and hard, but really, it's one of the best things for both you personally and your career. You're smart and can both learn and teach, no matter what your level of experience or background. Please try it.

You Are Part of Something Far Larger

If you're working here, you're one of the fortunate. There are a bunch of people who wanted to build a browser that could stand toe-to-toe with ones built by people with a lot of money. A browser that put their interests first. That worked how they wanted. We're the lucky, tiny portion that can get paid, but it comes with a price. It is our obligation to listen to the people who aren't lucky enough to get paid. The folk out there are our community. They're our peers. They are the ones who trust that we will continue to work for them, because if not, they'll find someone else who will.

We run the very real risk of losing those people.

We're also pretty small

It's too easy to think that we're big. We're not. We're a niche browser that is lucky enough to get well funded. We shouldn't try to be like the big browsers because that's not what our Community wants.

Think of it this way; imagine living somewhere filled with McDonald's, Burger King, and Wendy's. We're that cozy little Mom & Pop diner where customers say hi to each other, pour each other coffee, and clean up tables. It's the sort of place that folk meet up to chat at tables, have a pretty awesome sandwich, and ask the owner who runs the grill if he thought about having the pork chop come with rice instead of a baked potato.

People have to seek us out. They're doing that because they don't want to use the browser that they literally have at their fingertips. They also seek us out because they don't trust the other big browser that everyone insists that they should be using. The folk that don't care about that already use those big browsers. So, if people are looking for an experience that is absolutely not like the one they already have, why would we ever seek to emulate what those browsers are doing?

That doesn't mean we can't become big. We did this before. When we listened to our Community, gave them what they wanted, let them work with us to build something amazing, they told their friends. That was our growth phase, where we had ever increasing DAU (to use business terms). For regular folk, that's when they installed us on their uncle's laptop, their neighbor's phone, and their classmates' desktop, because their work was part of what made us special. They told their company about how we were a great alternative because they were proud to be part of what we were. That's something you can't build with just posters and stickers alone.

So, that's it. TL:DR;

  • Respect yourselves.
  • Help each other out.
  • Remember who you're working for.

If you want to stay in touch, I'm not that hard to find online, although I might spend a few days screaming at the ocean.

Right, so that's the note I sent when I left. Now I get to talk about all the things that bothered the hell out of me.

Mind you, I would have preferred to stay longer, but things got to a point where it just wasn't fun anymore. (Considering that my career has been supporting stuff that no one else wanted to deal with. When everyone else ran out of the room, I was the one that would sigh, raise his hand, and take on the task. This did not do wonders for my career, but it was honest, hard work and constantly challenging.)

My career generally has been weird, because I'm not the kind of person to jump ship after a year or two. That's about the point when I feel I actually understand not only what I was handed, but how it fit into the larger org, and I can actually improve things more holistically. Not all of those improvements radically changed things. I've heard it as “Keeping the campsite clean”, little changes and improvements that make everything else generally better.

I've had previous opportunities to “ride the rocket” with various companies and they've been moderately entertaining. Often the trajectory of that start-up rocket was “into the ground". In fact, the majority of companies I was with no longer exist, Netfix being the odd-duck out. It's often said that Mozilla survives in spite of it's Leadership, not because of it, and it absolutely rang true lately.

So, let's go over a few things that bothered me. These are all opinions from someone who worked in a trench there for 15+ years, but, the bonus was that I never got the chance to think that highly of myself.

I'm not kidding when I said that Firefox is a niche browser. Folk have to actively look to use it. They have to search it out, figure out how to download it, ignore all the warnings and “suggestions” that they should keep using whatever the native browser is, avoid all the ads for Chrome as the better replacement browser, ignore all the sites saying “Your browser is out of date” because they couldn't be arsed to test things in Firefox, etc. Firefox users are not normal. They are deeply abnormal, and frankly a lot of them are proud of that.

The problem is that Leadership doesn't know how to deal with that.

Mozilla, born of being niche, and started by a bunch of abnormal folk, is deeply abnormal. Mozilla is open source. Like, really open source. Pretty much every line of code they write is published somewhere. (There are some private repos of course, because they're not going to leave the keys under the doormat, and there are some repos that aren't public because the folk that wrote them are exec types that don't understand the power or motivation of Open Source, but they're weird and those projects don't last long anyway.)

Pretty much no other company in the Tech Industry is like Mozilla. So it's really hard to hire people with experience running traditional Tech Industry companies that have any clue about how to deal with being that level of open. They all come from worlds where The Black Turtlenecked God told you “Do Not Tell Anyone about Anything”. The idea that they literally give things away and are actually transparent as hell is like telling them Mozilla employees are martians. They smile, say polite things, then ignore our history and actions and do things that they know because the concept of anything alien is clearly evil.

This sort of thing manifests in weird ways. One of the more hilarious ones is the “Chase for the DAU” (Daily Active User). Mozilla's DAU count has been dropping for years. There's all sorts of reasons for that. I bet you can come up with a few yourself. Of course, New Leadership comes in with guns a'blazing and Big Ideas for how to make DAU go Up. Those proposals seldom work because those Big Ideas inevitably are “We should copy what the Big Browsers do!”. Remember when I said that our users are deeply abnormal? Yeah, they already have that feature in the browser that's already on their machine. If they wanted it that bad, they already have it.

I told someone once that imagine being in an area where every restaurant is McDonald's, Burger King, or Wendy's. Opening up another burger stand isn't really going to cut it. But if you open up a diner where folk know your name, and customers can pour coffee for other folk, or help clear dishes, or talk with the guy at the grill and try to convince him to add teriyaki spam and grilled cabbage on rice to the menu, you might wind up becoming the neighborhood hang-out that folk tell visitors about.

But, sure, if the DAU numbers are down, clearly New Thinking must be the answer! Maybe? I mean, thinking that's different than what was currently being done is probably a really good idea. The thing is that's also when the delusions set in. Every New Leader thinks “We've got to think like a Start Up!” I mean, they do know that most start-ups fail, right? Mozilla is a 30 year old company. They are the polar opposite of “start-up”. In fact, for the past 15 years, they've been "thinking like a Start-Up" in various flavors and now they have the lowest DAU ever. Instead, I dunno, maybe look back at that 30 year history and see what they were doing when they had positive DAU and do that again?

I'll give them a hint because I was around then: It wasn't chasing the latest fads.

It was doing what they're good at, being deeply abnormal, and helping folk make what they really wanted.

Back then, not only were we publishing all of our code, we were working with folk to make a better browser, regardless of where they were or if they were part of our org chart. Doing that excites people. Knowing that what you're working on can go into a product used by others makes you faithful about that product. Knowing that your opinions and ideas can change the most complex application on your computer makes you want to share that application with others. Having a sense of ownership, no matter how small, makes you a member of a Community that advocates for that browser and makes you want to install it everywhere. That beats any clever marketing project, ever. I know this, because I saw it happen, repeatedly.

Of course Leadership doesn't get that. That's one of those “martian” things, that is clearly wrong and bad because it wasn't part of their MBA syllabus and Meta didn't do that. (I mean, the fact that Facebook is a community message board, and thus has more contributions from outside of the company than inside, may skip right by them, but whatever.)

Another fun example of “not clear on the idea” happened after Mozilla decided to chase the Enterprise dollar. Don't get me wrong, that's an incredibly rich seam to draw from. Short of getting government contracts, there's no better source of reliable income. Of course, getting that “enterprise” gold isn't easy, and comes with lots of strings and conditions, usually in the form of ISO standards. One of them is that you need to prove that your code and infrastructure are secure. (No bad guys getting in and doing bad guy things.) There are ways to address that sort of problem. One is to follow the guidance and install essentially spy-ware and locks on company issued devices to make sure that "noting bad happens". Basically, fortify the walls so that the Outside People don't get your valuable data.

Again, Mozilla is abnormal. Normal companies secure things because if bad people saw their code they could write exploits and do bad things. That works fine for others, but Mozilla publishes all of their code. Bad guys are already building exploits, and Mozilla has a stellar track record of fixing critical bugs, often within 24 hours. That's unheard of, and they've done that since day 1. That's like mandating you put a bar on the steering wheel of an armored tank, that's actively crewed, in the middle of a base, with half the population watching. Yeah, sometimes bad things happen, but it's not that often. You push back where it makes sense. Trust me, enterprise companies use curl, linux, and a slew of open source stuff written by random people who engage in all sorts of activities and will never fill out a cybersecurity attestation certificate. You make sure that the keys are properly locked and the build environments are secure, and that there's a clear audit path available with trusted signatories. You know, the thing they've been doing so long and well it's what the guys who run cybersecurity outfits modeled themselves after.

Another delusion comes about because of self-reinforcement. Say you're going to release some, controversial feature. Maybe it's browser based DRM, maybe it's AI, maybe it's Push Notifications. Listening to your users can be a bit challenging, because while some might tell you, most probably won't. They'll just leave. That means that your source of information will be the people that stick around, so you wind up getting artificially high approval rates for things. It's a bit like that bomber diagram meme. I'm willing to say that if you announce a feature or function and the number doesn't go up past the initial novelty bump, that's a pretty solid indicator that you guessed wrong, and maybe the folk complaining on Reddit might have a point. Folk are telling you, they're just not doing it directly in a focus group.

A final Community thing is that over the past five or so years, Mozilla has been turning away from it's powerhouse, the Community. I have no idea why, but I can say that it's a top down decision. At some point, some folk at a high level decided that Mozilla got to where it did on it's own. It did not. The thing I was beating a drum about was that the folk working there were the lucky ones who got a paycheck, but most of their peers were folk who didn't have a badge and a @mozilla.com email address. Leadership was convinced that the people in our Community were just customers, and maybe fans. This pissed off so, so many folk, and rightly so. They had given hours or years of effort and time without compensation, because they believed they were part of a larger effort. They felt betrayed because they were betrayed. I'm sure that someone probably had a reasonable argument about “how could we let all these outsiders have say?" or "I don't like that those folk hate the amazing work we're doing promoting lemony fresh bell bottoms (or whatever trend they were chasing)?" I dunno, boss, maybe the folk using your browser actually might have solid reasons of their own and might really appreciate things that don't show up on LinkedIn top takes?

For what it's worth, I'm not concerned for Mozilla isn't it running out of money. So long as Google or another large search engine exists, it can get cash. There are also a few other financial stability angles it can do which (frankly) would be better. I wish they had made a bigger deal about the privacy preserving (no, really) ad stuff that they were pioneering. Basically, think of it as regressing ads back to the model they were pre-internet, and you're not far off. There will be lots of money and new leaders who don't understand what made the company they're now in charge of only last for so long. There will be lots of people with their own Big Ideas who will come in, chase the chickens around, and leave once they've caused enough ruckus. I hope that it continues to collect nascent martians like myself, who know how big Tech works, hate the approach that those companies take and want to actually make things better, rather than just put a gold star on their resume.

So, what to make of Mozilla?

I will absolutely say that it is full of some of the smartest, kindest, most privacy obsessed people I ever had the honor to work with. I'm proud of the 15+ years I spent working there and look back at most of it fondly. I will probably continue to use firefox as my daily browser, while turning off the latest fad stuff. I will keep telemetry on because I know exactly how it was used and how the people are painful about keeping it private. (Privacy scales, and is cheap as hell, even if it makes your job so damn hard.) I will avoid the AI crap because it's not going to last. That said, there are other browsers I will probably screw around with more, like Servo and Vivaldi.

I also fully expect that this post will probably make the rounds on #moco and #cccc and it'll promptly be ignored in a month. That's fine by me. I don't expect anyone in Leadership to change, which will make me sad. Google's cash cannon will continue to feed Mozilla for quite some time, so expect the bad ideas to continue.

But let's say someone were to ask me the question I used to pose to folk in interviews. “Let's say that you stun us and we make you CEO, what are the things you'd push hardest to do?”

Be boring for a while. There's blood when you live on the cutting edge, but a lot of it is yours. Mozilla has tried building everything from a shopping nexus to a phone's operating system, and kept discovering that they're not great at it. They are, however, really good at building browsers. They should do that. Focus on shoring up core features that folk rely on. There's opportunities to innovate and improve, of course, but it's not a bad idea to let the high speed pasta cannon cool down a bit.

Cut back on the moonshots. As the saying goes: “Shoot for the moon! Because even if you miss, you quickly die of radiation poisoning and circle the sun for thousands of years before impacting something at thousands of kilometers an hour.” Firefox has been “a thing” for thirty years. Those fellow weirdos already know about Firefox, because they're so sick of their default that they want something different. Instead of trying to give them some new, fancy thing-a-ma-bob that gets abandon in a year, how about spending time fixing all the old bugs and tech-debt that's accumulated? Give customers something that just works better, is less annoying, and doesn't constantly scream about how awesome it is. Be the browser that realizes that maybe some people like radical change, but others REALLY don't and make things “opt-in” by default. (Also, remember that your customers are not your fans. They barely tolerate you. You have to work every day to convince them to stick around. That's not being negative, that's being realistic. Humility drives improvement and forces you to be more critical about radical changes.)

Build back your Community. Encourage outside contributions, to the point where they are part of the active conversation about what to do next. Not some focus group you cornered for an hour with the promise of Amazon gift cards, but the folk that helped fix a bug, land a feature, translate a page, or answered a ton of questions. Did you know that for a while Firefox was available in nearly every language? That was thanks to the team of volunteers that helped make sure that Firefox was fluent when no other browser or application was.

Don't get rid of the good stuff. Mozilla has a really terrible habit of trying to get rid of things that are successful. They hold Thunderbird at arms length, tossed out Rust (even though it could have been their cash cow), Servo may be what beats you. Yes, there have been a lot of really bad ideas and plenty of passion projects that went nowhere, but Mozilla often “cleaned house” for terrible reasons. Mozilla could still invite some of those orphaned projects back, you know. Or at least work with them to actually make things better. Maybe some face of Mozilla is the one that provides the “enterprise” aspect for Rust, and they split revenue with the project. Maybe they bring the Servo folk back to talk about improvements. Maybe they spend a few resources to improve bugzilla, the thing that got corrupted into the waking nightmare that is Jira, and give Atlassian a run for their copious money.

I do, honestly, wish that Mozilla would reconnect with their Community, though. I'd love if the abnormal little niche browser built by martians became popular. Not because of being just like the big browsers, but because it's nothing like them. You know, kind of like it was back in the 2000's when the DAU was way the hell higher. Firefox succeeds not by being the same, but by attracting folk that want something different and reflects their needs rather than someone's OKR. We grow not by making noise, but by being useful. Heck, we can become a significant portion of the market again just by being a browser and not whatever the hell the other companies are doing that's driving their customers nuts. Will we ever be #1? No. That shouldn't be our goal, either. We should be a significant part of a vibrant ecosystem, not the black hole that consumed everything. Hell, if I were Mozilla Leadership, I'd be watching Vivaldi closer than I would be watching Chrome or Edge.

For the past year or so, I've been asking myself a question. “Who am I doing this for?” Who am I working so hard, landing features and functions, ensuring that things are working, etc. It's pretty clear it's not the same folk that I was doing it for when I started. It's not the folk outside who wanted a browser that was theirs. The answer I kept coming back with was “I'm doing all this so that someone else can get a gold star on their resume for the next gig." I'm sorry, but I don't care about your career any more than you cared about mine. The people who go through the extraordinary effort to find Mozilla Connect, have told you the things that they want us to work on. Yeah, it's low traffic in the same way that Arthur Dent's home plans were public and well trafficked. (Yes, I'm well aware that “Suggest” is used by something else, but we've never let that stop us before.) Working overtime and killing myself because someone got a wild hair is not super compelling, and makes what I do more “a job” than something I'm actually interested in doing.

As for me? I dunno. I'm more burnt than the Christmas Roast you just remembered you had in the oven. I'm more done than a Trump casino. I've got some money stashed away and can live off of that for quite some time, but I'll probably be back doing tech stuff, probably open source stuff because it's a lovely way to fuck with the tech bros trying to ruin it. Who knows, maybe I'll grab a few old laptops, some controllers and set up a few MAME rigs at the local nursing homes because even old folk need a few FYB type games.

Heck, I might even fork Autopush and some of the WebPush libraries just so I can finally work through the backlog of crap.


1. Someone noted to me that part of this is driven by Mozilla being notoriously slow about doing things internally. Everything feels like a struggle rather than how quick it is with startups that don't have guard-rails, and I appreciate that. The problem, though, is that those guard-rails were added for reasons. The problem is that the "Go Fast" folk don't ask why those guard-rails were added, nor try to figure out if they're still needed, they just get frustrated an leave and the guard-rails remain. Mozilla is 30 years old, and made a LOT of mistakes, but, also learned a lot, part of the "go slow" I talk about later might be testing to see if we still need all those guard-rails and speed bumps.

2. I will also add that there are some folk out there that are the absolute worst imaginable. They're terrible people who sling insults and insist that their brilliance is the only solution to any problem. These are the folk that make 4chan look like angels, and they're more than happy to pollute any discussion they can come across. Ignoring, banning, and otherwise dealing with those folk can be de-humanizing and soul crushing, because those folks do everything they can to make it that way. But, that's also part of the challenge. You don't let one person deal with it, you have lots of folk wade in and deal with it. You make the trolls small and uncomfortable because that's what they are, insiginficant mites there just to pester and ruin things. That's hard, and abnormal, and what makes a group stand out.

Tech aipolicy

Inside the Room Where America's Brightest Game Out How to Avoid an AI Apocalypse

Policymakers and technologists are increasingly using war-gaming simulations to anticipate and prevent existential risks posed by advanced AI systems.

Summary

What: Workshops and simulation rooms are being used to map out high-stakes scenarios involving AI failure, emphasizing the need for proactive governance despite political polarization.
Why it matters: The transition from theoretical AI safety discussions to active simulation indicates a move toward treating AI risk management as a matter of national security rather than just abstract research.

Original Article

Political polarization makes ambitious changes difficult, but inaction would be far more costly.

Tech researchenterprisellm

Has AI Already Killed How-To Nonfiction? Sales Trends, My Personal Data, and What It Might Mean for the Future

Prescriptive nonfiction sales are plummeting as readers turn to chatbots for personalized, instant answers instead of buying books.

Summary

What: Tim Ferriss reports that his catalog’s print sales have declined by approximately 57% in 2026 compared to 2025, following a 46% drop the previous year. He argues that LLMs are now acting as the interface for information retrieval, effectively commoditizing 'how-to' knowledge that was previously gated behind book formats.
Why it matters: This shift indicates that the value proposition of information-dense nonfiction is collapsing, forcing authors to prioritize deep storytelling, human connection, and '1,000 true fans' models over broad-market instructional content.

Deep Dive

  • Prescriptive nonfiction sales are experiencing a 'near-vertical drop' in 2025-2026.
  • Users prefer chatbots over books because LLMs provide personalized protocols based on individual constraints (e.g., body weight, schedule).
  • Traditional 'how-to' formats (books, long-form videos, podcasts) are losing their role as primary information distribution channels.
  • 'Voice, taste, and personality' are identified as the only durable competitive moats for creators.
  • There is a growing trend toward 'community-based' learning where the value lies in accountability and human interaction rather than just information delivery.
  • The author suggests focusing on a small, loyal audience (1,000 true fans) rather than chasing algorithm-driven vanity metrics.

Decoder

  • Prescriptive nonfiction: Books designed to provide specific instructions or frameworks for achieving a goal (e.g., health, wealth, productivity).
  • BookScan: A service from Circana that tracks print book sales at the point of sale in the U.S.
  • Run-rate: An extrapolation of current performance to predict future financial results over a longer period.

Original Article

Full article content is not available for inline reading.

Read the original article →

Data airesearchenterprise

A frontier without an ecosystem is not stable

Companies must build 'frontier ecosystems' by owning their learning loops to avoid ceding all value and IP to a few dominant AI frontier models.

Summary

What: This piece advocates for a strategy where firms build 'token capital'—AI capabilities owned and refined by the company—alongside their human capital. By using private evaluations, internal reinforcement learning, and institutional knowledge bases, companies can ensure their internal learning compounds, preventing the commoditization of their industry knowledge.
Why it matters: This proposes a stable equilibrium where companies differentiate themselves through proprietary data and workflow refinement, rather than relying on generic models that offer the same capabilities to every competitor.

Deep Dive

  • Humans and AI must form a cognitive loop where human agency drives token capital accumulation.
  • Organizations should treat their workflows and judgment as assets that compound over time.
  • Private 'evals' should measure business-specific outcomes rather than external benchmarks.
  • A firm's institutional memory should be queryable to make token usage more efficient.
  • The goal is to build a 'hill climbing' system that improves with every internal use.
  • Preventing value-capture by a few model providers is essential for political and economic stability.

Original Article

A frontier without an ecosystem is not stable

I’ve been thinking a lot about the future of the firm in an AI-driven economy.

This transition is different than any previous platform shift. In the past, we used digital systems to enhance human capital. This is the first time we can create a real cognitive loop between people and digital systems. That is a mind-bender, because it changes how we even conceptualize work inside an enterprise.

What is at stake is not some digital tool or system and its use, but how organizations continue to learn, build IP, differentiate, and thrive in a world where AI models can continuously absorb the expertise of humans and organizations and commoditize it.

Every company is going to have to build what I think of as human capital and token capital. Human capital comprises the knowledge, judgment, relationships, ingenuity, and pattern recognition of its people, while token capital is the firm’s AI capability it builds and owns.

Importantly, human capital does not become less valuable as token capital grows. It only becomes more valuable! I believe human agency will be the driver of token capital growth. Humans will set ambitious goals, connect dots across domains, build relationships, and recognize patterns that matter most. Without human direction, you have compute running in circles.

This means the real opportunity is not in picking the best model but instead in building a learning loop on top of models where human capital and token capital compound. You can offload a task, or even a job, but you can never offload your learning. The future of the firm is the ability to compound that learning across people and AI.

This requires a new architectural approach where every business is able to build agentic systems that improve over time, while still retaining control over their IP. A company should be able to switch out a “generalist” model without losing the “company veteran” expertise built into their learning system. This is the key “test” of your control and sovereignty in the era ahead.

Companies need to turn their workflows, domain knowledge, and accumulated judgment into AI systems that improve with each use. Private evals should capture whether a model is actually improving against outcomes that matter to the business (not just external benchmarks!). Private reinforcement learning environments should let models grow stronger on real traces from inside the organization. Its knowledge base makes institutional memory queryable and use of tokens more efficient.

This loop becomes the new IP of the firm. I think of it as a hill climbing machine. And unlike most assets, it compounds. Every improved workflow generates better training signal, which accelerates the accumulation of tacit knowledge unique to the firm. The companies that build this early will have an advantage that is hard to replicate, regardless of any new individual model capability.

The last thing any of us want is a world where every company across every sector is ceding value to a few models that eat everything they see. If all the value is accrued by only a few models, the political economy will simply not tolerate it. There is no societal permission for an AI future that hollows out entire industries.

Think about what happened in the first phase of globalization where entire industrial economies were hollowed out by outsourcing. The GDP numbers looked fine on the surface, but the displacement was real and the consequences are still being felt. Let us not bring that dynamic into the AI era, with a small number of AI systems capturing all the economic returns, while entire industries find their knowledge commoditized right out from underneath them.

In my view, our priority has to be building a frontier ecosystem, not just a frontier model, so value flows broadly across every company, every industry, and every country. One where every organization can own the learning loop that encodes its institutional knowledge, compounding its human and token capital.

This is the ethos I’ve grown up with where platforms enable more value on top than is captured inside, and where every company can continuously innovate and build value of its own.

When that happens, companies will create value for themselves and for the economy around them. Employees will see their expertise amplified and their judgment become part of systems that make it replicable and scalable and the benefits accrue to the companies and communities around them.

That is how companies drive value for themselves and the broader economy. And it is the stable equilibrium we should build together.

Data aipolicyenterprise

Linux Foundation Announces OpenSharing Project to Standardize AI Asset and Data Exchange

The Linux Foundation launched OpenSharing, a Databricks-contributed project aiming to standardize the exchange of AI models, agent skills, and data across clouds.

Summary

What: OpenSharing extends the Delta Sharing protocol to support agentic AI assets and unstructured data. It provides vendor-neutral APIs for discovery and access, with support for Apache Iceberg clients.
Why it matters: The industry currently relies on fragmented, proprietary marketplaces; a common standard is essential for the scaling of enterprise agentic workflows.

Decoder

  • Delta Sharing: An open-source protocol for secure data sharing, originally developed by Databricks, that allows organizations to share live data sets without moving them.

Original Article

Linux Foundation Announces OpenSharing Project to Standardize AI Asset and Data Exchange

New protocol enables secure, open sharing of Agent Skills, AI models, and data across platforms

Summary

  • The Linux Foundation launched the OpenSharing Project, an open, vendor-neutral protocol for sharing AI assets and data across organizations and platforms.
  • Contributed by Databricks, OpenSharing enables the exchange of agent skills, AI models, and unstructured data through a common standard.
  • The project eliminates reliance on proprietary marketplaces and custom integrations, allowing organizations to securely share AI resources across ecosystems using a single open protocol.
  • OpenSharing expands interoperability across data platforms by supporting multiple open table formats, including Delta Sharing and Apache Iceberg recipients, reducing fragmentation and increasing portability.
  • Hosted by the Linux Foundation, OpenSharing provides community-governed infrastructure designed to accelerate AI collaboration, interoperability and enterprise adoption at scale.

SAN FRANCISCO, June 10, 2026 – The Linux Foundation, the nonprofit organization enabling mass innovation through open source, today announced the launch of the OpenSharing Project, an open, vendor-neutral protocol designed to standardize how organizations share AI assets and data. Hosted by the Linux Foundation and contributed by Databricks, OpenSharing evolves the widely adopted Delta Sharing protocol to meet the requirements of the agentic era, providing the first unified framework for exchanging agent skills, AI models, and unstructured data volumes across disparate platforms.

As enterprises accelerate the deployment of agentic AI, the lack of a standardized exchange protocol has forced organizations to rely on point-to-point integrations or proprietary marketplaces. OpenSharing eliminates these silos by enabling secure, cross-organizational sharing through a single, open protocol. By abstracting underlying storage complexities, the project allows enterprises to publish AI assets and data that can be consumed by anyone, regardless of their specific cloud environment or platform.

“OpenSharing addresses a critical need for a common, vendor-neutral framework that enables organizations to exchange AI assets securely and interoperably across platforms and ecosystems,” said Jim Zemlin, CEO, Linux Foundation. “By bringing this technology to the Linux Foundation, we can foster open collaboration, broad industry participation, and the shared governance needed to accelerate AI innovation at scale."

A key aspect of OpenSharing is its support for interoperability across multiple open table formats and data-sharing approaches. Building on Delta Sharing’s open connectors that support a wide range of platforms, OpenSharing expands this cross-platform interoperability with support for Iceberg IRC clients, expanding the universe of reachable recipients. This broadens compatibility across data platforms and reduces fragmentation through a more consistent and collaborative sharing model.

“Delta Sharing proved the industry would choose open over locked-in,” said Matei Zaharia, Co-founder and CTO of Databricks. “OpenSharing extends that principle to the full AI stack, while expanding the cross-platform ecosystem to Iceberg recipients and on-premises providers. The agentic era deserves an open foundation, and OpenSharing delivers it.”

More information on how to contribute or integrate the protocol will be available soon at OpenSharing on GitHub and www.opensharing.io.

Supporting Quotes

“OpenSharing provides the open framework our customers need to securely access comprehensive property and location intelligence and predictive risk modeling within their preferred data and AI workflows. By making this data AI-ready and openly available, we’re enabling the financial, insurance, and real estate industries to collaborate and deploy AI against trusted property insights.” — Mark Weaver, Head of AI Partnerships, Cotality

"Healthcare data has always been fragmented. It lives across different systems, organizations, and formats, which makes it difficult to create a complete picture when decisions need to be made. OpenSharing helps remove some of that friction by delivering analysis-ready, AI-ready real-world data directly into the environments where our customers already work. The result is faster access to trusted data, stronger governance, and less time spent moving and managing data before teams can begin generating insights." — Jeff McDonald, CEO, Kythera Labs

“OpenSharing is aligned with our LSEG Everywhere strategy, and helps us deliver trusted, AI-ready financial data and AI assets wherever our customers work, regardless of which cloud, tools, or AI model they want to use it in. We're excited to be helping Databricks shape this offering to serve our joint customers worldwide.” — Ron Lefferts, Divisional CEO of Data and Analytics, LSEG

“Enterprises should not have to choose between keeping sensitive data on-premises and using modern AI and analytics platforms to extract value from it. Native open source OpenSharing in AIStor opens up access to data that cannot move. This creates a foundation for unlocking the vast untapped AI value hidden across enterprise data environments.” — AB Periasamy, Co-Founder and Co-CEO, MinIO

"Financial data is the lifeblood of our customers' operations, and they require the flexibility to analyze it in their preferred environments. Our partnership with Databricks is built on a shared vision of openness. Leveraging OpenSharing natively within Stripe Data Pipeline ensures that our users can securely and effortlessly unlock advanced analytics and AI capabilities on their customer, billing, and transaction data." — Emily Sands, Head of Data and AI, Stripe

About The Linux Foundation

The Linux Foundation is the world’s leading home for collaboration on open source software, hardware, standards, and data. Linux Foundation projects are critical to the world’s infrastructure including Linux, Kubernetes, Node.js, ONAP, OpenChain, OpenSSF, OpenStack, PyTorch, RISC-V, SPDX, Zephyr, and more. The Linux Foundation is focused on leveraging best practices and addressing the needs of contributors, users, and solution providers to create sustainable models for open collaboration.

Design enterprisefintech

Pinterest Bets on Creators with Amazon Storefront Integration

Pinterest is integrating Amazon Storefronts directly into profiles to capture creator-led affiliate commerce and displace AI-generated content with vetted human recommendations.

Summary

What: Pinterest creators can now link their Amazon Storefronts to their profiles, automating affiliate link application for tagged products. This follows previous advertising partnerships with Amazon and Google.
Why it matters: Pinterest is struggling to monetize its bookmarking user base and combat the 'AI slop' that has flooded the platform; by formalizing creator-led shopping, it attempts to shift the user experience back toward intentional, human-curated commerce.

Decoder

  • Amazon Storefront: A personalized landing page on Amazon where creators curate and showcase products, earning affiliate commissions on sales.

Original Article

Pinterest is expanding its partnership with Amazon. The social pinboard site said on Wednesday that it will now serve as a home for creators’ Amazon Storefronts. These online stores allow creators to generate income from affiliate links that direct their fans to products that they often feature in their videos and social media content.

The move gives Pinterest another way to appeal to creators who have largely built their shopping and affiliate businesses on larger platforms like Instagram, TikTok, YouTube, and Facebook. Pinterest says more than half its users visit it to shop, and that it sees more than 80 billion searches per month.

The deal comes as Pinterest works to regain its position as a shopping destination, and responds to user complaints about the growing volume of AI-generated content on the platform.

Pinterest said creators will now be able to use a new tool to connect their Amazon Storefront directly to their Pinterest account. After setup, their affiliate link will be applied automatically whenever they tag an eligible Amazon product, automating the process of promoting items.

These storefronts can then be featured on creators’ Pinterest profiles, allowing their fans and followers to see a broader picture of their recommendations beyond a single Pin or Board.

The deal ties Amazon and Pinterest closer together after the companies introduced their multi-year ads partnership in 2023, which made Amazon the first partner on third-party ads on Pinterest. The company then followed this up with a similar advertising deal with Google in 2024, as it continued to struggle to grow revenue for its popular but not fully monetized bookmarking, shopping, and inspiration platform.

However, Pinterest has also faced a barrage of AI content over the past year that has alienated some of its user base and led to many complaints about “AI slop.” The company last year rolled out a series of new tools to fight the AI content and put users in control, but it can only do so much when much of the AI content remains unlabeled.

This partnership offers Pinterest a new way to get back on track. By working with real-world creators, Pinterest could potentially reverse its souring reputation and become known once again as a place to shop and be inspired by other people’s recommendations.

Pinterest notes it will support storefront linking with “other partners” soon.

Design policyweb

We stopped clicking, and AI became the Internet

The shift toward zero-click AI responses threatens the economic viability of the open web by decoupling content creators from their audiences.

Summary

What: The article explores how AI-mediated browsing is replacing direct navigation, raising concerns about the loss of human discovery, source transparency, and the financial incentives for original reporting.
Why it matters: This shift marks a fundamental change in the internet's business model, where the value of traffic is increasingly captured by model providers rather than the sources providing the raw data.

Decoder

  • Zero-click experience: An interaction model where search engines or AI provide direct answers on the results page, preventing the user from navigating to the original source website.

Original Article

The internet is increasingly being mediated by AI, bots, and zero-click experiences, reducing the direct connections between readers and creators that once sustained the open web. While AI makes information more accessible and convenient, it also risks weakening curiosity, diversity of thought, and the economic incentives that support original human-created content. The challenge is not that AI has replaced the web, but that it increasingly sits between people and the original sources of knowledge. The piece argues for greater source transparency, AI neutrality, and product designs that encourage discovery and user agency rather than simply delivering the fastest possible answer.

Design mobile

How Videogame UX Quietly Reshaped 2026 Online Design

Online gambling platforms are increasingly adopting aggressive 'gamification' UX patterns from free-to-play video games to drive user retention.

Summary

What: Between 2018 and 2026, online casinos have replaced traditional slot-machine interfaces with design elements typical of console games, including daily quests, progression levels, and achievement unlocks.
Why it matters: This shift illustrates the convergence of gambling and gaming design, where behavioral science techniques meant to increase player session length in entertainment are being applied to high-risk financial products.

Decoder

  • Live-service games: Video games that receive ongoing updates and content to keep players engaged over a long period.
  • Gamification: The application of game-design elements and principles in non-game contexts to improve user engagement.

Original Article

Online casinos have dramatically transformed their user interfaces between 2018 and 2026 by hiring video game designers and adopting patterns from live-service games. The modern casino experience now features tutorial flows, daily quests, progression systems, and achievement mechanics borrowed directly from console and mobile gaming. While the underlying gambling mechanics remain unchanged, the design framework now mirrors free-to-play games rather than traditional slot machine interfaces.

DevOps cloudaienterprise

GitLab on Google Cloud: Fully managed, compliant, and AI-ready

GitLab is now available as a fully managed, AI-ready platform on Google Cloud through certified partners.

Summary

What: Organizations can deploy GitLab on Google Cloud with integrated access to Google's Gemini and Gemma AI models within the GitLab Duo Agent Platform, emphasizing data residency and compliance for enterprise users.
Why it matters: Enterprise customers increasingly demand managed cloud deployments that keep their data within specific regulatory boundaries while providing native access to generative AI tools.

Original Article

GitLab can now be deployed as a fully managed platform on Google Cloud through certified MSP partners, giving organizations control over data residency, governance, and compliance while integrating Google's latest Gemini and Gemma AI models into GitLab Duo Agent Platform.

Tech enterprisestartup

Elon Musk Becomes the World's First Trillionaire

Elon Musk has officially become the world's first trillionaire following a 20% surge in SpaceX share price after its IPO.

Summary

What: SpaceX shares rose 20% from their initial offering price of $135, pushing Elon Musk's net worth past the $1 trillion mark. His assets now represent approximately 3% of the total US gross domestic product.
Why it matters: This milestone reflects the massive capital concentration in the aerospace and private space exploration sector, dwarfing historical wealth benchmarks and signaling the growing scale of vertically integrated tech giants.

Original Article

Elon Musk became the world's first trillionaire on Friday after shares of SpaceX rose 20% from their initial public offering price of $135. Musk was already the world's richest person, claiming the title from Jeff Bezos in 2021 with a net worth of over $185 billion. His worth is now equivalent to more than 3% of the US gross domestic product. It is five million times as large as that of the typical US family.

Tech aimobile

Apple's New Siri Is Just Good Enough to Ease Its AI Crisis

Apple's revamped Siri update finally reaches functional parity with the AI chatbot landscape as it existed six months ago.

Summary

What: Apple's latest iOS 27 and macOS 27 updates introduce an improved, more competent version of Siri. The integration aims to reduce user reliance on third-party AI assistants by providing preinstalled, reliable functionality.
Why it matters: This shift marks Apple's attempt to stabilize its market position by prioritizing platform integration over leading-edge model breakthroughs, suggesting a strategy that relies on ecosystem stickiness rather than being the first mover in AI capability.

Original Article

Apple's Siri now works the way it's supposed to. The latest update doesn't do anything revolutionary or innovative, but it shows how Apple has finally entered the modern AI market. The new Siri is roughly competitive with where the leading chatbots were about six months ago. Apple's preinstalled functionality is now competent enough that many users won't need competing AI services.

Tech startuphardware

SpaceX's president is floating a Tesla merger as the company begins trading

SpaceX President Gwynne Shotwell has officially kept the door open for a future merger between SpaceX and Tesla.

Summary

What: In comments regarding SpaceX's transition to public trading, Gwynne Shotwell refused to rule out a combination with Elon Musk's other major company, Tesla.

Original Article

SpaceX President Gwynne Shotwell declined to rule out a future merger with Tesla while tempering expectations about timing.

Data devopsdatabase

SQL to ER Diagram (Tool)

SQL to ER Diagram is a free, local-only browser tool that generates interactive database diagrams from pasted SQL schemas.

Summary

What: The tool parses `CREATE TABLE` statements to generate visual entity-relationship diagrams entirely within the browser. It does not upload data or require account creation.
Why it matters: Visualizing schema relationships is a frequent task for developers; local-only tools are preferable for sensitive or non-public database designs.
Takeaway: Paste your schema directly into https://sqltoerdiagram.com/ to generate an ER diagram instantly.

Decoder

  • ER Diagram (Entity-Relationship Diagram): A visual representation of data entities and the relationships between them in a database system.

Original Article

SQL to ER Diagram is a free, open source, browser-only tool that turns pasted SQL schemas into clean, interactive ER diagrams without uploading your data or requiring signup.

Design mobileapple

Icon Composer 2 and SF Symbols 8 now available as betas

Apple's new beta design tools introduce Liquid Glass effects and expanded preview capabilities to streamline cross-platform icon development.

Summary

What: Apple released betas for Icon Composer 2 and SF Symbols 8. Icon Composer 2 adds 'Refraction' tools and extended previews, while SF Symbols 8 increases the library to 7,000+ icons with new animation and rendering support.
Why it matters: These tools indicate Apple is standardizing high-fidelity aesthetic effects like 'Liquid Glass' across its ecosystem, forcing designers to adopt these specific rendering patterns to keep app icons compliant with platform standards.

Decoder

  • SF Symbols: A comprehensive library of vector icons provided by Apple, integrated into iOS, macOS, and other platforms to ensure consistent UI iconography.
  • Liquid Glass: A design aesthetic introduced by Apple characterized by depth, realistic refraction, and transparency effects.

Original Article

Apple has released beta versions of SF Symbols 8 and Icon Composer 2 for developers and designers following WWDC26. SF Symbols 8 adds new symbols across Apple's latest operating systems, bringing the total library to more than 7,000 symbols with support for animations, effects, and variable rendering. Icon Composer 2 introduces Liquid Glass design tools such as Refraction for realistic light distortion, improved specular highlights for better layer definition, and an Extended Preview mode that lets designers see how app icons will appear on both current and previous operating systems.

Design web

Capture Webpages as Editable Figma Layers with the Chrome Extension

Figma's new Chrome extension allows direct conversion of live webpages into editable layers, bridging the gap between existing web content and design mockups.

Summary

What: Figma launched a Chrome extension for paid subscribers that captures web elements as editable layers. The release also includes new tab-grouping features for the Figma desktop app.
Why it matters: This move lowers the barrier to entry for non-technical designers to replicate existing web patterns, effectively turning the entire internet into a reusable component library for Figma users.

Original Article

Figma released a Chrome extension that captures webpages as editable layers, allowing users to copy full pages or selected elements and paste them directly into Figma. The feature enables designers to reference and modify web content without requiring coding skills. This beta feature is currently available only on paid plans, with design system integration coming soon.

Design frontendweb

CSS Buttons (Website)

This curated repository offers a library of over 100 CSS-styled buttons for developers to integrate into their projects.

Summary

What: CSSButtons.io provides pre-written CSS and HTML snippets for various interactive button styles, focusing on visual aesthetics for web interfaces.

Original Article

A diverse collection of over 100 unique button styles. Get the code you need to enhance your web projects with stylish, functional buttons.

Design policyenterprise

Five Ways the World Cup Ticketing Process is a Complete Design Fail

FIFA's ticketing platform is under state-level investigation for allegations of false advertising and aggressive dynamic pricing tactics.

Summary

What: Regulators in four US states are investigating FIFA's World Cup ticketing system following user complaints about deceptive pricing, seat selection restrictions, and exclusionary experiences for disabled fans.

Original Article

FIFA's World Cup ticketing system has drawn investigations from four US states over complaints of false advertising and inflated prices. The platform frustrates fans with long queues, glitches, blind seat selection, and dynamic pricing that prioritizes revenue over user experience. Disabled fans and groups face additional hurdles, including pricier accessible seating and no guarantee of sitting together.

Design

Dollhouse Oddity Turns Blythe Dolls into Gothic, One-of-a-kind Character Portraits

The studio Dollhouse Oddity creates bespoke, gothic-aesthetic character art by modifying mass-produced Blythe dolls.

Summary

What: Dollhouse Oddity specializes in repainting and restyling Blythe dolls to produce unique, high-detail character portraits with a focus on dark, feminine aesthetics.

Original Article

Dollhouse Oddity is a studio specializing in customizing Blythe dolls into gothic, one-of-a-kind character portraits with dark, spooky-feminine aesthetics.

Digest devoured!

Jun 15

Home