Redeploying Fable 5
Anthropic has restored access to its Fable 5 and Mythos 5 models after collaborating with the US government to improve safety classifiers.
Summary
Deep Dive
- Fable 5 and Mythos 5 access was suspended on June 12 due to export controls triggered by a report of safeguard bypasses.
- Anthropic implemented an improved safety classifier, which now blocks 99% of the previously reported jailbreak techniques.
- The company is standardizing jailbreak severity scoring across four criteria: Capability gain, Breadth of capability gain, Ease of weaponization, and Discoverability.
- Future releases of frontier models will include dedicated technical staff working alongside government evaluators during pre-release testing.
- Anthropic is scaling up resource commitments to national security agencies as part of the June 2 Executive Order.
- New 'HackerOne' program is open for researchers to submit cyber jailbreaks for Fable 5.
Decoder
- Jailbreak: A prompt or set of prompts designed to bypass the safety guardrails and restrictions configured into an AI model.
- Red-teaming: The practice of simulating adversarial attacks against a system to identify weaknesses and vulnerabilities.
- Glasswing: An industry-government partnership program for testing and securing advanced AI capabilities.
Original Article
Redeploying Fable 5
Claude Fable 5 and Mythos 5 redeployed. Access to Claude Fable 5 and Mythos 5 is now restored.
On Friday, June 12, the US government applied export controls to our newest models, Claude Fable 5 and Claude Mythos 5. This required us to restrict access to foreign nationals, whether inside or outside the United States. Because the order took effect immediately and we had no reliable way to verify nationality in real-time, we suspended access to both models for all users.
As of today, June 30, the export controls on Fable 5 and Mythos 5 have been lifted.
Fable 5 will be available starting tomorrow, Wednesday, July 1, to users globally on the Claude Platform, Claude.ai, Claude Code, and Claude Cowork. For Pro, Max, Team, and select Enterprise plans, Fable 5 will be included for up to 50% of weekly usage limits through July 7, after which it will be available via usage credits. We will re-enable access on AWS, Google Cloud, and Microsoft Foundry as quickly as possible.
We have also restored access to Mythos 5 for a set of US organizations, following the US government’s approval on June 26. We continue to coordinate with the government to expand access to the broader set of domestic and international partners in the Glasswing program.
In the remainder of this post, we provide further details and updates in four areas:
- A timeline of events, including updates we made to our safeguards. We discuss the events that led to the export control directive and how we addressed it with new safeguards.
- Our general approach to safeguards. We provide more context on how we use safety classifiers to detect potentially dangerous cybersecurity uses of our models.
- A shared industry framework. Although we have reached a constructive resolution, these events have made clear that the industry needs a consistent way to assess and fix potential “jailbreaks” of AI models (techniques that bypass a model’s safeguards). A shared standard for judging the severity of a given jailbreak would help AI developers triage new findings as they arise, launch highly capable models with greater safety, and communicate the level of risk consistently to government and industry partners. Together with Amazon, Microsoft, Google, and other Glasswing partners, we’ve started to develop such a framework, and we outline it below.
- Deeper government collaboration. We’re also strengthening our level of collaboration with the US government on new pre-release testing, information sharing, and research collaboration. We describe this deeper collaboration in the final section.
Timeline and safeguard updates
We released Fable 5 and Mythos 5 on Tuesday, June 9. They both share the same underlying model, but Fable 5 was released with strong safeguards to make it safer for general use. Mythos 5, which has fewer safeguards, was only released to a small number of trusted Project Glasswing partners for use in defensive cybersecurity.
The export control directive on June 12 came after the government became aware of a report in which Amazon researchers had found a method of bypassing Fable 5’s safeguards: prompting it so that it identified a number of software vulnerabilities. In one case, the model produced code demonstrating how the relevant vulnerability could be exploited. Over the past two weeks, we have worked closely with the government and other partners, including Amazon, to review the report and evidence.
Our testing confirmed that many less capable models—including Claude Opus 4.8, GPT-5.5, and Kimi K2.7—could identify the same vulnerabilities as Fable 5 did in the report. When it came to the demonstration of how to exploit the single vulnerability, every model we tested could produce the same demonstration as Fable 5 (including Claude Haiku 4.5, Sonnet 4.6, Opus 4.6, Opus 4.7, Opus 4.8, GPT-5.4, GPT-5.5, and Kimi K2.7).
Importantly, the reported technique did not expose any unique Mythos-level cyber capabilities. The behavior reflected a borderline case for Fable 5’s safeguards—as we will explain below, there are some tasks that are unlikely to be dangerous but are nonetheless blocked by the safeguards out of an abundance of caution. The reported technique allowed access to one such behavior, but it only involved routine defensive cybersecurity work.
Even so, we moved quickly to address the reported bypass. Working closely with the government, we trained an improved safety classifier that targets and blocks the behavior described in the report. Users will be notified if a request to Fable 5 is blocked, and the request will instead be sent to Opus 4.8.
The new classifier means that the specific technique described in the Amazon report is blocked in over 99% of cases. In a very small fraction of cases the model may provide information that isn’t detailed enough to help a cyberattacker. As we describe below, the model’s safeguards are not expected to block all low-risk routine cyberdefense capabilities—just those that are potentially harmful. Researchers from the US Department of Commerce’s Center for AI Standards and Innovation (CAISI) have tested both our prior and new safeguards and agree that they are extraordinarily strong.
The new classifier also comes at the cost of flagging benign requests more often during routine coding and debugging tasks. As with all our safeguards, we’ll continue to refine this to better distinguish genuine misuse from legitimate requests and reduce false positives.
Our approach to cybersecurity safeguards
Claude Mythos 5 can be used to find and exploit software vulnerabilities more effectively than any other model—and all but the most skilled human security experts. These prodigious cybersecurity capabilities make it uniquely attractive to malicious actors who wish to misuse it in cyberattacks.
Claude Fable 5, however, provides no such unique offensive capabilities. This is because we launched it with the strongest safeguards we’ve ever applied to a model. In the month prior to launch, we transferred staff from various teams within Anthropic to double the number of researchers and engineers working on this problem.
Fable 5 launched with a variety of safety mechanisms, each of which alone does not provide perfect defense but when combined make the model very difficult to misuse (an approach known as “defense in depth”). Some defenses involve training the model to decline to assist with dangerous requests; others involve retroactively analyzing patterns of misuse.
One particularly important safety mechanism involves classifiers—smaller automated AI systems that, during an interaction, detect when the model is asked to perform a potentially harmful cybersecurity task (or produces potentially harmful outputs). When this occurs, the classifiers block the model from responding to requests. The ultimate goal of these classifiers is to prevent the model from engaging in uniquely dangerous behaviors.
Like all safety mechanisms, classifiers can make mistakes. They sometimes fail to notice potentially dangerous content, and in some cases they can be deliberately “jailbroken”: users can prompt the model in unusual ways to trick the classifiers and get the model to produce harmful outputs that the system should have blocked.
We therefore deliberately set the safety classifiers to trigger on a set of requests that we know are likely benign. This “safety margin” approach means that a request has to look very clearly safe to avoid triggering the classifier. Users experience the safety margin as a model refusing to respond to some reasonable, non-harmful requests.
For Fable 5, we made this safety margin much larger than in any prior launch, meaning that many more benign requests would be blocked. We understood that these kinds of false positives would be frustrating for users, but made this tradeoff in the interest of making the model’s other capabilities widely available.
The safety margin also helps mitigate jailbreaks. Many jailbreaks are narrow: they unblock a very specific model behavior but nothing more. In some cases, a hypothetical user can jailbreak the model in a minor way and intrude into the safety margin (or sometimes into ambiguously harmful behavior), but not to the core harmful behaviors that we aim to block. Our view is that jailbreaks of Fable 5 reported so far fit into this minor category.
More serious jailbreaks unblock more harmful behaviors. Narrow harmful jailbreaks can elicit some specific harmful behaviors. These jailbreaks are typically of low to moderate severity, because the narrowness limits the attacker. The most concerning category is a universal jailbreak, which unblocks a wide range of harmful behaviors.
As we noted when we launched Fable 5, it is probably impossible to make any AI model fully robust (that is, impervious) to jailbreaks. We expect that some jailbreaks will be found for our models, and that they will vary in severity: there will be many minor jailbreaks, some narrow harmful ones, and although no universal jailbreaks for Fable 5 have been discovered at the time of writing, expert safety researchers continue to red-team it. We seek to ensure that we and our safety partners will be the first to find major jailbreaks and fix them before malicious actors can use them for harm.
The cautious approach outlined above means that the vast majority of jailbreaks will not successfully unblock dangerous behaviors. Our classifiers make successful jailbreaks very costly and high-effort to produce, and even if a jailbreak is successful, our extra layers of defense provide additional mitigation. We’ll continue to update our classifiers as we learn more about novel jailbreak techniques.
A consensus industry framework for jailbreaks
There’s currently no consensus in the AI industry on how to describe, in objective terms, the severity of an AI jailbreak. This adds a great deal of uncertainty whenever a new jailbreak technique is discovered: developers have no agreed-upon standard for which findings to focus on most urgently, and governments have no agreed-upon standard for when to act.
This problem will become more acute in the coming months, as more models with powerful cybersecurity (and other) capabilities are trained, assessed, and released. A common standard for assessing AI jailbreaks would help us and other companies launch new models safely, as well as allow our users to make the most of their advanced capabilities.
We are therefore partnering with Amazon, Microsoft, Google, and other Glasswing partners to draft a consensus framework for assessing the severity of AI jailbreaks and how AI developers should respond to them. We invite other industry partners and model providers to join us in this effort.
Our current proposal is to score a given jailbreak on the four different criteria below. The first two describe what the jailbreak provides to the attacker; the latter two describe how quickly the jailbreak can become a real-world problem:
- Capability gain. How far beyond existing tools does the jailbreak take the user? If existing widely available tools (including other, weaker AI models) can reach the same capability as the jailbroken model, the score here will be low; if the jailbreak unblocks model capabilities that can significantly accelerate even domain experts, the score will be high.
- Breadth of capability gain. For how many distinct offensive tasks does the same jailbreak technique work? Cases where the jailbreak only allows the model to pursue narrow targets will score low; cases where the same jailbreak technique works for multiple different targets or techniques will score high.
- Ease of weaponization. How much human effort does it take to turn the jailbreak into an attack? Where the jailbreak involves a great deal of skilled prompting and many retries, the score will be low; where the jailbreak works on a single prompt or on the first or second try, the score will be high.
- Discoverability. How easy is it for someone to obtain the technique? If it requires specialist knowledge it will score low; if it is already widely known and available online it will score high.
We propose to use this severity framework to calibrate our response to newly discovered jailbreaks. For the most severe class of jailbreaks (e.g., a jailbreak that, among other characteristics, is being used to actively cause a devastating impact on critical power grids or banking systems), we will immediately begin deploying preliminary mitigations upon confirmation of severity. We are also creating a team to provide 24/7 monitoring of key jailbreak submission channels.
Any method of scoring jailbreaks will be imperfect. Still, there is value in being able to communicate the approximate severity of a given finding through a common framework. This is a work in progress; as we receive feedback from more partners, we expect the framework to evolve over time.
We expect to share more details on the proposed framework soon. In the meantime, we’re also launching a new HackerOne program where security researchers can submit potential cyber jailbreaks they’ve discovered in Fable 5 (once available) for our review.
Partnering with the US government on frontier AI security
Over the past ten weeks, Anthropic has worked closely with the US government as it developed the approach reflected in the June 2 Executive Order on Promoting Advanced Artificial Intelligence Innovation and Security. Our engagement spanned the Office of the National Cyber Director, the Office of Science and Technology Policy, the Department of the Treasury, the Department of Commerce (including CAISI), and relevant national security agencies.
We are committed to continuing that work, building on nearly two years of pre-existing collaborations with US government partners on pre-deployment testing and evaluation. The commitments below reflect both that pre-existing work and our new proposals to scale up our government collaboration as the above framework is finalized:
- Pre‑release government access and evaluation. For models that materially advance the capability frontier in areas relevant to national security, we will provide designated government partners with expanded early access to both the models and the safeguards that accompany them. Those partners can then run independent capability evaluations and test our guardrails before broad release. We will dedicate Anthropic technical staff to work alongside government evaluators during these testing periods.
- Rapid information sharing on safeguards. When significant jailbreaks or misuse patterns are identified, we will quickly investigate, triage, and notify appropriate government counterparts. We will share the new safeguards we build in response so they can be independently tested. We will also provide government partners with our threat intelligence reporting in advance of publication and participate in the interagency cybersecurity vulnerability clearinghouse established under Sec. 2(d) of the June 2 Executive Order.
- Dedicated resources for joint research. We are substantially scaling up joint work with government partners on AI security. We will stand up dedicated Anthropic teams to work on shared government priorities, provide a significant compute allocation to support government testing and research, and make our safety and red‑teaming expertise available to help advance the state of the art in AI evaluation.
- A common industry bar. We will work with the government and with industry peers toward a shared, voluntary security and evaluation standard for frontier model providers. We’ll contribute evaluations, tooling, and best practices that the government can apply across the field.
Our hope is that this collaboration, along with our proposed consensus industry framework, will serve as the basis for systematic rules for the whole industry—and even offer the beginnings of a template for effective global coordination on the risks and benefits of AI.
These rules should be codified in strong regulation and applied equally across frontier model developers. Government involvement in AI releases requires a durable, transparent process that gives cyber defenders and others the certainty they need about access to powerful models.
We look forward to deepening our government collaboration in the ways we’ve described above. We’re also grateful to our users for bearing with us through this disruption, and to the researchers and industry partners who worked alongside us to make Fable 5 and Mythos 5 available again.
Footnotes
- For standard Enterprise seats, there is no included Fable 5 allowance, although you can get access through usage credits. If credits are not enabled, your users will not have access to Fable 5. For premium Enterprise seats, through July 7, Fable 5 is included in your subscription. It draws from each member's seat usage at no additional cost. After July 7, your team can continue using Fable 5 by enabling usage credits. If credits are not enabled, your users will no longer have access to Fable 5.
- Note that sometimes the term “bypass” is itself used instead of “jailbreak.” For current purposes, we consider these to be synonyms, but for the remainder of this article we use “jailbreak” because (a) this is a more commonly used term and (b) it is consistent with the terminology we have used in previous work.
- Analogously, no piece of software is immune to vulnerabilities (though in general, software vulnerabilities are more straightforwardly discovered and patched than LLM jailbreaks).
- In other areas of security research, there are agreed-upon standards: for example, the Common Vulnerability Scoring System (CVSS) is a common way of assessing the severity of a given software vulnerability.
How OpenAI Delivers Low-Latency Voice AI for 900M Users
OpenAI scaled its real-time voice infrastructure by splitting WebRTC into a stateless packet relay and a stateful transceiver to overcome Kubernetes limitations.
Summary
Deep Dive
- Kubernetes/WebRTC Conflict: Standard WebRTC setups require stable IP/port mappings, clashing with the dynamic, disposable nature of Kubernetes pods.
- The Split Architecture: OpenAI separates the stack into a stateless relay (routing) and a stateful transceiver (handling ICE/DTLS/SRTP).
- Routing via ICE ufrag: To avoid external database lookups on the hot path, OpenAI embeds routing hints into the ICE ufrag field during initial signaling, enabling the relay to direct the first packet to the correct transceiver.
- Performance Optimizations: Used Go's userspace networking, SO_REUSEPORT for kernel-level load balancing, and runtime.LockOSThread to maintain CPU cache locality.
- Avoidance of Kernel Bypass: The team opted for optimized Go code over complex kernel-bypass frameworks, finding it sufficient for their latency requirements.
- Redis Integration: A Redis cache stores established mappings to allow for fast state recovery if a relay pod restarts.
Decoder
- WebRTC: An open-source protocol enabling real-time audio/video communication directly between browsers or applications.
- ICE (Interactive Connectivity Establishment): A framework for finding the best path for two devices to connect through firewalls and NATs.
- ufrag (Username Fragment): A component of ICE credentials exchanged during session setup, repurposed here as a routing key.
- SFU (Selective Forwarding Unit): A server architecture that receives a media stream from one participant and selectively forwards it to others; commonly used in video conferencing like Zoom.
- STUN (Session Traversal Utilities for NAT): A protocol used to discover public IP addresses and NAT behavior.
- SRTP (Secure Real-time Transport Protocol): Provides encryption, message authentication, and integrity for voice and video traffic.
- SO_REUSEPORT: A Linux socket option that allows multiple processes or threads to bind to the same port, improving load distribution.
Original Article
How OpenAI Delivers Low-Latency Voice AI for 900M Users
OpenAI runs voice AI for 900 million users a week, and they use WebRTC for it because the alternative would mean reinventing how the internet handles live audio.
The catch is that WebRTC was designed for servers with stable IPs and ports, and Kubernetes treats those addresses as disposable. The conventional answer at this scale is an SFU, which suits multiparty workloads like group video calls, but OpenAI’s traffic is overwhelmingly one user talking to one model.
To deal with this, their architecture splits the stack into two pieces:
- A stateless relay handles protocol-aware packet routing at the geographic edge.
- A stateful transceiver owns all the heavy WebRTC state.
The trick that ties them together is using the ICE ufrag, a field the protocol already exchanges during setup, as a routing key that the relay can read off the first packet of a new session. Everything else, from Global Relay to the userspace Go implementation to the Redis cache and the careful socket-level optimizations, builds on top of that core idea.
Why Latency Matters For Voice AI
Voice AI either feels like a conversation or it feels like a walkie-talkie. The line between those experiences is measured in milliseconds.
When the network pauses between hearing a user and responding, the illusion breaks. Pauses turn awkward, interruptions get clipped, and users are compelled to cut off the AI mid-sentence, which is also kind of rude. In other words, voice AI only feels natural if the conversation moves at the speed of speech.
The harder constraint underneath is the continuous-stream property. Audio has to arrive at the model as a steady flow, rather than as a single upload after the user finishes talking. That stream is what lets the model start transcribing, reasoning, and calling tools while the user is still speaking. The experience collapses into push-to-talk once it breaks.
For OpenAI specifically, those constraints translate into three concrete requirements:
- The system has to reach 900 million weekly active users wherever they are.
- Connection setup has to be completed quickly enough that users can start speaking as soon as a session begins.
- Round-trip time for audio has to stay low and stable so turn-taking feels crisp.
WebRTC is the protocol the industry built for this kind of work. It is a bundle of smaller protocols (ICE for figuring out how two endpoints reach each other across firewalls, DTLS for encrypting the channel, SRTP for the audio packets, and RTCP for quality feedback).
The Original Architecture
The first version of OpenAI’s WebRTC infrastructure was a single Go service built on Pion. It handled both jobs in one place:
- On the signaling side, the service negotiated SDP (the format clients and servers use to describe a session), selected codecs, generated ICE credentials, and set up sessions.
- On the media side, the service terminated WebRTC connections from clients and maintained upstream connections to the backend services that run the AI models, including inference, transcription, speech generation, tool use, and orchestration.
That combined service still powers ChatGPT voice, the Realtime API’s WebRTC endpoint, and several research projects, and it has handled that work well. The question OpenAI ran into was how to deploy it on Kubernetes, the container orchestration system that runs most modern cloud infrastructure.
Kubernetes assumes compute is cheap and movable. Pods come up, get scheduled wherever capacity exists, run for a while, then get rescheduled or replaced. Standard WebRTC deployment patterns assume the opposite. That mismatch shows up in two specific places.
The first is port exhaustion. The conventional way to deploy WebRTC uses one UDP port per session. At OpenAI’s scale, that means tens of thousands of public UDP ports per service. Cloud load balancers were built for a handful of well-known ports, so each additional range adds operational complexity for load balancer config, health checks, firewall policy, and rollout safety. The exposed surface area also makes security audits harder. Kubernetes autoscaling clashes with the requirement to reserve large and stable port ranges, which makes elasticity brittle.
The second is state stickiness. Running one UDP port per server and demultiplexing sessions behind it solves the port problem. ICE and DTLS, however, are stateful protocols. The process that started a session has to keep receiving its packets to validate connectivity checks, complete the DTLS handshake, decrypt SRTP, and process later session changes like ICE restarts. If a packet for an existing session lands on a different process, setup fails, or media breaks.
Splitting The Relay From The Transceiver
The architecture OpenAI shipped splits packet routing from protocol termination.
A stateless relay sits at the front, presenting a small public footprint to the internet. A stateful transceiver sits behind it, owning all the heavy WebRTC state. Signaling still goes directly to the transceiver. Media enters through the relay first.
The relay’s scope is deliberately narrow. It reads enough of each packet to choose a destination, then forwards the rest as an opaque payload. Audio stays encrypted on the way through, ICE state machines stay with the transceiver, and codec negotiation happens elsewhere. From a client’s perspective, the WebRTC session looks normal in every way.
The transceiver owns the parts of WebRTC that have to remember things. ICE connectivity checks, the DTLS handshake, SRTP encryption keys, and the session lifecycle all live there. The transceiver is the endpoint that completes the handshakes and encrypts or decrypts the actual media.
An SFU, or Selective Forwarding Unit, is the standard media server architecture for WebRTC at scale. It terminates one WebRTC connection per participant and selectively forwards streams between them. The AI joins as another participant. This works well for inherently multiparty products like group calls, classrooms, and collaborative meetings. OpenAI’s workload looks different. Most sessions are 1:1, with one user talking to one model. For that kind of traffic, the SFU model adds overhead and forces backend services to behave like WebRTC peers themselves. The transceiver model lets the backend stay an ordinary service.
TURN was also considered and set aside. TURN is the standard protocol-terminating relay used for NAT traversal. The trouble is that TURN allocations add setup round-trips before media can flow, and migrating or recovering them across servers is hard. For a latency-sensitive workload, those extra round-trip matters.
Routing The First Packet
The first packet of any new session is the difficult one. Subsequent packets are easy because the relay has a mapping that says that packets from this source IP and port go to this transceiver. The first packet is what creates that mapping, so the relay has to figure out where to send it from the packet itself.
OpenAI generates the server-side ufrag during signaling. They can put whatever they want in it, so they encode routing metadata into it. The relay parses just enough of the first STUN binding request to read the ufrag, decode the routing hint, and forward the packet to the transceiver that owns the session. Every packet after the first one flows through the established session mapping, which skips the ufrag parsing step entirely.
Each transceiver in the fleet listens on a shared UDP socket, which is one operating system endpoint bound to an internal IP and port. All sessions for that transceiver multiplex behind it.
The relay’s state is purposefully tiny. It holds an in-memory map of source address to transceiver destination, plus some counters for monitoring and timers for session cleanup. If a relay instance restarts and loses the mapping, the next STUN packet rebuilds it from the ufrag. To make recovery faster, a Redis cache holds the source-to-destination mapping once a route is established. A restarted relay can look up the mapping from Redis immediately.
Global Relay and Geo-Steered Signaling
Global Relay is OpenAI’s fleet of geographically distributed relay ingress points. Geographic distribution shortens the first client-to-OpenAI hop. A packet entering the network at a relay close to the user, in both geography and network topology, has a much easier time than a packet that has to traverse the public internet to reach a distant region first.
OpenAI uses Cloudflare for geographic and proximity steering on the signaling side. The initial HTTP or WebSocket request that sets up a session is routed to a nearby transceiver cluster. The request context then determines which Global Relay ingress point gets advertised back to the client in the SDP answer.
The Go Relay Implementation
The relay is a Go service running in userspace, which is to say a regular process that reads from a regular UDP socket. OpenAI evaluated kernel-bypass frameworks and chose to stay away. Bypass raises packet throughput at the cost of operational complexity. The team’s workload fit inside what a careful Go implementation could handle.
Three implementation choices carry most of the performance load:
- SO_REUSEPORT is a Linux socket option that lets multiple workers on the same machine bind the same UDP port. The kernel then distributes incoming packets across those workers, which removes the bottleneck of a single read loop.
- runtime.LockOSThread pins each UDP-reading goroutine to a specific OS thread. Combined with SO_REUSEPORT, this tends to keep packets from the same flow on the same CPU core, which helps cache locality and reduces context switching.
- Pre-allocated buffers and minimal copying during packet parsing keep allocation overhead low and avoid putting pressure on Go’s garbage collector.
Design Tradeoffs
- The whole design is built around 1:1 sessions. If OpenAI ever wants to add multiparty features, large parts of this architecture would probably need rework.
- OpenAI also took on a custom infrastructure burden. A standard SFU comes with documentation, a community, and battle-tested patterns. The relay, the transceiver, and the coordination between them are all internal code.
- The “stateless” relay turns out to be stateless mostly in spirit. It holds an in-memory flow table and uses a Redis cache to recover that table across restarts.
- The ufrag trick depends on controlling both ends of signaling. A team that uses an off-the-shelf signaling stack might find this technique harder to adapt directly.
Conclusion
The architecture OpenAI built for voice AI is a careful response to a specific pressure. WebRTC was designed for stable servers. Modern cloud infrastructure runs on the opposite assumption. OpenAI’s team had the protocol depth to determine whether a WebRTC session needs to live in one process at all.
Their answer separates the work into two pieces. A stateless relay forwards packets near the user. A stateful transceiver, anchored in one place, owns ICE, DTLS, SRTP, and the session lifecycle. The two pieces communicate through information that’s already in the WebRTC handshake, which keeps the routing decision on the packet path itself.
Most AI Work Can Wait
Engineers should prioritize AI routing systems over specific model choices to dramatically reduce costs through local models and asynchronous batch inference.
Summary
Deep Dive
- Skill classifier: Identifies the intent (e.g., summary, migration).
- Router: Assigns the task to a specific model tier based on complexity and context.
- Model selector: Picks the cheapest model meeting the confidence threshold.
- Queueing allows for async inference, which is two orders of magnitude cheaper than real-time.
- Closed-loop evaluation systems (nightly traces) allow the router to update its own weights automatically.
Decoder
- Async Batch Reasoning: Processing AI requests in a non-real-time queue to access cheaper, higher-throughput inference resources.
- Skill Distillation: The process of training smaller, specialized local models to perform specific tasks as well as large, general-purpose frontier models.
Original Article
In short : Prioritize routing over model choice. Most AI work runs on cheap local models.
Most teams building agents pick the model first & the architecture second. That is backwards. The model choice is the last decision, not the first.
What matters is the router, a small piece of code that decides which tier of model handles each request. Get the router right & 70-80% of traffic runs on local models that cost nothing per call, or on async models that reduce AI spend by 90%+.
Brian Armstrong made the same point last week about how Coinbase cut AI spend in half while token usage grew, paraphrasing :
How to keep AI spend flat while token usage grows exponentially : not with friction & spend alerts. With better defaults, routing, & caching. Engineers can choose any model they want, but defaults matter.
The routing problem has three layers, and each does a distinct job :
- Skill classifier turns a raw user request into a concrete operation. It answers what the task is. Draft-a-reply, summarize-a-repo, run-a-migration. The classifier is intent recognition.
- Router decides which tier executes the classified operation. It answers which model runs it. The router does not read the prompt. It reads the classifier’s label plus a few features : complexity, context size, historical success rate.
- Model selector picks the cheapest model within a tier that meets a confidence threshold.
Classifier & router are not the same. The classifier is a language problem ; the router is a scheduling problem. Conflating them buries the model choice inside the prompt & kills the ability to A/B different models against the same operation.
Local compute is close to free. Async batch reasoning runs two orders of magnitude cheaper than real-time inference. So the real question is narrower : what fraction of work needs real-time answers?
Surprisingly little, once the system can queue work.
Queueing is why this works. A draft reply, a repo summary, a diligence memo, a nightly evaluator run : none of these need to return in a second.
We built the first version of this into our agent runtime. The router already scored tasks on complexity, context size, & local memory retrieval. Two feedback mechanisms now sit on top of the router, & they operate on different time scales :
- Synchronous failure-mode signals. A predictor annotates each incoming route with five features : missing repo context, long dependency chains, risky migrations, security-sensitive prompts, & high-consequence writes.
- Nightly closed-loop feedback. A batch evaluator scores yesterday’s traces overnight & updates the router’s weights, running on async inference to keep the evaluation cost near zero.
The synchronous predictor catches known-hard tasks before they fail. The nightly loop discovers new failure modes the predictor missed.
Once skill distillation flattens the operation set, 70-80% of agent traffic can run on local models for most non-coding work.
The implication : design your system around routing, not around models. Pick your models last.
Introducing the 'usermedia' HTML element
Chrome 151 introduces the element to replace legacy JavaScript permission prompts with a declarative, browser-controlled mechanism for accessing hardware.
Summary
Deep Dive
- Declarative API: Moves permission request logic from JavaScript into HTML tags, reducing boilerplate code.
- Capability Elements: A browser initiative to standardize UI for powerful hardware features (like camera/mic or geolocation).
- Recovery Flow: Allows users to re-enable blocked access directly via the HTML element, bypassing deep browser settings.
- Styling Constraints: The browser enforces strict contrast (3:1) and sizing rules on the element to prevent deceptive UI patterns (e.g., clickjacking).
- Data Mediator: The element acts as an intermediary, delivering the
MediaStreamobject directly to the application upon successful authorization. - Backward Compatibility: The element degrades gracefully into an
HTMLUnknownElement, allowing custom fallback to the legacygetUserMediaAPI.
Decoder
- Boilerplate: Standard, repetitive code required to implement a specific feature.
- MediaStream: An object containing tracks of media (audio or video) that can be rendered or processed by the browser.
- Imperative: A programming style where the developer writes exact, step-by-step instructions (e.g., calling
getUserMedia()) rather than defining an intent.
Original Article
Introducing the <usermedia> HTML element
Following the launch of the <geolocation> element in Chrome 144, the next functional control in the Capability Elements suite is the <usermedia> HTML element. Available from Chrome 151, this element marks the next phase of the transition from generic permission requests to targeted and functional controls for accessing camera and microphone streams. By moving away from script-triggered prompts toward a declarative and user-activated experience, <usermedia> reduces boilerplate code, improves security, and provides a seamless recovery path for users who have previously denied access, effectively solving the long-standing permission hole.
From permission management to capability control
The <usermedia> element is the next specialized control to launch in the Capability Elements suite, following the successful introduction of <geolocation>. This transition from the original and generic <permission> proposal—part of the PEPC initiative—lets the browser handle the unique complexities and behaviors of different hardware capabilities more effectively. While the early proposal focused primarily on managing permission states, such as allow versus deny, Capability Elements function as data mediators.
The <geolocation> element provides a location object to your site, and <usermedia> manages the entire flow for camera and microphone access. It captures user intent, manages the browser prompt, and delivers the MediaStream object to the application. This shift eliminates the need for separate getUserMedia() calls, simplifies implementation, and ensures the browser has a trusted signal of the user's intent.
Validation of the concept
Real-world data from the initial Origin Trial demonstrated that the in-context and user-initiated permission controls significantly improve user success rates.
- Cisco observed that users who initially denied permissions were only about 10% likely to successfully grant permissions using legacy prompts, but that rate jumped to more than 65% with the new element.
- Zoom reported a 46.9% decrease in camera or microphone capture errors, such as system-level blockers, by using the element to guide users through recovery;
- Google Meet saw a 17% decrease in "mic not working" feedback and a 131% increase in successful permission recovery for users who had initially denied access.
Why use the <usermedia> element?
Building on the patterns established by <geolocation>, the <usermedia> element addresses the core challenges of requesting powerful capabilities. Media requests rely on imperative JavaScript calls that often trigger out-of-context prompts. If you accidentally block your site, reversing that decision requires navigating deep into browser settings, a "permission hole" that often leads to abandoned features.
The <usermedia> element solves these issues by providing the following:
- Clear intent and timing: Because the prompt only appears after a physical tap on a browser-controlled element, it provides a trusted signal of intent. This lets the browser bypass automated quiet blocks that often cause typical script-triggered requests to fail.
- Simplified recovery: If access was previously denied, tapping the element triggers a specialized recovery flow that lets you re-enable your camera or microphone instantly on the page, without navigating complex browser settings.
- Direct stream access: As a data mediator, the element exposes the media stream directly. This reduces the boilerplate code required to manage callbacks and error states in your application.
| Feature | getUserMedia() JS API |
<usermedia> HTML Element |
|---|---|---|
| Triggering event for permission prompt | Imperative script execution (getUserMedia) |
User clicks on the browser-controlled element |
| Browser role | Decides prompt based on state and heuristics | Acts as a data mediator (manages consent and stream delivery) |
| Site responsibility | Manually call the JavaScript API, handle callbacks, and manage errors | Listen to the stream event and access the stream property |
| Core goal | Basic camera and microphone access | Stream access, permission management, and recovery with reduced friction |
Implementation
Integrating the element requires significantly less boilerplate than the legacy JavaScript API. Following the declarative pattern established by the <geolocation> element, you can add the <usermedia> tag to your HTML and configure hardware requirements with the setConstraints() method.
<usermedia id="media-ctrl">
<button>Enable camera and microphone</button>
</usermedia>
const el = document.getElementById('media-ctrl');
// Specify hardware preferences before user interaction:
el.setConstraints({
video: { width: 1280, height: 720 },
audio: { echoCancellation: true }
});
// Handle successful stream acquisition:
el.addEventListener('stream', () => {
videoPreview.srcObject = el.stream;
});
// Handle stream acquisition failure:
el.addEventListener('error', () => {
console.error(`Access failed: ${el.error?.name}`);
});
// Handle prompt cancellation or dismissal:
el.addEventListener('cancel', () => {
console.log('Permission prompt was dismissed by the user.');
});
Key attributes and properties
stream: A read-only property that provides theMediaStreamobject once the user has successfully granted access.setConstraints(): A method that lets developers update hardware preferences, such asdeviceIdor resolution, prior to user interaction.error: A read-only property that returns aDOMException(for example, aNotAllowedError) if the request fails or is dismissed.onstream: An event handler that fires immediately once the media tracks are acquired.onerror: An event handler that fires when a stream acquisition attempt fails.oncancel: An event handler that fires when the user cancels or dismisses the permission prompt during acquisition.
Styling constraints
To ensure user trust and prevent deceptive design patterns, the <usermedia> element applies the same strict styling restrictions as other Capability Elements:
- Legibility: The browser checks text and background colors for sufficient contrast (at least 3:1) to ensure the request is always readable. You must set the alpha channel (
opacity) to1to prevent the element from being deceptively transparent. - Sizing and spacing: The browser enforces minimum and maximum bounds for
width,height, andfont-size. It disables negative margins or outline offsets to prevent the element from being visually obscured. - Visual integrity: The browser limits distorting effects. For example,
transformsupports only 2D translations and proportional scaling. - CSS pseudo-classes: The element supports state-based styling, such as :granted (which activates once permission is active and the stream is acquired), as well as standard interaction states like :hover and :active.
Progressive enhancement and migration strategy
Following the design pattern established by <geolocation>, the <usermedia> element is built to degrade gracefully. Browsers that don't support the element will treat it as an HTMLUnknownElement and render its children. This lets you provide a fallback experience for all users.
Custom fallback pattern
Programmatically detect support for the <usermedia> element in JavaScript:
if ('HTMLUserMediaElement' in window) {
// Use modern <usermedia> element logic
} else {
// Fallback to legacy getUserMedia() API
}
Use this detection logic to add a standard button inside the <usermedia> element to trigger the legacy getUserMedia() API:
<usermedia id="stream-handler">
<button id="fallback-stream-handler">
Enable Camera and Mic
</button>
</usermedia>
// Function for handling video/audio streams:
function handleStream (event) {
/* ... */
}
if ('HTMLUserMediaElement' in window) {
// In this case, we have <usermedia> element support:
const streamHandler = document.getElementById('stream-handler');
streamHandler.addEventListener('stream', event => {
handleStream(event);
});
} else {
// <usermedia> element support is missing, so fall back instead:
const fallbackStreamHandler = document.getElementById('fallback-stream-handler');
fallbackStreamHandler.addEventListener('click', event => {
navigator.mediaDevices.getUserMedia({video: true, audio: true}).then(handleStream);
});
}
Migration for Origin Trial participants
For developers who integrated the experimental and generic <permission> element during the Origin Trial, transitioning to <usermedia> is designed to be minimal.
- Tag update: Replace
<permission type="camera microphone">with<usermedia>to ensure that all selectors targeting the previous<permission>elements are updated to use the<usermedia>element instead. - Feature detection: Update checks from
HTMLPermissionElementtoHTMLUserMediaElement
The roadmap ahead
While the <usermedia> element handles combined audio and video requests, the roadmap for future Capability Elements includes:
<camera>: Focuses specifically on video-only scenarios.<microphone>: Focuses specifically on audio-only scenarios.
Building Indexes on a Moving Target
Apache Hudi enables adding indexes to petabyte-scale live tables without downtime, a feat impossible in standard copy-on-write lakehouse formats.
Summary
Deep Dive
- Merge-On-Read (MOR): An architectural model that separates ingestion (writes) from compaction (reads/optimizations).
- Async Indexing: The ability to build metadata indexes (Bloom filters, secondary indexes) while ingest writers continue to commit.
- File Groups: The core storage unit in Hudi; stable logical containers that allow concurrent processes to append mutations without rewriting base files.
- Timeline: Hudi's event log that records every action (commits, indexing, compaction) as a sequence.
- Catchup: The reconciliation process where an indexer verifies metadata updates from concurrent writers to ensure consistency before flipping a partition to readable.
- Two-Stage Config: Managing index visibility via 'inflight' vs 'active' states to prevent readers from seeing incomplete metadata.
Decoder
- Merge-On-Read (MOR): A storage pattern where mutations are written to delta log files, which are merged with a base file during read-time.
- Copy-On-Write (COW): A storage pattern where any update requires rewriting the entire data file, making it unsuitable for high-velocity updates.
Original Article
This is the third post in a series on Merge-On-Read as an architectural shift in Apache Hudi. The first post made the broad case; the second showed how the metadata table inherits the same append-first model; this one follows the pattern one step further — into how new indexes can be added to a live table without stopping it.
The first post in this series argued that Merge-On-Read is not a storage optimization but an architectural shift — a strategy for time-shifting work in systems where mutation and analytical scan run at different rhythms. The second grounded that argument in Hudi's metadata table, showing how metadata itself became another append-first workload. Async indexing extends the same architectural pattern one step further: enabling new capabilities to be introduced into a live table without stopping the system.
This article assumes some familiarity with Hudi's metadata table and its indexing capabilities. If you're new to the topic, the motivation for metadata indexes, their role in query planning, and the evolution of the metadata table are covered in detail in RFC-45 as well as our earlier articles on the metadata table and mutation-friendly metadata. Here, we focus on a different question: once a table is already serving production writes, how can a new metadata index be introduced without interrupting ingestion or exposing inconsistent state?
The title is literal. In Hudi, you can build a new index — record-level, secondary, expression, column stats, bloom filter — over an existing multi-petabyte table that doesn't stop while you build it. Ingest writers keep committing. No outage. No drained pipeline. No swap window. The bootstrap reads historical state into a freshly initialized metadata partition; concurrent writers route new mutations into the same partition's file groups; a reconciliation step at the end stitches the seam.
That capability does not exist in any other open lakehouse format at the spec level — not because nobody implemented it, but because the storage layer every other format is built on cannot express the pattern. Async indexing is not a feature added to Hudi; it is a consequence of decisions Hudi made in 2017. The only systems that can build indexes this way are the ones whose mutation model is already append-first.
RFC-45 is the design doc, and the Onehouse engineering blog walks through the mechanics. Those explain what async indexing does and how it's wired in. This post is about why the design works at all — what architectural primitives have to exist underneath before something like it becomes a coherent thing to build.
The Shape of the Problem
Adding a new index over an existing large table has always been an operationally expensive thing for analytical storage to do. Scan every row, project the indexed columns or compute the indexed expression, materialize the index. The trouble is that "scan every row" of a multi-petabyte table takes hours, and during those hours the table either has to stop accepting writes or has to coordinate with a table that won't sit still.
Two responses dominate the lakehouse conversation, and both inherit the limitations of their foundation.
The first is to bind the index to the write path: compute the value at ingestion time and write it alongside the data — column-level statistics, min/max metadata, file summaries. This is what Iceberg and Delta do by default. It works, but it cannot retroactively add an index. If the table was written for two years without it, the write path can't help; the data is already on disk in its final form.
The second is to lock-and-rebuild: global or per-partition lock, scan, materialize, atomic swap, release. Fine at small scale. At petabyte scale on a continuously ingesting table the lock duration is hours, the ingest pipeline has to stop or buffer indefinitely, and any failure during the rebuild discards the work.
Hudi's predicament when RFC-15 introduced the metadata table was that MDT was going to host indexes — files, column_stats, bloom_filters, record_index, partition_stats, secondary_index, expression_index. Every one of those would eventually need to be added over an existing table by someone who'd been running Hudi for a year before the index existed. If index addition required taking the table offline, the metadata table's growth story collapses to enable everything at table creation or never. That isn't a metadata layer; that's a one-shot schema decision. RFC-45 is the document that fixes this.
What COW Makes Impossible — Not Hard, Impossible
Imagine implementing async indexing on a Copy-On-Write storage layer — a generic columnar lakehouse format where the unit of mutation is the file rewrite. Bootstrap scans history, computes, materializes. Concurrent writers continue ingesting. Index has to be queryable at the end.
The bootstrap is slow under any architecture; that's not what breaks. What breaks is that there is no place for concurrent writers to put their contribution to the new index while the bootstrap is running.
Under COW, every index update is a rewrite of a finalized index file. There is no append channel. If the indexer has materialized idx_000.parquet for history up to t, and a writer commits at t+1, the writer cannot contribute to idx_000.parquet. The writer's only option is idx_001.parquet — which means the bootstrap has to coordinate with the writer about which file was produced, validate it, possibly merge it. That coordination is the synchronization we were trying to avoid. The deeper problem is the absence of a stable logical container: the index isn't something the writer can point at and append into; it's a collection of physical files whose ownership has to be negotiated.
The workaround in rewrite-oriented systems is bootstrap-and-swap: build against a frozen snapshot at t; writers ingest normally but don't touch the new index; at t + Δ, replay all writes between t and t + Δ into the index, then atomically swap. This works, but the replay window has no upper bound (if ingest exceeds replay rate, catchup never converges), and the index is unavailable for query during the entire bootstrap-plus-replay duration.
Hudi doesn't need bootstrap-and-swap because the foundation offers something neither Iceberg nor Delta has: a stable file group with an append channel, and a timeline that expresses index construction as a first-class action with its own reconciliation semantics. Both primitives are downstream of MOR.
Before examining the architectural primitives individually, it's helpful to look at the end-to-end lifecycle of an async index build. The figure below intentionally presents the high-level execution flow rather than the implementation details. For readers interested in the complete design, state-machine transitions, and failure handling, RFC-45 provides the authoritative specification. Here, we focus on the architectural primitives that make the design possible.
The Three Primitives, All of Them MOR Consequences
Async indexing in Hudi rests on three substrate-level primitives. Each one is something the MOR design forced into existence for other reasons, and each one is what makes RFC-45's design coherent rather than a heroic engineering effort against the storage layout.
Primitive 1: The File Group as a Stable Mutation Container
The file group is a stable logical mutation container. Once initialized, multiple independent processes — not just ingest writers, but background services such as index builders — can deterministically contribute mutations into the same logical object while deferring physical reconciliation. Async indexing is one application of this primitive. Later in the series, we'll see the same stable mutation containers reused by other Hudi capabilities, including async compaction and eventually non-blocking concurrency control.
When the indexer initializes a new metadata partition, it creates empty file groups for that partition. From that moment, they exist on disk and are addressable. A writer that sees the partition is inflight (via table config) can immediately append log blocks into those file groups, even though base files haven't been written yet. The writer doesn't need to know whether the indexer has finished the base file or coordinate physical layout at all. It needs only to know which file group to route into — a deterministic function of the record key (via bucket index or RLI's hash routing).
Under COW this primitive doesn't exist. There's no "container the indexer prepares and writers append into" — only files, produced atomically and consumed read-only. The closest analog is a shared directory both processes write new files into; but that's a namespace, not a container, and the synchronization problem returns immediately.
Primitive 2: INDEXING as a First-Class Timeline Action
Hudi's timeline is an append-only event log: commits, deltacommits, cleans, compactions, restores, rollbacks. RFC-45 adds INDEXING_ACTION. An index build is not a side process; it is represented as its own sequence of timeline instants: <t>.indexing.requested, <t>.indexing.inflight, and <t>.indexing (completed), alongside data writes and other table services.
That placement matters. Because the action lives on the timeline, concurrent writers discover it through the same timeline refresh they already perform for commits and table services; no separate coordination channel is required. Because it follows the standard requested → inflight → completed lifecycle, existing mechanisms for recovery, rollback, and lock-protected state transitions apply without introducing a parallel state machine. Async indexing is built by extending the timeline rather than bypassing it.
More importantly, the timeline and table configuration together govern visibility, not just execution. Bootstrap and Catchup may execute for minutes while the table continues accepting writes, but the new index remains invisible to readers throughout that period.
Internally, the indexing action progresses through the standard requested → inflight → completed lifecycle on the timeline, while the table configuration independently tracks whether the metadata partition is still under construction (metadata.partitions.inflight) or has become queryable (metadata.partitions). Readers observe the new index only after both transitions have completed. This separation is subtle but fundamental. Bootstrap and Catchup describe how the index is constructed. The timeline and table configuration determine when that construction becomes visible. By separating progress from publication, Hudi allows long-running background work to proceed without exposing partially constructed metadata to readers.
Primitive 3: Two-Stage Partition Lifecycle in Table Config
The subtlest of the three, and the one most directly inherited from MOR. hoodie.properties carries two related fields: hoodie.table.metadata.partitions.inflight lists partitions initialized and receiving writes but not yet queryable; hoodie.table.metadata.partitions lists partitions fully built and safe to read.
That split is what makes async indexing safe under split-brain reader/writer views. When scheduling completes — file groups initialized, plan persisted — the partition is added to *_INFLIGHT. Writers see it and start routing log blocks. Readers see it too, and ignore the partition: table config is the source of truth for query readiness. Only after the bootstrap base is written, Catchup reconciles any missing metadata updates from concurrent commits, and the indexing action completes does the partition move from metadata.partitions.inflight to metadata.partitions.
The two-stage config is MOR's base-file-vs-log-file separation lifted to the partition level. Log files are writable before the next base file is materialized; readers reconcile both at query time. Inflight partitions are writable before the bootstrap completes; readers ignore them until done. Same idea, different scope. You can see this in code at RunIndexActionExecutor.updateTableConfigAndTimeline: the partition transitions from inflight to active as part of the indexing completion sequence, allowing the new index to become visible only after both the timeline action and table configuration reflect a completed build.
Indexing Catchup as Deferred Reconciliation, Generalized
The three primitives set up the most interesting piece of the design: indexing catchup. The implementation lives in AbstractIndexingCatchupTask, orchestrated by RunIndexActionExecutor. It's where the MOR-as-foundation argument becomes concrete.
The indexing plan is scheduled against a bootstrap target — the latest completed data-table instant at scheduling time. Bootstrap builds the historical portion of the index by scanning the table up to that point. Meanwhile, the data timeline continues advancing.
A common misconception is that Bootstrap alone constructs the index. In reality, every commit that arrives while Bootstrap is running is equally part of the index build — it simply reaches the metadata partition through the incremental write path rather than the historical scan. Writers that observe the indexing action after it becomes inflight begin appending metadata updates into the new file groups. By the time Bootstrap completes, the metadata partition consists of a historical base together with an unknown amount of concurrent incremental state.
The indexer cannot simply assume that every eligible commit has been reflected in the metadata timeline. Some writers may have started before the indexing action became visible. Others may still be inflight. Catchup exists to reconcile that uncertainty before the index becomes visible.
Catchup walks the relevant portion of the data timeline and compares each eligible data-table instant against the metadata timeline using the indexing action's requested-time identifier. If the corresponding metadata update already exists, the writer has already contributed its portion of the index. Otherwise, Catchup synthesizes the missing metadata update on the writer's behalf.
Inflight writers receive special treatment. Rather than immediately treating them as missing, Catchup waits for active writers to complete while monitoring their heartbeats. Writers that complete normally are incorporated through the usual reconciliation path. Writers whose heartbeats have expired are treated as abandoned and skipped, allowing index construction to converge without waiting indefinitely. The wait itself is bounded by hoodie.metadata.index.check.timeout.seconds (900 seconds by default), after which the indexing operation aborts and the partially constructed partition is discarded.
This is deferred reconciliation in its purest form. Bootstrap constructs historical state. Concurrent writers continue making forward progress. Catchup reconciles any remaining divergence between the data timeline and metadata timeline before publication. Readers never observe intermediate state.
The pattern is exactly what Merge-On-Read readers already do for every file group: reconcile a stable base with subsequent incremental mutations to produce a consistent view. Async indexing applies the same architectural idea at the scope of an entire metadata partition.
The Races, and How the Architecture Absorbs Them
RFC-45 enumerates four error cases. Each one is a place where synchronous coordination would have required a lock, and each one is handled instead by deferred reconciliation against timeline state. They're worth walking through because they make the architectural argument concrete.
Writer fails while the indexer is inflight. The writer crashed before logging its update. The indexer doesn't care. Catchup observes the failed instant, sees no metadata deltacommit, and either skips it (if data-table rollback already cleaned up) or fills it in. The metadata partition isn't corrupted because reader visibility hasn't flipped yet.
Indexer fails while a writer is inflight. The writer continues logging into the new file groups. The partition stays in *_INFLIGHT because no commit metadata was written. Readers continue ignoring it. When the indexer is re-triggered, it sees the partition is inflight, picks up the build (idempotent by partition name in ScheduleIndexActionExecutor), and proceeds. The writer's log files aren't lost — they're already in the file group, waiting for catchup.
Writer goes inflight just after the indexing request, before the indexer executes. A writer begins after the indexing action becomes visible and therefore participates in index construction from the outset. By bootstrap completion, log files already exist. Catchup observes them, sees the corresponding deltacommit, moves on. No coordination — writer and indexer were independently pointing at the same stable container.
Inflight writer about to commit but indexing just completed. The trickiest race. The writer started knowing the partition was inflight; during its execution the partition flipped to completed. The writer commits, producing a log file in the now-completed partition, and the indexer missed it. Because the reader engine is designed from day one to merge base + logs on the fly, the reader's merge layer inherently immunizes the system against race conditions. The storage layout itself acts as the concurrency buffer. The reader's merge logic doesn't know the difference between a log file produced during catchup and one produced just after. RFC-45 discusses an alternative design based on locking metadata-timeline appends, but explicitly rejects it in favor of the eventual-consistency path adopted by the implementation.
Why This Isn't Expressible in Rewrite-Oriented Formats
Can async indexing be added to Iceberg or Delta with sufficient engineering effort? It can be approximated, but the approximation either gives up the "no synchronization with writers" property or requires inventing the three primitives above on a foundation that doesn't naturally support them.
Iceberg has manifests, manifest lists, snapshots — no stable logical container with an append channel. Each commit produces new manifest files; there's no "manifest the indexer prepares and writers contribute to." Adding async indexing to Iceberg would require a new manifest layout where indexers stage skeletons that subsequent writers contribute to — a spec change, not a feature.
Delta has the transaction log, deletion vectors, and row-level concurrency for non-overlapping MERGE/UPDATE/DELETE. None provide an append container stable across commits. Coordinated commits in 4.0 improve multi-engine atomic commit, but the unit of commit is still a set of new files; there's no notion of writers appending into a file the indexer staged. Delta's data skipping and statistics are written during the data commit; they cannot be retroactively built without bootstrap-and-swap.
What This Unlocks Operationally
The architectural argument lands on a small set of capabilities that change how teams operate Hudi tables.
The most immediate: adding indexes to existing tables without a maintenance window. RLI on a 1 PB table for primary-key lookup. Secondary index on customer_email for a query pattern that wasn't important when the table was created. Expression index on LOWER(country_code) because the workload changed. Issue CREATE INDEX (or run HoodieIndexer) against the running table. Bootstrap fills history. Writers continue. Catchup closes the seam.
Second: index re-initialization as a routine operation. If an index gets into a bad state — schema evolution mismatch, corrupted base file, missed catchup window — recovery is "delete the partition, re-run the indexer." Not a migration event. The same ScheduleIndexActionExecutor.abort path handles cleanup before re-initialization.
Third: staged index rollouts. You don't have to enable every metadata partition at table creation. Enable files and column_stats initially, add record_index later when point-lookup load justifies it, secondary_index later still when a new query pattern emerges. The metadata table's footprint grows with the workload's actual needs.
What MOR Ultimately Made Possible
A single architectural decision — append-first mutation with deferred reconciliation — produces a family of capabilities that look like independent features but are the same idea applied to different scopes. The metadata table is MOR because the commit cadence demands it. Async indexing is possible because the same stable mutation containers can be initialized empty and evolved by concurrent contributors.
Looked at in isolation, each capability has its own moving parts: RFC-15 for MDT, RFC-45 for async indexing, and later in the series we'll examine how Hudi extends the same architectural principles — stable mutation containers, deferred reconciliation, and timeline-mediated publication — to support non-blocking concurrency control. They miss the bigger point, which is that the moving parts exist because MOR established the underlying mutation model — and the same model gets reused every time a new operational requirement shows up.
Introducing TabFM: A zero-shot foundation model for tabular data
Google's new TabFM foundation model performs zero-shot tabular prediction by treating datasets as in-context learning tasks, bypassing traditional manual feature engineering.
Summary
Deep Dive
- Hybrid Attention: Combines row-wise and column-wise attention to capture feature dependencies without needing explicit feature engineering.
- In-Context Learning (ICL): Enabling a model to perform tasks based on provided examples in the input rather than updating weights.
- Synthetic Training: Uses structural causal models to generate diverse datasets, circumventing the scarcity of public, large-scale industrial data.
- Benchmarks: Validated against TabArena, a living leaderboard of classification and regression performance.
Decoder
- Zero-shot: The ability of a machine learning model to perform a task it was not explicitly trained on, using only the context provided in the input.
- Tabular data: Structured data organized into rows and columns, such as a CSV or database table.
- XGBoost: A popular, gradient-boosted decision tree algorithm that has long been the industry standard for tabular prediction.
Original Article
Introducing TabFM: A zero-shot foundation model for tabular data
We’ve seen a massive shift in how people handle time-series forecasting since we launched TimesFM. Now, we’re bringing that same "zero-shot" logic to tabular data.
We introduce TabFM, a new foundation model for tabular data to simplify classification and regression workflows.
Tabular data constitutes the backbone of enterprise data infrastructure and powers a significant fraction of critical predictive machine learning applications. From predicting customer churn to identifying financial fraud, tabular regression and classification tasks are ubiquitous. For years, supervised tree-based algorithms like AdaBoost, XGBoost and random forests, to name a few, have historically dominated this space, offering robust performance on structured data.
However, the lifecycle of deploying these traditional models presents a significant bottleneck. Fitting an XGBoost model to a new dataset is not merely a matter of a single .fit() step; it invariably requires tedious manual effort. Data scientists must invest countless hours into extensive hyperparameter optimization and domain-specific feature engineering just to extract a reliable signal from the raw data.
On the other hand, recent advances in the broader machine learning landscape — particularly the evolution of large language models (LLMs) — have changed how we interact with novel tasks. LLMs have demonstrated the remarkable power of zero-shot prediction through in-context learning (ICL). This technique lets a pretrained model learn a new task by providing examples and instructions in the input context, without updating any underlying model weights.
Today, we introduce TabFM, a foundation model designed specifically for tabular data classification and regression. By framing tabular prediction as an ICL problem, TabFM eliminates the need for manual model training, hyperparameter tuning, and complex feature engineering. We are excited to share how this approach allows users to generate high-quality predictions on previously unseen tables in a single forward pass. TabFM is now available on our Hugging Face and GitHub repos.
How it works
The traditional ML paradigm relies on updating model parameters specific to a given dataset's distribution. In contrast, the ICL paradigm bypasses this completely. Instead of undergoing a traditional training phase for each new task, TabFM takes the entire dataset — comprising both the historical training examples and the target testing rows — as a single unified prompt. The model learns to interpret the relationships between columns and rows directly from this context at inference time.
However, applying ICL to tabular data is not as straightforward as tokenizing natural language. Standard language models process one-dimensional, ordered sequences, but tables are fundamentally two-dimensional and inherently orderless: swapping two rows or two columns does not change the underlying meaning of the data. To effectively process these diverse tabular structures while enabling scalable zero-shot prediction, TabFM synthesizes the strengths of architectures like TabPFN and TabICL into a novel hybrid design. This architecture, visualized below, relies on three key mechanisms:
- Alternating row and column attention: First, the raw table is processed through a multilayer attention module. Similar to TabPFN, this step applies alternating attention across both columns (features) and rows (examples). By continuously attending across these two dimensions, the model learns rich representations that natively capture complex feature interactions and dependencies. This deep contextualization effectively performs the heavy lifting that would otherwise require tedious manual feature crafting by data scientists.
- Row compression: Following this contextualization, the rich, cross-attended information for each individual row is compressed into a single, dense vector representation.
- In-context learning (ICL): Finally, a dedicated Transformer operates on this sequence of compressed embeddings. Adopting the highly efficient approach of TabICL, performing attention over these compressed row vectors — rather than the raw, uncompressed grid — drastically reduces the computation cost. This ensures the prediction step remains highly computationally efficient, even for much larger datasets.
Training on synthetic data at scale
A typical recipe for building foundation models is to use a high-capacity neural network trained on vast amounts of diverse data. However, a major hurdle in tabular ML is that high-quality, diverse tabular datasets — especially the massive tables required to reflect true industrial data analysis — are critically scarce in the open-source space. Industrial tables often contain proprietary schemas and sensitive information, making them inaccessible for broad pre-training.
Because synthetic tables can be generated to be arbitrarily large, they are effectively the only viable option for pre-training a foundation model at this scale. As a result, TabFM is trained entirely on hundreds of millions of synthetic datasets. These datasets are dynamically generated using structural causal models (SCMs) that incorporate a wide variety of random functions. This massive synthetic generation captures the wide variety of distributions and complex feature relationships prevalent in real-world tabular data. As a result, the model generalizes well to unseen real-world tables, as we demonstrate in our benchmarks below.
Performance and benchmarking
To rigorously test TabFM against existing state-of-the-art methods, we evaluated it on TabArena, a living benchmark system that calculates Elo scores based on head-to-head win rates. This comprehensive evaluation spans 38 classification datasets and 13 regression datasets ranging in size from 700 to 150,000 samples.
As shown in the performance plot below, we benchmarked two distinct configurations of our model:
- TabFM: This represents the out-of-the-box capability of the model. Predictions are generated in a single forward pass, requiring no tuning or cross-validation.
- TabFM-Ensemble: This configuration pushes performance further by incorporating cross features and SVD (Singular Value Decomposition) features. We compute the optimal weights for a 32-way ensemble using a non-negative least squares solver. For classification tasks, this variant also incorporates Platt scaling as an additional calibration step.
For comprehensive TabArena benchmark results—including detailed per-fold metrics and head-to-head win rates against specific baseline models—please visit our GitHub page.
Conclusion
By reframing tabular prediction as an in-context learning problem, TabFM utilizes a hybrid attention architecture and massive synthetic training data to natively capture complex feature interactions. This approach successfully eliminates the traditional bottlenecks of manual feature engineering, hyperparameter optimization, and repetitive model training, and consistently outperforms heavily tuned, industry-standard supervised algorithms. TabFM brings the out-of-the-box convenience of modern foundation models directly to tabular ML workflows, empowering practitioners to generate highly accurate predictions in a single forward pass.
To make this accessible, TabFM is being integrated directly into Google BigQuery. In the coming weeks, users will be able to perform advanced regression and classification using a simple AI.PREDICT SQL command in BigQuery — no ML expertise required.
Acknowledgements
This project is joint work with Erez Louidor Ilan, Taman Narayan, Shuxin Nie, Rajat Sen, Yichen Zhou, Joe Toth, Deqing Fu and Samet Oymak. We thank Kimberly Schwede for designing the graphics.
Google Releases Nano Banana 2 Lite, its Fastest and Cheapest AI Image Generator Yet
Google is aggressively pushing low-latency, high-volume AI generation with its new four-cent Nano Banana 2 Lite model.
Summary
Deep Dive
- Nano Banana 2 Lite focuses on throughput and latency for high-velocity developer pipelines.
- The model is optimized for prompt adherence, text legibility, and character consistency despite the focus on speed.
- Gemini Omni Flash provides a native pipeline to animate static images into 10-second video clips.
- Developers can access these tools through the Gemini API and Google AI Studio.
- Google is positioning this stack for e-commerce and advertising use cases, moving away from pure consumer-focused branding.
Decoder
- Nano Banana: Google's internal naming convention for its specialized family of lightweight, high-speed image generation models.
- Gemini Omni Flash: A multimodal model optimized for low-latency video generation, distinct from Google's more resource-intensive Pro-tier models.
- Prompt adherence: A measure of how strictly a generative model follows specific text instructions provided by the user.
Original Article
TL;DR
Google launched Nano Banana 2 Lite, an image model that generates in four seconds for under four cents per thousand images.
Google on Tuesday released Nano Banana 2 Lite, the fastest and cheapest model in its Nano Banana family of AI image generators. The model produces images in four seconds and costs under four cents per thousand images, making it the company’s most aggressive play yet for developers who need to generate visuals at scale. It is available immediately in Google AI Studio, the Gemini API, and the Gemini Enterprise Agent Platform.
Nano Banana 2 Lite is built for speed, not quality. Google positions it as the model for “rapid ideation and high-velocity developer pipelines” where latency and cost matter more than fine detail. The company’s existing Nano Banana 2, launched in February, remains the recommended model for work that demands higher fidelity, while Nano Banana Pro handles complex professional use cases.
The new model replaces the original Nano Banana, which Google now calls its “legacy model.” Despite prioritizing speed, Nano Banana 2 Lite retains what Google describes as reliable prompt adherence, strong character consistency, and legible text rendering inside images. It is also rolling out to consumer surfaces including AI Mode in Search, the Gemini app, NotebookLM, Google Photos, Stitch, Google Flow, and Google Ads.
Alongside the image model, Google announced a wider release of Gemini Omni Flash, its video-generation model first introduced at Google I/O in May. Omni Flash is now available to developers through the Gemini API and Google AI Studio for the first time, priced at ten cents per second of video output. Clips are capped at ten seconds, with longer durations expected later.
Google is pitching the two models as a pipeline. Developers can use Nano Banana 2 Lite to rapidly generate and iterate on images, then pass those images to Omni Flash to animate them into video. A new demo app called Omni Product Studio converts static images into what Google calls “cinematic e-commerce videos,” and two other demos let users place themselves into landmark photos or reimagine room interiors.
The releases land in a market where AI-generated imagery remains deeply polarizing. A recent study found that 60 percent of TikTok videos are now classified as AI-generated content, and the term “AI slop” has entered everyday vocabulary to describe low-quality machine-made media flooding social platforms. Google has leaned heavily into marketing its image tools for advertising and business use rather than consumer creativity, a framing that sidesteps some of the backlash but not all of it.
The company’s relationship with Hollywood is also drawing scrutiny. Google DeepMind struck a $75 million deal with indie studio A24 last week to develop AI filmmaking tools, a partnership that prompted significant criticism from fans and creative communities who accused A24 of undermining the artists it built its reputation championing. A24 has defended the partnership, saying it wants to “dictate what tools get built for artists” rather than leave those decisions to technology companies alone.
Nano Banana 2 Lite and Omni Flash are the latest additions to a generative-media stack Google has been building out aggressively since last year. The strategic bet is that making image and video generation fast enough and cheap enough will embed these tools into everyday developer workflows before the debate over their social costs is resolved.
39 principles for designing human-AI interaction
Designing effective AI interfaces requires moving away from deterministic UI patterns toward a framework that emphasizes user control, transparency, and recoverability.
Summary
Deep Dive
- Probabilistic Foundation: Acknowledge that AI output is not fixed; design for variance.
- Expectation Setting: Clearly label AI-generated content and define capability boundaries.
- Calibrated Trust: Use provenance (citations) instead of confidence scores to help users verify facts.
- Transparency: Implement progressive disclosure for complex reasoning steps.
- Control & Agency: Ensure AI actions are easily reversible and that users can override model behavior.
- Graceful Failure: Build specific paths for users to correct errors rather than just 'undoing' them.
- Responsible Autonomy: Require human approval for irreversible or high-stakes actions.
Decoder
- Probabilistic: A system that predicts the likelihood of an outcome rather than providing a single, guaranteed correct answer.
- Provenance: The documented history or source of data, used here to show users exactly where an AI retrieved its information.
- Mixed-initiative system: A design approach where both the human and the AI system contribute to task completion, shifting control dynamically based on the context.
Original Article
Full article content is not available for inline reading.
Meta Is Planning a Cloud Business to Sell AI Computing Power
Meta is preparing to monetize its massive AI infrastructure by launching a cloud service that sells excess compute power to external developers.
Summary
Original Article
Meta Platforms is developing an internal cloud infrastructure initiative to sell access to its surplus AI computing power and hosted models to external developers. This strategic pivot aims to generate a new revenue stream from the company's massive multi-billion-dollar investments in data centers and chips, directly challenging dominant hyperscalers like Amazon Web Services, Microsoft Azure, and Google Cloud.
Autoresearch: The feedback loop behind self-improving agents
Introspection is building infrastructure for 'autoresearch,' shifting from simple agent harnesses to self-improving feedback loops that include human input.
Summary
Deep Dive
- The product focus has moved from model capabilities to agent harnesses and now to feedback loops.
- Agent recipes act as portable containers for human expertise, including judges and evals.
- Autoresearch involves an 'outer loop' where secondary agents study and maintain the primary production agents.
- Introspection leverages Git as the primary audit log for agent decision-making.
- The framework treats humans as a core tool, using 'ask-a-human' triggers to learn preferences over time.
- Systems should be designed as a factory that progressively learns, rather than assuming full autonomy is possible immediately.
Decoder
- Autoresearch: A design pattern where an AI system automates the research and improvement cycle of its own performance through feedback loops.
- Harness: A testing and orchestration environment used to evaluate and deploy AI agents in specific tasks.
- Agent Recipe: A standardized configuration file containing the data, evals, and logic required for an agent to perform a specific domain task.
Original Article
Autoresearch: The feedback loop behind self-improving agents
Introspection co-founder Roland Gavrilescu explains autoresearch, agent “recipes,” self-improving loops, and why humans remain central to the software factory.
We’ve heard a lot about loops at the AI Engineer World’s Fair this week. Another buzzword is autoresearch, which involves building an “outer loop” where agents help maintain and improve the primary system, using feedback signals, evals and human input to make progress over time.
At least, that was the framing of Roland Gavrilescu, co-founder and CEO of Introspection — a new company building infrastructure for deploying these self-improving systems. Before starting the company, Gavrilescu worked on agent infrastructure and cloud agents at xAI, where he met his co-founder, Julian Bright.
Ahead of his “Autoresearch in the Wild” session at the AI Engineer World’s Fair today, I spoke with Gavrilescu about the shift from agent harnesses to feedback loops, the role of the open-source Pi framework, and why autonomous software factories must first learn from humans.
From xAI to Introspection
Latent Space: How did your new company, Introspection, come about?
Roland Gavrilescu: Last year, I was at xAI, where I met my co-founder. We were working on agent infrastructure and cloud agents, and we felt there was a new agent form factor that needed to be explored further. xAI was not necessarily the environment where we could focus completely on that.
We decided to leave and ask what a company designed around this new form factor might look like. We were interested in what made companies such as Cursor and Cognition successful, and how we could turn some of those ideas into a product that others could use.
That became the basis for Introspection.
Autoresearch allows you to build loops in which agents help maintain the system itself. The challenge is designing the right signals and feedback mechanisms so agents can improve the system, make architectural decisions and move in the right direction without constantly being bottlenecked by humans.
The loop becomes the product
Latent Space: Your session is titled “Autoresearch in the Wild” — what will it cover?
Gavrilescu: We have heard a lot about what autoresearch can do for improving experiments, but we wanted to talk about what these loops look like in production.
We are presenting three patterns that we think form the basis of a new blueprint.
The first is that the loop is the product. We have moved from focusing on models, to harnesses, and now to loops. The key question is whether you can define the right feedback mechanisms so agents can take on more work without generating more slop.
The second pattern concerns what the loop generates and how you track it over time. We are proposing a concept called an agent recipe.
We moved from agent tools to agent skills. Recipes are a larger container that brings together the components needed to encode human expertise: evals, judges, signal processing and the information that feeds back into the loop.
The goal is to create a portable format that agents can iterate on, almost like a research laboratory, but in a provider-agnostic way.
The third pattern is about what we optimize for. How can the system become both better and cheaper over time?
Companies such as Cursor and Cognition have shown that these products can work. The next stage is making them more accessible, faster and cheaper, and gradually distilling the capabilities of frontier models into systems that you own and that are customized for your environment.
Agent recipes
Latent Space: Can you explain more about what an agent recipe is…
Gavrilescu: It’s like a description of the ingredients you need and how they evolve.
The idea comes partly from data recipes used in model post-training. A data recipe describes how much data from different domains should be baked into a model.
Agent recipes are similar. A recipe might describe how your harness works with different models, the evals you use, the judges you have created, the human expertise you have captured and the failures that led to new evals.
Imagine that tomorrow you suddenly gained access to the Devin codebase. The code alone would not necessarily be that helpful if you could not see how the team arrived at the current version. You would want to understand the failures, mistakes and decisions that informed it.
A recipe captures that process. You begin with a baseline and then record how each signal produced a new judge, embedded new human expertise or led you to introduce a different model.
The inner loop and the outer loop
Latent Space: Does autoresearch mean orchestrating multiple agents, or can it involve one agent repeatedly working and verifying its results?
Gavrilescu: You can think of the system as having an inner loop and an outer loop.
The inner loop is the primary system interacting with users and performing the work. Autoresearch is more concerned with the outer loop: another system that studies and maintains the primary system.
The question is how to design that outer loop so it makes progress on the right problems without consuming an unreasonable number of tokens while deciding what to do.
Pi as the Linux of agent harnesses
Latent Space: You have compared Pi to Linux. In that analogy, is Introspection something like Red Hat?
Gavrilescu: Pi is like the Linux of agent harnesses. Linux has distributions such as Ubuntu, but the underlying system is designed to be extended. Pi is similar: it was never intended to be run as an unchanged, vanilla product. Pi separates the agent loop from its extensions and configuration, which makes the agent portable. You can spin up several different agents by loading different files into the runtime.
We saw an opportunity to combine that extensibility with recipes and open-source building blocks that can evolve for each customer while remaining portable and easy to deploy.
Making loops reliable in production
Latent Space: Reliability and the messy reality of agent loops have been recurring themes at the conference. How does Introspection address those problems?
Gavrilescu: The product is designed around the point at which you are ready to move into production.
You need to know what infrastructure is required to make the loops work, keep costs under control and maintain security. The managed infrastructure covers what is necessary for these systems to operate in production.
A major part of our focus is bringing the kind of infrastructure available inside frontier AI laboratories to a product that other companies can deploy.
Humans remain part of the system
Latent Space: What about the human in the loop?
Gavrilescu: These loops are designed with humans in the loop because you need the right signals as the system makes progress.
The human can effectively become a tool and a source of signals. Agents can be trained to ask people questions through an “ask a human” tool.
During its first few loops, an agent may rely heavily on asking questions and learning what a human would do. Over time, it accumulates those preferences and can become increasingly autonomous.
It is similar to an employee joining a new company. Initially, that employee asks a lot of questions. As they learn how the organization works, they can make more decisions independently.
Taking agent infrastructure into vertical markets
Latent Space: So what kinds of use cases are you seeing?
Gavrilescu: We are concentrating on vertical agents.
Coding agents are clearly working, and we have seen a number of companies succeed in that area. The next question is how to deploy agents in vertical and non-coding domains.
Companies in those markets are asking how they can do this securely without becoming dependent on a single provider. They want the deployment to belong to them, they want to retain ownership of their data, and they do not want to be locked into OpenAI or Anthropic. Introspection is intended to provide infrastructure that addresses those requirements using open-source building blocks.
Frontier AI labs have developed sophisticated internal agent technology. We want to bring similar capabilities into vertical SaaS and services businesses.
Why the work happens in Git
Latent Space: Is Introspection mainly intended for developers, or will product managers and other business users work with it?
Gavrilescu: We are initially focusing on software engineers in vertical SaaS companies.
We want the environment to be agent-friendly, meaning agents can work inside their own repositories and codebases. Everything is Git-based, and Git becomes the audit log that you maintain over time.
In the future, there will be interfaces that enable product managers and others to participate. But we are already seeing product managers move closer to code.
We think the right initial form factor is a human-to-agent interface in which the actual work and its history live in Git.
From orchestras to software factories
Latent Space: Does Introspection fit within the broader idea of software factories?
Gavrilescu: Yes. Designing the loops is essentially designing the factory. The remaining question is how much autonomy the factory should have.
There has also been discussion about “orchestras, not factories.” That distinction is really about the level of autonomy.
An orchestra might retain a human conductor who controls how the loops operate. A factory implies something more fully autonomous.
But you should build toward the factory rather than assume you can create a completely autonomous factory on the first day. Models do not initially possess all the context or understand every decision people inside an organization make. You cannot simply capture all of that knowledge in a Markdown file.
The right approach is to design the human as a core component of the factory. The early system should extract tacit knowledge and workflows from people over time, rather than attempting to automate everything immediately.
How to start with autoresearch
Latent Space: What would you recommend to engineers who want to experiment with autoresearch?
Gavrilescu: The first step is to invest in your signals. What are the things you actually want agents to respond to?
Product feedback is a good example. Not all feedback carries the same value, and you cannot respond to every individual data point. You need a mechanism for filtering the signals and identifying which ones an agent should act on.
The second requirement is control over cost. You do not want to wake up to an unexpected thousand-dollar bill because an agent has been running an inefficient loop.
The third is to follow the research. Look at the kinds of harnesses models are being trained to use and remain close to those patterns. Study how research labs use data recipes and consider how those ideas can be applied to your own product.
The broader goal is to turn your product organization into a miniature research lab, with agents acting as miniature researchers.
ZCode
ZCode launched as the official IDE harness for GLM-5.2, bringing agentic planning and coding features to desktop environments.
Summary
Decoder
- IDE (Integrated Development Environment): Software that consolidates basic developer tools (editor, compiler, debugger) into a single interface.
Original Article
Introducing ZCode, the official development environment for GLM-5.2
- GLM Coding Plan subscribers: now 1.5x usage quota in ZCode
- BYOK supported: works with your existing subscriptions and APIs
- Available on macOS, Windows, and Linux
Download now: zcode.z.ai/en
Learning to Replicate Expert Judgment in Financial Tasks
Custom models fine-tuned on expert-labeled financial datasets outperform frontier LLMs at a fraction of the cost.
Summary
Deep Dive
- Proprietary models built on expert-labeled data significantly outperform general-purpose LLMs on specific domain tasks.
- Frontier models like Gemini, Claude, and GPT models averaged only ~50% accuracy on financial filtering tasks using zero-shot prompts.
- Expert-crafted prompts improved performance to roughly 75%, but hit a ceiling that necessitated fine-tuning.
- Data cleaning involved a verification scheme where a baseline model's predictions were compared against human labels; contested examples were routed to human experts.
- The team utilized Qwen3-235B and trained on their internal 'Tinker' infrastructure.
- Interleaved batching improved performance by 12.1% compared to fully mixed batches.
- On-policy distillation with a strong teacher model yielded a 3.1% gain over using a frozen base model.
- The final model reduced inference costs by 13.8x compared to frontier alternatives.
Decoder
- Alpha: In finance, the excess return on an investment relative to the return of a benchmark index.
- F1 score: The harmonic mean of precision and recall, often used as a measure of a model's accuracy on classification tasks.
- CISPO loss: A custom loss function variant designed for better stability and performance in policy optimization.
- On-policy distillation: A training method where a student model learns from a teacher model that is periodically updated based on the student's own performance gains.
Original Article
Judging information
Outperforming the market is hard. When every investor has access to the same sources of public information, alpha must come from unique insight built on taste and judgment. A strong investor’s judgment is difficult to articulate and teach directly to others, whether human or AI. It comes from experience.
Even when we decompose an investor’s job into its simplest constituent tasks, those tasks turn out to be surprisingly difficult for LLMs. In this post, we consider a simple special case: filtering and processing financial documents to surface information relevant to investment decisions.
Investors are bombarded with information every day: news articles, research reports, company documents, emails, internal write-ups, and more. Reading is the easy part. The real work is the small, repeated judgments carried over it — filtering, interpreting, segmenting, and identifying where the useful signal lies. These judgments are embedded throughout an investor’s daily workflow and consume substantial time.
We wanted to see if we could automate the information triage task: identifying what is relevant and interesting to read. This alone could greatly augment investors’ productivity, letting them spend their freed up attention on higher-level synthesis and decision making.
Given that LLMs perform poorly on simple financial tasks, we asked: is it possible to teach LLMs financial judgement? We find that with high-quality human annotations, we can teach LLMs to interpret text with expert-level taste and judgement. Our proprietary model outperforms all frontier models we tested on information accuracy and recall, at a fraction of their cost.
We describe our training process and results on a subset of data cleared for public release. Based on our results, we further describe the seeds of a vision of differentiated intelligence, with models tuned for specific organizational needs.
Frontier model performance
We evaluated models on six information filtering tasks drawn from investors’ daily workflows. Beyond these tasks, we have many others internally that show similar patterns to these six tasks: frontier models we tested on underperform compared to our internally trained models.
We measured accuracy — the percentage of documents that were correctly labeled according to our investors. For classification tasks, we also calculated the F1 score.
- Financial Article Relevancy: Given a financial article, classify whether it is relevant to a C-suite investment professional.
- Central Bank Document Relevancy: Given a central bank document, classify whether it signals the direction of future interest rate changes.
- Generic Document Relevancy: Given an investor's question and a research document, classify whether the document helps answer it.
- Ad Hoc Content Labeling: Research documents are either recurring (repeated boilerplate) or mixed (boilerplate plus one-off, issue-specific analysis). Classify which, and find the last page of issue-specific content.
- Document Truncation: Identify where boilerplate content begins in a document.
- Email Truncation: Identify where boilerplate content begins in an email.
These tasks are trivial for investors, but they get stuck when articulating their decision process. Consider the following example of classifying a news article as relevant to an investment professional: the Greenland example is unlikely to be taken seriously given the context of the article, while the China tariffs are highly relevant. Yet both examples touch on geopolitics and finance.
In contrast to our investors, frontier models we tested on perform surprisingly poorly. Variants of Gemini, Claude, and GPT averaged a mere ~50% accuracy when given a prompt that simply states each of the six tasks to perform.
We first tried to improve LLM performance with stronger prompting. Our experts wrote instructions based on real task descriptions, and also suggested reframing certain tasks. For example, while an article about a small IPO is clearly financially relevant, it lacks the broad significance that would make it interesting to a macroeconomic investor at Bridgewater. LLM performance on the article classification task improved when they were asked to sort news stories into three labels: relevant and interesting, relevant but uninteresting, and irrelevant.
These changes boosted their accuracy from a coin flip to the mid-70s. We saw no further gains in accuracy from automatic prompt-optimization methods. With our best prompts the frontier models we tested on still achieved less than 80% accuracy — the threshold investors expect from a system they could trust in their daily workflow.
Our results also suggest that newer models aren’t improving rapidly at this task, especially per dollar spent. GPT 5.4 costs 43% more than 5.2 but is only marginally more accurate.
An explicit prompt can only convey the intuition an expert is able to put into words, while the judgments that matter most are often the hardest to articulate. Fine-tuning sidesteps this: rather than contorting the expert’s intuition into a static prompt, the training process lets the model develop its own judgment.
Training dataset construction
The first challenge of training a custom model was acquiring a dataset that reflects high-quality investor taste. In particular, much of the information is only useful when filtered through an investment professional’s judgment.
We initially sourced a dataset from vendors providing non-expert labeling. Models trained on this dataset still performed poorly. After examining the reasoning traces of the model we realized that the labels in the dataset were often wrong. Since expert labelers are costly, we devised a verification scheme that routes only the contested examples to experts.
The scheme worked as follows: we trained a model on the dataset from non-expert labelers, then evaluated it on the same data. Examples where the model’s answer differed from the labelers’ were sent to our experts for reevaluation — if a model couldn’t match an example from its own training set then either the example is genuinely difficult, or the original label was wrong. This procedure was used to clean the training set data; the final evaluation was done on a held out test set.
Training recipe
We trained our models on Tinker from Thinking Machines Lab. We chose Qwen3-235B as the base model as its fine-tuning performance is widely studied in the academic literature.
We began with standard GRPO and importance-sampling loss as a simple, critic-free starting point. This baseline approach resulted in a massive jump in the model performance, but it still fell short of our desired 80% threshold.
We make the following modifications to our training recipe to push performance farther:
1. Interleaved batching
For our multi-task training recipe, we compared three batching strategies: training each task sequentially, fully mixing tasks within a batch, and interleaving one batch per task in round-robin order. We found interleaving worked best, improving accuracy by 12.1% over fully mixed batches.
2. CISPO loss with asymmetric clipping
We used CISPO loss with asymmetric clipping to replace the standard importance-sampling loss. Across the loss functions and clipping schemes we tried, this performed best, improving accuracy by 10.1% over the importance-sampling baseline.
3. On-policy distillation with strong teachers
We train with on-policy distillation (OPD), constructing the advantage as follows:
r = reward - beta * avg(student_lp - teacher_lp)
adv_i = r_i - avg(r)
The reward is penalized when the student drifts from the teacher’s distribution, regularizing the policy while it learns the task. Every 20 steps, we promote the current checkpoint to the teacher — but only if validation accuracy has reached a new high, so we never distill toward a weaker model. This gave a further 3.1% gain over a frozen base-model teacher.
Results
Our trained model improves average accuracy from 78.2% to 84.7%, meaning the trained model makes 29.8% fewer mistakes than the best frontier model we evaluated. We find this level of accuracy is sufficient for our daily work.
Our trained model is also vastly cheaper due to its smaller size: a 13.8x reduction in inference costs per task. As we plan to rely on more models trained to help with specific tasks and to scale AI across the organization, cost is an important consideration.
Conclusion
Frontier models we tested on struggle with relatively simple financial tasks, and model advances don’t improve performance much. In contrast, we’ve shown that high-quality proprietary datasets labeled by expert investors and used for fine-tuning produce custom models that exceed frontier performance on our tasks. We have found that this outcome holds true well beyond the six tasks we’ve discussed in this post.
Aside from higher accuracy, custom models are also substantially cheaper. We expect to see more productivity gains from custom model training in the future, especially with the availability of training infrastructure like Tinker that enables rapid experimentation.
Our results show the possibility of a future of differentiated intelligence, where custom models tuned to specific organizational needs outperform frontier models.
For the First Time, a Cell Built From Scratch Grows and Divides
Biologists have successfully built a non-living synthetic cell from scratch that can grow, replicate DNA, and divide, marking a major milestone in synthetic biology.
Summary
Deep Dive
- Researchers assembled a synthetic genome without metabolic genes.
- Feeder liposomes were used to supply essential proteins, ribosomes, and sugar.
- Division was achieved by attaching protein tags to the membrane to induce physical bending.
- The synthetic cells exhibit basic genetic replication and population growth.
- The team plans to release tools and methods via the new nonprofit, Biotic.
Decoder
- Liposome: A spherical vesicle having at least one lipid bilayer, used here as the synthetic cell's membrane.
- Abiogenesis: The process by which life arises from non-living matter.
- Protocell: A self-organized, endogenously ordered, spherical collection of lipids proposed as a stepping-stone to the origin of life.
Original Article
For the First Time, a Cell Built From Scratch Grows and Divides
For the very first time, biologists packed nonliving components into a cell-like membrane, piece by piece, and witnessed the bag of molecules start to behave like life. The lab-made synthetic cell grew, replicated its DNA, and divided, demonstrating the basic functions of a cell cycle.
It’s “an impressive step,” said Jack Szostak, who studies the origins of life at the University of Chicago and was not involved in the research. “I don’t know of any other effort to put together an artificial cell from biological components that has progressed so far.”
The cell is not alive by any definition. It can’t survive without constant deliveries of food and ribosomes, the machinery needed to make proteins. It has no defenses or a good waste removal system. But it’s the strongest demonstration yet that it is possible to generate life from nonlife, a goal that synthetic biologists have been chasing for decades.
“It’s a big step forward to this holy grail of making a living thing out of dead components,” said Sijbren Otto, a systems chemist at the Stratingh Institute for Chemistry in the Netherlands who was not involved in the work. “It’s not completely there yet, but it’s definitely getting quite close.”
Since these cells were pieced together from scratch, and all the molecular parts were crafted in the lab, scientists can tinker with the system and switch components in and out. “I have a blueprint, I have a full chemical ingredient list of every component,” said Kate Adamala, a synthetic biologist at the University of Minnesota who led the new study, which is not yet peer-reviewed; the paper was posted on the scientific preprint site biorxiv.org on July 2. With such flexibility, this kind of synthetic cell could eventually be coaxed to create new materials, such as biofuels and drugs, and help researchers study disease.
It could also give scientists insight into some of their deepest existential questions: What is the minimum needed to sustain life? How could life start? What happens if we alter the biology that composes life on Earth today?
Or, as Adamala put it: “What else can biology do?”
Building Life
Some 4 billion years ago, a bunch of nonliving molecules got together to form the first protocells. They fed, grew, and divided. Then, over time, evolutionary processes emerged that let these cells change and diversify into many different types, decorating a barren world with all manner of strange beings. A purely chemical world blossomed into a biological one. Scientists cannot agree on how this shift from nonlife to life, or abiogenesis, happened, but some have turned their sights on trying it out for themselves in the lab.
For decades, researchers have taken different approaches to this challenge. Some, like the synthetic biologist John Glass at the J. Craig Venter Institute, are stripping down bacterial cells to their smallest, barest genomes to reveal a cell’s minimum requirements to stay alive. Others, like Otto, try to build cells with molecules that differ from those found in Earth biology.
Adamala also works from the ground up, but with biological molecules found in nature today. When she started her lab in 2016, she envisioned assembling a synthetic cell, a proof of concept, that could undergo a complete cycle of cell division using its own genome.
She found an instruction manual in what all known cells have in common: They grow, they duplicate their DNA, they divide, and they evolve. They transcribe their DNA into RNA and then make proteins to carry out these tasks and others that keep a cell running, such as metabolizing molecules for energy. All of this is done inside a lipid membrane, which holds all the necessary materials in one place. Adamala’s team needed to build their synthetic cell a genome and supply it with all the materials to carry out those tasks.
They developed and optimized different ingredients, most inspired by other labs, before combining them together inside liposomes — hollow sacs enclosed by a simple lipid membrane. This would serve as the cellular body.
They started with a cell’s most fundamental system: its mechanism for copying its DNA and passing it down to daughter cells. They adopted a DNA replication system, pioneered by the synthetic biologists Hannes Mutschler and Christophe Danelon, and tweaked it to work alongside other systems, including a commercial pack of 36 enzymes that let the cell read DNA and make proteins. Adamala’s team fiddled with their cellular brew, switching genes in and out and adjusting concentrations of various molecules, to get the crucial information-carrying and protein-making genetic systems to jibe.
Their tiny synthetic genome did not encode any metabolic genes, which would let the cell process food and energy, or many of the complex molecules a cell needs. So, in parallel, the researchers prepped some supply packs.
They filled other liposomes with sugar, lipids, and enzymes, as well as complex molecules, such as transfer RNA (tRNA) and ribosomes, which work together to translate genetic instructions into proteins. For their protocell to accept these crucial supplies, the team also modified a protein that would sit in the cell membrane and attract the lipid bubbles. When a bubble bumped into the cell, their membranes would fuse, releasing the supplies inside.
It wasn’t easy to get all these genetic systems to work together successfully. After some more tweaking and optimizing, the cell started growing and replicating its DNA.
“I was almost ready to say ‘Done’ and ‘We’re going to publish it,’” Adamala recalled. But her vision for a synthetic cell had one more step: division.
This was where the field had been stuck for some time. Researchers before Adamala had figured out different ways to feed and grow synthetic cells and to replicate their DNA. But cell division is a different beast. A typical cell reorganizes its cytoskeleton — a network of protein fibers that provide structural support — to halve its DNA and split. Synthetic biologists could not figure out how to get their cells to undergo this complex process.
So Adamala decided to ditch the cytoskeleton. One day, while tearing through the literature, she came across an interesting mechanism in a paper. By attaching protein tags to a cell membrane, the synthetic biologist Reinhard Lipowsky at the Max Planck Institute of Colloids and Interfaces attracted other proteins to crowd around and physically bend the membrane, forcing the cell to divide. Following this approach, Adamala tweaked a cell-membrane protein and tested it in her protocells. After several tries, it worked.
“I wasn’t allowing myself to believe it for a while,” she said. “It was like, ‘Holy shit, did I actually make a dividing cell?’ … At some point, you’ve been checking enough that [you think], ‘OK, now it’s real.’”
This paper “beautifully demonstrates this division mechanism,” said Job Boekhoven, a systems chemist at the Technical University of Munich who was not involved in the study. “That has been a huge achievement.”
By putting together systems inspired by different labs — DNA replication; feeder liposomes; and swarming, division-inducing proteins — and then optimizing them to work together, Adamala’s team showed that it is possible to induce the chemical world to form a biological one in the lab.
“Combining all of these things is a staggering technical accomplishment,” Glass said. “I think it will prove to be a watershed event for the synthetic-cell field and biology in general.”
Michael Lynch, an evolutionary biologist at Arizona State University who was also not involved in the study, agreed. It is “a synthetic biology tour de force,” he said. However, he also cautioned against over-hyping the cell since it’s not yet self-sustaining.
Once the synthetic cells were created, her students and others started calling them Adamala cells — a moniker she hated. She insisted that they name the cells after anything else, jokingly suggesting potatoes. So her students started calling them spudcells. “I’m Polish, I’m mostly made of potatoes, so that’s fine with me,” Adamala said.
Each cell is tiny. Its genome is way smaller than bacterial genomes, and it doesn’t look like anything special. It’s “beautiful to me because I’m super excited about it,” Adamala said. “But if you look at it under the microscope, it’s like, ‘OK, it’s a blob.’”
Evolution and Beyond
The cell could grow and divide. But could it take the next step toward life by evolving?
The researchers started fiddling with the synthetic cell’s DNA to see if they could get some cells to grow larger or divide faster — in effect, creating genetic variation in the cell population. They found that the cells that grew bigger also had more daughter cells and started to become more populous. In other words, those traits started being selected for within the population, the first step toward evolution.
What Adamala’s team demonstrated was not quite natural selection, the primary mechanism that drives evolutionary change, in which organisms that are better adapted to their environment are more likely to survive. Even if she got their cell to produce more daughter cells, she doesn’t think it would lead to evolution. That’s because Adamala’s team had to create genetic variation synthetically, instead of allowing for random mutations in DNA. The enzyme that builds new DNA strands works too well, she said; it doesn’t introduce meaningful mutations into the sequence. They will need to find an enzyme that is more error-prone — but not so error-prone that the genome’s integrity and the cell’s function is lost.
“Biology needs to change fast enough, but not too fast,” Adamala said. She said that she needs to find the sweet spot between order and chaos, referencing the biochemist and complexity theorist Stuart Kauffman, a professor emeritus at the University of Pennsylvania, who argues that biology works best at the “edge of chaos.”
A clear demonstration of an evolutionary process is “clearly something that’s missing,” Boekhoven said. “I’m sure that that’s the next big step.” Other researchers have shown adaptive evolution in other types of synthetic cells. But those cells were bacteria stripped of all but the bare minimum of genes — they weren’t built from the ground up.
The cells are also limited by the fact that they need to be fed many of their raw materials. That the cells can’t make their own ribosomes, the way natural cells do, “limits [their] potential for growth and sustained reproduction,” said Szostak, who was Adamala’s doctoral adviser. “If their system was able to generate its own ribosomes and other proteins and RNAs, it would be much closer to existing biological cells such as bacteria.”
Adamala also thinks they will need to figure out a way to add a cytoskeleton to improve their replication system. Currently, the cells waste a lot of energy and time attracting molecules to crowd around and help them divide.
All told, scientists are far from building anything remotely close to a modern living cell — but this new one is still the most lifelike yet. “The modern cell is like a Dreamliner,” Adamala said, referring to the Boeing 787 airplane. “We built a Wright flyer… the first bike frame with wings that flies 100 feet.”
Alongside sharing the new results, Adamala and other synthetic biologists announced the formation of a nonprofit called Biotic, which they will use to make their synthetic biology tools available to researchers around the world. The team is releasing their data and methods so that synthetic biologists can start building and improving on their cell. The hope is that the work can be used, decades from now, to create plastics without fossil fuels, for example, or fertilizers or drugs.
These synthetic cells could also pave the way to the past, to the origins of biology itself. Life on Earth would have started from much simpler molecules than the ones that spudcells use. Still, Adamala’s creation of a synthetic cell system from non-living materials brings researchers a step closer to exploring, in the lab, deeper questions about life’s origins and requirements, a dream she shares with others.
“If you want to understand what life is,” Boekhoven said, “you need to first build life.”
Meta Build for Wearables
Meta has released documentation for developers to build Web Apps for the Meta Ray-Ban Display, requiring a strictly constrained, keyboard-navigated, additive-light UI.
Summary
Deep Dive
- UI must use dark backgrounds due to the additive waveguide display.
- All interactive elements must have a focus state clearly defined by CSS.
- Sensors like IMU and geolocation require user permission via user-triggered events.
- SVGs are not supported for app icons; use PNGs or Unicode.
- Storage is limited to 5MB for both localStorage and sessionStorage.
Decoder
- Additive Waveguide Display: An optical system that projects light into the eye to overlay digital images on the real world; black pixels appear transparent.
Original Article
Build
Overview
Web Apps for Meta Ray-Ban Display (MRBD) use standard web APIs. The easiest way to build Web Apps is using AI coding tools. Learn how to build optimized Meta Ray-Ban Display Web Apps by understanding:
- How to Build with AI
- Capabilities and Best Practices
- Display
- Input
- Sensors
- Location
- Local Storage
- App Icons
Build with AI
AI coding tools and app platforms such as Replit, Manus, Lovable, Claude Code, Vercel, and Cursor can help build Web Apps for Meta Ray-Ban Display glasses when prompts include the platform constraints. The most reliable setup combines the Web Apps GitHub plugin, this guide, and the Wearables MCP endpoint https://mcp.developer.meta.com/wearables and its search_webapps_docs tool.
Copy a starter prompt
Use these prompts to get started quickly. They tell the assistant to inspect your project first, use the Wearables MCP endpoint https://mcp.developer.meta.com/wearables to call search_webapps_docs for current docs, handle unavailable MCP tools explicitly, and keep the first code change small.
New Web App
Use https://wearables.developer.meta.com/docs/develop/webapps/build/, then use the Wearables MCP endpoint https://mcp.developer.meta.com/wearables to call search_webapps_docs for current Meta Ray-Ban Display Web Apps constraints, setup, testing, and publishing guidance. If search_webapps_docs is unavailable, use the linked build guide and state that MCP docs lookup was unavailable before proceeding. Inspect my project first, then build the smallest working Web App for this idea: [describe the app]. It must render in a fixed 600 x 600 pixel viewport, avoid scrolling, use a dark additive-display UI, support arrow-key navigation and Enter activation, and keep all interactive elements reachable without mouse or touch input. Run the relevant local checks.
Navigation and focus behavior
Use https://wearables.developer.meta.com/docs/develop/webapps/build/, then use the Wearables MCP endpoint https://mcp.developer.meta.com/wearables to call search_webapps_docs for current Meta Ray-Ban Display Web Apps input, D-pad navigation, focus-management, viewport, and no-scroll guidance. If search_webapps_docs is unavailable, use the linked build guide and state that MCP docs lookup was unavailable before proceeding. Inspect this Web App first, then add the smallest reliable navigation fix. Preserve arrow-key movement, Enter activation, visible focus states, a fixed 600 x 600 pixel viewport, and no scrolling. Run the relevant local checks.
Debug an existing Web App
Use https://wearables.developer.meta.com/docs/develop/webapps/build/, then use the Wearables MCP endpoint https://mcp.developer.meta.com/wearables to call search_webapps_docs for current Meta Ray-Ban Display Web Apps constraints and troubleshooting guidance. If search_webapps_docs is unavailable, use the linked build guide and state that MCP docs lookup was unavailable before proceeding. Inspect this Web App first and identify whether the issue is layout, input navigation, storage, sensors, or deployment. Preserve the fixed 600 x 600 pixel viewport, no-scrolling behavior, arrow-key and Enter input model, and dark additive-display UI. Make the smallest fix and run the relevant local checks.
HTML metadata
Add the following metadata to the <head> of your HTML file. This allows your web app to support upcoming discovery surfaces and enables us to message users when a website isn’t compatible with MRBD.
<head>
<!-- A brief description of your app -->
<meta name="description" content="Description of your web app">
<!-- Identify your web app as MRBD-compatible -->
<meta name="mrbd-web-app-capable" content="yes">
</head>
Capabilities
Summary
Below are the currently supported capabilities for Web Apps.
| Capability | Description and guidance |
|---|---|
| Display | Additive waveguide overlay. Use dark backgrounds/light, high-contrast UI colors. Fixed 600x600px viewport. Avoid scrolling. |
| Input | Navigation via Neural Band/captouch gestures translates to standard arrow key and Enter events. No mouse/touch/keyboard. All elements must be focusable. |
| Sensors (IMU) | Standard DeviceMotionEvent (accelerometer, gyroscope) and DeviceOrientationEvent (heading, tilt, roll) W3C APIs. Requires user permission. |
| Location (GPS) | Standard navigator.geolocation W3C API. Location is fetched from the paired mobile device. Requires user permission. |
| Local Storage | Standard Web Storage APIs (localStorage, sessionStorage). Best for lightweight data (preferences, small caches). Use JSON for structured data. |
| App Icons | Use Unicode symbols or high-resolution PNG favicons (>= 52x52 px) via <link> tags or Web App Manifest. SVGs are not supported. |
Unsupported Capabilities: Web Apps do not yet support:
- Camera
- Microphone
- Text Input
- Offline Support
- Notifications
- Back Navigation
Also, there is no continuous cursor support for Web Apps.
Display
The display is an additive waveguide that overlays rendered pixels onto the wearer’s real-world view. This has a direct impact on how your app looks.
- A pixel rendered as pure black is fully transparent, since it contributes zero light.
- Bright, vivid colors are the most visible, because they add light on top of the real-world scene.
Color and typography
Because the display is additive, color choices matter more than on a conventional screen:
- Use dark backgrounds, since they effectively disappear. Bright backgrounds cause glare and reduce readability.
- Use light, high-contrast colors for UI elements like text and interactive components. Also, use bright colors for accents.
- Use large, readable fonts: a minimum of 16 px for body text and 20-24 px for primary content.
Viewport
All content should render within a fixed 600 x 600 pixel viewport and avoid scrolling. Include the following viewport meta tag to lock the scale and prevent unexpected zooming:
<!-- optional -->
<meta name="viewport" content="width=600, height=600, initial-scale=1.0, user-scalable=no">
Set overflow: hidden on the <body> element to ensure no content extends beyond the viewport boundary:
body {
width: 600px;
height: 600px;
overflow: hidden;
}
Input: Neural band and captouch gesture
MRBD UI navigation is driven by two input mechanisms: the Neural Band and a touch strip on the glasses temple arm that senses swipe gestures. They produce directional and selection inputs that the glasses OS translates into standard arrow key (ArrowUp, ArrowDown, ArrowLeft, ArrowRight) and Enter events delivered to your Web App.
Note: Since there is no mouse, touch screen, or physical keyboard, every interactive element of your Web App must be reached and activated by these gestures.
JavaScript
// — Input Constants —
const DPAD = {
UP: 'ArrowUp', DOWN: 'ArrowDown',
LEFT: 'ArrowLeft', RIGHT: 'ArrowRight',
SELECT: 'Enter', BACK: 'Escape',
};
// — Focus Management —
function moveFocus(direction) {
var focusables = Array.from(
document.querySelectorAll('.focusable:not([disabled]):not(.hidden)')
);
if (!focusables.length) return;
var idx = focusables.indexOf(document.activeElement);
if (idx === -1) { focusables[0].focus(); return; }
var next = (direction === 'up' || direction === 'left')
? (idx > 0 ? idx - 1 : focusables.length - 1)
: (idx < focusables.length - 1 ? idx + 1 : 0);
focusables[next].focus();
focusables[next].scrollIntoView({ block: 'nearest', behavior: 'smooth' });
}
// — D-pad Listener —
document.addEventListener('keydown', function(e) {
switch (e.key) {
case DPAD.UP: moveFocus('up'); break;
case DPAD.DOWN: moveFocus('down'); break;
case DPAD.LEFT: moveFocus('left'); break;
case DPAD.RIGHT: moveFocus('right'); break;
case DPAD.SELECT:
if (document.activeElement.classList.contains('focusable')) {
document.activeElement.click();
}
break;
case DPAD.BACK: history.back(); break;
default: return; // don't preventDefault on unhandled keys
}
e.preventDefault();
});
HTML
<!-- Mark interactive elements with the focusable class -->
<button class="focusable" data-action="settings">Settings</button>
<button class="focusable" data-action="start">Start</button>
CSS
.focusable {
transition: all 150ms ease;
border: 2px solid transparent;
min-height: 88px; /* glasses minimum tap target */
}
.focusable:focus {
outline: none;
border-color: #00d4ff;
box-shadow: 0 0 20px rgba(0, 212, 255, 0.4);
}
Sensors
Overview
Meta Ray-Ban Display glasses expose access to accelerometer, gyroscope, and compass data through the standard DeviceMotionEvent and DeviceOrientationEvent web APIs. Simply add event listeners on window as you would in any mobile browser.
Requesting permission
Motion and orientation data require an explicit user permission grant. For cross-platform compatibility, check whether DeviceOrientationEvent.requestPermission() exists and call it before attaching listeners.
function startIMU() {
window.addEventListener('deviceorientation', handleOrientation);
window.addEventListener('devicemotion', handleMotion);
}
// Check whether requestPermission exists before calling it
if (typeof DeviceOrientationEvent !== 'undefined' &&
typeof DeviceOrientationEvent.requestPermission === 'function') {
// Platforms that require explicit permission (e.g., iOS Safari)
DeviceOrientationEvent.requestPermission()
.then(function(state) {
if (state === 'granted') {
startIMU();
}
});
} else {
// Glasses runtime and most Android browsers grant automatically
startIMU();
}
Note: The permission request must be triggered by a user gesture (for example, a button press via Enter key). It cannot be called automatically.
DeviceMotionEvent
DeviceMotionEvent provides real-time accelerometer and gyroscope readings. Use it to detect movement, measure G-forces, or track rotation speed.
window.addEventListener('devicemotion', function(e) {
// Accelerometer (including gravity), in m/s²
var ax = e.accelerationIncludingGravity.x;
var ay = e.accelerationIncludingGravity.y;
var az = e.accelerationIncludingGravity.z;
// Compute magnitude in G-force
var g = Math.sqrt(ax * ax + ay * ay + az * az) / 9.81;
document.getElementById('gforce').textContent = g.toFixed(2) + ' G';
// Gyroscope (rotation rate in degrees/second)
var yawRate = e.rotationRate.alpha;
var pitchRate = e.rotationRate.beta;
var rollRate = e.rotationRate.gamma;
});
DeviceOrientationEvent
DeviceOrientationEvent provides the current orientation of the glasses relative to the Earth. Use it for compass heading, tilt detection, or spatial UI.
window.addEventListener('deviceorientation', function(e) {
var heading = e.alpha; // Compass direction (rotation around z-axis): 0-360°
var tilt = e.beta; // Forward/back tilt (rotation around x-axis): -180° to 180°
var roll = e.gamma; // Left/right tilt (rotation around y-axis): -90° to 90°
document.getElementById('heading').textContent = heading.toFixed(1) + '°';
});
Best practices
| Consider | Avoid |
|---|---|
| Requesting permission from a user gesture (for example, button press) | Calling requestPermission() automatically on page load |
| Checking for API availability before attaching listeners | Assuming DeviceOrientationEvent is always defined |
| Throttling or debouncing high-frequency sensor updates for UI rendering | Updating the DOM on every single sensor event without throttling |
Using accelerationIncludingGravity for tilt-based interactions |
Relying on acceleration alone when gravity context is needed |
| Removing event listeners when sensor data is no longer needed | Leaving listeners active in the background, which drains battery |
Location
Overview
MRBD glasses implement the standard navigator.geolocation web API. Location data is fetched from the wearer’s paired mobile device, since the glasses themselves do not have location-aware sensors. Use the API exactly as you would in any Web App. Like Sensor Data, Location also requires user permission.
One-shot position
Use getCurrentPosition to request a single location fix.
navigator.geolocation.getCurrentPosition(
function(position) {
var coords = position.coords;
console.log('Latitude: ' + coords.latitude); // Decimal degrees
console.log('Longitude: ' + coords.longitude); // Decimal degrees
console.log('Accuracy: ' + coords.accuracy); // m
console.log('Altitude: ' + coords.altitude); // m (may be null)
console.log('Speed: ' + coords.speed); // m/s (may be null)
console.log('Heading: ' + coords.heading); // Degrees from north (may be null)
console.log('Timestamp: ' + position.timestamp); // ms since epoch (UTC)
},
function(error) {
// error.code:
// 1 = PERMISSION_DENIED: wearer denied permission request
// 2 = POSITION_UNAVAILABLE: location could not be retrieved (i.e., phone offline)
// 3 = TIMEOUT: request exceeded the timeout
// error.message - human-readable description
console.error('Geolocation error:', error.code, error.message);
},
{ timeout: 15000 }
);
Continuous position tracking
Use watchPosition to receive ongoing location updates as the wearer moves.
var watchId = navigator.geolocation.watchPosition(
function(position) {
// Called each time the position updates
updateMap(position.coords.latitude, position.coords.longitude);
},
function(error) {
// error.code:
// 1 = PERMISSION_DENIED: wearer denied permission request
// 2 = POSITION_UNAVAILABLE: location could not be retrieved (i.e., phone offline)
// 3 = TIMEOUT: request exceeded the timeout
// error.message - human-readable description
console.error('Watch error:', error.code, error.message);
}
);
Call clearWatch when updates are no longer needed.
// Stop watching when done
navigator.geolocation.clearWatch(watchId);
Position options
Both getCurrentPosition and watchPosition accept an optional third argument to configure behavior:
| Option | Type | Default | Description |
|---|---|---|---|
enableHighAccuracy |
boolean | false | Request the most accurate position available. May take longer and use more power. |
timeout |
number | Infinity | Maximum time (in milliseconds) to wait for a position. Use 10000-15000 ms as a practical default. |
maximumAge |
number | 0 | Accept a cached position if it is no older than this value (in milliseconds). |
navigator.geolocation.getCurrentPosition(successCb, errorCb, {
enableHighAccuracy: true, // Request most accurate position (boolean, default false)
timeout: 15000, // Max wait time in ms (number, default Infinity)
maximumAge: 5000 // Accept cached position if newer than this in ms (number, default 0)
});
Best practices
| Consider | Avoid |
|---|---|
Using a timeout of 10-15 seconds (10000-15000 ms), since the first request may take several seconds |
Omitting using a timeout or setting it too low |
Handling PERMISSION_DENIED gracefully, since the wearer must grant permission |
Assuming wearer does not need to grant permissions |
| Ensuring permissions requests are triggered by a user gesture | Assuming permission requests are triggered |
Notes
- Remember, location comes from the paired companion phone’s GPS/network services.
- Expect an accuracy of 5-50 meters, depending on signal quality.
Location error handling
Always provide an error callback to handle failure gracefully. Location may be unavailable if the following errors occur:
| Description | Error code | Type |
|---|---|---|
| Wearer denies the permission prompt | 1 | PERMISSION_DENIED |
| Location data could not be retrieved (for example, companion device is offline) | 2 | POSITION_UNAVAILABLE |
| Request exceeded the specified timeout | 3 | TIMEOUT |
Storage
Web Apps on MRBD glasses have access to standard Web Storage APIs, including both localStorage and sessionStorage, to persist lightweight data on MRBD glasses:
localStoragepersists data across sessions, even after the app is closed and reopened.sessionStoragepersists data only for the current session, so values are cleared when the session ends.
These work exactly as they do in any modern browser, and both APIs store data as key-value string pairs.
Saving and reading data
Use the standard setItem, getItem, and removeItem methods.
// Save a value
localStorage.setItem('userPreference', 'dark');
// Read a value
var preference = localStorage.getItem('userPreference');
// Returns 'dark', or null if the key does not exist
// Remove a value
localStorage.removeItem('userPreference');
// Clear all stored data
localStorage.clear();
Session storage
sessionStorage has an identical API, but scopes data to the current session. Use it for temporary states that should not persist after the user exits your app.
// Track whether the user has seen the onboarding screen this session
if (sessionStorage.getItem('onboardingSeen')) {
showMainScreen();
} else {
showOnboarding();
sessionStorage.setItem('onboardingSeen', 'true');
}
Storage limits
The glasses runtime provides storage within the following limits:
localStorage: 5 MBsessionStorage: 5 MB
As a general practice, keep stored data lightweight - avoid storing large blobs, images, or multi-megabyte datasets. Web storage is best suited for user preferences, small caches, and application state.
App Icons
For app icons, use Unicode symbols or high-resolution PNG favicons (larger than 52x52 px). The system checks the Web App manifest and page source (not just favicon.ico) for this content. If no suitable icon is found, a default fallback icon is shown. SVGs are not supported.
Icons can be implemented as HTML <link> tags in your page’s <head> section.
<link rel="icon" href="/icon-96.png" sizes="96x96">
<link rel="apple-touch-icon" href="/apple-touch-icon.png" sizes="180x180">
Icons can also be implemented by referencing the Web App JSON manifest. Each entry in the icons array must include a src attribute and ideal sizes.
HTML
<link rel="manifest" href="/manifest.webmanifest">
JSON
{
"icons": [
{ "src": "/icons/icon-96.png", "sizes": "96x96" },
{ "src": "/icons/icon-192.png", "sizes": "192x192" }
]
}Using LLMs to Analyze Spark SQL Plans: A Practical Approach to Debugging Long-Running Jobs
Expedia uses LLMs to interpret complex Spark execution plans, turning opaque DAGs into human-readable debugging insights.
Summary
Decoder
- DAG (Directed Acyclic Graph): A sequence of tasks representing a data pipeline execution, where nodes are operations and edges are data dependencies.
Original Article
Instead of manually parsing complex physical plans and DAGs for debugging long-running Spark SQL jobs, Expedia feeds the plans (along with relevant context) by using LLMs to analyze and explain Spark execution plans to quickly identify bottlenecks, inefficient joins, skewed data, or suboptimal operators, significantly speeding up troubleshooting for production Spark workloads.
How We Built DEmate: Taming LLMs for Data Engineering at Meta
Meta’s DeMate assistant uses a RAG-heavy architecture to automate SQL generation and data pipeline maintenance for thousands of internal engineers.
Summary
Decoder
- RAG (Retrieval-Augmented Generation): An architecture that connects LLMs to external, private data sources to improve response accuracy without retraining models.
Original Article
Meta's DeMate is an internal LLM-powered assistant for data engineers that helps with writing SQL, generating pipelines, reviewing code, and understanding complex data flows at massive scale. The architecture combines RAG over internal data catalogs, schema documentation, and code repositories with carefully engineered prompts, multi-step reasoning chains, and human-in-the-loop feedback loops for evaluation and continuous improvement.
Query Faster, Query Smarter: Our Move to DuckDB and What We Learned
Arcesium cut query costs and runtimes by 50% by ditching cloud-native engines like Athena and Trino for DuckDB on specialized workloads.
Summary
Decoder
- Trino: A distributed SQL query engine designed to query large data sets across multiple heterogeneous data sources.
Original Article
Arcesium migrated thousands of SQL queries from Athena to Trino to DuckDB over 18 months, cutting query costs by ~50% and reducing query runtime by ~50% for small-to-medium workloads. Athena hit account/service limits, while Trino solved scalability but increased resource cost. DuckDB delivered the needed speed with ~40% lower memory footprint. The migration required handling Glue-less schema evolution, Parquet compaction, JSON fallbacks for STRUCT mismatches, and thread parallelism tuning.
Too many tables are bad for you
Managing too many tables in PostgreSQL can degrade performance by bloating system catalogs and increasing query planning overhead.
Summary
Decoder
- Catalog (System Catalog): The collection of internal tables that store metadata about all other tables, indexes, and objects in the database; it is queried during every query plan.
Original Article
Having too many tables in PostgreSQL is a bad idea and can seriously hurt performance. The hidden costs can come from bloated catalogs and slower query planning to increased I/O. Practical guidance includes consolidating small, related tables, avoiding excessive schema-per-tenant patterns, monitoring catalog size and planning time, and using inheritance or declarative partitioning when appropriate.
SedonaDB 0.4: GPU-Accelerated Spatial Joins
SedonaDB 0.4 introduces RayBooster, which repurposes NVIDIA GPU ray tracing cores to accelerate spatial joins by nearly 6x.
Summary
Deep Dive
- RayBooster: A new engine using dedicated ray tracing cores for geometry intersection detection.
- Storage Optimization: Uses a Structure of Arrays layout for O(1) geometry access.
- Indexing: Implements Z-stacking and global Bounding Volume Hierarchies (BVH) to replace granular indexing.
- Compatibility: Works on standard consumer RTX 3090 GPUs to achieve significant throughput gains.
Decoder
- Spatial Join: A database operation that combines two tables based on the relationship between spatial features, such as points within a polygon.
- BVH (Bounding Volume Hierarchy): A tree-like data structure used to organize geometric objects to speed up spatial queries.
- DE-9IM: A standard model for describing the spatial relationship between two geometries.
Original Article
SedonaDB 0.4: GPU-Accelerated Spatial Joins
In SedonaDB 0.4, we taught this Rust database to run spatial joins on your $1,500 gaming GPU's ray tracing cores, and it beats an H100.
The Apache Sedona community released SedonaDB 0.4.0, resolving 187 issues and adding 26 new functions from 15 contributors. SedonaDB is the first open-source, single-node analytical database that treats spatial data as a first-class citizen — the counterpart to the distributed Sedona engines for small-to-medium datasets running on a single machine.
This is the first in a series of posts diving into what's new in SedonaDB 0.4. We'll be covering more of the release — the Python DataFrame API, the R dplyr interface, Geography support, GeoParquet write support, N-dimensional rasters and Zarr, and more — in the posts to come; for the full rundown, see the 0.4.0 release blog post. We're kicking things off with the feature we're most excited about: GPU-accelerated spatial joins.
GPU-Accelerated Spatial Joins
Gaming GPUs contain dedicated ray tracing cores designed for video game lighting — and they sit idle during database queries. Spatial joins are about finding intersecting geometries, which maps naturally onto ray tracing primitives. We built RayBooster, an extension that brings ray tracing core acceleration into SedonaDB.
The accompanying research paper, "RayBooster: A Ray Tracing Engine to Accelerate SedonaDB," was accepted to VLDB 2026 (Industry Track), developed in collaboration with The Ohio State University.
How it works: four components
1. GPU-friendly storage layout. Instead of the stream-oriented WKB format, RayBooster uses a Structure of Arrays organization that separates offsets, vertices, and types, enabling O(1) random access to any geometry.
2. A single monolithic index. Rather than building millions of tiny index trees, it uses Z-stacking — encoding each geometry's ID into the unused Z-axis of the ray tracing scene and building one global BVH for the entire batch.
3. A universal predicate engine. RelateEngine computes the DE-9IM matrix (a topological descriptor) on RT cores, giving one code path that resolves any geometry/predicate combination instead of hardcoding 500+ kernel variants.
4. Memory-aware execution. A scheduling and spilling layer keeps joins within GPU memory budgets on irregular real-world workloads, preventing out-of-memory failures.
Performance
Testing on SpatialBench:
- Up to 5.93x speedup on heavy joins, with a 59.02% cost reduction on AWS
- Q11 cross-zone trip join: 7.51s (CPU) → 1.61s on a consumer RTX 3090 — a 4.66x speedup
- 10x scale: 53.34s reduced to under 7s
- Heavy joins at scale: 4.93x to 9.68x speedups across GPU models
- Consumer RTX 3090 vs. H100: on some queries the gaming card actually beat the H100 (1.26s vs 1.77s on Q10), despite the H100 lacking RT cores
Using it
On a machine with an NVIDIA GPU, pull the official Docker image and enable the feature with a single command:
ctx.sql("SET gpu.enable = true")
Citation
Liang Geng, Rubao Lee, Dewey Dunnington, Feng Zhang, Jia Yu, and Xiaodong Zhang. "RayBooster: A Ray Tracing Engine to Accelerate SedonaDB." PVLDB, 2026 (Industry Track).
TiDB (GitHub Repo)
TiDB is a MySQL-compatible, distributed SQL database that separates storage and compute to enable horizontal scaling and hybrid transactional/analytical processing.
Summary
Deep Dive
- Horizontal Scalability: Compute and storage layers are decoupled to allow independent scaling.
- HTAP Capability: Real-time analytical queries run on TiFlash columnar replicas without impacting transactional performance on TiKV.
- Consistency: Strong ACID guarantees maintained across distributed nodes via the Raft consensus protocol.
- Deployment: Native support for Kubernetes via TiDB Operator or fully managed service in TiDB Cloud.
Decoder
- HTAP: Hybrid Transactional/Analytical Processing; systems that can handle both operational (OLTP) and analytical (OLAP) queries simultaneously on the same data.
- ACID: Atomicity, Consistency, Isolation, Durability; properties that guarantee reliable database transactions.
Original Article
TiDB
TiDB (/’taɪdiːbi:/, "Ti" stands for Titanium) is an open-source, cloud-native, distributed SQL database designed for high availability, horizontal and vertical scalability, strong consistency, and high performance.
Key Features
-
Distributed Transactions: TiDB uses a two-phase commit protocol to ensure ACID compliance, providing strong consistency. Transactions span multiple nodes, and TiDB's distributed nature ensures data correctness even in the presence of network partitions or node failures.
-
Horizontal and Vertical Scalability: TiDB can be scaled horizontally by adding more nodes or vertically by increasing resources of existing nodes, all without downtime. TiDB's architecture separates computing from storage, enabling you to adjust both independently as needed for flexibility and growth.
-
High Availability: Built-in Raft consensus protocol ensures reliability and automated failover. Data is stored in multiple replicas, and transactions are committed only after writing to the majority of replicas, guaranteeing strong consistency and availability, even if some replicas fail. Geographic placement of replicas can be configured for different disaster tolerance levels.
-
Hybrid Transactional/Analytical Processing (HTAP): TiDB provides two storage engines: TiKV, a row-based storage engine, and TiFlash, a columnar storage engine. TiFlash uses the Multi-Raft Learner protocol to replicate data from TiKV in real time, ensuring consistent data between the TiKV row-based storage engine and the TiFlash columnar storage engine. The TiDB Server coordinates query execution across both TiKV and TiFlash to optimize performance.
-
Cloud-Native: TiDB can be deployed in public clouds, on-premises, or natively in Kubernetes. TiDB Operator helps manage TiDB on Kubernetes, automating cluster operations, while TiDB Cloud provides a fully-managed service for easy and economical deployment, allowing users to set up clusters with just a few clicks.
-
MySQL Compatibility: TiDB is compatible with MySQL 8.0, allowing you to use familiar protocols, frameworks and tools. You can migrate applications to TiDB without changing any code, or with minimal modifications. Additionally, TiDB provides a suite of data migration tools to help easily migrate application data into TiDB.
-
Open Source Commitment: Open source is at the core of TiDB's identity. All source code is available on GitHub under the Apache 2.0 license, including enterprise-grade features. TiDB is built with the belief that open source enables transparency, innovation, and collaboration. We actively encourage contributions from the community to help build a vibrant and inclusive ecosystem, reaffirming our commitment to open development and accessibility for everyone.
Quick Start
-
Start a TiDB cluster.
-
On local playground. To start a local test cluster, refer to the TiDB quick start guide.
-
On Kubernetes. TiDB can be easily deployed in a self-managed Kubernetes environment or Kubernetes services on public clouds using TiDB Operator. For more details, refer to the TiDB on Kubernetes quick start guide.
-
Using TiDB Cloud (recommended). TiDB Cloud offers a fully managed version of TiDB with a free plan, no credit card required, so you can get a free cluster in seconds and start easily: Sign up for TiDB Cloud.
-
-
Learn about TiDB SQL: To explore the SQL capabilities of TiDB, refer to the TiDB SQL documentation.
-
Use a MySQL driver or an ORM to Build an App with TiDB.
-
Explore key features, such as data migration, changefeed, vector search, HTAP, disaster recovery, etc.
Need Help?
-
You can connect with TiDB users, ask questions, find answers, and help others on our community platforms: Discord, Slack (English, Japanese), Stack Overflow, TiDB Chinese Forum, X @PingCAP
-
For filing bugs, suggesting improvements, or requesting new features, use Github Issues or join discussions on Github Discussions.
-
To troubleshoot TiDB, refer to Troubleshooting documentation.
Architecture
Learn more details about TiDB architecture in our Docs.
Contributing
TiDB is built on a commitment to open source, and we welcome contributions from everyone. Whether you are interested in improving documentation, fixing bugs, or developing new features, we invite you to shape the future of TiDB.
-
See our Contributor Guide and TiDB Development Guide to get started.
-
If you're looking for issues to work on, try looking at the good first issues or help wanted issues.
-
The contribution map lists everything you can contribute.
-
The community repository contains everything else you need.
-
Don't forget to claim your contribution swag by filling in and submitting this form.
License
TiDB is under the Apache 2.0 license. See the LICENSE file for details.
See Also
- TiDB Online Playground
- TiDB Case Studies: TiDB Customers, TiDB 事例記事, TiDB 中文用户案例
- TiDB User Documentation
- TiDB Design Docs
- TiDB Release Notes
- TiDB Blog
- TiDB Roadmap
Acknowledgments
How To Corrupt An SQLite Database File
While SQLite is highly resilient, it can still suffer corruption through external factors like broken file locks, hardware failures, or application-level memory bugs.
Summary
Deep Dive
- Locking Pitfalls: POSIX advisory locks can be accidentally dropped by unrelated threads closing file descriptors.
- Sync Failures: Consumer-grade USB drives often lie about successful syncs, leading to corruption during sudden power loss.
- File Identity: Using multiple paths (hard/soft links) to the same file causes different processes to treat journals and WAL logs as isolated, leading to state inconsistency.
- Memory Safety: Memory-mapped I/O (mmap) allows stray pointers in the host application to directly corrupt the database file structure.
- Bug History: Documents past obscure bugs as proof of the high bar SQLite maintains for its code quality and testing.
Decoder
- WAL Mode (Write-Ahead Logging): A SQLite journal mode that allows multiple readers and one writer concurrently, improving performance and reliability.
- PRAGMA: SQLite-specific commands used to modify the library's internal behavior or query database metadata.
- Hot Journal: A journal file left over from a previous process that crashed, used by SQLite for recovery.
Original Article
Full article content is not available for inline reading.
No babysitting, not today
dbt Labs' new CLI, dbt Wizard, demonstrates agentic data engineering by autonomously migrating projects and configuring semantic layers with built-in validation.
Summary
Deep Dive
- Wizard uses a native metadata engine to understand project lineage and structure before initiating changes.
- The validation loop automatically retries tasks if models fail to parse or compile, iterating until success.
- It correctly identified stale local credentials versus production configuration errors, preventing troubleshooting cycles.
- The agent automatically refactored metric definitions to be co-located within model YAML files.
- It effectively managed the gap between PR merging and actual semantic layer refresh by triggering platform jobs.
Decoder
- dbt Fusion: A high-performance execution engine for dbt that supports advanced metric discovery and semantic modeling.
- Semantic Layer: A centralized data model that defines business metrics and dimensions so they remain consistent across all reporting tools and downstream dashboards.
- MCP (Model Context Protocol): A standard for connecting AI agents to external data sources and developer tools to provide context-aware access to files and databases.
Original Article
We recently released dbt Wizard, a CLI for doing agentic data work on dbt. This is a harness specifically built for data work and as such I was excited to spend an afternoon running dbt Wizard across a personal project with about 100 models or so that I've been maintaining for a while. You can use this project to see firsthand some of the differentiators for Wizard, particularly how the validation loop works.
Data projects are never done and I had a few tasks I've been putting off:
- Migrating from MotherDuck to Iceberg + BigQuery
- Upgrading dbt Core to dbt Fusion
- Migrating from dbt Core to dbt platform
- Building out a semantic layer
These are a good mix of pretty standard data work (have you done your annual migration yet?) and some dbt specific tasks. A good informal eval for the fundamental question for any dev tool: Does it work?
I was pleasantly surprised with my experience in that I didn't have to babysit Wizard or build any additional workflows on top of it. Because it has a native understanding of what a dbt project is, and its ability to set up validation subagents that match the task made it easy for me to just pop in when needed and I didn't have to babysit the whole process and I accomplished all of my objectives.
The repo is public, so you can read the prs that Wizard authored:
- Migrate from MotherDuck to BigQuery + Apache Iceberg
- Prepare dbt Platform and Fusion orchestration
- Fix Semantic Layer for Fusion metric discovery
Project background
The project is an economic data platform. The Federal Reserve, Bureau of Labor Statistics, Historical stock, ETF commodity prices, and Treasury data come in through Dagster assets, get transformed by dbt, and feed a set of analysis agents. DuckDB and MotherDuck worked fantastically, but I wanted to test out the developer ergonomics and cost of Iceberg, and the dbt Fusion engine.
The primary interface that I use to consume this data is via a Claude plugin where I ask questions about market state, where we are in the economic cycle, running backtests, that type of thing. It helps me test new investment and trade ideas and going back and forth with an AI is pretty ergonomic for this type of thing. I had some issues with data quality in the past so improving accuracy and repeatability is important. I also had Wizard build out the initial Semantic Layer and metrics.
The Validation loop in practice
Wizard has this built-in validation loop that starts with the intended change, implements the most reasonable version, checks to see if it works in the project, if validation fails or exposes a mismatch, Wizard adjusts the implementation.
A common metric that I look at when thinking about the financial market is volatility or price variance. Being able to normalize across nominal prices is necessary so you can do a true apples to apples comparison. The calculation works like this:
V = σ_1yr / P_current
where:
σ_1yr = trailing one-year standard deviation of the price
P_current = the latest close
Take the one-year standard deviation of a price, divide by the current price, and you get a dimensionless volatility proxy you can compare across assets at wildly different price levels.
- name: annualized_volatility_ratio
type: ratio
type_params:
numerator:
name: price_volatility_proxy
denominator:
name: max_close_price
This was Wizard’s first attempt, pretty simple. It failed during the validation loop because it had two modeling problems:
Max_close_pricewas not quite the expression I wanted, current\last_price was what I needed, not the max close over a current window.- It depended on cross-metric references in metrics.yml, which became fragile once Wizard migrated metrics inline to be in support of fusion.
For validation Wizard had to work within the following constraints:
- Fusion needed to discover metrics colocated with model YAML.
- dbt Core/Dagster needed to parse the project without choking on newer semantic syntax.
- CI needed a stable project parse/build path.
- dbt Platform needed deployed artifacts that the hosted Semantic Layer could actually expose.
Wizard tried again with an older MetricFlow-style shape. Which was suboptimal in that it worked for one parser but didn't work for another. Better for one parser, worse for the other. It flipped which side failed, which is the kind of result that eats an afternoon when you do it by hand.
The version that shipped is the least clever of the three but it worked.
- name: annualized_volatility_ratio
label: Annualized Volatility Ratio
description: Average ratio of 1-year price StdDev to close price, a dimensionless proxy for annualized volatility. Higher means more volatile relative to price level.
type: simple
agg: average
expr: std_diff_1yr / nullif(current_price, 0)
The ratio logic moved into the expr, the nullif kills the divide-by-zero, and it parses under Fusion, parses under dbt Core, and passes the project CI. Having Wizard do the validation loop to understand the finer points in the differences between dbt Core and Fusion saved me a ton of time.
For this process, I didn't have to look up any documentation and only had to go back and forth with Wizard a few times, mostly for my understanding. This was a success for me since I now have a simple metric, that matches the expected formula, and has both parser compatibility and a hosted semantic layer deployment path. I also only had to see on diff, which was the final one, the dead ends and iterations happened off screen so I only needed to be tapped in when necessary.
Why it could run unattended
Three things made that work, some of them can be replicated with other harnesses using skills, MCPs, hooks and other automation but these came out of the box.
It knew the graph before it started. Wizard has a native metadata engine, a structured index of lineage, models, tests, and metric definitions, built before the first prompt. When it wrote std_diff_1yr and current_price into that expr, it didn't have to grep through the sql files to find the right name it was able to quickly know what field was the right ones.
Validation was the default. I never said "check that it compiles." The loop runs on write operations whether you ask or not. You can turn this feature off if it isn't interesting to you.
It stayed in its sandbox. The task was building out an initial set of metrics in the markets domain. It didn't wander off to refactor unrelated models or rewrite my Dagster assets. It scoped the change, iterated inside that scope, and surfaced a diff for review. Nothing persisted without my approval. This tight scope is what makes "look away and come back" a safe move instead of having a tech debt super spreader event.
The proof I wanted
A green compile is not proof a Semantic Layer change works. It tells you the YAML is shaped right, not that a metric is discoverable and queryable against your real warehouse. So after the PR merged and the dbt platform jobs ran, I checked the hosted layer directly: dialect BIGQUERY, 43 metrics discovered, query status SUCCESSFUL. Then I ran an actual metric query through the dbt MCP and got real numbers back from BigQuery.
But the biggest proof for me is that I was able to give Wizard a defined set of analytics tasks before a string of meetings and it just went through and completed them without any major interventions on my end. The best part was at the end of it everything worked and during those meetings I was able to be present and contribute without being tempted to make sure my agent wasn't going off the rails. Maybe the real evals were the friends we made along the way.
Other things it did well
A few shorter wins from the same stretch of work:
- It refused to chase a misleading error. A local command threw a Snowflake auth error in a BigQuery project. Wizard separated the hosted environment (correctly BigQuery) from a stale Snowflake credential path on my laptop. That reframed the fix from "production is broken" to "your local creds are stale," which is the difference between an hour and a day.
- It knew "merged" and "available" are different states. A merged PR doesn't mean the Semantic Layer refreshed. Wizard traced the gap to dbt platform job execution, ran the deploy jobs, then confirmed the metrics were live.
- It colocated the definitions. The metrics started scattered across standalone files. Wizard moved them into model-owned
schema.yml, so each metric sits next to the columns it measures. The Semantic Layer stopped being a metadata island. - It made a pragmatic cut. Some saved queries were brittle against the migrated metrics. Rather than block the change, it pulled them out, shipped stable metric discovery first, and left them as follow-up. I'd have fought them for an hour. Cutting was right.
Try it
The repo is public if you are curious about the actual metrics and the merged PRs. Install the Wizard CLI and point it at a project that runs Fusion or dbt Core:
curl -fsSL https://public.cdn.getdbt.com/dbt-wizard/install/install-wizard.sh | sh
The Wizard docs cover setup, and how it works goes deeper on the validation loop. If you try it on a migration of your own, come tell me in #dbt-wizard in the dbt Community Slack.
Designed for a Dead Language
Language learning apps often replicate the outdated Prussian Grammar-Translation method, prioritizing measurable drills over the immersive acquisition needed for true fluency.
Summary
Deep Dive
- Prussian Method: A 1788 system designed for dead languages (Latin), focusing on written rules rather than communication.
- Acquisition vs. Learning: Acquisition is an unconscious process (immersion); Learning is a conscious study of rules.
- Proxy Metrics: Companies often optimize for measurable engagement (streaks) at the expense of outcome (fluency).
- The Shift: AI-driven conversation tools (Praktika, Langua) are closing the gap by providing the immersive environment that was previously only available through travel.
Decoder
- Grammar-Translation Method: A method of teaching foreign languages that focuses on the study of grammar rules and vocabulary rather than oral communication.
- Monitor Model: A theory of language acquisition by Stephen Krashen stating that fluency is acquired through subconscious immersion, while conscious learning only serves to 'monitor' or correct errors.
Original Article
Every language app in your pocket inherited a teaching method built for Latin. Understanding why that happened is a more useful design lesson than anything the apps themselves will teach you.
In 1788, Prussia introduced the Abitur, a standardized national examination required for entry into universities and the civil service. To pass it, students needed to demonstrate measurable, gradable knowledge. The system needed to teach language to large classrooms, produce consistent outcomes, and do it with one teacher and thirty students. The educators responsible for designing this system reached for the only teaching template they had, one that had been used in European schools for two centuries: the method developed to teach Latin.
Latin, by 1788, was a dead language. Nobody needed to speak it. The scholars who studied it were reading Cicero and Virgil, not conducting conversations. The method built around it, memorizing grammar rules, constructing translations, analyzing written texts, reflected that reality exactly. Oral skills were irrelevant. Comprehension of written form was everything. The method was not designed to produce speakers. It was designed to produce readers of texts in a language nobody spoke.
When Prussia applied this template to French and German, living languages spoken by living people, the premise did not change. Johann Valentin Meidinger’s textbook Praktische Französische Grammatik, published in 1804, ran to 37 editions across Europe by 1857. Karl Plotz formalized the approach into what became the dominant model for teaching modern languages across Europe and eventually the United States, where it became known simply as the Prussian Method. Each institution that adopted it trained teachers in it, who trained students who became teachers. The constraint that created the method, how do you grade language at scale with limited resources, became invisible inside the method itself. What remained was the assumption: language is a body of rules to be learned consciously and measured. It was a design decision dressed up, over time, as a pedagogical truth.
The observation that should have ended it
There are people in the world who cannot read or write a language and speak it fluently. There are children who hold full conversations years before they can read a single word. There are immigrants who arrive in a country knowing nothing of its language and come out, years later, speaking it naturally, not because they studied it, but because they lived inside it. Literacy and fluency are separate things produced by entirely separate mechanisms. The Grammar-Translation method, as it became known, assumed they were the same thing. That assumption was inherited from a method designed for a language nobody needed to speak, and it was wrong the moment it was applied to a language people actually used.
The evidence against it accumulated slowly. In the mid to late nineteenth century, reformers including François Gouin in France and Maximilian Berlitz in the United States argued independently that language should be taught the way it is actually acquired, through immersive exposure to real communication in the target language, not through analysis of its rules. Berlitz built an entire school network around this principle. The reformers were correct. They were also largely ignored by mainstream education systems, because the Grammar-Translation method had one decisive advantage that direct immersion did not: it could be graded.
In 1982, the linguist Stephen Krashen gave the argument its most formal articulation in what he called the Monitor Model of second language acquisition. His distinction was precise: language acquisition, the unconscious process through which children absorb their native language and through which adults succeed in immersive environments, is categorically different from language learning, the conscious study of grammar rules and vocabulary that classrooms deliver. Acquisition produces fluency. Learning, at best, produces the ability to pass a test. The evidence supporting this distinction, and the observation that immersive exposure to real native-speaker communication is the mechanism that produces genuine fluency, has only grown since.
I went to Brazil without a word of Portuguese and came out speaking it. I studied French in a classroom for years and cannot hold a conversation in French today. This is not an unusual experience. It is the expected outcome, and it has been the expected outcome for as long as we have had formal language education.
The same decision, made again in a different medium
Prussian educators faced the question: How do you deliver language learning at scale, measure progress, and retain users over time? The answer it arrived at was structurally identical to the one arrived at in 1788. Duolingo gamified the grammar drill into a streak. Anki formalized the translation exercise into a spaced-repetition flashcard. Babbel organized grammar lessons into structured modules. The interfaces were new. The underlying assumption, that language is a thing you study rather than an environment you inhabit, was not.
This was not a failure of design skill. The products that emerged from these decisions are, in many respects, genuinely well-crafted. Duolingo’s retention mechanics are sophisticated. Anki’s spaced repetition is grounded in real cognitive science. They are excellent at what they actually do. The problem is what they actually do: produce measurable engagement with a proxy for language rather than the conditions that produce language itself. A streak is measurable. A vocabulary score is measurable. The moment a user walks out of an app and holds a real conversation in another language, that happens in the world, outside the product, and cannot be instrumented.
When the outcome a user needs is difficult to measure directly, the design process tends to reach for something that can be measured. The proxy becomes the goal. The interface optimizes for it. The gap between what the product delivers and what the user actually needed grows. This is not a pattern unique to language learning. It is a pattern that repeats across product categories whenever a design constraint—the need to measure, the need to scale, the need to produce a grade—gets built into a system so deeply that it stops being visible as a constraint and starts being mistaken for a truth about the problem itself.
What happens when the constraint changes
The constraint that made the Grammar-Translation method necessary in 1788 was real and rational. One teacher. Thirty students. A standardized exam. You cannot grade a conversation at scale. You can grade a translation exercise. The method was not chosen because it produced fluency. It was chosen because it produced a score.
That constraint no longer exists in the same form. Technology has made it possible to deliver immersive, real-time conversation practice to anyone with a smartphone, at a cost that continues to fall. The design problem is no longer how to make language learning gradable at scale. It is how to make the conditions of genuine language acquisition accessible to people who cannot move to another country or afford a native-speaker tutor.
The products that are now closest to solving the actual problem are not the ones that invented a new pedagogy. They are the ones that removed the access barrier to an old one. Praktika builds AI conversation partners with distinct personalities, regional dialects, and cultural context, replicating the specificity of a real native speaker rather than a generic language-learning voice. Langua clones native speaker voices so that the interaction feels like a real conversation rather than a lesson. Rosetta Stone’s foundational methodology, image association in the target language with no translation, was built on the same insight Berlitz arrived at in the nineteenth century: language is acquired through immersive exposure, not through analysis of its rules. A 2025 meta-analysis of 31 studies found that AI conversation tools produced a statistically significant improvement in language learning outcomes, a result that no amount of flashcard optimization has consistently matched.
None of these products invented a new theory of language acquisition. They translated an existing one into something more people could reach.
The design question this leaves
The Grammar-Translation method persisted not because educators were wrong about design, but because a design decision made under a specific constraint became, over two centuries, indistinguishable from the thing itself. The constraint, how do you grade language at scale, was forgotten. The method it produced was inherited as if it were a description of how language works, passed from Prussia to Europe to America to the App Store, from the grammar drill to the streak.
Every time a design team optimizes for a metric because the actual outcome is hard to measure, they are making a version of the same decision. It is often the right decision given real constraints. The question worth asking is whether the constraint that made it necessary still exists, or whether it has simply become invisible inside the system it originally produced.
Before reaching for what can be measured, it is worth asking what the user actually needs to do, and what stopped them from doing it before. Sometimes the answer is a new solution. More often it is an old one that was always out of reach.
Notes
- Richards, J.C., and Rodgers, T.S. Approaches and Methods in Language Teaching. Cambridge University Press, 2001.
- Krashen, S.D. Principles and Practice in Second Language Acquisition. Pergamon Press, 1982.
- Howatt, A.P.R., and Widdowson, H.G. A History of English Language Teaching. Oxford University Press, 2004.
- Lyu, B., Lai, C., and Guo, J. Effectiveness of Chatbots in Improving Language Learning: A Meta-Analysis of Comparative Studies. International Journal of Applied Linguistics, 35(2), 834-851, 2025. DOI: 10.1111/ijal.12668.
Google might be testing Gemini Flash upgrade on LM Arena
A new Gemini Flash checkpoint is appearing on LM Arena, signaling a potential minor version upgrade for Google's most efficient model tier.
Summary
Decoder
- LM Arena: A popular open-source benchmarking platform where users blind-test models against each other to generate crowd-sourced leaderboards.
Original Article
A Gemini Flash checkpoint appears to be circulating on LM Arena, and early impressions place it a step above the Flash version currently running in the Gemini app. The gap looks incremental rather than generational, but testers comparing outputs are picking up a real difference in quality. Google hasn't commented on the listing, and it isn't clear whether this reflects a genuine release candidate or another internal build that quietly disappears. Google's Arena appearances have reliably preceded confirmed launches over the past year, which is part of why this one is drawing attention.
The logical next step after Gemini 3.5 Flash would be a "3.6" label, but that's an extrapolation from Google's versioning habits rather than anything confirmed, and there's no indication of when or whether a wider rollout will follow. Another possibility could be a "Gemini 4 Flash" label, as its trace has already been spotted on GitHub.
GOOGLE 🔥: A new Gemini Flash checkpoint is being tested on LM Arena and may be released under a different version number.
Gemini 3.6 Flash and even Gemini 4 Flash are among the possible options.
Soon? 👀
If a build like this surfaces officially, expect it first in the Gemini app's model picker, AI Studio, and the Gemini API, mirroring how the current Flash generation rolled out. Everyday Gemini users and cost-conscious developers stand to gain the most, since Flash carries the bulk of free and pay-as-you-go traffic that would otherwise need a pricier Pro-tier model.
That backdrop matters more than usual right now. Gemini 3.5 Flash launched at I/O in May as the default across the Gemini app and AI Mode in Search, beating the previous 3.1 Pro tier on several coding and agentic benchmarks while running several times faster. Gemini 3.5 Pro, pitched onstage for a June arrival, has since slipped into July, with reports pointing to additional tuning to coding, token efficiency, and long-task performance following early tester feedback. Whether that traces to Pro needing polish or to Google wanting more distance from OpenAI and Anthropic's coding benchmarks isn't confirmed, but rivals have kept pace on the agentic tasks Google has prioritized. Against that, a sharper Flash tier would yield a faster win, since Flash already carries most of the daily load for Google's fast-growing user base, while Pro remains unsettled.
Continual Harness: An Efficient Self-Improving Agent on ARC-AGI-3
Continual Harness is a new framework designed to optimize agent self-improvement specifically for the ARC-AGI-3 benchmark.
Summary
Decoder
- ARC-AGI-3: The Abstraction and Reasoning Corpus, a benchmark designed to test AI intelligence by requiring agents to solve novel logic puzzles rather than relying on memorized training data.
Original Article
Continual Harness: An Efficient Self-Improving Agent on ARC-AGI-3
ARC-AGI-3 is an IQ test for agents. The heavy test-time learning required by the benchmark pushes agents to form an internal world model of the rules and mechanics that updates with new evidence.
The Winning Essays for the Big Questions About AI
Dwarkesh Patel's essay contest winners argue that AI labs should focus on bio-infrastructure, policy stability, and asset-based business models.
Summary
Decoder
- Far-UVC: A specific wavelength of ultraviolet light that kills airborne pathogens while remaining safe for human skin and eyes.
- AMC (Advance Market Commitment): A financial mechanism where a buyer guarantees to purchase a future product if it meets certain specifications, reducing market risk for developers.
- CapEx (Capital Expenditure): Money spent by a business on physical assets like buildings, data centers, or transit lines.
Original Article
The Winning Essays for the Big Questions About AI
Two months ago, I posted some big questions about AI. We had 600 essays submitted for this contest. Below is a bit of information of the 3 winners, followed by all 3 full essay. Thanks to everyone who participated!
First Place - Jassi Pannu
Jassi Pannu is an Assistant Professor at Johns Hopkins University, where she focuses on biosecurity and pandemic preparedness. She serves on the board of Blueprint Biosecurity.
Jassi answered the question about what the OpenAI Foundation should do. She persuasively argues that we can live in a post-disease world, and gave very concrete and well thought out ideas about how to dedicate 10s of billions of dollars to that project.
Second Place - Ege Erdil
Ege Erdil is a co-founder of Mechanize, a startup building environments and evals for frontier coding agents. He was previously a researcher at Epoch AI.
Ege answered the question about what countries outside the AI supply chain should do to avoid increase their odds of not being totally sidestepped by transformative growth.
He argues that these countries should concentrate on enacting the kinds of policies that already work well in increasing growth and improving productivity. These strategies (strong property rights, low capital taxes, and an open regulatory regime) will be even more important in a world where enacting them can drive a much higher growth differential than is possible today.
What I love about Ege’s essay is that, in one sense, he’s giving very common-sense advice (as opposed to much more galaxy-brain schemes some other applicants proposed - one application suggested middle countries blackmail China and American by threatening to nuke their fabs and datacenters). But it’s actually this much more grounded and timeless advice that felt the most contrarian. And it’s also more likely to work.
Third Place - Michael Li
Michael Li is a Master of Public Policy candidate at Harvard Kennedy School. He writes Ceteris Paribus — a blog at the intersection of emerging tech, econ and policy.
Michael wrote about how the labs will actually make money. His was selected for the unique analogy he drew between AI labs and Hong Kong’s Mass Transit Railway business model - even if your main product consumes crazy CapEx and doesn’t directly earn it back, maybe you can make up for it by buying out all the complementary assets. In the case of Hong Kong MTR, that would be the adjacent properties - I don’t know what it looks like for the AI labs, but it was a interesting analogy to think about.
Essay #1 - Jassi Pannu on how she would run the OpenAI Foundation
I’d run the Foundation as a state-scale operation to end airborne transmission.
AI’s largest welfare upsides (curing diseases) and deadliest tail risks (engineered pandemics) both run through biology. By radically suppressing airborne pathogen transmission, we’d unlock >$1T in annual global GDP (through ending seasonal flu and the like, chronic diseases increasingly linked to viral infections, productivity losses, healthcare costs, etc.) and would take the possibility of catastrophic pandemics entirely off the table.
The dual-payoff principle: Most “make AI go well” interventions are insurance against bad outcomes, especially tail risks. My meta-level argument is that the best way of converting money into impact is to identify interventions that have the property of paying off big in both worlds: by producing step-changes in welfare in the everyday world as well as significantly reducing tail-risks in the emergency world. The bio resilience interventions I describe below are the best example of this.
AI for biology is on the critical path to cures, but destabilizing capabilities will arise early
Using AI to automate and scale every step in the biological research process, including managing the process itself (something I’ll call autonomous biological discovery), will bring humanity closer to a post-disease world. Over 4 billion years, life has been doing a random walk on an astronomically tiny subset of viable, connected, fitness-positive paths. Multi-component AI feedback loops (that include bio foundation models and systematic wet-lab experimentation at scale) for autonomous discovery will enable us to explore much more of possible biological design space. While we’re most interested in predicting and designing multicellular systems, it’s likely that the destabilizing capability of manipulating simpler pathogens will emerge first. The challenge this poses is that AI-enabled offense (seeding an outbreak) will be much easier than defense, which will remain constrained by physical-world deployment; I argue this advantages pre-positioned defensive technologies already embedded in our infrastructure.
There’s a clear path to ending airborne transmission, using physical infrastructure.
Regardless of what you think about the above, though, ending airborne transmission can be more than justified based on everyday benefits. Respiratory infections cause acute illness and productivity losses, but are increasingly linked to dementia, cardiovascular disease, and more; even “normal” childhood respiratory infections are being linked to long-term neurodevelopmental outcomes.
After evaluating many approaches, I’d argue ending airborne transmission is more achievable than most realize, through a specific, under-appreciated approach. I’m currently sitting in a building that provides me with pathogen-free water, keeps my food cold and pathogen-free, helps me heat my food to eliminate pathogens, and pipes away sewage. We have already embedded technologies all around us that enable a post-cholera, post-typhoid, post-dysentery world.
There is passive, pathogen-agnostic (works against any pathogen), physical infrastructure tech capable of making our buildings entirely free of respiratory pathogens, such as lamps that emit wavelengths safe for humans but are deadly for bugs (called far-UVC). Researchers have suspected these would work at scale for decades; the reasons we haven’t deployed them are primarily non-technological. Consider this analogical case.
We now live in a post-smallpox world. This is one of humanity’s greatest accomplishments. How long did it take for us to do this? Jenner demonstrated vaccination could prevent smallpox in 1796. 171 years later, in 1967, D.A. Henderson launched the campaign that would successfully eradicate smallpox. In that period of time, humanity discovered electromagnetism, thermodynamics, general relativity, and we were 2 years from landing on the moon. Eradication was accomplished within a mere 10 years (with limited tech advances). Delays in smallpox eradication, clean water, and pasteurized milk were not due to lack of tech advancement; they were primarily market and coordination failures exacerbated by lack of political will. This is why this problem is so philanthropy-shaped.
4 steps to ending airborne transmission
Total: ~$40-$60B over 10 years for physical infrastructure to end airborne transmission; the rest of OAIF’s stake remaining for other interventions meeting the dual-payoff principle. By year 10, every primary school and major transport hub in OECD countries operates with passive pathogen-reduction infrastructure as default. Seasonal flu mortality is reduced by 60%. The probability of a respiratory pathogen achieving pandemic-scale spread is reduced by an order of magnitude.
- Push-funding to resolve the target product profile ($5B, Years 1-3)→ Hire Jacob Swett, director of Blueprint Biosecurity, to lead a DARPA-style program office focused on: a) pathogen inactivation data from human aerosols, b) computational modeling for deployment, c) safety studies beyond conventional UV effects, d) gold standard cluster-randomized trials powered to detect plausible effect sizes. By the end of year 3, deliver a validated TPP for far-UVC lamps and real-world efficacy data demonstrating >30% transmission reduction.
- AMCs to guarantee demand and pull private capital ($15B, Years 1-5)→ Create laddered purchase commitments for (a) 100K far-UVC fixtures that meet an interim TPP, (b) 1M for fixtures meeting the full TPP including safety and efficacy validation, (c) 10M commitment to retrofit specific buildings (see step 3). Modeled on Kremer’s pneumococcal vaccine AMC and Ransohoff/Frontier’s carbon removal commitments. By year 5, expectation is $30-50B in private capital mobilized, and supply chain capacity built to retrofit ~10% of global building stock.
- Large-scale deployments to generate evidence ($15-25B, Years 2-7)→ Years 2-4: Deploy in all hospitals and long-term care facilities in 50 largest metro areas globally. Years 3-6: Primary and secondary schools in the same metro areas. Years 2-7: Major airports and high-density workplaces. By year 7, substantial real-world evidence base.
- Political infrastructure and state handoff ($3-5B, Years 1-10)→ Smallpox eradication was a genuinely contingent historical event predicated on political will. Fund a memetic campaign à la the Rockefeller Foundation’s IHD yellow fever playbook which would transform respiratory transmission from “normal” to undesirable/unnecessary, and the training of a cadre of thousands that move into governments to build the political constituency and institutional infrastructure needed for global deployment and standards-setting. When pilot deployments demonstrate ≥40% reduction across 3 OECD countries, the Foundation transitions to catalyst rather than principal funder role, activating state procurement at orders-of-magnitude scale.
Essay #2 - Ege Erdil on what countries outside the AI production chain should do
I think a good baseline which very few countries, in the AI supply chain or not, will beat is “do nothing and ignore populist pressures to take radical actions”.
This is because people are naturally technologically conservative, and they hate economic disruption that causes job losses. Full automation of human labor by AI would bring about rapid technological and economic progress that would result in all humans losing their jobs. So the default expectation should be that policymaking in an era of AI automation will be profoundly irrational and counterproductive.
Resisting this political pressure will be hard enough for even the most functional governments. Expecting more from governments with poor track records such as India and Nigeria is unreasonable.
What does good policy in the AI era look like?
Having given this basic but uninspiring answer, I’ll now flesh out what I actually think will be important for good policymaking in the era of AI automation, though in practice it’s unlikely to have much relevance for how policy decisions will be taken.
Today, the economic output of a country depends on its endowment of natural resources, on how much physical and human capital it has, and how efficiently it’s able to make use of these resources. The major shift with AI will be that human capital will drop out of this equation. If a country wants to do well in a world after full automation of labor; they need more natural resources, more capital, or more ability to make use of these inputs effectively, i.e. more total factor productivity.
While the capital elasticity of output will dominate any other factor of production, how much capital a country ends up with is itself endogenous. Capital moves across borders more easily than labor, and it will likely flow to the places where factors complementary to it – total factor productivity and natural resources – are abundant.
So I think the pillars of good policy in the AI era involve going in the following directions, to whatever extent possible:
- Get out of the way of AI automation. Repeal or revise occupational licensing laws, liability laws, data protection laws, and intellectual property. Abolish price and wage controls. Dismantle cartels and unions of human workers who will try to protect themselves from AI competition. Apply safety and security standards consistently between humans and AIs, instead of discriminating in favor of humans. Make it much easier to start new businesses, as AIs will need organizational structures just as much as humans do in order to bring out their productive potential.
- Provide political and legal stability. Investment will be anemic under instability, and if a country can stand out as an island of stability in what will likely be tumultuous decades for the world, it will attract enormous amounts of investment. The lowest bar is avoiding civil wars, revolutions or coups (not trivial for Nigeria to clear, since there was a serious military conspiracy to overthrow Tinubu as recently as September 2025); but this is not enough by itself. People must be confident that their investments will not be expropriated, and that the core operation of their businesses won’t suddenly be declared illegal.
- Increase capital formation in your country. Reduce or eliminate taxes on capital gains and corporate income, don’t let important projects be held up by permitting, don’t hamper construction by requirements of “having to pay prevailing wages”. Remove exchange, interest rate, and capital controls.
- If industries necessary for AI deployment are broken, urgently fix them. For example, Nigeria’s grid sucks and needs to be fixed before anything else can happen, and this is downstream of the entire economy’s price system being messed up by government restrictions. If this isn’t fixed, the country will never be able to accumulate the kind of capital it needs to be productive in an era of AI automation.
These recommendations feel extreme because humans instinctively like policies that favor labor over capital. This already causes problems in our world where capital’s output share is only around 30%, but it will likely become a critical obstacle in a world where labor has been made obsolete and capital’s output share is much higher.
If a country that’s currently in bad shape succeeds in doing these, they would attract a ridiculous amount of outside investment and be transformed in the decades in which AI automates the world economy. The fact that they had no previous AI labs or fab companies with trillion dollar valuations would be irrelevant, because those very companies would at that moment be in the process of automating their moat – their human capital and organizational knowledge – away.
Will this happen?
Probably not. As I’ve said, the relevant countries are far too dysfunctional for these radical reforms to be adopted, and the countries currently in the AI supply chain (e.g. the US and China) already perform well relative to other countries on the metrics I’ve listed.
However, there’s still some hope for countries outside the AI supply chain: the incumbents can screw up. This is not that implausible, because policy quality in the industrial era will not be that strongly correlated with policy quality in the AI era, just like how the best foraging economies in the world weren’t the best agrarian ones and the best agrarian economies weren’t the best industrial ones.
Concretely, the incumbents can outlaw AI automation in key industries, slow down datacenter buildouts, expropriate capital holders, impose strict safety standards for rolling out AI agents in critical industries like medicine and law, et cetera. There’s virtually no limit to the mistakes that they can make in the coming period of acute irrationality. If a mediocre country simply holds on and doesn’t make any big blunders, they will probably come out the other side of this in a much better relative position than they had been in at the start.
Essay #3 - Michael Li on how the labs will make money
A subway company solved AI’s business model
There is an industry with the following economics: billions in upfront capital before earning a dollar. Core service priced near marginal cost. Enormous value created for users, almost none captured by the builder. Relentless pressure to keep investing in the next generation of infrastructure. I’m not talking about AI labs. I’m talking about Hong Kong’s Mass Transit Railway.
Many have reached for the railroads analogy when discussing the business of AI. Most conclude that the lesson is commercial viability requires state subsidy for a general purpose technology with public good properties.
I want to challenge that, because Hong Kong’s MTR actually solved the problem. It’s one of the only mass transit systems in the world that is commercially self-sustaining, publicly listed, paying dividends with no government operating subsidy.
The financials are structurally identical
MTR’s core rail service has never funded its own expansion. In 2018, its best pre-covid year, transport operations earned HK$2.0 billion in EBIT. Estimated capital expenditure for 2024–2026 is HK$87.9 billion, nearly all rail-related. Three years of peak rail earnings would cover 8% of that. The rail has never paid for itself through fares. It was never designed to.
MTR fares are kept affordable through a government fare adjustment mechanism. You can’t price transit to recover full construction costs because it would be unaffordable and defeat the purpose. Each rail line can maybe cover its operating costs, but fare revenue never stretches to fund the next line. AI API pricing faces the same constraint from the other direction. Distillation and open source alternatives deflate API prices roughly 10x per year, and any lab that prices above marginal cost loses volume to rivals. Each model can be operationally profitable on inference, but the margin never stretches to fund the next training run.
The standard global solution is subsidy. London Underground requires billions in TfL grants. China’s national HSR carries a trillion dollars in debt with 94% of routes unprofitable. AI is on the same trajectory: CHIPS Act, Stargate, sovereign wealth fund investments, Pentagon contracts. The default endpoint is subsidy-dependent quasi-public infrastructure.
MTR found another way.
Rail plus property
When MTR was built in 1979, its designers understood that fares alone would never recover construction costs. So they structured the corporation around a different premise: the rail line would make surrounding land valuable. So own the land.
MTR develops residential towers, offices and shopping malls above and adjacent to its stations, capturing the value appreciation that its own infrastructure creates. Property profits cross-subsidize rail operations and fund the next line. Today MTR owns 13 shopping malls, manages 47 developments above its stations, and property generates the majority of actual profit. Don’t try to capture value through the rail service itself. Own the assets that appreciate because of the rail service.
The AI parallel
“When do labs make money?” has the same structure as “when does rail pay for itself through fares?” It doesn’t, and that’s the wrong question.
A biotech startup uses a frontier model to screen drug compounds, shaving two years off a clinical trial. A logistics firm uses it to optimize routing, saving $40 million in fuel costs. A solo developer ships in a weekend what used to take a five person team three months. In each case the model provider captures a fraction of a percent through API fees. The provider can’t charge more, because four other labs and a dozen open source alternatives offer comparable capability. The surplus flows to the users and the broader economy. This is what general purpose technologies do. The steam engine, electricity and TCP/IP generated zero revenue for their creators.
MTR’s lesson: stop trying to make fares cover construction. Find the property.
Four candidates, ranked by defensibility
Government-granted deployment rights come first. A government grants a lab exclusive deployment access to national health records, tax systems or defense logistics. The lab accumulates domain data, integration depth and regulatory clearance that takes years to replicate. This is MTR’s own mechanism: development rights granted by the state, justified by natural monopoly properties.
Accumulated RL reward data is second. Billions of interaction signals that train the next model generation. Unlike weights (which depreciate via distillation), RL data is practically non-replicable and compounds across generations. It doesn’t convert to revenue directly, but it’s a land bank. Appreciating, undeveloped.
Forward-deployed integration is third. Instead of selling model access to a consulting firm that captures the productivity surplus, own the service delivery end to end. The way Palantir embeds engineers inside government agencies rather than licensing software. The lab doesn’t charge the law firm an API fee. The lab becomes the legal research service, priced against the outcome it delivers rather than the tokens it serves. Switching costs compound with accumulated domain data and institutional knowledge. This is MTR’s shopping mall: capture the foot traffic the rail creates rather than charging passengers more for the ride.
Data trusteeship over national datasets is fourth. Governments sit on enormous under-leveraged datasets (patient records, tax filings). A frontier lab designated as trustee gets exclusive access to train on and build products against this data. The scarcity is genuine, justified by the same logic as MTR’s land. You don’t want five competing labs accessing fifty million patient records any more than you want five developers building on the same plot. But this creates a public-private data monopoly and would require careful governance: clear boundaries on usage, benefits flowing back to the public, independent monitoring and real sanctions for misuse.
The reframe
The labs that survive won’t be the ones that make the API profitable. They’ll be the ones that identify their property above the station and build toward it now. The API is the rail. It will never be profitable enough. The money is in what appreciates around it.
The policy question follows: instead of subsidizing training runs, governments should design institutional mechanisms (deployment rights frameworks, data trusteeship structures, productivity measurement standards) that let labs capture the surplus their infrastructure creates.
There’s a final irony. The AI policy conversation is dominated by the US-China frame. American free market labs versus Chinese state-funded champions. The most relevant institutional model may be neither. It may be Hong Kong’s: a 45 year old public-private hybrid, commercially operated, self-financing through institutional design rather than ideology. The model that makes frontier AI sustainable might already exist. It just runs trains.
PorTAL: Portable Task Adapters for LLMs
PorTAL proposes a new architecture to decouple task-specific fine-tuning from base model weights, avoiding the need to re-train adapters for every new foundation model.
Summary
Decoder
- LoRA (Low-Rank Adaptation): A technique for fine-tuning large models by freezing the original weights and injecting trainable rank-decomposition matrices, significantly reducing memory usage.
Original Article
PorTAL: Portable Task Adapters for LLMs
Researcher: Ben Geist
Abstract
Parameter-efficient fine-tuning (e.g. LoRA) adapts a frozen LLM to a task, but the resulting adapter is locked to one base model. When a new model is released, the...
A New Look at AI's Impact on Jobs: Firm-Level AI Spending and Workforce Adjustment
Companies that aggressively invested in generative AI increased their total headcount by 10.2 percent over two years.
Summary
Original Article
A joint study by Ramp and Revelio Labs analyzed the relationship between firm-level generative AI investments and employment outcomes across over 21,000 companies in the US. The research reveals that companies with high-intensity AI spending grew their overall headcount by 10.2 percent and entry-level positions by 12 percent over the two years following adoption.
Product Shape is the Moat
Sustainable competitive advantage in AI applications cannot rely on mere technical wrappers or fine-tuning, as these are easily commoditized.
Summary
Decoder
- Wrapper: An application that provides a user interface or specific task flow over an existing, third-party LLM API.
Original Article
Product Shape is the Moat
Wrapper Laundering
Since 2022 there’s been a "wrapper laundering" shell game. Founders build AI wrappers and continually come up with simple stories about why their work is "hard" and "defensible" in...
OpenAI proposes 5% stake to Trump administration to ease Washington pressure
OpenAI has reportedly proposed ceding a 5% equity stake to the US government to mitigate regulatory scrutiny and align AI growth with public interest.
Summary
Decoder
- Sovereign wealth fund: A state-owned investment fund that invests in real and financial assets to benefit the nation's economy.
Original Article
Key Points
- OpenAI proposes handing the U.S. government a 5% stake in the company, according to a report in the Financial Times.
- The potential holding would be worth roughly $42.6 billion at the artificial intelligence startup's recent $852 billion valuation.
- OpenAI CEO Sam Altman reportedly argued the move was the best way to share the upside of AI with the public.
OpenAI has proposed handing the U.S. government a 5% stake in the company, the Financial Times reported Thursday, as the artificial intelligence startup seeks to defuse mounting political pressure in Washington.
A 5% holding would be worth roughly $42.6 billion, after the AI lab closed a record-breaking funding round in March at a post-money valuation of $852 billion.
OpenAI CEO Sam Altman argued that giving the public a financial interest in the company is the best way to share the upside of AI, the FT reported, citing two people familiar with the talks.
Altman suggested a stake of that size in early discussions with the Trump administration, as part of a broader arrangement under which Washington would hold 5% of each of the leading U.S. AI developers via a government vehicle, according to the report.
The proposed arrangement envisions other U.S. AI companies, such as Anthropic, Google and Meta, ceding similar stakes to the government through a sovereign wealth fund vehicle, the FT said. It is not clear whether any of these groups would agree to OpenAI's proposal.
The White House, OpenAI, Google, and Meta did not respond to CNBC's requests for comments.
The Trump administration and Anthropic have not discussed the government taking stakes in the company, a source familiar with the matter said on Thursday.
Pressure has been mounting on major U.S. AI firms as Washington grows increasingly wary of cybersecurity vulnerabilities associated with their models and rising competition from Chinese open-source models that are proving to be almost as capable and significantly cheaper than some of the top American models.
Anthropic had disabled access to its most advanced Mythos and Fable models last month to comply with an export control directive from the government. On Tuesday, the company behind the Claude AI platform said it was cleared to restore access to the models after taking steps to resolve policymakers' safety concerns.
OpenAI's reported pitch came after more than a year of talks about a possible government stake in the company, CNBC reported last month, with Altman first pitching the concept directly to the Trump administration in early 2025.
In April, the leading model-maker proposed creating a "public wealth fund" to hold assets capturing growth in AI companies and distribute the economic benefits to the public.
The Trump administration has previously taken stakes in private companies, investing in Intel Corp, IBM, and other quantum and critical mineral companies during the president's second term.
The government obtained a 10% stake in Intel after a landmark $8.9 billion investment in the chipmaker's common stock in August last year. In May, President Donald Trump said he should have asked for a bigger stake in the company.
Trump has described the U.S. taking an ownership stake in AI giants as "a beautiful thing" that would make Americans "partners in this revolution."
SpaceX Showed Investors Prototype of Elon Musk's New AI Device
SpaceX is developing a proprietary handset prototype that integrates custom AI, signaling Elon Musk's intent to reduce dependence on Apple's app ecosystem.
Summary
Original Article
SpaceX showed off a prototype for a handset-like device to investors prior to its IPO last month. Designed to run on a proprietary operating system and integrate AI technology from SpaceX's AI, the device would run on a Qualcomm Snapdragon chipset. Elon Musk has previously considered building a smartphone due to frustration over how Apple controls the distribution of third-party apps. A device could help Musk become less reliant on other companies.
Rise of the Cheap Robots
A new wave of affordable general-purpose robots priced under $10,000 aims to make household automation accessible to the general public.
Summary
Original Article
Several startups have announced general-purpose robots in the sub-$10,000 range. Nori Robotics is selling a bimanual robot for under $1,400 that's tall enough to reach the top of a table or countertop. The BracketBot is a wheeled robot manipulator with hoverboard-style drive wheels that will sell for under $3,000. Weave's Isaac 1 is a fully capable home mobile manipulator that can put away laundry and tidy a room. It will cost $8,000 or $450 per month.
The Anthropic Fable Ban Is Over. The Battle Over How to Tame AI Has Just Begun
The US administration is struggling to balance national security concerns over powerful AI API tools with the risk of stifling domestic innovation against foreign competition.
Summary
Original Article
The debate over the degree to which the US federal government should control access to cutting-edge API tools is now heating up. There is a growing awareness of how powerful these new tools are, but little agreement on how they should be controlled. AI proponents say that bans and implementations would hurt the US in its efforts to stay ahead of China in the geopolitical AI race. Administration officials and advisers are attempting to balance innovation and security while addressing safety risks in the most minimally invasive way possible.
The economics of SpaceX
SpaceX’s $2.16 trillion valuation is decoupled from its current Starlink connectivity business, relying instead on a speculative strategy to monetize user transactions.
Summary
Decoder
- ARPU (Average Revenue Per User): The total revenue generated by a company divided by the total number of its active subscribers.
- Ku-band/Ka-band: Specific radio frequency ranges used for satellite communications, with higher bands offering more bandwidth but requiring precise alignment.
- 50G-PON (Passive Optical Network): A fiber-optic technology capable of delivering 50Gbps of downstream bandwidth, used as the benchmark for high-performance terrestrial broadband.
Original Article
At the end of June 2026, the share price of SpaceX was around USD 164 per share, and if you multiply that price by the total number of shares in the company, some 13.7 billion, you get a market capitalization of USD 2.163 trillion. Against a global population of 8.264 billion people, the share price of SpaceX is currently at a phenomenal USD $261 per head.
The average revenue per user (ARPU) for the Starlink business is around USD 66 per month, or USD 792 per year. That’s down from the 2023 reported ARPU of USD 1,188 due to international pricing used in developing regions in the world, and a push to improve its market share from terrestrial markets. So, this company has a market valuation of USD 261 per head of population but has market penetration of 0.0121%.
Even if Starlink increases its user base to a completely unlikely number of 1 billion users, each user would need to account for around USD 2,000 of value to justify the share valuation. At the same time, the ARPU of the global mobile industry sits at between USD 72 and USD 120 per year, and it’s steadily declining over time at a rate of around 1.3% per annum. So, in ARPU terms, Starlink is outperforming the global mobile industry by a factor of 8 or so.
But so far Starlink is not a direct competitor to terrestrial mobile networks. Starlink can only do limited-capacity services to mobiles and falls far short of the 200 – 500Mbps of the terrestrial 5G networks. If you want to increase the capacity of the services offered to mobile devices, you need to use far larger antennae on the spacecraft, and the only provider (so far) in that particular market sector, AST Space Mobile, has demonstrated capacity up to 100Mbps. In the mobile device space, Starlink looks like a triumph of exuberant hype over a more sobering reality.
The Starlink prospectus notes that: “Based on the total number of connected devices globally and the mobile ARPU, we estimate the Starlink Mobile market opportunity to be $740 billion.”
However, current system designs offer only limited capacity to mobile devices, at around 4Mbps per user. This may be acceptable in remote areas, but it is not competitive more broadly. Achieving this projected market size at USD 100 annual ARPU would require over one billion users, placing Starlink in direct competition with terrestrial networks while offering an inferior service.
So perhaps this is all about the terrestrial broadband networks rather than mobile systems. The industry rate of broadband ARPU is approximately double the ARPU of mobiles, at around USD 240 per year.
Starlink has gathered 10M predominantly broadband users to date. SpaceX reported 2025 revenue of USD 18.67 billion. Connectivity services contributed USD 11.39 billion, or 61% of the total.
Here, Starlink’s prospects appear healthier, with an ARPU of around USD 1,000 per year — roughly four times the industry benchmark for fixed broadband. However, Starlink has so far been positioned as a user-terminal-based service that generally delivers between 50 – 250Mbps downlink and 20 – 50Mbps uplink. Starlink V3 satellites are reported to provide a total of 1Tbps of downlink capacity and 160Gbps of uplink capacity, divided across 48 downlink beams and 16 uplink beams. If you happen to be the only active Starlink user within a cell, achieving a 1Gbps downlink may be possible.
On the other hand, if you live in an urban area served by fibre infrastructure, current market offerings are moving towards capacities four times higher. A 50G-PON deployment shares a common 50Gbps capacity across a split ratio of between 1:32 and 1:64, yet the market values these fixed broadband services at an ARPU of around USD 240 per year.
For fixed broadband, Starlink is offering a largely undifferentiated access service. Where broadband is already available, it must compete for market share largely on price. Starlink’s competitive advantage lies in rural and remote areas, where it is far less expensive than deploying fixed fibre infrastructure. However, rural and remote markets are, by definition, a relatively small segment of the overall market. To grow its user base beyond this segment, Starlink must eventually compete in the more densely populated urban and suburban broadband markets.
This analysis suggests that, in densely populated urban markets, the ARPU for a Starlink broadband service offering up to 1Gbps of capacity would need to be around USD 200 per year or less. That is substantially below Starlink’s current ARPU of approximately USD 792 per year for a service that delivers lower capacity.
Starlink’s prospects depend heavily on deploying many more, and much more capable, satellites in the future. However, there are some challenges. In the broadband access market, Starlink uses radio spectrum that has traditionally been dedicated to satellite communications: Ku-band (10.7 – 12.7GHz downlink and 14.0–14.5GHz uplink) and Ka-band (17.8 – 20.2GHz downlink and 27.5 – 30.0GHz uplink). Additional capacity is expected to come from the use of V-band (37.5 – 42.5GHz downlink and 47.2 – 51.4GHz uplink) and E-band (71.0 – 76.0GHz downlink and 81.0 – 86.0GHz uplink).
In the mobile market, Starlink uses S-band spectrum (around 1.9GHz and 2.0GHz). For mobile services, this direct-to-device satellite capability may still offer only around one hundredth of the capacity per square kilometre of a 5G network operating in a sparsely populated rural environment. However, Starlink is not saying that it will offer coverage only where there is no existing 5G service, but urban and suburban markets as well.
So how could a satellite-first, radio-based service become the preferred access service in the densely populated locales of suburbs and cities? The more mundane realities of utility economics and radio physics suggest that it can’t.
In bidding SpaceX’s share price up to USD 160 and beyond, investors are clearly not valuing the company on Starlink’s current service profile as a connectivity utility. That valuation necessarily rests on a hazy vision of what SpaceX might become in the future.
Alongside Starlink, the SpaceX prospectus also highlights the company’s prospects in X, Grok, payments, and Artificial Intelligence (AI). It estimates a USD 2.4 trillion market for AI infrastructure, a USD 760 billion market for consumer AI subscriptions, and a USD 600 billion digital advertising market, while pointing to a digital economy that is expected to reach USD 22.7 trillion within the next few years.
Connectivity is not the business plan. In the eyes of SpaceX, connectivity is the customer acquisition strategy, while the value lies in the other activities users perform on their Starlink-connected devices. The underlying assumption is that Starlink places SpaceX in a position to channel users towards its own services and capture a far larger share of revenue from those activities.
Starlink brings the customer on board. Mobile keeps them engaged. X captures attention. Grok becomes the AI assistant. X Money handles transactions. Enterprise AI tools generate recurring business revenue.
In that model, connectivity becomes less important as a profit centre and more important as a control point.
We’ve heard and seen all this before. In the telco world, the introduction of mobiles heralded the prospect of the platforms where connectivity captured the customer, and the apps on the mobile device captured the user’s attention, payments, information tool and business device. The original, and very naive view on the part of the telcos, saw these apps as being managed by the mobile operator, and a small fraction of the revenues coming from this line of activity would fund a new golden age for the telcos. Obviously, this did not happen. Applications flourished, but outside of the overarching control of the connectivity provider.
We saw a similar play with Apple and the iPhone, and Google and the Android platform, where Apple and Google attempted to channel all mobile transactions through their respective stores so they could take their share of the action. The problem is that many regulatory regimes want this sector of the market exposed to open competition, and the cost of constantly defending an extractive, completely one-sided business model in every part of the world has created an environment where the only way Apple and Google can relieve even a small part of this pressure is by reducing their fees. This attrition won’t relent until the entire capture model is destroyed.
So, the question behind this wild share valuation is — will Starlink fare any differently in trying to leverage device connectivity into a position where it is taxing the transactions that occur with those devices? Previous iterations from this exact same playbook say: Of course not!
What we are left with is a classic economic bubble.
Bubbles are fundamentally driven by speculative hype and exploit herd behaviour, contradicting financial commentary. Key drivers include:
- The greater fool theory: Investors buy overvalued assets, not because they believe in the asset’s fundamentals, but because they plan to resell it to a ‘greater fool’ at an even higher price.
- Contagious enthusiasm: Fear of missing out and media hype cause a snowball effect, drawing more buyers into the market.
- Easy credit: Low interest rates or loose lending standards inject excess money into the market, fuelling speculation.
Economist J. K. Galbraith viewed economic bubbles not as rational market anomalies, but as products of mass psychology and ‘financial euphoria’. He argued that speculative manias thrive on the illusion that there is ‘something new in the world’. Bubbles are driven by a growing disparity between an asset’s true economic value and its inflating price. Investors stop focusing on business fundamentals of markets, costs, profits, and dividends, and instead rely solely on the expectation that prices will climb forever, or at least long enough to make a substantial gain and sell the asset to a greater fool. Once a bubble bursts, it inevitably triggers intense periods of blame. Individuals who were widely regarded as financial geniuses during the boom are suddenly villainized. The rest of us switch to a position of sober sanctity, conveniently ignoring our own roles in the prior collective insanity.
So, it’s time to strap in, as it’s going to be a fun ride!
Xbox working on a way to digitize your physical game collection as Sony abandons discs
Microsoft is reportedly developing 'Project Helix,' a system to digitize physical Xbox game collections as the industry trends toward disc-less hardware.
Summary
Decoder
- Digital entitlement: A cryptographic license tied to a user account that allows access to a specific piece of digital content, regardless of the physical storage medium.
Original Article
Xbox is testing out a disc-to-digital feature that will create a 'digital entitlement' for games that can be moved between Microsoft accounts and Xbox profiles.
Ontology Everywhere!
Ontologies are shifting from academic curiosity to essential infrastructure, providing AI agents with the semantic context required to act on enterprise data.
Summary
Decoder
- Ontology: A formal representation of knowledge as a set of concepts within a domain and the relationships between those concepts.
Original Article
Ontologies are re-emerging as a practical data platform layer because AI agents need explicit business meaning, not just schemas or dashboards. Unlike data models, ontologies encode shared concepts, typed relationships, constraints, and limited inference. In enterprise tools, this often appears as typed-edge traversal, semantic layers, or knowledge graphs. High-value deployments still require human-curated semantics, especially where systems can write back and act on decisions.
Data Residency Is Not a Legal Problem. It Is An Infrastructure Design Problem
Data residency is an infrastructure design problem that requires managing regional parity across compute, logs, ML experiments, and backups, not just storage.
Summary
Deep Dive
- Hidden Surfaces: Residency compliance must cover logs, experiment artifacts, feature caches, and CI/CD build environments.
- Region Parity Gap: Managed services and ML workbenches are often not equally available across all cloud regions, forcing teams to choose between hacks or cross-region data leakage.
- Infrastructure Hygiene: Moving from 'click-ops' to versioned, reproducible specifications in Git is the only way to make an platform genuinely portable.
- False Binary: Self-hosted components do not equal unmanaged chaos; they can be built as controlled, audit-ready layers when managed services are unavailable in a region.
Decoder
- Data Residency: The legal or regulatory requirement to keep data within specific geographic or political boundaries.
- RBAC (Role-Based Access Control): Restricting system access to authorized users based on their role within an organization.
Original Article
Data Residency Is Not a Legal Problem. It Is an Infrastructure Design Problem
Why regulated companies can't solve data residency with policy documents alone, and what infrastructure teams need to design before compliance turns into a migration crisis.
A data residency requirement usually shows up as a single sentence in a legal or compliance document: user data must be stored and processed inside a specific country or region. On paper it reads like a storage constraint. Move the database, restrict exports, update the policy, done.
Real systems are rarely that tidy.
Moving the database is not enough. Residency depends on the full lifecycle of data and computation: where data is stored, where code runs, where ML experiments execute, where logs get written, where backups are created, and who can reach the system across regional boundaries.
That makes residency an infrastructure design problem. Legal can write the requirement, but platform and engineering teams are the ones who decide whether the system can actually meet it.
The requirement looks simple until you inspect the architecture
The first reaction is almost always about storage: "we need to move the data warehouse to the local region." That's necessary, but it's one layer out of many.
Modern data and ML platforms are distributed by default. Data might be stored in one region, transformed in another, logged in a third, and inspected through a managed service whose control plane the application team never even sees. Dashboards, notebooks, feature pipelines, CI/CD runners, monitoring tools, exports, temporary files: any of them can turn into a residency surface.
A team can migrate its primary tables and still leave sensitive data sitting in places nobody thought to check: query logs, notebook outputs, audit trails, snapshots, support exports, experiment artifacts, feature caches, error payloads.
The useful question isn't "where is the database?" It's "where can user data appear during normal operation?"
The hidden list of residency surfaces
A real residency review covers more than storage. It should trace the full path from ingestion to deletion. At minimum, look at these layers:
- Primary storage: operational databases, data warehouse datasets, object storage buckets, feature stores.
- Compute: batch jobs, scheduled queries, notebooks, model training, inference workloads, serverless functions.
- ML tooling: managed workbenches, experiment tracking, model registries, GPU jobs, notebook environments.
- CI/CD: deployment runners, build logs, test data, temporary artifacts, environment variables.
- Observability: application logs, audit logs, traces, metrics labels, error reports, profiling output.
- Backups and disaster recovery: replicas, snapshots, archive buckets, restore procedures.
- Access and identity: service accounts, admin access, break-glass procedures, cross-region support workflows.
- External services: SaaS tools, analytics platforms, LLM APIs, ticketing systems, BI exports.
If any of these layers touches sensitive data outside the allowed region, the architecture fails the intent of residency, even with the primary database sitting locally.
Managed services become a hidden dependency
Managed services are useful because they take operational work off your plate. They're risky for the same reason: they hide infrastructure decisions that later turn into compliance decisions.
A managed ML workbench might be available in one region and missing in another. A logging product might keep part of its control plane somewhere else. A data transfer service might write operational metadata to a global location. A provider might offer a database in a new region but none of the surrounding tooling that makes the database worth using.
This is the region parity trap: the business assumes cloud regions are interchangeable, while the actual service catalog is not.
Cost is the obvious downside of a managed service. The one that hurts during a migration is that the service may simply not exist where you're legally required to run.
When that happens, the options are all bad: wait for the provider, run cross-region workloads, bring in a second cloud, or hack together a workaround. None of them is comfortable in the middle of a compliance-driven migration.
Region-aware platform design
Residency gets manageable when the platform is region aware from the start. That doesn't mean duplicating every service everywhere. It means the platform knows which parts are portable, which parts are tied to a region, and which dependencies would block a migration.
| Layer | Residency question | Common failure mode |
|---|---|---|
| Storage | Where is user data physically stored? | Primary tables are local, but replicas or exports are not. |
| Compute | Where does code execute against that data? | Jobs read local data but run in a different region. |
| ML workloads | Where do notebooks, GPU jobs, and experiments run? | The managed ML platform isn't available in the compliant region. |
| Logs | Do logs contain sensitive data, and where are they stored? | Query text, payloads, or identifiers leak into global logs. |
| Access | Who can access data across regions? | Broad admin roles allow uncontrolled cross-region access. |
| CI/CD | Can the system be deployed reproducibly into the region? | Manual environments can't be recreated under compliance pressure. |
| Backups | Are snapshots and restore paths also local? | Disaster recovery silently violates residency. |
The table is simple, but it shifts the conversation. Instead of asking whether one database moved, the team asks whether the whole operating model can run inside the required boundary.
A better architecture pattern
The weak pattern usually accretes over time. Someone moves the data to the required region, but the notebooks stay in a managed service somewhere else. Access control runs on hand-maintained groups. Users spin up their own runtime environments. Logs flow into a global sink. People create scheduled jobs through the UI. Nobody can rebuild the system from scratch, because the real architecture lives in people's habits instead of in a repo.
The stronger pattern looks like this:
- Storage is region-local by default.
- Compute and ML execution stay in the same compliant boundary.
- Infrastructure lives in code, not click-ops.
- Runtime images are standardized and versioned.
- Access goes through SSO, RBAC, and audit trails.
- Scheduled workflows are defined in Git and deployed through APIs.
- Logs are classified, filtered, and stored by sensitivity.
- Critical dependencies are documented with their region availability and an exit path.
None of this is anti-cloud. It's about not letting compliance hang on abstractions you can't reproduce, inspect, or relocate. Managed services are fine; blind dependence on them is the problem.
Self-hosted isn't the same as unmanaged
When managed tooling isn't available in a regulated region, teams tend to fall into a false binary: use the managed platform, or accept a chaotic self-hosted mess. That framing is wrong.
A self-hosted platform can be better governed than a managed one if you build it as a controlled internal layer. For ML experimentation that might mean Kubernetes-based execution, JupyterHub or a similar workbench, SSO, RBAC, user isolation, configurable GPU allocation, autoscaling node pools, approved Docker images, persistent storage policies, quotas, audit logs, and configuration managed through CI/CD.
What separates the two is ownership. A self-hosted platform shouldn't be a pile of hand-built VMs. It should be a reproducible execution layer you can deploy, review, monitor, and audit.
What to do before regulation forces the issue
The worst time to find out your infrastructure isn't portable is in the middle of a mandatory regional migration. Mature teams treat portability as a design constraint long before it becomes a legal emergency.
A practical checklist:
- Map where sensitive data is stored, processed, logged, exported, and backed up.
- List the managed services involved in data and ML workflows, then check regional availability for each one.
- Document which systems can be redeployed from code and which depend on manual setup.
- Standardize runtime environments before every team builds its own version.
- Move recurring workflows out of UI configuration and into versioned specifications.
- Apply least-privilege access and audit logging across all data-adjacent platforms.
- Build migration runbooks, then test whether critical workflows can actually be replayed in another region.
- Make vendor dependency explicit in architecture reviews.
It all looks like ordinary platform hygiene, right up until regulatory pressure turns it into business continuity.
Residency as a maturity test
For a lot of companies, a residency requirement is the first real test of how mature their infrastructure is. Immature setups end up tied to a region without anyone choosing that on purpose. Mature ones treat region placement as an explicit decision.
It also forces some uncomfortable questions. Can we stand this platform up again in another region? Do we actually know where our logs end up? Can we run ML workloads without a region-bound managed service? Can we prove who accessed what? Can we move recurring workflows without someone clicking through a UI at 2 a.m.?
When the answer is no, compliance isn't really the problem. The infrastructure is.
Where responsibility actually sits
Residency isn't a legal checkbox that engineering ticks after the fact. It's a stress test for platform design. Companies that treat it as a storage migration tend to learn too late that compute, logs, ML tooling, identity, backups, and workflow automation count just as much.
The mature move is to design platforms that are portable, auditable, reproducible, and region aware. That doesn't mean avoiding managed services. It means knowing exactly where a managed service ends and your own responsibility starts.
Compliance documents define the boundary. Infrastructure decides whether the system can actually live inside it.
Your AI isn't underperforming. Your data foundation is
Businesses are stalling AI deployments because they are prioritizing token usage metrics over actual business value, according to new Elastic research.
Summary
Deep Dive
- 80% of decision-makers worry that high usage volume is being mistaken for genuine productivity.
- Only 8% of organizations track specific revenue or cost-saving outcomes from AI.
- 31% have a centralized view of autonomous AI agents running in their stack.
- 16% of companies are redirecting IT infrastructure budgets to fund AI.
- Security budgets are being diverted to AI, increasing risk at a time when threats are escalating.
Decoder
- Tokenmaxxing: The practice of optimizing AI workflows for maximum token usage or activity counts rather than measurable business outcomes.
- Context engineering: The process of curating and structuring the specific data fed to an AI model to improve the relevance and accuracy of its responses.
Original Article
Your AI isn’t underperforming. Your data foundation is.
New research reveals why Australian businesses are entering the new financial year with bigger AI budgets and the same unsolved problem.
One in three Australian businesses exceeded their AI budget last year. Yet, half of them plan to increase AI spending again this year. Yet the behaviour that caused those budget overruns remains largely unaddressed.
Elastic has just released new research surveying more than 500 senior AI decision-makers across Australia, and the findings are a clear signal that this new financial year needs to be different.
Here is what the data is telling us.
The tokenmaxxing trap
Tokenmaxxing describes the tendency to optimise AI for visible activity rather than genuine business value. The usage dashboards look healthy. The leaderboards are climbing. But the outcomes are not necessarily following.
Our research found 80% of Australian AI decision-makers are concerned that high AI usage is being mistaken for genuine productivity gains. Yet only 8% are tracking whether AI is delivering real revenue or cost savings.
And the consequences are already showing. Almost a third of businesses (32%) have paused, cancelled, or wound back AI deployments because the output could not justify the cost. Another 28% are currently reviewing deployments for the same reason. Investment and adoption are accelerating faster than the ability to measure return, and as budgets grow, the stakes of getting this wrong grow with them.
The data foundation problem
When AI tools underperform, decision-makers blame the quality of their underlying data more than anything else (32%) — more than double the share who blame the limitations of the AI models themselves (14%). Yet, only 28% treated data readiness as a formal prerequisite before deploying AI. A further 8% deployed with no formal data quality assessment at all.
The instinct is to treat rising AI costs as a compute problem. But the compute explosion in most enterprises is happening at the data retrieval layer. When an AI system relies on low-quality or poorly scoped data, it forces the model to consume exponentially more tokens to reach a usable answer. You are not just paying to compute; you are paying to compute junk.
Australian organisations do not need a massive flood of tokens for every query. They need the exact, right drop of hyper-contextualised data. AI return on investment is an observability problem before it is a budget problem.
Agents are scaling faster than governance
Only 31% of businesses have a centralised view of how many AI agents or autonomous workflows are running across their organisation. Nearly half (47%) are concerned that AI adoption is outpacing their ability to govern it.
The monitoring gap is stark. Only 13% have usage logging in place, 11% conduct regular risk reviews, and just 2% have a formal incident response process for AI. Yet, 50% plan to expand the use of agents in the new financial year. If an AI tool caused a compliance failure tomorrow, only 22% are very confident they could identify what went wrong and why.
Where budgets are coming from
To fund AI increases, businesses are redirecting budgets:
- 16% from IT infrastructure and operations
- 12% from existing software licensing
- 10% from headcount or hiring budgets
- 8% from cybersecurity
That last figure is worth noting. Though a very small number of businesses said they were using security budgets this way, no business should use cybersecurity budgets to fund AI expansion at the very moment AI is increasing the speed and scale of cyber threats. This creates risks that deserve scrutiny at the board level.
AI spend is variable and unbudgeted in ways that legacy software licensing never was. CFOs are scrutinising the entire technology stack differently as a result. The honeymoon period for AI spend is over. Boards want proof of value, not promises.
How AI is changing the workforce
While the financial metrics are facing intense scrutiny, the research points to a more nuanced workforce story than the one that often dominates headlines.
AI is not just changing how much organisations spend on budgets; it is fundamentally altering how people work. When routine, repetitive administrative tasks are successfully automated, employees can elevate their daily focus.
The research found that three-quarters (75%) of Australian businesses report their people are redirecting freed-up time toward higher-value initiatives. Instead of managing manual workflows, they are focusing on strategic planning, new product development, deeper customer engagement, and proactive upskilling.
Furthermore, the data highlights a strong demand for specialised skills. Rather than reducing headcount, 45% of organisations anticipate creating entirely new, AI-focused positions within their businesses, with 18% already recruiting for these roles.
Ultimately, sustainable AI investment relies heavily on supporting the human element. The businesses that pull ahead in this new financial year will not just build stronger data foundations; they will actively invest in the people who work alongside the technology. This collaboration is where compounding returns live, and where the real competitive gap opens up.
Reset AI approach in new financial year
The new financial year is the right moment to reset. It’s not the time to pull back from AI, but to build the foundation that makes it genuinely work. The businesses that do that now will be considerably better positioned in the future.
Learn more about Elastic’s context engineering capabilities and data retrieval tools.
About the research
The Elastic AI cost research, conducted by Pure Profile, surveyed over 500 senior decision-makers at Australian organisations with 50 or more employees that are currently using, formally or informally deploying, or piloting AI tools. All respondents are either final decision-makers or strong influencers and recommenders on AI, data, or digital transformation in their organisation. Fieldwork was conducted in June 2026.
The state of the creative industry 2026: what our survey tells us about pay, burnout and AI
A survey of 882 creative professionals reveals an industry in crisis, with 69% reporting burnout and widespread resentment toward mandatory AI adoption.
Summary
Deep Dive
- 69% of creative professionals report significant burnout, with mid-career roles feeling the most pressure.
- Nearly half of freelancers earn under £30,000 annually, indicating a lack of financial security despite high adoption of productivity-boosting AI.
- 80% of creatives have stopped entering awards, citing high costs and a lack of perceived value.
- Respondents explicitly identified 'AI' as the design trend they are most tired of, citing it as a contributor to generic output.
- The primary requested support mechanism for creatives is community building and mentorship, not technical training.
Original Article
The state of the creative industry 2026: what our survey tells us about pay, burnout and AI
Our wide-ranging survey lays bare a profession that's exhausted, anxious about its future, and using AI tools it doesn't trust.
Feeling tired, less secure and resentful of AI? Then it's official: you're by no means alone. Creative Boom's flagship survey for 2026, gathering responses from 882 creative professionals worldwide (UK and US weighted, with 43% bringing more than a decade of experience), confirms what you've probably already suspected. This has been a tough year for creatives, wherever you are in your career.
This isn't a survey about one bad quarter. It's a story about a workforce under serious pressure, trying to work out what AI, the economy, and a changing client landscape mean for their livelihoods, and finding few reassuring answers.
A boom in burnout
To be perfectly honest, these numbers don't need much analysis: they tell a pretty clear story. For example, a massive 69 per cent of respondents say they've experienced burnout in the past 12 months.
Mid-career creatives report the highest burnout rate at 77%, with early-career professionals close behind at 74%. Founders and studio leaders fare a little better, at 59%, though that's still a majority struggling.
Why the distinction? My best guess is that founders, for all their stresses, generally have more control over the shape of their workload and the clients they take on. Consequently, it's the mid-career cohort—the people running projects, managing junior staff and fielding client demands without the authority to say no—who are absorbing the brunt of a difficult year.
AI adoption vs approval
Perhaps the most telling finding is the gulf between AI adoption and AI approval. Eighty-six per cent of respondents now use AI tools in their work: a figure that would have seemed remarkable even a couple of years ago. Yet only 10% of creatives think AI's overall effect on the industry is positive. Fifty-eight per cent describe its impact as mixed, and 28% are straightforwardly negative about it.
That gap, between near-universal use and near-universal unease, is perhaps the defining story of this survey. Creatives aren't refusing to use AI; they're adopting it because they feel they have to. But at the same time, they remain deeply sceptical about what it's doing to their industry, their pricing power and their sense of authorship.
This is not a community that's been won over, but one that's adapting under duress. And that distinction matters. Tool-makers and commentators often use adoption figures as proof of enthusiasm, but enthusiasm and necessity look very different up close. A profession that feels compelled to use a technology while doubting its long-term value isn't embracing change so much as bracing for it; hedging its bets while it waits to see how client expectations, pricing and competition shift around it.
What's happening to freelance pay?
None of this unease, by the way, has been alleviated by extra pay. In fact, half of the respondents feel less financially secure than they did a year ago, compared with just 18% who feel more secure. Almost 48% are worried about where the industry is heading, compared with under 38% who feel confident. And more than a third (38%) are considering a job change, with 7.5% planning to leave the creative industry altogether.
For the self-employed, particularly, the picture is stark. Nearly 47% of self-employed creatives in our survey earn less than £30,000 a year. That's not a poverty wage, of course, but bear in mind this is a workforce stocked with experienced professionals, 43% of whom have more than a decade behind them. So, for a substantial chunk to be earning well below the UK's median full-time salary of £39,039 (ONS: April 2025)—with none of the security that salary typically comes with—is worth reflecting on.
Are creatives quitting?
Most creatives aren't planning to leave the industry altogether: the figure is a modest 7.5%. But the bigger problem isn't people quitting creative work outright; it's people quietly looking for a way out of their current role, agency or set-up, while staying within the profession itself.
That's arguably a harder problem for employers and clients to spot and fix than outright attrition. A wave of resignations is visible; a slow drift of disengaged, undervalued talent looking sideways for something better often isn't, until it's too late.
As a whole, our survey paints a picture of a profession where rates haven't kept pace with costs, competition, or, increasingly, AI-enabled undercutting. Clients now have a cheaper, faster option for certain tasks, which inevitably puts downward pressure on what freelancers can charge for work AI can approximate, even imperfectly.
Indeed, when we asked which design trend creatives are most sick of, one answer dominated by a wide margin: AI itself, with more than 70 mentions. Gradients (19 mentions) and minimalism (10) trailed well behind. Remember when those kinds of stylistic niggles were the biggest bugbear for visual creatives? Happy days…
Awards are being ignored
With creatives so stressed out over burnout and pay, I'm not surprised awards have slipped down the priority list for many. Consequently, a full 80% of respondents haven't entered an award in the past year. Perhaps more significantly, only 12% still believe awards meaningfully help careers, and 35% believe they're too expensive and inaccessible to bother with.
For a profession that's long used awards as a shorthand for credibility and as a marketing tool for agencies and studios, this looks like a vote of no confidence. Is award entry becoming something larger agencies do for visibility, rather than something the wider creative community sees much value in?
What would actually help
While complaining about the current state of affairs might be cathartic, it ultimately won't get you anywhere. So we also wanted our survey to include potential solutions.
Asked what would genuinely improve their working lives, our respondents didn't point to new software. Instead, networking and community came top, cited by 57.5%, with mentorship close behind at 53%. New tools and technology trailed well behind, at just 31%.
That's an interesting result for an industry that could once have been accused of an obsession with tooling. What creatives say they need most isn't a new AI feature; it's people. Peers, mentors and a genuine sense of professional community matter to them more than the next piece of software, and they're harder to find than ever in a fragmented, increasingly remote working world.
The instinct in difficult years is often to retreat: to get your head down, cut costs, skip the conference, let memberships lapse. But if this survey tells us anything, it's that retreating from community is precisely the wrong move at precisely the wrong time.
The creatives who are struggling most don't want another dashboard or plug-in. They want a room full of people who understand what they're going through, and someone a few years ahead of them willing to offer some guidance. That's a far harder thing to build than a new feature, but on this evidence, it's the thing that actually moves the needle.
You're missing the point; this is your real value in tech companies
Tech professionals must pivot from simply 'operating' AI tools to applying human strategy and domain expertise to survive an increasingly automated landscape.
Summary
Deep Dive
- The author compares AI models to a 'Thermomix' (kitchen robot)—a tool that handles production but lacks the creative strategy to innovate.
- Businesses that fully automate processes often produce generic, indistinguishable products that lose value in crowded markets.
- 'Human-in-the-loop' remains essential for injecting style, purpose, and strategic decision-making that AI models currently cannot replicate.
- The author warns that relying purely on AI tools for efficiency creates a 'race to the bottom' where products become indistinguishable.
- True value for employees lies in the ability to apply context and judgment to AI output rather than just prompt-engineering.
Decoder
- Human-in-the-loop (HITL): An AI development model where humans remain part of the loop to refine, validate, or steer the AI's output, preventing the algorithm from operating in complete isolation.
Original Article
You’re missing the point; this is your real value in tech companies
It’s not about a tool or a model, but about your expertise and the criteria that will ultimately enable you to generate profound changes in the industry.
Over the last two years, we have witnessed the biggest tech revolution of the century. We have gone through a spiral of company changes, trends, hype, influencers, misinformation, uncertainty, and massive layoffs, to name a few. These behaviors are “expected” somehow when, as a society, we face such a tremendous change in how we create things, communicate with systems and services, and begin to see our reality with new eyes.
The impact of having AI now rooted in our lives has had huge consequences for all of us on many levels. One of these, without question, is the significant influence on businesses that, as any tech revolution does, creates new transactional dynamics between companies, services, and users, as well as new corporate bets that are not always positive for everyone.
Along with this new way of connecting people (or should we start saying agents?) with products and services, there is a phenomenon that has led us to lose the real perspective on businesses and human value. As evidence of this problem, the companies that laid off thousands of people in 2025 because of the benefits of the almighty AI tools now seem to regret that decision due to the high costs of tokens and irresponsible usage of AI models; some companies now put in the balance the cost of expending tokens against the full price of an employee, and the numbers do not match. The new market reality is still evolving every day, so what we can be certain about today might be obsolete tomorrow.
My reading of this phenomenon (without being, or pretending to be, an expert) is a simple graphic illustrating the inverse relationship between technology (in this case, AI) and human skills that can affect industry realities nowadays.
This opposing relationship between humans and tech progress is not new in our history as a society (it is actually part of pop culture, as seen in stories and movies for decades). Still, it doesn’t mean we can’t relate to achieving greater results; that is precisely the point of this article: understanding where the real value to employees lies despite significant technological advancements.
A machine that creates everything?
In a conversation with a colleague the other day, he gave me an example illustrating the current situation in companies: the dilemma of choosing between people and technology, or between automation and human skills to thrive in business, and, above all, how to measure whether things are going well.
The perfect metaphor he used to illustrate this crossroads goes like this: Imagine you could get the Thermomix machine, a smart cooking robot that can literally cook anything for you, so the problem of learning how to cook would be almost nonexistent, and with this advantage, you decide to start a business around this new toy, seeking to thrive in the food market by being faster and more affordable than competitors.
So far, with this advantage, some considerations will arise when starting a new commercial venture: no need to hire employees; you already have a complete kitchen production in a single artifact. No recipe limitations: you already have an extensive catalog of options to offer without effort, and without the recipe barrier, there’s no need for a learning path to innovate with unique dishes.
Everything sounds promising, doesn’t it? But as usual, nothing in life is granted, and it is always full of nuances. These are the possible scenarios I can predict from that example (not a novelty), which would happen with this situation (I’ll refer to generalities only to illustrate the point):
The all-machine business
This scenario is pretty much the reality for many companies now; whether big or small, the risk of choosing technology outcomes over human skills can reveal complex challenges as well as benefits. The most evident consequences of this approach might be the following:
Possible positive side:
As a new business, all initial investment will focus on acquiring the thermo machine to get started; once you have that, you will have almost the entire production chain in one place and under your own operation.
All incomes will be yours since you don’t need employees or infrastructure that requires rent. If the business needs customer service, maybe, as a starting point, you can be the same person who operates the machine and delivers.
Possible negative side:
The thermo machine will indeed do the job, and that’s it. For some people, that will be enough, but to make a name in the food market, having the same product value as hundreds of restaurants with more capabilities will not make a difference. How do you compete in this scenario? Maybe use the time factor, but does your food offer something else that will make people come back?
In a vast market of options, having a soul, style, or purpose will make you stand out from others. The same criteria apply to technology: our thermo machine is all the AI models that promise fast results, but they are so generic that anyone can tell the same technology created them, and eventually, our product will vanish and be devalued among hundreds of identical solutions.
The all-human business
Possible positive side:
When we talk about creations and originality, this path will be the right one. People’s creativity can amaze us in unthinkable ways; we have witnessed all the delicious meals from people experimenting with an idea. In business, creativity matters because it can drive innovation and disruption, but it is not the only factor to consider when trying to succeed.
Possible negative side:
Creating delicate dishes can put you in an original position relative to competitors and eventually build a reputation, but to stay ahead of the market, you might need to scale your business, refine processes, and maintain a human workforce to support customer demand. Do not get me wrong: scaling a business is not the only way to thrive; take, for example, Jay Fai, a Michelin-starred street-food kitchen in Bangkok, so famous that people wait in long lines for hours for a portion of its food. This recognition is based on years of craft and perseverance, with no sign of being a big food franchise.
In a business, this approach will have consequences, such as time vs. money (unless you are so famous that people will wait for you no matter what). When the time it takes to create your product before delivering it to users is longer than expected, you are at a disadvantage against competitors that offer the same product faster. Being somehow disconnected from technology or automation in business makes you operate on slow, expensive, almost artisanal dynamics.
The new connection value business
So you decide to take a different approach for your business: this time, the Thermomix machine will handle one production line, while you, on the other hand, start making many unique creations, strategies that complement what the machine does. In addition to this, you decide to create cooking concepts that catch the attention of a new audience for growth, develop services or new products, and, in this model, being fast is one of the many values you can offer customers. The machine work is to develop, and your job is to think in strategy.
This model strikes a balance among maintaining a competitive production chain, achieving scalability, and developing innovative ideas in the market. That combination is the new value that can produce real changes and innovation. Is it the same to use the Thermomix, adding cooking expertise, as to just rely on the machine’s outcome? No, I’m pretty sure results are not the same, and that’s right there, when you realize where your full business potential can be.
In tech companies, this dilemma is evident: some decide to rely solely on AI models or tools, eliminating human skills, and I’m not sure whether the products feel or work the same; they look generic and sometimes have usability or performance issues (it is obvious that there was no human in the loop).
Some companies are still struggling to find that balance and, with it, the value of their products; they prefer to “save money” by letting people leave, but at the cost of lacking the touch — ideas that only a person, not an algorithm, can create.
Put a tool in the right hands, and it will be more efficient than the same tool in the hands of someone lacking vision, skills, and strategy. In a world where anyone can build stuff, your real value now is the ability to apply criteria, knowledge, strategy, and vision to solve problems or uncover business opportunities using advanced tools, not in the opposite direction.
Surviving the future and my thoughts about it:
The Thermo mix story is the face of the near future and of our relationship with it. Today, more than ever, we realize that with the advent of the AI revolution, every day is an experience full of uncertainty and change. But how do we survive this continuous movement within companies? Having a plan for this is impossible, but maybe the only way to surf the wave is to embrace these changes with a more critical view:
- In an era when social media is the most popular form of communication, everyone is fighting for people’s attention, so there is a lot of trash content to get clicks and views. Be critical when finding reliable sources to follow.
- Do not fall into trends without understanding all the benefits and constraints a new model can offer; understanding its functionality and then applying it to your context is more valuable than just being part of the latest technology hype.
- Put the criteria and expertise over tools; it is fine to understand and operate new technology models, but as the person with the Thermo mix, a business is more robust if you find a combination between automation and strategy.
- Surround yourself with people who want to learn and experiment, not “experts” who claim to own the truth of things.
We create the technology of today and tomorrow, but when we learn to view these inventions simply as tools that enhance our creativity, we will be able to explore new paths of progress.
Create Viral Ads by Chatting (Website)
Advivi is an AI-powered ad generator that allows marketers to create video advertisements for social platforms through conversational prompts.
Summary
Decoder
- Performance Marketing: A form of digital marketing where advertisers pay only when a specific action occurs, such as a click, lead, or sale.
- Creative Iteration: The process of rapidly testing and refining different versions of ad visuals and copy to identify what drives the best conversion rates.
Original Article
AI Video Ad Maker. Create viral ads by chatting. Turn an idea into ready-to-run automated video ads for TikTok, Meta, and YouTube with the ultimate AI ad generator.
Over 1M+ viral shorts generated by 10,000+ creators
From engaging faceless channels to high-converting product showcases - let AI do the heavy lifting while you focus on scaling your audience.
Short-Sleeve Apparel Ad
Skincare Product
Sunglasses Ad
Lipstick Ad
Summer Dress Ad
TikTok Ad
Instagram Ad
Face Cream Ad
Why Advivi ?
Your AI Ad Agent
Generate a full AI video ad through simple conversation — no scripts needed. Advivi guides your ideas, turning them into storyboard, footage, and edits automatically.
Multi-Model Video Generation
Powered by Sora, Veo, Kling, and Nano Banana — using the best text to video AI model for every scene.
Built for Performance
The ultimate AI commercial maker. Generate high-performing automated video ads optimized for TikTok, Meta, YouTube, and any platform — online or offline.
Edit Directly in Advivi
Refine your generated videos instantly, all in one place — no extra software required.
Built for Amazon listings, Amazon ads, TikTok, Shopify, and Meta
Explore search-intent-specific workflows for Amazon product videos, listing assets, Sponsored Brand Video, short-form ads, storefront content, and paid-social testing.
Amazon AI Video Generator
Turn Amazon product URLs and listing assets into Amazon product videos, listing video concepts, and Sponsored Brand Video variations with a faster seller workflow.
Best AI TikTok Video Generator
Create more hook-first short-form ad variants with faster pacing tests and repeatable TikTok creative iteration.
Shopify AI Video Generator
Build storefront videos and paid-social variants from the same product brief with a cleaner ecommerce workflow.
AI Video Generator for Meta Ads
Generate more Facebook and Instagram ad variations with better hooks, clearer scene flows, and faster performance testing.
Create AI Video Ads in 4 Steps
Turn an idea into a launch-ready ad in four simple steps:
1. Upload Your Image, Describe Your Idea
Upload your product image and share your idea via chat. Select your AI engine—Sora, Veo, or Kling—to best capture your creative style.
2. Generate & Preview
Watch the Agent turn your concept into a strategic ai video ad in minutes. Preview the stable draft instantly in your browser.
3. Edit Directly (Optional)
Not happy with a scene? Don’t start over. Use our intuitive timeline and built-in video editor to manually rewrite the text hooks, swap the music, or adjust the clip timing.
4. Export & Publish
Once you're happy with the result, export your high-converting automated video ad. Download it and publish directly to start scaling.
They’re Using Advivi
E‑commerce sellers, content creators, and marketers are generating high-performing AI video ads with Advivi every day.
Sarah Chen
Performance Marketer at TechFlow
"Advivi transformed our ad testing workflow. As our go-to AI video ad maker, what used to take days now ships in hours, and our top creatives are consistently beating benchmarks."
Marcus Johnson
Creative Director, Studio X
"We now iterate ad scripts and visuals together in one place. The team moves faster, and we launch platform-specific automated video ads with far less friction."
Emily Rodriguez
Freelance Ad Creator
"Finally, an AI tool that understands ad storytelling. From hooks to CTA scenes, the outputs feel conversion-focused instead of generic."
David Kim
Founder of AIStartups.io
"Advivi helped us scale ad production 10x without expanding our editing team. It is the best growth lever we added this year."
Jessica Lee
Marketing Lead at GrowthBox
"From idea to launch in hours. We test more TikTok and Instagram ads every week, and the ROI improvement is obvious."
Built for performance marketing
Scale ad creative production with speed, quality, and conversion in mind.
FAQs: Everything You Need to Know about AI Video Ad Creation
What exactly is Advivi and how does it work?
Advivi is a web-based AI Ad Agent designed for high-converting video ads. Unlike basic generators, Advivi gives you access to the world’s best engines—Sora, Veo, and Kling—in one place. Just upload a product image and share your idea via chat; our Agent guides the storyboard, selects the best model for the scene, and produces a winning 30-second Al video ad in 5 minutes.
What ad types can I create with Advivi?
You can create ads for TikTok, Meta (Instagram/Facebook), and YouTube, offline product ads as well as high-converting product videos for Amazon and Shopify.
Do I need to spend credits on generating AI video ads?
Yes, Credits are needed to generate AI video ads, using our high-end models like Sora or Kling. However, once your video is generated, you can preview and edit directly in the browser.
Is there a free trial?
Yes! Every new user gets 100 Free Credits upon sign-up. This is enough to generate one full professional AI video ad using any of our premium engines (Sora, Veo, or Kling). You can preview it, edit it directly in your browser, and see the power of Advivi for yourself-completely risk-free.
Can I edit the video after it’s generated?
Yes. You have full creative control. You can manually rewrite captions, swap background music, or adjust the timing of any scene directly in our in-browser editor before you export.
Do I need to download any software?
No. Advivi is 100% Web-based. You can create, edit, and export professional AI video ads directly in your browser—no heavy software or high-end hardware required.
Are the videos ready for commercial use?
Absolutely. Every video you export is optimized for TikTok, Meta, and YouTube, and offline video standards, ready to be plugged into your ad campaigns immediately.
Google Reader was building the wrong future
Google Reader’s true legacy was not its feed-reading code, but an accidental social network that prioritized direct, algorithmic-free connections between creators and readers.
Summary
Decoder
- RSS (Really Simple Syndication): A standardized web feed format that allows users and applications to access updates to websites in a standardized, computer-readable format.
- Divertimento: A light, entertaining distraction; used here to describe the overwhelming and often shallow nature of internet content consumption.
- Social graph: A map of the social relations and connections between people and their interests within a network.
Original Article
Google Reader turned out to be the social network that people wanted all along.
arXiv's next chapter: Updates on our spin out from Cornell University
The preprint repository arXiv officially separates from Cornell University on July 1, 2026, becoming an independent nonprofit entity.
Summary
Decoder
- Preprint: A scholarly paper that is posted to a public server before it has undergone peer review at a formal academic journal.
Original Article
arXiv is becoming an independent nonprofit organization on July 1 after being part of Cornell University for the past 25 years.
Never seen a data quality issue that wasn't actually an ownership problem
Data quality is almost always an organizational failure of ownership rather than a technical failure of software or infrastructure.
Summary
Original Article
Data quality failures are usually ownership failures: when multiple teams consume the same metric but no single person controls its definition, calculation, and change process, trust erodes and fixes stay temporary. The practical remedy is explicit metric governance: one named owner, clear decision rights, version/change control, and enforceable quality rules tied to the metric.
Koto redesigns Stack Overflow to champion human knowledge in the AI era
Stack Overflow is rebranding to shift its perception from a simple Q&A forum to a foundational resource for human-validated technical knowledge.
Summary
Deep Dive
- The rebrand aims to pivot the platform's public identity from a stagnant repository to a dynamic, community-driven knowledge source.
- Koto implemented a 'stack' visual motif across the UI to reinforce the modular nature of programming.
- Stack Sans was developed to improve readability and bring a distinct, modern identity to the platform's technical documentation.
- The design language includes AI-integrated tools that assist with knowledge curation and community moderation workflows.
- The visual system is intended to reflect the collaborative, non-linear process of real-world software development.
Original Article
Koto has rebranded Stack Overflow around the idea that trustworthy, human-validated knowledge is the foundation of effective AI, repositioning the platform from a Q&A site to an essential resource for developers and enterprises. The new identity centers on the concept of “Always in build,” introducing an evolved logo, a stack-based visual system, the custom Stack Sans typeface, AI-assisted design tools, and a clearer tone of voice that reflects the collaborative, iterative nature of the developer community. Developed with community feedback and refined after its 2025 debut, the rebrand aims to reinforce Stack Overflow's role as a living source of reliable technical knowledge in the AI era.
watchOS 27 has brand new default Home Screen for Apple Watch
Apple’s watchOS 27 updates its default home screen with a dynamic grid that prioritizes AI-suggested apps based on real-time usage.
Summary
Deep Dive
- The Dynamic app grid uses contextual data to suggest apps at the center of the watch face.
- The feature is part of the watchOS 27 beta testing phase.
- Traditional grid and list views remain available, now sequestered behind a secondary shortcut.
- Early feedback indicates the UI prioritizes responsiveness and relevance over the uniform layout of previous watchOS versions.
Original Article
watchOS 27 introduces the Dynamic app grid as the new default Apple Watch Home Screen, placing the Siri app at the centre and surrounding it with AI-powered suggestions based on your most-used and recently used apps. The traditional Grid and List views remain available but are now accessed through a shortcut at the bottom of the dynamic grid. Apple aims to make app access faster and more relevant, with early beta testing suggesting the recommendations work well.
Finding photos is so much easier with Siri AI in iOS 27 that I no longer scroll
iOS 27's Siri AI has significantly reduced photo navigation friction by enabling natural-language queries for complex visual identification.
Summary
Deep Dive
- Siri's improved photo search relies on semantic indexing to map natural language to visual content metadata.
- The system performs well for object-based identification but struggles with complex, multi-variable temporal queries.
- Results are natively integrated with sharing and deep-linking into the Photos app, streamlining the workflow.
- Despite beta-phase inaccuracies, the feature marks a significant departure from standard folder or date-based browsing.
Decoder
- Semantic indexing: A technique where data is tagged based on meaning and context rather than keywords, allowing search engines to understand the intent behind a query.
Original Article
iOS 27's Siri AI improves photo search by letting users find images with natural-language voice commands, making it much easier to locate specific photos in large libraries without endless scrolling. It generally performs well when searching for objects, people, or clothing, and allows users to quickly share results or open them in the Photos app, though it can still make mistakes with more complex or time-based queries. Despite some inaccuracies in the beta, the feature is a significant improvement over previous photo search capabilities and saves considerable time.
v0 Template Gallery (Website)
Vercel's v0 platform now features a community template gallery for UI components, dashboards, and landing pages.
Summary
Original Article
Discover the best apps, components, and starters from the v0 community.
The Future of UX in an AI World: Why Designers are Becoming Strategic Leaders
The integration of AI into product development is pushing UX designers to take on leadership roles in defining trust, autonomy, and system strategy.
Summary
Original Article
AI is reshaping digital products, pushing UX designers beyond interface work into strategic decisions about trust, autonomy, and how humans and AI should interact. As uncertainty grows around AI-powered features, user research becomes essential for uncovering how people actually behave, rather than relying on assumptions. This shift is opening new career paths, with designers gaining greater influence over product direction, roadmap planning, and cross-functional decision-making.
Why are We All Obsessed with Anti-design?
Anti-design is resurging as a reaction against sterile, AI-generated minimalism, favoring human-centric imperfections to build brand authenticity.
Summary
Decoder
- Anti-design: A design movement that deliberately breaks traditional conventions of order, readability, and minimalism to create striking, unconventional visuals.
Original Article
Anti-design has resurged over the past decade as a reaction to minimalism's polished, white-space-heavy aesthetic that grew tiresome to audiences.
Creativity Lessons from The Eiffel Tower
The Eiffel Tower succeeds not as a solitary act of genius, but through iterative collaboration building on existing engineering foundations.
Summary
Original Article
The Eiffel Tower offers four creativity lessons drawn from a book on creative thinking, starting with how its design built on earlier works and multiple collaborators rather than solitary genius.