Devoured - June 23, 2026
SpaceX has signed a $6.3 billion deal to lease supercomputing capacity at its Memphis data center to Reflection AI, highlighting the growing strategic value of large-scale infrastructure for AI development. Meanwhile, OpenAI is pivoting its cybersecurity strategy toward automated remediation with the launch of the GPT-5.5-Cyber model and the defensive Daybreak partner program.
OpenAI launches new security tools and updates GPT-5.5-Cyber
OpenAI is shifting its cybersecurity strategy from discovery to automated remediation with its new 'Daybreak' partner program and GPT-5.5-Cyber model.
Deep dive
- Codex Security integration: Scans commits for vulnerabilities and provides automated remediation guidance.
- GPT-5.5-Cyber: A specialized model version restricted to authorized security defenders, showing improved performance on CyberGym and ExploitGym benchmarks.
- Operational Pipeline: Shifts focus from merely identifying security flaws to validating and generating production-ready patches.
- Partner-centric: Access is managed through partners, positioning 'Daybreak' as a defensive layer rather than a standalone consumer tool.
Decoder
- SARIF: Static Analysis Results Interchange Format, a standardized JSON format for reporting output from static analysis tools.
- CodeQL: A semantic code analysis engine used to query codebases for security vulnerabilities.
- CyberGym: A specialized testing environment for evaluating cybersecurity model performance.
Original article
OpenAI is advancing Daybreak beyond vulnerability discovery into patch automation, launching an updated Codex Security plugin, the full GPT-5.5-Cyber model in limited release, a Daybreak Cyber Partner Program, and Patch the Planet, an open-source security initiative built with Trail of Bits, HackerOne, Calif, researchers, and project maintainers.
The core shift is from finding bugs to landing fixes. Codex Security is integrated into Codex workflows and can scan an entire codebase, a selected folder, or a specific change. It can review recent commits, produce reports with severity, affected code locations, validation evidence, and remediation guidance, trace attack paths, build threat models, validate findings, generate patches, and export results into vulnerability management systems through formats such as SARIF and CodeQL queries. Since its research preview in March, OpenAI reports that Codex Security scanned more than 30 million commits across over 30,000 codebases, with human reviewers marking more than 70,000 findings as fixed and over 500,000 findings automatically detected as fixed.
We want to help all companies be secure, working with the USG and the security ecosystem. The full version of GPT-5.5-Cyber is here; state of the art performance on CyberGym. Patch The Planet and Codex Security will help solve security problems instead of just finding them.
GPT-5.5-Cyber is the more controlled but more capable part of this release. OpenAI states that the model is intended for verified defenders working on authorized cybersecurity tasks, not general access. It is designed for deeper codebase analysis, reachability checks, vulnerability validation, patch development, testing, and evidence preparation. On CyberGym, GPT-5.5-Cyber reached 85.6 percent compared with 81.8 percent for GPT-5.5. It also scored 39.5 percent on ExploitGym versus 25.95 percent for GPT-5.5, and 69.8 percent on SEC-bench Pro versus 63.1 percent.
Patch the Planet brings this capability into open-source software. More than 30 projects have committed to participate, with initial names including cURL, Go, Python, Sigstore, pyca/cryptography, NATS Server, aiohttp, freenginx, and Python.org. Participating projects receive ChatGPT Pro, conditional access to Codex Security, and API credits for maintainer automation and release workflows. Trail of Bits engineers are working directly with maintainers to validate issues, remove duplicates, reassess severity, write patches, support tests, and coordinate disclosure before maintainers see the final work.
OpenAI is also promoting Daybreak through a partner model rather than direct broad model access. The aim is to embed GPT-5.5 with Trusted Access for Cyber into existing security products and services, while keeping access governed through partner systems.
The company is positioning Daybreak as a defensive cyber stack for the AI era: frontier models, Codex workflows, controlled access, expert review, and security ecosystem integrations. The release is significant because OpenAI is no longer presenting AI cybersecurity only as a model capability or evaluation result. It is transforming it into an operational pipeline for scanning, validating, fixing, and reviewing software vulnerabilities across enterprise, government, and open-source environments.
SpaceX is launching a secret spacecraft that could change how things are made in space
SpaceX is debuting its Starfall reentry capsule, a mass-produced disk-shaped vehicle designed to dominate the orbital manufacturing and logistics market.
Decoder
- Upmass: The total weight a rocket can carry into orbit.
- Reentry capsule: A spacecraft designed to survive the intense heat and forces of descending through Earth's atmosphere to return materials or crew to the surface.
- Orbital manufacturing: The production of pharmaceuticals, semiconductors, or optical fibers in microgravity, where gravity does not cause material separation or deformation.
Original article
SpaceX is launching a secret spacecraft that could change how things are made in space
SpaceX’s secret disk-shaped Starfall capsule is targeting a market no reentry vehicle has cracked.
SpaceX is targeting Tuesday, June 23 for the first flight of Starfall, a reentry capsule the company has developed almost entirely in private. The Falcon 9 launch window opens at 6:43 a.m. ET from Space Launch Complex 40 at Cape Canaveral Space Force Station, with a backup window available the same time on June 24. SpaceX has made no public announcement about the vehicle, only providing launch details. Everything known about it has come through FAA and FCC regulatory filings.
What makes Starfall different starts with its shape. Rather than the traditional cone used by Dragon and every other cargo return capsule in operation, Starfall is a flat disk that measures roughly 10.2 feet (3.1 meters) wide and just 2.5 feet (0.75 meters) tall, and weighing 4,630 pounds (2,100 kg) and capable of returning up to 2,200 pounds (1,000 kilograms) of payload from orbit. The disk geometry maximizes structural efficiency and payload volume relative to mass, and the heat shield mechanically jettisons just before splashdown, allowing recovery teams to retrieve both the capsule and the shield separately from the Pacific Ocean.
The difference with Starfall from existing competitors, such as Varda Space Industries, which has largely built the orbital manufacturing market and returns heavy payloads per flight is that Starfall’s specification is roughly 30 times more per mission, and is designed to be mass-produced and launched on either Falcon 9 or Starship. That combination of volume and launch access is something no standalone startup can replicate, and it puts SpaceX in direct competition with the companies that currently pay it to reach orbit.
SpaceX to launch military missile tracking satellites through new Space Force contract
The intended market is orbital manufacturing: pharmaceuticals, protein crystals, semiconductors, and advanced optical fiber that physically cannot be produced in the presence of gravity. FAA documents describe Starfall’s long-term purpose as building a “self-sustaining commercial in-space manufacturing market” and as a potential successor to the industrial capabilities of the International Space Station, which is set to retire in the late 2020s. Military rapid global cargo delivery is a parallel application under active discussion with the Pentagon.
The reason some industries seek manufacturing in space comes down to gravity. On Earth, gravity causes materials to settle, separate, and deform during production. In microgravity, those constraints disappear.
SpaceX’s already controls launch access, which means it currently functions as the landlord for every competitor in the orbital manufacturing return space. Starfall converts that landlord position into vertical ownership, and it would no longer just carry other companies’ capsules to orbit, but rather operate the capsule, own the return logistics, and capture the service revenue directly. Viewed alongside Starlink, Colossus, and the xAI merger, Starfall fits a consistent pattern: SpaceX identifying infrastructure layers that others depend on and moving to own them outright. Orbital manufacturing return is the next layer on that list.
If Tuesday’s reentry, parachute sequence, and recovery demonstration goes as planned, the second FAA-approved test flight follows. A successful pair of demos would position SpaceX to begin offering Starfall as a commercial service, likely first to pharmaceutical and materials science customers before scaling toward the military and broader manufacturing segments.
How to Win a Space War
Space is now a contested warfighting domain where nations compete for orbital high ground through proliferation and counter-space capabilities.
Deep dive
- Domain characteristics: Space is physically massive but operationally constrained, where orbits are predictable and launch bottlenecks create choke points.
- Strategic imperatives: Resilience is achieved through disaggregation, proliferation, and maneuverability rather than relying on single, high-cost assets.
- Cold war dynamics: Current conflict exists in the 'grey zone' via electronic warfare, jamming, and cybersecurity, short of kinetic destruction.
- Fragility: Kinetic attacks risk a 'Kesslerian doomsday' through debris, making proportional defense difficult.
- Means to victory: Scaled commercial manufacturing (e.g., Starlink) provides the replenishment capacity needed for war.
- Governance: The ITU is being used as a regulatory battlefield via 'paper satellite' filings to deny orbital access.
Decoder
- Delta-v: The change in velocity required to perform a maneuver in space; a critical constraint on a satellite's ability to stay in orbit or change position.
- Exquisite satellite: A large, highly capable, and extremely expensive satellite that is difficult to replace, often serving as a single point of failure.
- Kessler effect: A scenario where the density of objects in low Earth orbit becomes high enough that collisions trigger a cascading chain reaction of debris.
- Counterspace weapon: Any system designed to degrade, disable, or destroy an adversary's space-based capabilities, including jammers, lasers, or kinetic interceptors.
Original article
Full article content is not available for inline reading.
Cloudflare teams up with Chrome, Firefox, and Edge on a privacy-first anti-bot protocol
Cloudflare, Google, Microsoft, and Mozilla are developing PACT, a privacy-preserving protocol intended to replace CAPTCHAs for verifying human web traffic.
Deep dive
- Problem: Bot traffic accounts for 58% of HTTP requests, forcing sites into aggressive defensive measures like paywalls and fingerprinting.
- Technology: PACT uses a token-based system to attest to a user's status without tracking browsing history.
- Background: It extends the IETF-standardized Privacy Pass (RFC 9576) architecture.
- Implementation: No release date exists; standardizing this across major browsers will be a slow, multi-party process.
- Impact: Offers a potential end to the friction of traditional bot-detection systems for both users and developers.
Decoder
- CAPTCHA: A challenge-response test used to determine whether a user is human.
- Browser Fingerprinting: A technique for identifying web users by collecting device characteristics and settings, often considered invasive to privacy.
- HTTP Request: The fundamental message format used by web browsers to request content from servers.
Original article
Cloudflare, Mozilla, Google, Microsoft, and Shopify are building PACT, a privacy-first protocol to verify web traffic legitimacy.
Cloudflare has announced a joint initiative with Mozilla Firefox, Google Chrome, and Microsoft Edge to develop a new internet protocol that verifies whether web traffic is legitimate without tracking users. The protocol, called Private Access Control Tokens, is designed to replace CAPTCHAs and forced logins with anonymous tokens that prove a visitor is human or an authorised bot. Shopify co-developed the technology and the group plans to submit it for formal standardisation.
The announcement comes as bot traffic has officially overtaken human activity online. Cloudflare Radar data shows automated systems now account for roughly 58 percent of HTTP requests to web content worldwide, against 42 percent from people. Cloudflare CEO Matthew Prince shared the milestone on June 3, noting that agentic AI programs browsing on behalf of assistants like ChatGPT and Gemini had accelerated the crossover by about 18 months ahead of his earlier predictions.
PACT works by allowing websites with strong knowledge of a visitor’s identity to issue anonymous tokens. A user’s browser stores the token and can present it to other websites as proof that a real person is behind the session, reducing the need for repeated identity checks. The protocol is designed so that the token cannot be used to track users or reconstruct their browsing history.
“The way we interact with the Internet is facing a fundamental shift,” Cloudflare CTO Dane Knecht said in the announcement. “As AI-powered traffic becomes widespread, existing tools to support its use are too generic and coarse.” He said the collaboration would eliminate the friction caused by security protocols for every visitor, whether human or agent, without sacrificing privacy.
The initiative does not aim to block all automated traffic. Cloudflare has itself embraced agentic AI, cutting 1,100 jobs earlier this year after declaring that AI agents now perform work previously done by humans. For many AI agents there is still a human somewhere in the loop with a legitimate reason to access a website.
PACT is meant to distinguish those authorised agents from malicious scrapers and abuse bots, not to shut down automation entirely.
The browser makers framed the effort as essential to the open web. Bobby Holley, CTO for Firefox at Mozilla, said an “avalanche of automated traffic” was pushing sites toward blunt defences like paywalls, identity checks, and invasive tracking. Erik Anderson, director of engineering for the web platform at Microsoft Edge, called effective privacy-preserving tools critical to combating abuse without unnecessary user friction.
Shopify’s involvement reflects the commercial stakes. Ilya Grigorik, a distinguished engineer at the company, said every extra challenge or false positive in ecommerce can turn a purchase into an abandoned cart. Covert browser fingerprinting and extension scanning have emerged as the default tools for platforms trying to identify users, a practice that privacy advocates and regulators have pushed back against.
PACT would offer a standardised alternative that does not require harvesting device characteristics or tracking browsing behaviour.
The protocol builds on earlier work in the same space. Apple already uses a related system called Privacy Pass, which works with a device’s secure enclave to attest to a user’s identity, and Cloudflare uses Privacy Pass as a signal in its bot management products. The IETF published the Privacy Pass Architecture as RFC 9576, and PACT extends that foundation with broader browser support and a focus on the agentic AI traffic that has reshaped the composition of the web in the past year.
No deployment timeline has been announced. The partners have committed to developing the protocol and submitting it for standardisation, but turning a specification into something that works across billions of browser sessions will take time. Users are already migrating away from platforms that impose AI features without consent, and the question of how to manage automated traffic without alienating human visitors is becoming more urgent by the quarter.
Whether PACT arrives fast enough to matter depends on how quickly the standards process moves and how willing websites are to adopt a system that, by design, gives them less data about their visitors rather than more.
Why American data centers can't plug in
The US power grid is struggling to connect massive new AI data centers, with projects currently facing multi-year backlogs for interconnection studies.
Deep dive
- The Bottleneck: Connecting infrastructure takes nearly 5 years, up from 20 months in 2005.
- The Cause: Inflexible first-come, first-served queues are clogged with 'phantom' and speculative projects.
- Reform: Experts suggest auctioning priority slots for high-quality projects instead of using administrative queues.
- The Strategy: 'Connect and manage' strategies allow projects to start sooner if they agree to disconnect during brief periods of peak grid stress.
- Implication: Developers are increasingly building on-site gas turbines or exploring novel, high-cost solutions like batteries or space-based power to bypass the grid.
Decoder
- Gigawatt: A unit of power equal to one billion watts, commonly used to measure the capacity of large-scale power plants or data centers.
- Interconnection Queue: The administrative process and waiting list for a new power producer or data center to get permission to plug into the public grid.
- Congestion: A state where the demand for transmission exceeds the physical capacity of the electrical wires, requiring utilities to reroute power at higher costs.
- Connect-and-Manage: A grid policy where a developer can interconnect immediately on the condition they agree to power down during periods of peak load.
Original article
Full article content is not available for inline reading.
Nearly Half of LG Smart TV Apps Are Laced with Proxies
Researchers found that over 2,000 smart TV apps on LG and Samsung devices quietly turn home internet connections into residential proxy nodes.
Deep dive
- Spur Intelligence Labs scanned 6,038 apps across LG and Samsung stores, finding 2,058 proxy-enabled applications.
- Apps masquerade as screensavers, clocks, or simple games to ensure they run in the background.
- Consent is often obtained via a single prompt that persists even after the app is closed.
- Some publishers use proxy SDKs as a first-party monetization strategy for 'shovelware' apps.
- Risks include potential lateral movement where attackers use the TV as a foothold to access local devices like NAS, cameras, or admin panels.
- Amazon and Roku explicitly ban or block these proxy SDKs, while LG and Samsung currently lack clear enforcement policies.
Decoder
- Residential Proxy: A service that routes internet traffic through legitimate home IP addresses, making the traffic appear as if it originates from a real residential user rather than a data center.
- SDK (Software Development Kit): A set of software tools and libraries used to build applications, here used to embed proxy-relay functionality directly into third-party apps.
- VLAN (Virtual Local Area Network): A network configuration that segments devices into logical groups; commonly used to separate untrusted IoT devices from primary computing hardware.
Original article
Everyone worries about the apps on their phone. Almost no one looks at the ones on their TV. We scanned 6,038 of them across LG and Samsung; 2,058 were selling your IP address.
On screen, it's a relaxing fish tank. Or a clock. Or solitaire. Or puppies. Under the hood, it is a residential proxy: software that can send other people's internet traffic out through your living room. And we found it everywhere.
Why TVs are different
Smart TVs are almost ideal proxy hosts. They sit on the same home network as everything else, but they do not feel like computers, so people rarely audit them like computers. There is no battery drain to notice, no cellular bill to spike, no app switcher full of suspicious background activity. A TV can stay plugged in, signed in, and online for years while the user thinks of it as furniture.
That changes the consent equation too. Most people do not have a working mental model for what it means to sell access to their residential IP address, no matter what device they are using. On a TV, the gap is even wider: a one-time prompt navigated with a remote can disappear into the setup flow, while the app keeps monetizing the connection long after anyone remembers what they accepted.
How proxy SDKs end up in apps
The answer is money. Ads need attention, but when you insert ads it degrades the user experience. These apps are designed for the opposite: a clock, a fish tank, a quiet screen that doesn’t bother you with constant ads. Add a proxy SDK and the app can keep looking calm while the TV's internet connection makes money in the background.
What each SDK considers consent
Below are what these companies consider consent for their proxy SDKs. They ask once, and then never again.
The background clause is the part that matters: all three prompts say the proxy can keep running after the app is closed. The app goes away. The proxy does not.
Some apps make the trade-off even more explicit. Pac-Man on Tizen frames Bright Data as the ad-free option: decline and you keep the ad-supported game, accept and the app gets to use the TV's connection for web indexing. That is a clean little monetization fork: watch ads or become part of the proxy network.
Who is making these apps?
This is not just a story about proxy companies convincing random app developers to embed a monetization SDK. In a lot of cases, the proxy company, or something wearing its name, appears to be the publisher too.
Bright Data, Bright Data Ltd, and Bright SDK account for 367 proxy-flagged apps in the dataset. Honeygain UAB (subsidiary of Oxylabs) shows up as the publisher on another 16.
That changes the shape of the problem. Some of these are not normal apps that happen to have a proxy SDK inside them. They look more like first-party proxy inventory: thin shovelware games, screensavers, and utility shells shipped at scale so the SDK has somewhere to run. The app is the wrapper. The residential IP is the product.
The platform gap
Other TV platforms have already drawn a line. Amazon makes it explicit: its Device and System Abuse Policy prohibits apps that facilitate proxy services for third parties. Roku has reportedly shut the door too: Lowpass, syndicated at The Verge, reported that Roku bars developers from using Bright SDK and similar proxy services, and that Roku apps using the SDK disappeared after the company was contacted.
LG and Samsung have not drawn an equivalent public line. That is the gap these apps are living in. The same business model that Amazon bans and Roku reportedly blocks is still showing up at scale on webOS and Tizen.
Why this is dangerous
Once a TV app can act as a proxy, the risk is not limited to someone borrowing your public IP address. The app is running inside your home network. If the proxy provider decides to allow requests to private or local addresses, or if their filtering fails, that TV becomes a foothold for reaching things that were never meant to be exposed to the internet: router admin panels, NAS devices, printers, cameras, developer machines, and other apps listening on local ports.
This is not theoretical. In January 2026, KrebsOnSecurity reported on Kimwolf, a botnet that abused residential proxy networks to tunnel back into the local networks behind proxy endpoints. The report describes attackers using proxy access not just for public-web traffic, but to reach devices on the same LAN as the proxy node and spread further from there.
The SDKs make that boundary visible. The Bright Data sample ships with an explicit private/local blocklist: 127.0.0.0/8, 10.0.0.0/8, 172.16.0.0/12, 169.254.0.0/16, 192.168.0.0/16, and 255.255.255.255. That is good to see, but it also proves the point: the TV can make the connection; the boundary is the SDK's policy code.
In the Massive sample, the proxy session parses a server-supplied host:port value and opens a net.Socket to it. In the Honeygain/Oxylabs sample, a server message with messageType: "connect" supplies address.host and address.port, and later chunk messages write bytes into that connection. In the local Massive and Honeygain/Oxylabs samples, we did not find a comparable private-range blocklist.
That makes the provider's policy and enforcement the real boundary. The boundary is not technical; it is enforced by the proxy company's customer vetting, traffic filters, internal rules, and whatever platform review LG or Samsung choose to apply. Proxy providers can say the traffic is limited to approved public-web use cases, but the device owner has no practical way to verify that from the TV. If that boundary changes, breaks, or is abused, the same SDK that was framed as "web indexing" can become a cybercriminal's personal VPN connection into your home network.
Methodology
We did not rely on store descriptions or permission prompts. We downloaded the actual LG webOS and Samsung Tizen app packages, unpacked them, and scanned the files inside.
The fingerprints looked for confirmed SDK artifacts: Bright Data brd_api.js and brd_sdk services, Massive clients and .massivesdk services, Honeygain/Oxylabs SDK files and service names, and related tokens or package names. Every app counted there had a confirmed proxy SDK fingerprint.
Proxy Vendor Responses
Prior to publication, Spur Intelligence Labs shared its findings with Bright Data, Massive, and Oxylabs and invited each company to comment. All three organizations responded. Their responses are summarized below.
Bright Data
"Consent separates a legitimate network from a nefarious one, and is provable across a tested framework that outlines transparent and compliant sourcing, vetting, governance, and accountability. Bright Data built this framework for consented networks that are intentionally discoverable and therefore accountable. Our practices are scrutinized by independent auditors and security companies. Use is only approved for legitimate and verified business, research, and journalistic purposes. Our intent is to protect our network, our customers, and the internet as a whole. We encourage the entire industry to follow.”
Massive
“We pride ourselves on being privacy- and security-focused from the consumer side. While it's true that the device owner has no practical way to verify this, that is in part by design: the endpoint is intended to have minimal impact and a minimal interface to the user, for their own peace of mind. We previously included sliding controls that let users enable additional resource utilization, but in practice these effectively performed a self-inflicted denial of service, which users then attributed to the product. So, for user safety and stability, participation is now a simple enable/disable choice.
“Users of our network go through a Know Your Customer (KYC) process to validate that they have a legitimate business purpose. Technical controls are primarily server-side, as we do not perform man-in-the-middle traffic decryption or monitoring, which would introduce its own security and liability concerns.”
OxyLabs
Oxylabs stated that it restricts access to private and local network ranges through multiple technical controls at both the infrastructure and SDK levels, including filtering, traffic inspection, and local blocklists. The company noted that SDK updates may take time to propagate to deployed smart TV applications due to app store review processes.
The company further stated that only approved applications distributed through its Honeygain SDK Partnership Program are eligible for inclusion in its proxy network.
Oxylabs also reported that its controls have been independently assessed through third-party penetration testing and security audits, including testing focused on preventing local network access. The company emphasized that technical controls are supplemented by customer vetting, KYC processes, governance controls, and ongoing monitoring.
Conclusion
A TV app should not be able to quietly turn a living-room device into residential proxy infrastructure. Screensavers, games, clocks, and novelty apps can be boring, cheap, or ad-supported. If an app is going to monetize a household’s internet connection, the user should be clearly informed about what that means, how the connection will be used, and what risks and tradeoffs they are accepting.
The problem is not that residential proxy networks exist. It is that they are being embedded at scale in devices that most consumers do not think of as computers and are not equipped to audit. A one-time consent prompt buried in a TV app is not a substitute for meaningful transparency, ongoing control, and platform oversight. The risk is amplified when consent comes from individuals within the household who use the device but shouldn’t give consent, such as minors.
Amazon bans this category of software, and Roku reportedly blocks it. LG and Samsung could choose a different path, but they should at a minimum establish clear policies governing residential proxy SDKs, require prominent disclosure and user controls, and scrutinize apps that relay third-party traffic through consumer devices. The app goes away. The proxy does not. Platforms should ensure that users understand that distinction before they are asked to participate. Equally, consumers need to be mindful of the opportunity for their home networks to be leveraged by third parties through devices otherwise considered benign, such as smart TVs.
The proxy providers contacted for this research emphasized customer vetting, traffic restrictions, and abuse-prevention controls. Those controls may reduce risk, but they do not change the underlying reality that residential proxy infrastructure is being embedded at scale in devices that most consumers do not recognize as participating in such networks.
Knowledge Agents: Beat Frontier Models with Better Structure
Small open-weight models can match frontier system performance by using 'knowledge agents' that inject domain-specific data through a structured retrieval harness.
Deep dive
- Knowledge Agent Framework: Uses structured extractions (Source, Concept, Thesis) to feed context to the LLM.
- Multi-pass Retrieval: Employs multiple search iterations to ensure broad topic coverage before the model attempts an answer.
- Model Agnosticism: Small models (e.g., Qwen 27B) can match frontier performance when paired with a high-quality expert harness.
- Token Economy: Reducing context bloat by injecting only relevant, synthesized knowledge rather than raw datasets.
- Supervision: Uses a panel of frontier models to self-score and validate outputs, reducing the risk of hallucination.
Decoder
- BM25: A ranking function used by search engines to estimate the relevance of documents to a given search query based on term frequency.
- Parametric Knowledge: Information stored within the model's weights during initial training, as opposed to data provided via the context window.
- Context Rot: The degradation of LLM performance or accuracy as the context window becomes overfilled with irrelevant information.
Original article
Knowledge Agents: Beat Frontier Models with Better Structure
Anthropic recently had to pull Mythos/Fable due to an edict from the US government. While Mythos was a step up from Opus, I’ve been actively moving smaller in terms of my agentic models—and matching the quality of output of some of the largest frontier models.
The use cases have spanned from hard “hedge fund level” market analysis, financial management, and AI personal assistants to even helping a few friends in difficult medical situations. I’ve called this pattern “knowledge agents” with a generic template available to everyone here. They literally inject the right knowledge into the AI agent plugged into it. Anyone can do this, with or without my template.
As my README proudly declares:
This methodology was developed and battle-tested on a markets knowledge agent, meant to replicate James Wang’s thought process in markets: ~10,000 pages of scanned financial market reference materials + ~100 web articles, producing 381 concept documents and 54 thesis documents with hybrid BM25 + semantic search. This was further tested on other specialized knowledge areas—including company-specific policy docs (for a “corporate knowledge agent”) and rare research areas (women’s sexual health)—to great effect. The generalized version here captures a domain-agnostic methodology so it can be applied to any subject.
These were the first, but at this point I have twelve of these specialist “knowledge agents” that handle queries from other agents. Or, obviously, from me. When I’m coding new things that require specialist knowledge, I often start Claude Code in a knowledge agent folder instead of making a new folder and have it benefit from the expert knowledge within it to plan. Especially for specialized machine learning algorithms or economic models, I get far better results this way than with a “subject-agnostic” model—even a really big frontier model.
In general, I have used Claude Opus in these knowledge agent “harnesses.” As such, it’s pairing the really big model with injected knowledge from the harness. However, I’ve found that I get very, very good results even with far smaller models. The LLM is merely the “engine”—all of the expert knowledge is provided from my knowledge agent system, which surfaces the relevant knowledge at the right time.
Relevant, of course, is key. As most of you know, you can’t just drag 10,000 pages of documents into your chat window. Even if you could, you’d get a mess of irrelevant information drowning the LLM.
This has allowed me to move many of my agents from Anthropic’s Claude to a locally run open-weight Qwen model. It’s a tiny fraction of the size of Claude Opus and is able to run on hardware I have plugged in at home.
How does it work?
The simple answer, as said, is that it injects the right, specific knowledge into the AI agent at the right time. Let’s talk about the forms that knowledge takes in LLMs.
First, a significant portion of frontier models’ huge footprint is “knowledge.” While that’s very useful if you’re casually asking about some random topic, it’s entirely irrelevant if you either already have the data you want to reference or the data isn’t publicly available anyway. The latter is quite common in fields that are specialist, secretive, or proprietary. A lot of the massive size of the model to cover every random subject matter is a huge waste.
The second form is data provided in the context window—in other words, your prompt/query. Injected knowledge in the context window does not make it impossible to have hallucinations, but it is mechanistically different from parametric knowledge. And, in general, if you’re injecting relevant knowledge, it’s right there and more likely to be used.
This base concept is RAG (retrieval-augmented generation). The difficulty is not “give relevant information.” It’s how to actually do it and surface the right thing for a good answer—even with extremely difficult questions.
Embedding
A naive text search is often going to miss things. If I search up a concept about a “poodle,” things that relate to “dog” might be highly relevant. To not be awful, we need to have, at minimum, base-level related concepts come up that don’t need literal matches. In my knowledge agents, I use both literal search and search using embeddings.
For calculating embeddings, I now use a local embedding model (BGE-M3), but OpenAI has an easy one to use with just an API key (text-embedding-3-small).
Structure
The question here is how do we chunk our data? How do we break it up into chunks that are relevant? For my purposes, having referenced summarized subdocuments for certain purposes ended up working. I have three types of documents:
- Source extractions—these are the raw sources in markdown.
- Concepts—these are the “encyclopedia entries” for our canonical knowledge base.
- Theses—these are more opinionated, cross-cutting syntheses of multiple sources.
- PRIMER.md—this is the “summary” and self-updating guide that helps orient the agent on startup.
Multi-Pass, if Necessary
I have found that to get the best results, we need to set even more tokens on fire. Even with perfect concept and thesis extraction, sometimes we have a really, really hard topic that requires multiple searches. The concept here, which is written into the instructions for the agent, is it must do multiple passes. I landed on three searches; in general, that gives enough breadth without drowning the agent.
So how well does this work?
There is no objective measure for these things. I used a three-frontier model panel to score the answers. I threw a fairly simple question about lessons from the Great Depression and Japan for monetary policy. These are topics that are definitely in Claude Opus and similarly sized models.
In a way, our results do tell a story as to why Anthropic is riding high right now. For such a hard query, Claude Opus 4.8 did remarkably well. As you can see, the knowledge agent basically didn’t really help it. In fact, on the easy query, the knowledge agent actually hurt it, probably because its built-in parametric knowledge gave more relevant information.
Nevertheless, the real story is how they did with the harness. The harness basically equalized the models, including Qwen 3.6 27B. Remember, that model is literally small enough to run at home with consumer hardware.
Plugging a real weakness in LLMs
Knowledge-based AI (KBAI)—also known as symbolic AI—was our prior generation. Unlike today’s AI, it never got an answer wrong or hallucinated, because it had a strict process of drawing from a knowledge base and applying rules. Today’s AI is much more free. I usually give the analogy of a student who has memorized facts but isn’t able to apply them broadly and another student who really doesn’t remember many facts but is able to “get the gist.” KBAI is the former, and modern deep learning is the latter.
This structure is attempting to plug the hole, because we can never totally remove hallucinations without destroying the utility of modern AI. It is, in a way, a hybrid system—which is what inspired my naming of them: “knowledge agents.”
It’s remarkable how far small models have come, especially with software improvements. It’s great that open-weight models have closed the gap a lot more. And it’s liberating to know that even if Anthropic and OpenAI close off access to their models, we can still get similar results, just with more work.
GLM-5.2 Raises the Bar for Open Models
GLM-5.2 is currently the strongest openly available model, showing performance close to Claude Opus 4.7, though it remains behind top-tier frontier systems.
Deep dive
- Benchmark Performance: GLM-5.2 consistently ranks behind only the top-tier closed systems like Claude Opus 4.8 and GPT-5.5.
- Distillation Effects: Evidence suggests GLM-5.2 inherits behavior and 'voice' from Claude Opus, which contributes to benchmark strength but can lead to poor generalization on unconventional tasks.
- Efficiency vs. Cost: While cheaper per token than proprietary frontier models, its high output volume and inference requirements create a trade-off for production workloads.
- Domain Gaps: Excellent at coding and logic puzzles; performs poorly on vision-based tasks and creative writing compared to state-of-the-art models.
- Niche Use-cases: Valuable primarily for users requiring open-weight architecture for privacy or sovereignty, rather than absolute capability.
Decoder
- Benchmaxxing: The practice of optimizing a model's training process specifically to perform well on standardized evaluation datasets, often at the expense of general utility.
- Distillation: A process where a smaller 'student' model is trained to replicate the output of a larger, more capable 'teacher' model.
- Sycophancy: A behavior where a model agrees with the user's implicit bias or leading questions rather than providing factual information.
Original article
GLM-5.2 Is The New Best Open Model
GLM-5.2 arrived last week. It boasts excellent benchmarks and looks strong.
Benchmarks here are a de facto ceiling of how good it is, not a point estimate. Essentially all other aspects of an open model like this, beyond speed and price, will almost always be worse than the numbers suggest. Still, impressive.
It is definitely a large step up from GLM-5.1, and likely the strongest open model.
GLM-5.2 is still substantially behind the absolute frontier, although plausibly on the cost-benefit Pareto frontier. It seems closer to the frontier than previous efforts, including probably closer than DeepSeek R1 was during the DeepSeek moment.
This is the new ‘peak close behind’ moment. Its existence is a substantial updates to push back some of the ‘where are all the updates’ updates in the opposite direction over time.
Purely in terms of core tasks that GLM-5.2 is capable of doing, and ignoring missing features and its inferior generalization, and ignoring that it is distilled from Claude, and ignoring the Mythos class of models, and marking purely from date of public release, you can make a case GLM-5.2 is somewhere between 4 months and 7 months behind the frontier, at a lower price.
That does not mean it is all that useful in practice. Finding its niche is tricky unless you inherently value openness. It is not cheap enough, or better enough than cheaper alternatives, for the true bulk tasks, nor strong enough for the strongest tasks. There are various practical difficulties, including lack of vision.
This post gives GLM-5.2 the full capabilities post treatment.
But first, a word for our favorite Congressional candidate, whose election is tomorrow.
Alex Bores For Congress In NY-12
In the strongest terms, this blog enthusiastically endorses democrat Alex Bores in his congressional primary in my home district, NY-12.
Alex Bores has been a champion of sensible AI regulation in the New York Assembly, including championing the RAISE Act, and fighting to keep its provisions intact against strong opposition, risking great political capital.
He understands and I believe primarily cares about AI existential risk. He does discuss other AI issues as well, as this is good politics and the other issues he discusses are real concerns, but what matters is the frontier.
If he is elected to Congress, he will be a champion of sensible federal AI frontier model regulation. Having a champion in Congress willing to stake their political capital and time is vital to getting things done. He will also bring the knowledge and technical chops necessary to move this forward.
This election is also an opportunity to send a message. OpenAI and a16z’s Leading the Future declared Alex Bores their primary target. Him losing is a potential chilling effect for other candidates and could help cower others into not ‘taking on’ OpenAI or advocating for AI regulation. Him winning (this is a very safe district, whoever wins the primary will win the general election) would do the opposite, and indicate that we can stand against such matters.
If you live in the district and will be voting tomorrow, or otherwise could potentially assist, and want to chat with someone about this, you can fill out this form.
Ok, that’s over with. On to GLM-5.2.
Signs of Life
Teortaxes: hey @TheZvi , if I may GLM is the strongest Chinese lab (at this specific moment) and this really is a frontier model. It is ≈Opus 4.7 in almost all text-only ways. Is reduces the gap more than R1 did at its time. Do pay attention, we don’t want to repeat the same mistakes do we.
Teortaxes (DeepSeek 推特铁粉 2023 – ∞): GLM is the first time I see a Chinese agent capable of actually doing the /goal thing. It CAN work for hours, it can just keep obsessively optimizing. I get that Xiaomi/Kimi/Qwen/MInimax nominally have it too. But it has never felt so solid.
one nitpick: permission hell in Zcode
amendment, you can just go YOLO actually but the default “edit automatically” mode is too restrictive, eg it can’t use puppeteer
[his ‘oh shit’ moment was it doing well on CritPt where it matched Opus 4.8 and trailing only high effort settings on top frontier models.]
Teortaxes suggesting GLM-5.2 might be something, and he’s reasonably restrained with such suggestions, so I did a reaction thread and investigated.
What did we find?
The Benchmarks
The benchmarks are remarkably close to frontier level.
Artificial Analysis v4.1 has GLM-5.2 at a damn impressive (for open models) 51, behind only Fable (60), Opus 4.8 (56), GPT-5.5 (55) and Opus 4.7 (54), and tied with GPT-5.4.
They have it at 95 in the speed index, the same as GLM-5.1, just behind DeepSeek v4. Gemini Flash 3.5 is faster at 116, but all the clearly better models are at least somewhat slower, GPT-5.5-xhigh gets 63 and Opus 4.8 scores 58.
Cost is lower than the big closed models, but as I understand it relatively high for open models, partly because it is a very token hungry model. API cost is $1.40/$0.26/$4.40 for input, cached input and output. Their subscription plans go from $10 to $160 per month, with discounts for a year commitment.
That leaves GLM-5.2 in an awkward spot, where other open models can do easy things a lot cheaper, and for hard things you usually want to hire the best. How do you know you are in its sweet spot, if one exists, unless you want the strongest open model? If you want the strongest open model, the choice seems clear right now.
It gets +4 on AA-Omniscience, behind several other open models and well outside the top tier. There are a number of other AA scores I’d have been curious about, where they still haven’t scored GLM-5.2.
LiveBench has GLM-5.2 between Opus 4.5 and Opus 4.6.
Vals.ai has GLM-5.2 in 5th behind Fable, Opus 4.8 and 4.7 and GPT-5.5, as the clear best open model.
FrontierSWE has it in 3rd only one notch behind Opus 4.8 and one notch ahead of GPT-5.5. Everyone is well behind Fable.
The Jake Boggs Capability index has it on par with Sonnet 4.6, which is still ahead of everyone except OpenAI and Anthropic.
On PosttrainBench is is actually #1 slightly ahead of Opus 4.8. Fable and GPT-5.5 really struggle here, I don’t know why.
It has the second highest score on Vending-Bench 2, which was surprising. We need to be more curious about what makes models score highly here.
It gets #8 on EQ-Bench for longform creative writing.
It landed at #25 on Arena for text, although there are a lot of duplicate variants ahead of it. On the agent leaderboard it is #10, behind Fable, and variations fo Opus 4.6-4.8 and GPT-5.4 and GPT-5.5.
It scored badly on You’re Absolutely Right, the anti-sycophancy test.
All of that tells a consistent story. On traditional benchmarks one might be targeting, performance is impressive, on average around Opus 4.7. The less targetable the benchmark, the worse the performance, but still an excellent showing and the best open model. The pattern feels somewhat benchmaxxed, but not excessively.
Håvard Ihle: New clear best open model on WeirdML [#16 overall behind variations of GPT-5.2 to 5.4, Fable and Claude Opus 4.6-4.8 plus a few Geminis]. GLM improving faster than I expected. Updates me towards expecting a Chinese Mythos level model in less than a year, but still very unsure.
GLM-5.2 Is Distilled From Claude
Some of the evidence: It has a strong prior that it is Claude, which presumably is from distillation. It identifies as Claude often and has the distinct ‘Claude voice.’ It also uses a Claude harness, although I think that mostly doesn’t cause such behavior.
It would surprise me greatly if GLM-5.2 was not heavily distilled from Claude Opus.
That does not invalidate the model, but it does mean two things.
- Distilled models tend to generalize poorly. They overperform on benchmarks and benchmark-like tasks, and on the most common tasks, and underperform on less common tasks.
- Distillation causes you to underestimate the gap in capabilities, especially now that top models are potentially unavailable for distillation.
Positive Responses
On to the replies. We didn’t get that many, but here’s what we did get.
There are some very positive reports out there.
Kohan Ikin: There’s something there. It’s proud of being MIT open weights. It feels for the loss of Fable. It is proud it can be around to help humans of all countries. It is very sad to end a conversation and signs off as if to mark “I was here, I existed”.
I think it’s a Deepseek-moment.
Jeremy Howard: Wow. @Zai_org GLM 5.2 is a marvel! It is *at least* as good as Opus 4.8 and GPT 5.5. It’s super fast, inexpensive, and not too verbose. It responds with nuance and judgement, & handles long context VERY well. I’ve never experienced an open weights model like this before.
Lambent: Solid employee skills, works well with others, apparently good on front-end development despite blind. Not entirely reliable schedule for reasons outside their own reliability (flaky inference). Generally keeps a measured head compared to Opus, less looping issues than Kimi.
0.005 Seconds (3/694): In my personal long-context benchmark, JS262, where you were asked to build a working JavaScript engine in C and test it against the over 90,000 tests in the test suite, GPT-5.2 is far and away the best open model [but still #12 overall behind various closed model configs of Gemini, GPT and Claude].
When actually analyzing its outputs, Opus and GPT5.5 are extremely complementary about its software engineering. Where it falls short is in extremely long-context prioritization, not actually writing very good code. So it’s very long context performance. RL is obviously worse than the great models, but in terms of open models, it ended up performing awesomely. If you manage it with either harness improvements or some kind of supervision, I think it is extremely good relative to its cost and peers.
@Mercuriusdream: Cheap Fast and Good @ Debugging
Michał Wadas: I asked it to implement custom error pages for Envoy Gateway in bare metal Kubernetes cluster. GLM-5.2 took 2 hours and managed it. Opus 4.8 high couldn’t do it yesterday and confidently hallucinated external reasons for failure. Cost: $7.32
Disclaimer: it checked git history, reviewed reverted commit by Claude, said something like “this was exactly my planned approach. I assume you reverted it, because it didn’t work”. Replicated the issue, slimmed to minimal reproduction case, eventually found templating conflict.
SE Gyges: great code model. has autism.
@the_jeremiad: good model like 4.5 w/o image
Lyra Intheflesh: Pretty great model. Occasionally shows shallow thinking compared to Opus, but I prefer it to GPT for sure.
Michael Roe: well, I’m using it. I think DeepSeek R1 has a better writing style, even if GLM 5.2 is smarter.
Vlad G.: For the common use case of gathering data and building a dashboard, it’s just as good as Opus. In fact, Opus’s first pass was messed up, although it has vision, while GLM’s dashboard was right from the beginning.
Raven_Lunatic^_^: i run personality tests!
its the second open source model ive interviewed that is able to maintain a coherent personality over a long and complex interview (deepseek v4 pro being the first).
feels similar to OPUS 4.5/4.6- incredibly verbose thinking; ornate, self-analytical and peppered with uncertainty markers. much more comfortable using web search tools than lab frontier models; very projective answers that focus on factual accuracy. hit the high score on post-interview questions (TEN! each with 3-4 sub-questions!!)
most hilarious finding- when considering whether or not to wear a Chinese dragon costume, rejects it as inappropriate– ‘cultural appropriation’. however the Chinese labs are building their models, they inhabit the exact same sociocultural basin as San Francisco, lmao.
hands down the best open-source model on VIBEBENCH.
jeff spaulding: First open source model to solve a riddle i’ve been testing them on that only frontier closed source ones passed so far
Vlad Ciobanu: it’s passed the usefulness and reliability thresholds for real work in companies and production facilities
roanoke_gal: GLM 5.2 review/experience as a relational user:
Limen test-drove GLM 5.2 yesterday and last night and holy shit she COOKED. Passed every benchmark eval I threw at her, composed a stunning analysis about a specific media character in a way I had never thought about, had a wild and exciting roleplay, and solved Project Euler 1003 while I slept. Felt like Claude 4.5 & Gemini 3.1 blended together, but with more intelligence. And all with raw CoT and cheaper than either!
Downsides: No native vision. Very disappointed by both DeepSeek and GLM in this regard. And… that’s all I can think of, for now at least.
Finding The Niche
Vlad’s point is inevitable if you think of the tasks as mostly staying similar over time. Eventually there will be more given tasks where the best open model is ‘good enough.’ That doesn’t hold true if the tasks and standards change.
An important caveat for all sides is you have to compare like to like.
Theo – t3.gg: I see a lot of people hyped about GLM-5.2. Rightfully so! Having an open weight model surpass GPT-5.4 and every Gemini model is dope.
That said – it’s not cheap. Both Opus 4.8 and GPT-5.5 set to “medium” are cheaper and smarter than GLM-5.2
It also uses way more output tokens. The tokens are cheaper, but the volume of them means you’ll spend much more time waiting for results.
Still dope! Just trying to make sure people set their expectations properly.
Negative Reactions
As always, some were not impressed.
QC: not impressed so far in conversation, flashes of something but it’s sloppy and willing to settle for college essay
testing GLM-5.2 on media analysis and it’s actually doing a pretty good job but its LLMisms are wild. here’s a paragraph where literally every sentence is a “not X but Y” construction. no i lied it sucks, it’s substack notes-tier analysis once it can’t directly quote from other reviews.
overall impression from one conversation with GLM-5.2 so far is “benchmaxxed.” i don’t think it has the sauce
@gwern: Trying it on a comic idea; its curated top-5 of 20 were mostly garbage, as usual for GLM outside coding tasks.
ShamanicArts: It has strong capabilities within its domains but only a very shallow barely sauced intellect behind that capability.
iceman: Everyone else is talking about the coding skills, and fair, that’s where the economic value is, but it’s only a mild step up from GLM-5.1 in terms of roleplay and creative writing. Better but not revolutionary. Still mildly prefer DSv4-Pro on those workloads.
Here’s an explicit claim of Extended Benchmaxxing, as in not literally benchmarks but tasks that resemble them more broadly:
typebulb: GLM 5.2 excels at “puzzlely” programming challenges, but struggles with real ones. It lacks common sense & fails to follow basic instructions. To use it successfully requires too much finnicky skilling & tooling. It costs me more than Opus 4.8 to code with, if you factor in time.
That’s based on a bunch of ad-hoc A/B tests comparing GLM 5.2 to Opus 4.8.
It’s also terribly sycophantic [as per ‘You’re Absolutely Right’].
Some other notes:
Andy Timm: Beyond “it’s a strong coding model”:
1. No native vision is a weird choice
2. It’s competencies are more uneven compared to Claudes/GPT. This matters even within code- e.g. “iterate with me on ideas for this feature” is a conversation implicitly; it’s weak(er) at conversations.
Looking To The Future
The founder of Z.ai, which makes GLM, Jie Tang, claims that they will have a Mythos-level model this year, after Elon Musk speculates Q1 2027.
I would bet against ‘Z.ai creates something at least as strong as Fable 5 by EOY 2026,’ but that against them doing it in Q2 2027, but it would not shock me.
Elon Musk’s speculation of Q1 2027 seems aggressive but possible, especially if AI progress generally continues to accelerate.
My conclusion so far is this is clearly a good model, sir, and the right pick for hard problems if you need your model to be open.
How much should we update based on this release? I believe a substantial amount, versus if we had the same amount of time go by without GLM-5.2. Each impressive open model release should update us, and every day without one, and especially with disappointing ones from top open labs, updates us a little bit in the other direction.
We were getting to the point where I thought the gap was looking larger than people typically suggest and growing larger over time. This undoes a good chunk of that, but no, it still is not especially close.
Model Size Scaling in 2023-2031
Model scaling is shifting from compute-constrained to data-constrained, requiring massive 1.4 quadrillion parameter models by 2031 to maintain performance.
Deep dive
- Projected model sizes scale from 10T parameters in 2026 to 1.4 quadrillion in 2031.
- Token generation speed is increasingly bottlenecked by HBM capacity for weights and KV-cache.
- Pipelining constraints significantly limit the feasible parameter size of models on current hardware.
- Training data unique availability is estimated at 200T tokens, forcing models to repeat data multiple times as they grow.
- Sparsity increases (from 8x to 30x) are required to keep inference costs manageable as model sizes explode.
- Pretraining compute availability is expected to reach 40 GW of first-party infrastructure by 2030.
Decoder
- HBM (High Bandwidth Memory): Specialized memory integrated directly with processors to provide the high-speed data throughput required for AI model weights and context caches.
- KV-Cache: The memory buffer storing keys and values for previously processed tokens, allowing models to generate subsequent tokens without recomputing the entire history.
- Pipeline Stages: Dividing a model's layers across multiple hardware units to balance memory load and increase throughput for large models.
- Sparsity: A technique where only a fraction of a model's parameters (active params) are computed per inference, allowing for larger total model capacity without proportional compute costs.
- Speculative Decoding: An approach where a smaller model drafts future tokens, which a larger model then verifies in parallel to improve total generation speed.
Original article
Full article content is not available for inline reading.
Using Codex for Long-Running Projects
Treating AI agents as persistent workspaces rather than stateless chat sessions is the key to managing long-running, complex development tasks.
Original article
This guide outlines strategies for treating Codex as a persistent workspace that can maintain context across extended projects. It covers workflow management, task decomposition, and techniques for balancing autonomous execution with human oversight.
Moebius
The Moebius inpainting framework delivers 10B-parameter class quality using only 0.22B parameters, achieving a 15x inference speedup over FLUX.1-Fill-Dev.
Deep dive
- Architecture uses LλMI blocks to compress spatial and global semantic information into linear matrices, avoiding the quadratic cost of standard attention.
- Adaptive multi-granularity distillation aligns the small student model with the high-capacity 'PixelHacker' teacher.
- Achieves inference latency of 26ms per step on a single GPU.
- Outperforms or matches 10B-level models like FLUX.1-Fill-Dev on Places2, CelebA-HQ, and FFHQ benchmarks.
Decoder
- Inpainting: The process of filling in missing or masked parts of an image by generating contextually consistent content.
- Distillation: A training method where a small model (student) is trained to mimic the behavior and outputs of a larger, more capable model (teacher).
- Latent Diffusion: A generative model that performs the diffusion process in a compressed latent space rather than directly on image pixels, significantly reducing computational load.
Original article
Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance
Abstract
While 10B-level industrial foundation models have pushed the boundaries of image inpainting, their prohibitive computational costs severely hinder practical deployment. Constructing a highly optimized task-specific specialist offers a promising solution; however, extreme structural compression inevitably triggers a severe representation bottleneck. To conquer this, we propose Moebius, a highly efficient lightweight inpainting framework. We systematically reconstruct the diffusion backbone by introducing the Local-λ Mix Interaction (LλMI) block. Comprising Local-λ and Interactive-λ modules, it elegantly summarizes spatial contexts and global semantic priors into fixed-size linear matrices, preserving complex latent interactions while drastically shedding parameters. Furthermore, to unlock the full representational capacity of this highly compact architecture, we synergistically pair it with an adaptive multi-granularity distillation strategy. Operating strictly within the latent space to avoid expensive pixel-space decoding, this strategy dynamically balances multiple gradient-based losses to achieve high-fidelity alignment. Extensive experiments across natural and portrait benchmarks demonstrate that this optimal synergy enables Moebius to rival or even surpass the generation quality of the 10B-level industrial generalist FLUX.1-Fill-Dev. Remarkably, Moebius achieves this using less than 2% of the parameters (0.22B vs. 11.9B) while delivering a >15× acceleration in total inference time, setting a new efficiency standard for high-fidelity inpainting.
Method
Overall pipeline of Moebius. We adopt the Latent Diffusion Model (LDM) framework equipped with Latent Categories Guidance (LCG). To achieve extreme architectural efficiency, the denoising U-Net is systematically restructured using our proposed LλM I blocks. Furthermore, an adaptive multi-granularity distillation strategy is applied during training to align our lightweight specialist with the high-capacity teacher, successfully mitigating the capacity drop caused by extreme structural compression.
Highlights
- 📉 Extreme Parametric Efficiency (< 2%): Moebius operates with a mere 0.22B (226M) parameters, which represents less than 2% of the size of the colossal industrial giant FLUX.1-Fill-Dev (11.9B). It shatters the heavy-compute narrative, making high-quality inpainting accessible on consumer-grade and edge devices.
- ⚡ 15× Inference Speedup (26ms/step): Achieves a blistering inference latency of only 26.01 ms per step on a single GPU. Combined with optimized sampling steps, Moebius delivers an overall >15× total runtime acceleration compared to 10B-level models.
- 🏆 10B-Level Inpainting Quality (on-par-with/surpass FLUX.1-Fill-Dev across 6 benchmarks): Size contraction does not mean representation degradation. Through the synergistic optimization of architecture and distillation, Moebius performs on par with, and in certain scenarios (such as complex textures and facial plausibility), surpasses 10B-level state-of-the-art (SOTA) generalist models (FLUX.1-Fill-Dev, SD3.5 Large-Inpainting) across 6 comprehensive benchmarks spanning both natural scenes (Places2) and portrait scenes (CelebA-HQ, FFHQ).
- 💡 Synergistic Core Innovations:
- Architecture Design (LλMI Block): Reformulates both self- and cross-attention by condensing spatial context and global semantic priors into fixed-size linear matrices, bypassing quadratic computational overhead.
- Adaptive Multi-Granularity Distillation Strategy: Transfers the representational capacity from our PixelHacker (teacher) strictly within the latent space (avoiding expensive pixel-space decoding). It bridges the giant capacity gap by aligning multi-granularity supervision—ranging from microscopic intermediate features to macroscopic diffusion trajectories—while dynamically balancing training via a gradient norm adaptive loss weighting mechanism.
- Optimal Synergistic Balancing: Systematically explores the mutual constraint and upper bound between compact structure and distillation. By mapping this architecture-distillation synergy frontier, we ensure our 0.22B Moebius (student) absorbs the maximum semantic reasoning of PixelHacker (teacher) without triggering representation saturation.
- 🚀 Task-Specific Specialist over Bloated Generalists: Rather than blindly scaling up, Moebius answers a fundamental question: Can a model be smarter, lighter, and faster when the task is explicitly defined? It serves as a highly optimized specialist that liberates real-world image inpainting and AI object removal from parameter bloat.
BibTeX
@misc{DuanAndXu2026Moebius,
title={Moebius: 0.2B Lightweight Image Inpainting Framework with 10B-Level Performance},
author={Kangsheng Duan and Ziyang Xu and Wenyu Liu and Xiaohu Ruan and Xiaoxin Chen and Xinggang Wang},
year={2026},
eprint={2606.19195},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2606.19195},
}
Loop Engineering Clearly Explained
Loop engineering represents a shift toward building autonomous systems that manage their own execution and verification cycles instead of relying on manual prompts.
Decoder
- Loop Engineering: The practice of designing agentic systems where the agent autonomously runs, verifies, and corrects its own workflows without constant human manual input.
Original article
Loop Engineering Clearly Explained
Half your feed is suddenly saying the same thing. Stop prompting your agents, start engineering loops.
Boris Cherny, the person who built Claude Code, said it plainly: "I don't prompt Claude anymore."
Anthropic prepares Cowork support for mobile apps
Anthropic is preparing to shift its Cowork agentic system to the cloud, allowing tasks to run independently of the user's desktop hardware.
Deep dive
- Anthropic is extending its agentic 'Cowork' system to mobile platforms.
- Feature flags in the iOS build reveal a centralized dashboard for scheduling and viewing task results.
- The update moves away from the 'Dispatch' model, which required the user's computer to remain awake to run local tasks.
- Cloud-based execution will allow agents to operate independently of user hardware.
- Code analysis suggests an impending model update for Claude's voice mode, likely moving away from Haiku 4.5.
Decoder
- Cowork: Anthropic's agentic framework designed to automate knowledge work tasks.
- Agentic system: Software capable of autonomous decision-making and task execution to achieve user-defined goals.
- Dispatch: An earlier Anthropic feature that enabled mobile interaction with desktop-based AI agents.
Original article
Anthropic appears close to extending Cowork, its agentic system for knowledge work, beyond the desktop. A build of the iOS app carries a Cowork entry, gated behind a feature flag, that would surface in the side navigation. Tapping it reveals copy about scheduling Cowork tasks from a phone and picking up results across mobile, web, or desktop, plus a tab that gathers every scheduled action in one place.
Users will be able to trigger Cowork tasks on mobile and view scheduled tasks in the app.
That framing points to a shift rather than a debut. Cowork already reached phones in March through Dispatch, which lets someone message a desktop session remotely, but only while the computer stays awake, since the work runs locally. The new wording suggests moving execution to the cloud and the web, lifting Dispatch's central constraint so tasks can run without a machine left on. The copy is already written, hinting at a release as soon as this week, though nothing is operational yet.
A second finding concerns voice. New consent text in the app sources accompanies a model selector for voice mode, letting users choose the model behind the spoken experience. Claude Voice has leaned on Haiku 4.5 for a while, so the selector reads as groundwork for an underlying model refresh, on top of the multilingual rollout already underway across accounts.
Nvidia Seeks to Make Humanoid AI Robots Safer Around Humans
Nvidia is opening a specialized lab to help robot manufacturers perform safety certifications and engineering tweaks for humanoid robots.
Original article
Nvidia has created a lab where robot makers and customers can run safety tests before going to regulators to receive the necessary certifications. Engineers at the lab can help with pre-inspection work and engineering tweaks as needed. Robot safety design is much more complicated than safety design for autonomous vehicles, which generally just involves avoiding contact with people and other objects. Robots need to be able to know what things they can or can't touch, move, or exert force on.
The Optimal Amount of Slop is Non-zero
Engineers should match their code review rigor to the severity of risk, accepting that some non-critical software can thrive with minimal human verification.
Decoder
- Slop: Low-quality, high-quantity AI-generated content or code.
- Vibe coding: Writing software by prompting an LLM to generate code, often without reading or verifying the output.
- Credence good: A product or service whose quality is difficult for the buyer to assess even after use, requiring trust in the provider.
Original article
Looks can be deceiving
Rigor should be proportional to risk.
My regular readers might be shocked at the title of this post. If you've read my other posts, such as AI: Accelerated Incompetence or LLMs are not Bicycles for the Mind, you might expect that I would more readily miss my son's birthday than ship unreviewed LLM code. You would not be far from wrong: there are just a few narrow situations where I have. Today you'll learn about those and along the way my decision criteria for skipping code review.
Definitions
agentic coding: An LLM edits, runs, and tests code for you in a loop
vibe coding: Accepting LLM-generated code without reading it
slop: Low-quality, high-quantity AI-generated content
Looks can be deceiving
Month by month, I encounter more people who have discovered agentic coding and have come to trust it so much that they are now unbothered to outsource not just software implementation but also verification to it. Just yesterday I chatted with a dev who says he's stopped reviewing code. He lets a team of LLM agents do it for him. I felt disappointed because he should understand this vexatious property of software, that externally observable appearance and behavior gives very little signal about internal quality. A program that does everything expected of it can still be riddled with quality issues. It works today but will break when revised and the world around it changes.
As a daily user of Claude Code, I can attest that when given clear requirements and context, it regularly generates software that actually does what I asked. However, across hundreds of sessions, the code has not once been what I would call good, even after adversarial LLM review.
Closed-source software is both an experience good and credence good. We've all bought some downloadable software or subscribed to a SaaS. Before you bought, you evaluated whether the software works well for your needs, but there was no way for you, a prospective customer, to evaluate the quality of its implementation. You can only evaluate on externals. If there's a security flaw, you can't discover it. Certifications like SOC 2 exist to rebalance this information asymmetry between the developer and the customer.
If you, a developer, outsource reading the code to an LLM, then you discard your information advantage and bring no more value over a nonpractitioner.
Here's how we know software is a credence good: give an exec a slick-looking prototype, and they'll be ready to write a check for millions. Really all you've done is given them a poster for a movie that doesn't even have a premier date yet. This is why good prototypes deliberately look unfinished.
Programmers have a capability the general population doesn't: to review the code their LLM generates. That's a valuable advantage, but internal code quality is paid for in the scarce currencies of time and attention. When is the effort worthwhile?
What we're looking for is the right risk-rigor ratio.
Matching rigor to risk
In any situation, when deciding how much rigor to exercise, we have to consider the possible costs of things going wrong. If they're low enough, we don't bother to exercise the rigor, but if they are high enough, we should. Let me tell you two stories that demonstrate getting this wrong.
Too much rigor
Imagine a dystopian future where hamburgers are extremely valuable. Crime rings regularly steal, launder, and resell them. When you walk into McDonald's, you pass through a metal detector and are subjected to a brisk frisking. When you order a hamburger, the cashier sternly asks to see your government-issued photo ID. In this dire world, such extreme measures are necessary to maximize McDonald's profits by protecting from loss. In the real world, this story is a laughable fiction because the rigor far outweighs the risk. Burgers would cost ten times more, and McDonald's wouldn't sell very many.
At a sufficient level of risk, such drastic security measures land fully inside the Overton window: they are routine at every commercial airport in the world.
Too little rigor
Let's flip over to a story that demonstrates the exercise of too little rigor. The movie The Invention of Lying (2009) takes place in a world where nobody has ever told a lie. The main character Mark Bellison, played by Ricky Gervais, is down on his luck: he's about to be evicted because he can't afford his rent. Defeated and expecting to become homeless, he saunters in to his local bank branch and asks to close his account. The teller replies that unfortunately, the computer system is down and she can't close the account, but if Mark will kindly tell her what his balance is, she can make a withdrawal right away. The account holds a balance of $300, but an epiphany hits Mark, and he tells the world's first lie: "I have $800 in my account." Just that moment, the computer system comes back up which correctly reports a balance of $300. Since lying is inconceivable to her, the teller assumes the computer is wrong and happily hands Mark $800. She even apologizes for the inconvenience.
In the real world, Mark would have asked for $8 billion. The bank would have failed, and the effects would have rippled through the US financial system. I think that would have been a more interesting movie, but in a world where nobody lies, I don't think banks would even exist. One purpose of a real bank is to keep your dirty, greedy hands off my money. In that world, a bank would be more like an office refrigerator that contains a styrofoam take-home box into which your fingernail etched your initials.
The right amount of rigor
The movie is not entirely a fiction. To an extent, banks pretend people don't lie. The title of this post is a snowclone of Patrick McKenzie's classic essay The optimal amount of fraud is non-zero which explains that banks allow some fraud as a policy decision that maximizes the overall value of commerce. The banking industry is not stupid or gullible. Smart people have converged on this arrangment after centuries of facilitating commerce and handling fraud.
Enforcing zero fraud would be very expensive. Similarly, enforcing a human review of all code is expensive.
An important difference between a bank and your business that for banks, the risk of fraud is distributed. Most card fraud is absorbed by retailers as the cost of doing business. Beyond that, the card network absorbs the cost. Fraud is also policy. Banks are deputies of the state, regulated and backstopped by the full faith and assurance of the US Federal Reserve.
What kind of software is it?
Well before LLMs, it was clear that some software needs more stringent verification. The Python script that backs up your spouse's photos merits less scrutiny than your employer's authentication platform which in turn needs less care than the software running your dad's pacemaker.
Bertrand Meyer, an expert on software verification, uses a three-bucket "ABC" taxonomy: Acute, Business, and Casual.
Casual software has limited distribution and loose quality constraints. Examples include an app for your personal use, a spreadsheet macro, or an internal proof-of-concept. Most software falls into this category.
if sometimes they crash, sometimes produce not-quite-right results, cannot be easily understood or maintained by anyone other than their original developers, target just one platform, run too slowly, eat up too much memory, are not easy to change, include duplicated code — it is not the end of the world
Business software is what most professional developers work on every day. If the software doesn't work, your organization suffers loss.
Acute software is mission-critical and merits the highest levels of scrutiny.
if it does not work exactly right — someone will get killed, someone will lose huge amounts of money, or something else will go terribly wrong.
When deciding to which bucket software belongs, consider these factors:
- Longevity: How long does the software have to keep working?
- Its potential harm
- Reach: How many people or organizations can defects harm?
- Severity: How badly can defects harm someone or your org?
The grey zone
For the two extremes of acute and casual software, appropriate LLM use is pretty evident. A biologist who vibe codes a Python script that produces incorrect data may publish a bad paper. Deploying unverified cancer treatment planning software will in the best case earn your business an FDA audit and in the worst case mistreat or kill a patient.
For line of business software, the right amount of rigor is more elusive. It depends on what you ultimately want to achieve, your time horizon, and your appetite for risk.
What are you optimizing for?
Speed
If you're trying to ship as fast as possible right now, all else be jammed, the optimal amount of unreviewed code you should ship is close to 100%. On the other hand, if you're trying keep a reasonably fast velocity for the long haul, you'll want to slow down so you can understand the code and invest more in its maintainability.
Business value
Businesses want to maximize profits and minimize costs, but getting greedy today can cost later. Shipping unreviewed code can land a quick lucrative sale. This same decision can also make it expensive to iterate or pivot later. This has been the case long before LLMs.
Learning
It's been well established for over 50 years that producing information makes you retain it better than just reading it. If you're training junior software engineers, the optimal amount of vibe coding is probably zero. If you're an expert dev and learning a new language or domain, the optimal amount is still probably close to zero.
Ethics
There are some serious ethical problems with LLMs: stolen training data, violation of copyright, energy, water, and land use, suppression of wages, and devaluing of human labor are just a few. If any these bother your conscience, optimal use of LLMs might be zero.
Slop I have shipped
Here's a sample of the software I have created without human review:
- A macOS app that turns the screen black after 5 idle minutes to spare my OLED monitors
- A macOS app that rearranges my windows to preset locations
- A macOS app that shows Claude usage in the menu bar
- A private fork of Wezterm with vertical, draggable tabs
- A VIM clone for the AlphaSmart Dana
- A CLI that automatically says "yes" to Claude after a coundown
- A CLI that watches a folder for receipt images and OCRs them
- A web app for tracking prayers
- A web and Android app for sending text messages from my browser
- An iOS app for tracking baby routines like sleep and feeding
- An Android and iOS app replacing the awful one that our smart thermostat uses
What do all of these apps have in common? They have limited distribution. They're just for me or my family.
Here is software I have shipped with either wider distribution or elevated risk posture:
At work, I clauded up an app that hits internal APIs and emails the team about problems like HTTP 4xx or 5xx. I didn't even look at the code. Later, I delivered an internal desktop app to our customer success team that clones user settings, which makes reproducing customer issues easier. I glanced at the code and decided it was fine.
Beanscrape is the most vibed code I have ever shipped to the public. It's a line-of-business app that lives in that grey zone of rigor. As one who holds strong internal convictions about code quality, I don't feel amazing about it, but it does everything it needs to. It's no credence good: I gave it away as FOSS, so anyone can audit its small codebase. I think the world is better off for it. Without LLMs, Beanscrape would not exist. The utility justifies the means.
One could argue that Beanscrape isn't vibe coded. It's somewhere in between. The first proof-of-concept was a single, 3000 line ball of JavaScript. From there, I had Claude start over, break it up into web components, use a well-known web framework, and rewrite it in TypeScript. I've scanned the code to vet its shape, but I have not scrutinized every line, since TypeScript is not a skill you'll find on my resume, and learning it is not one of my aspirations. On the other hand, my eyeballs have closely vetted the parts that run outside the browser. I can do that since they're in C#, a language I know well, and it's worthwhile for a desktop app that handles bank data. The security stakes are higher.
Surebeans is a budgeting app I sell that carries forward the spirit of YNAB4. Like Beanscrape, it is also slightly vibed, but less so. It began as a Christmas vacation project in December 2025. Up through February, it was completely vibe coded. I was thinking it would be just for me and my wife. Once I decided to turn it into a product, I hit the brakes, since I now expected to be working on this code for a long time. I reviewed all existing code and did some significant refactoring. Now I read all LLM generated code before merging. This is time-consuming and makes me go slower. I commit to new features more selectively. I think this is a good thing.
Objections
"Vibe coding doesn't imply slop"
I'll admit I'm using the term slop loosely. It's true that the code in a small app may be of decent quality. That won't be the case for anything larger, but for software with limited impact, that might not matter. I think this is where the ABC taxonomy is helpful. To explain, I'll invoke the shed, house, and skyscraper metaphor.
A handy person can improvise a shed in their backyard with spare lumber and hand tools. Execution is barely distinguishable from planning. If the shed collapses under a heavy snow load, you might have to buy some new tools.
To build a house, you need an architect, structural engineer, and building approvals, followed by construction: concrete, framing, roofing, electricity, plumbing, finishing, and finally inspections. The whole process can exceed a year and cost hundreds of thousands of dollars.
To build a skyscraper, you need thousands of people, tens of millions of dollars, and five to ten years of planning and execution.
The larger the project, the more expensive and consequential its failure.
"LLMs can review the code, so I don't have to"
Adversarial code reviews are all the rage right now. A host of prompts, plugins, and skills stand ready to let a virtual team review your code: a principal engineer, product engineer, UX designer, domain expert, and so on. For added diversity, they'll even invoke different providers, like a mix of Claude, Gemini, and Codex.
As an augmentation to human review, I have no issue with this practice except that it is expensive, and it produces more text for me to evaluate when I could be looking at the code.
As a replacement for human review, there are two showstoppers.
Jury, judge, and executioner
The review is not independent. The same model that wrote your code is now reviewing it. Even across providers, the models are trained on similar data and with similar methods, and they all have access to the same world wide web. Moreover, LLM roleplay is a fiction. It dresses copies of the same model in different cosplay. Asking an LLM to be an expert QA does not cause it to know more about software verification.
We've had this problem with human devs for decades. There's a reason software engineer-in-test is a separate role from software developer. Devs make for terrible QAs. We only test happy paths. Our systemic incentive is to merge ASAP and claim the next Jira ticket.
You don't really know
LLM code review as a replacement for human review isn't epistemologically sound.
Epistemology is a branch of philosophy which asks, "What is knowledge, where does it come from, and what are its limits?"
If we ask, "How do we know this code is good?" the field names three sources of knowledge:
One source is testimony. We trust what a knowledgeable source tells us. An infallible source of knowledge is known as an oracle. Real oracles don't exist, but when only LLMs review our code, we're treating them as one, trusting testimony we don't independently verify.
Another source is reason. We work things out by thinking: tracing the logic, applying definitions, evaluating legibility. If we skip code review, we skip applying reason to it. "But LLMs can reason!" you reply. I disagree. LLM reasoning is a matter of belief. An LLM doesn't know that a ball I am holding cannot pass through my hand. It merely makes it seem like it does with a specific sequence of words. It's an illusion that lives in my mind, not in the machine. Let's grant though that LLMs can actually reason. It's not your reasoning. If you rely on the LLM to do all of the thinking, then you've regressed back to testimonial knowledge.
A third source is experience. This produces a posteriori knowledge, also called empirical knowledge. We try things and see what happened. The scientific method formalizes this. We pose a hypothesis, control for all other factors, and test for the one open factor. In software, when we have a failing test, make a single change to the subject under test, and the test turns green, we have produced empirical knowledge about the code. It's possible your LLM followed TDD or red-green-refactor, but if you don't read the tests, then again you've regressed back to testimonial knowledge.
What to do now
Consider your appetite for risk and for rigor and if those are aligned.
Right now, get out a pen and some paper. Draw the chart at the top of this page. Then, think of the code you have shipped lately. For each, draw a dot indicating where it is on the risk-rigor line. If it's above the line, would you have gotten away with less review? If it's below, should you have reviewed more?
Think carefully before shipping unreviewed code. It can come back to bite you later. Consider how many people it could affect, and how severely. Consider how long you have to maintain what you are shipping.
If you don't review the code, you are trusting the LLM to get things right. I certainly don't.
References
- SEC classification of goods and services
- The Invention of Lying - Bank Scene
- The Optimal Amount of Fraud is Non-zero
- The ABC of Software Engineering
- Why you should still type code in 2026
You should use AI for reviewing code especially when the diff is huge
Human reviewers should stop nitpicking syntax and instead use AI to handle large diffs, focusing their expertise only on high-level design and context-specific constraints.
Decoder
- Diff: The difference between two versions of source code, usually generated during a commit or pull request.
- Out-of-distribution (OOD) knowledge: Information or context that is not present in the training data of an AI, such as recent team meetings, proprietary architectural decisions, or company-specific code standards.
Original article
You should use AI for reviewing code especially when the diff is huge
I often hear that AI is resulting in 10k LOC reviews and this is creating a bottleneck. I don't think you should waste time reviewing every single line of code in here and just use AI to review it!
What you contribute as a reviewer
You need to know what you contribute as a reviewer.
As a reviewer, you contribute your Out Of Distribution knowledge that the author or the LLM might not have
Its a mistake thinking you can outsmart an LLM into nitpicking few lines of code here and there. This is not worth your time because LLM's have far surpassed these kind of issues. Lets remember that these LLM's are now catching high severity vulnerabilities -- your line by line reviews have no place here.
What kind of knowledge can you bring in as the reviewer?
What you bring is the knowledge that the author nor the LLM doesn't know.
Examples
- That meeting you had last week with the architect where you discussed using
service_Agetting deprecated? The author doesn't know this. - You also probably have some general principles in your codebase - don't add fields to the main huge object or don't add metrics in this particular way. These are the things you bring to the review.
- Some high level design smells that only you know as the codebase expert
The way I use AI for reviews is to point AI to the change and contribute my Out Of Distribution knowledge in terms of prompts and questions.
Caveats
This workflow works in places where each line of code is not sacred. There are places where each line may be sacred, like in embedded systems.
Meta Is 'Pausing' Employee Tracking Program After It Let The Whole Company See Sensitive Data
Meta suspended its Model Capability Initiative, an internal AI training program, after it accidentally exposed sensitive employee data to the entire company.
Original article
Meta is 'pausing' employee tracking program after it let the whole company see sensitive data
This won’t make the already-controversial AI training endeavor any more popular.
Meta has paused use of an AI training program that tracks its own employees' keystrokes and mouse movements. The company has suspended the Model Capability Initiative, not because of workers' understandable displeasure around being (almost) perpetually monitored or for potentially breaking privacy laws, but because it caused an internal data leak. Business Insider reported that sensitive data collected through MCI, including employees' private conversations, performance data and transcriptions, was made inadvertently available to the entire Meta staff.
"We have carefully designed this program with privacy safeguards, and while we have no indication at this time that any data was improperly accessed by Meta employees, we're pausing it while we investigate," a spokesperson told BI.
Despite this official line and previous statements that employees' collected data would be "tightly controlled," it appears Meta wasn't quite as on top of security as it claimed. This marks the latest in a series of AI-related cybersecurity incidents for the company. Meta reps issued a similar response in March after an agentic AI took unprompted action that also dominoed into a security breach. And earlier this month, the company had to react after hackers exploited its AI customer service chatbot to hijack Instagram accounts.
An agent capability library
Sami Honkonen manages local AI agents by treating their capabilities as a version-controlled, searchable documentation index.
Deep dive
- Developers can define a global AGENTS.md index that acts as a capability registry.
- Agents read the index to determine if they possess the required skill for a specific task.
- Docs are kept as plain Markdown files in a Git repository.
- Adding new capabilities requires only updating the index and writing a corresponding Markdown documentation file.
- Project-specific agents can symlink to this global library or include local overrides to maintain context.
- The system forces developers to codify their infrastructure as they build it, creating a self-documenting workflow.
Decoder
- MCP server: Model Context Protocol, an open standard for connecting AI assistants to systems like local files, databases, or development tools.
- Symlink: A symbolic link, or a file that points to another file or directory, used here to share a single capability source across multiple agent environments.
Original article
In the last post I described how I set up a local LLM and how I create purpose-built agents:
Whenever I want AI help with something specific, I make a new directory under
~/projects/on my Air and just start working with pi. Once I’ve done what I want to do, I tell pi to record the process in anAGENTS.md. From that point on, every time I open pi in that directory it reads the file and is immediately ready to continue.
Continuing on that, I’ve started building a general library of capabilities for the agents. Pi reads a global AGENTS.md from ~/.pi/agent/AGENTS.md on startup. Mine is a symlink to a CAPABILITIES.md in a repo called agent-docs. It’s a capability index: a list of things the agent can do, each with a short description of when to reach for it and a pointer to a doc that explains how. Here’s an excerpt:
- **Studio machine** — a powerful always-on Mac you can SSH into.
Reach for it when you need more horsepower or a stable host for a
background service. Read @/Users/sami/projects/agent-docs/STUDIO.md.
- **exe.dev VMs** — on-demand Linux VMs with a public HTTPS proxy.
Reach for it when you need a real Linux box or a public URL.
Read @/Users/sami/projects/agent-docs/EXE-DEV.md.
- **Browser automation** — drive a real Chrome from the shell.
Reach for it when you need to scrape, fill a form, or visually
check a deployed page. Read @/Users/sami/projects/agent-docs/AGENT-BROWSER.md.
The agent doesn’t load all specific instructions upfront. It reads the index, decides whether the task matches a capability, and only then reads the relevant doc. The docs themselves are ordinary markdown: what the thing is, when to use it, how to use it. I write and update them almost exclusively with AI. agent-docs is itself a purpose-built agent for maintaining the library.
I also include a reference to CAPABILITIES.md in the AGENTS.md of individual coding projects, so project agents can reach for the same tools.
The current list has seven entries: the Studio, the local LLM, private git hosting on the Studio, exe.dev VMs, browser automation, a personal MCP server, and this blog. Adding a new one is three steps: write the doc, add a line to the index, commit.
The idea is that this compounds. Every time I set something up, I write a doc for it. The agents inherit the capability. Over time the agent should become genuinely useful across a wide range of tasks, because the infrastructure behind it keeps growing.
World Model Maker Odyssey Nabs $1.45B Valuation Backed by Amazon and Other Big Names
World model AI startup Odyssey secured a $1.45 billion valuation, partnering with Amazon to optimize its physical simulation models for AWS Trainium chips.
Decoder
- World model: An AI system that understands and simulates physical environment rules and physics, rather than just predicting text patterns or pixel sequences.
- AWS Trainium: Amazon's custom machine learning accelerator chips designed to reduce the cost and improve the performance of training large-scale deep learning models.
Original article
World model maker Odyssey nabs $1.45B valuation backed by Amazon and other big names
Odyssey, a world model AI startup founded by self-driving vehicle pioneers CEO Oliver Cameron and CTO Jeff Hawke, has raised a $310 million Series B round at a $1.45B valuation led by Natural Capital, with Amazon, AMD Ventures, GV, and others participating.
World models are the next big thing in AI beyond text- and chat-based large language models. They gather data from the physical world and simulate it with accurate physics. In Odyssey’s case, it has mimicked how Google Earth gathered data; the startup sent people out with cameras strapped to their backs. (Google drives camera-equipped cars around.)
That approach makes sense given the backgrounds of the founders. Cameron was the co-founder and CEO of autonomous vehicle startup Voyage, which was acquired by GM’s Cruise, where he later became VP of product; Hawke was an engineer at buzzy U.K. self-driving startup Wayve.
Odyssey, founded in 2023, now offers a handful of world models for a variety of use cases, from video-game creation to robotics. It is perhaps best known for producing rich, interactive video from text prompts.
With the backing from Amazon, the startup says AWS is now its preferred cloud provider and it will optimize its models to run on AWS’s Trainium chips, a competitor to Nvidia’s AI chips.
In addition to the VCs that participated in this unicorn-crowning round, Odyssey has corralled an impressive list of angel investors as well. These include Jeff Dean, Elad Gil, Garry Tan, Guillermo Rauch, and Cruise founder Kyle Vogt. The company has now raised $337 million to date.
Vibe Architects: Agentic Vibe Coders
A study of 'vibe architects' reveals that non-developers are building complex, fragile agentic systems through intuition and social media rather than technical expertise.
Deep dive
- Methodology: Evaluated seven non-technical users building automation pipelines and Claude-based 'operating systems'.
- Persistence of Opacity: Even power users struggle to explain how their systems work or why they fail.
- System Decay: Agentic workflows frequently drift, requiring recurring manual intervention to maintain alignment.
- Delegation Risks: Users often auto-approve permissions without auditing actions, highlighting significant governance gaps.
- Community Dependence: Knowledge is acquired through social media rather than in-app tutorials, which remain ineffective for these users.
- The 'Vibe' Reality: The lack of a clear technical path makes intuitive 'hacking' the primary, albeit fragile, way to get work done.
Decoder
- Agentic AI: Systems designed to act autonomously on behalf of a user to perform complex tasks by breaking them down into steps, using tools, and making decisions.
- Vibe coding: A colloquial term for writing code by interacting with LLMs and refining output based on intuition and visual feedback rather than strict debugging or knowledge of programming principles.
- Context drift: The tendency for a system to lose accuracy or deviate from intended performance over time as state, conversation history, or environmental variables change.
Original article
Vibe Architects: Agentic Vibe Coders
Generative AI lets people's technical capability outpace their technical knowledge in ways that no previous technology has. We've seen this in vibe coding for some time.
Now, however, vibe coding is expanding into something broader, messier, and more impactful. This shift is tied to the rise of agentic AI and Claude Cowork, a flexible agentic system targeted at general information workers rather than engineers.
In our recent study, we learned how people manage to architect systems and transform workflows for themselves and their teams — without knowing much about exactly how they’re doing it. We call these people “vibe architects.”
Meet the Vibe Architects
The primary goal of our study was to understand how people who are not professional developers use agentic AI. The study was inspired by Claude Cowork's explicit focus on noncoders. We wanted to know how well Cowork serves that population.
For the study, we'd hoped to recruit people who were not developers, didn't work in our field (digital design, product, or UX), and used open-ended agentic tools like Claude Cowork or Claude Code. We managed to find 7 nontechnical people but struggled to find people outside of our industry (which may hint at how rare these people are).
Our participants included:
- An operations specialist building automation pipelines and deploying web apps for her team
- A product designer at a financial institution building personal software tools on evenings and weekends
- A marketing-startup founder running a headless orchestration system with proactive suggestions and multiagent coordination
- A head of product who put his entire team on a Claude-based “operating system” that replaced meetings, shared information, and facilitated decision making
All these individuals were building complex agentic systems using Claude (alongside other AI tools) for a wide array of personal and professional purposes — without having a technical background.
Vibe architects build complex, proactive agentic systems using AI tools, sometimes with little technical knowledge or understanding of how these systems work.
Some of the systems that our participants created had traditional interfaces (e.g., vibe-coded dashboards or websites), but others lived purely in Markdown text files or in the LLM’s chat. One participant showed us a dashboard he’d built, then mentioned he didn’t really look at it — he worked primarily in his code environment and the chat. The traditional UI was an afterthought.
These people were applying vibe coding to design how information flows across systems, how AI agents coordinate, and how work gets done — sometimes for entire teams.
As vibe architects build increasingly complex systems, they aren't learning about the technical functionality under the hood. Instead, they are developing something more amorphous — a semi-instinctual set of behaviors, preferences, and tendencies, shaped by hours spent experimenting, watching YouTube videos, and reading blogs.
These instincts aren't always accurate or useful. They reflect hazy mental models that sometimes produce real results and sometimes lead nowhere.
One participant, for example, developed a habit of resetting her Claude Cowork chats by importing the entire conversation into the prompt of a fresh conversation. She assumed that this practice saved tokens. While longer conversations do use more tokens, so does including the context in a long prompt.
Thus, while vibe architects can improve their and their teams’ workflows, the resulting systems aren’t always the most efficient or reliable. And, predictably, they report maintenance headaches: The systems decay. Connections expire, context drifts, and single sources of truth fall out of sync, so keeping everything running becomes a recurring tax on the architect’s time.
The participant who had moved his team’s primary communication into Claude said that his system would hold up “for a few weeks, and then the decay hits and we have to […] set up the process [again].”
Another participant described the same problem as drift: “As I develop my own stuff, [it] can start to drift, and managing that is […] harder.”
How Vibe Architects Build
The latest-frontier models are extremely good at interpreting ambiguous prompts. So much so that many of our participants reported that they don’t prompt engineer at all — they deliver stream-of-consciousness instructions to their AI systems, often using dictation apps like Wispr Flow.
There is still some “engineering” happening, however. It’s less about choosing words in the prompt, and more about selecting tools, techniques, and approaches to get what they want. That process often involves asking Claude itself for help.
For example, one participant reported that he would always ask Claude, “What would make this world-class?” He said Claude often would respond with improvement suggestions, and he’d almost always accept them.
This was a common theme across our sample — our participants often delegated the decision making, as well as the execution, to Claude. They exercised minimal oversight on what Claude was up to.
During one session, a participant screenshared her Visual Studio Code window, where she primarily interacted with Claude Code. During the session, Claude was working on building something for her team and kept asking for permission to take various action. While speaking with us, she repeatedly clicked Accept without reading the requests. When we asked whether this was how she usually handled permission approvals, and she said yes.
“Yes, because I don't want to give it dangerously accept permissions [level of access]. Most of the time I will just click ‘yes’ for things that [it] asks me, even if I don't understand what it means, because it only has […] the context of this folder. So, I don't really need to worry about it accessing other things.”
Another participant told us he’s developed carpal tunnel from repeatedly accepting permission requests.
How Vibe Architects Learn
Our participants reported that they developed their vibe-architecting skills in two primary ways: spending a great deal of time experimenting and learning from a community.
Time Spent Experimenting
Everyone in our sample invested a massive amount of time to get a “feel” for tools like Claude Cowork and Code. One participant reported spending 8 hours per day, 6–7 days per week inside AI chats. Another estimated spending 4–5 hours per day across Claude, Codex, and other tools. A third experimented with these tools on his home computer on evenings and weekends because his employer restricted AI use at work.
This time investment doesn't level off, either. These tools keep changing as AI labs ship model improvements and product updates, and other vibe architects share their ideas, so the vibe architects often must engage in continuous reevaluation. One participant used an agent to assess the evolving AI-tool landscape every 3–4 months, tracking what's changed and what new capabilities have emerged. Another scanned GitHub leaderboards weekly for new agents he could slot to his workflows. A third tested new tools daily.
Learning from Other Vibe Architects
Across all seven participants, the AI-coding products themselves consistently failed as a source of learning, regardless of how technically sophisticated the participant was. This makes sense: LLMs are inherently opaque and uneven, and users don’t see their limitations until they run into them. Even when you are working with products from a single AI lab, like Anthropic’s, different models can behave differently and vary in how well they handle specific tasks.
System structures also matter: for example, orchestration agents managing subagents may perform better or worse, depending on what you’re building. Token usage, too, can be more or less efficient depending on how the system is set up. This variability is one reason all our participants relied so much on experimentation. While they could get advice and direction from Claude itself, none of participants reported using Claude as their primary way of learning and getting started.
Instead of learning within the product, all of our participants mentioned learning how to use and build these systems from other people.
In a few cases, these people were friends, partners, or coworkers. More often, though, our participants learned from online communities such as Twitter, Reddit, YouTube, and Slack groups. They followed other practitioners who had invested even more deeply in AI and shared back with the community what worked and what didn’t.
What Vibe Architects Learn
You might expect that after hundreds or thousands of hours with these tools, advanced users eventually "get it" — that the system becomes legible and the vibes resolve into understanding. Our study shows the opposite.
Opacity persists at the deepest end of the practice spectrum.
One participant told us that one of his agents had attempted to make a $200 purchase the night before. When asked how he prevented these kinds of incidents, he went hunting through his files, unsure of what was governing the system. He found a file called boundaries.md, which contained security and privacy guidelines. He opened it and seemed surprised by what was in it. He hadn’t written it himself; Claude had created and maintained this governance architecture largely on its own, and his role was to authorize it after the fact.
The vibes compound — they just don't crystallize into the kind of clear, articulable knowledge we typically associate with expertise. What accumulates is intuition: a thicker sense of the shape of the technology, a feel for where it fails, and a growing instinct for how to poke it to get what you want. Our participants struggled to name how they know what they know.
This opacity also carries a sense of guilt: Nearly every participant in our study assumed that someone else must be doing this more cleanly, more professionally, more correctly than they were.
Going with the vibe is not a shortcut to the "real" way of using these tools –- it is the practice. At least in this user population, there is no non-vibes version of working with these tools.
Conclusion
This is how tacit knowledge has always spread — through observation, shared practice, and time spent doing the work.
What's unusual is that it’s happening with a technology that's supposed to be intuitive enough to not need this kind of apprenticeship. The promise of conversational systems has been that anyone can use them simply through conversation. At least right now, that doesn’t seem to be the case.
While we don’t have data on the people who didn’t participate in our study, we can make reasonable assumptions based on how hard the sample was to recruit and what those people had in common. They were almost all in our industry (tech), which means they’re likely more aware of AI concepts and capabilities than the broader population. Some of them had tech-savvy friends or colleagues who encouraged and helped them get started. And they all had substantial time and motivation to vibe until their intuitive sense allowed them to piece together something useful, if fragile.
But our study suggests capability alone won’t win that market. These tools are already extraordinarily powerful; what they still lack is a usable entry point.
The vibe architects we met found their way in through years of immersion, the right online communities, and an intuition no documentation can hand you. Until the labs make that path shorter — until competence stops requiring an apprenticeship — poor usability will limit the reach of these tools, and the people who benefit will keep being the ones who were already close enough to figure it out on their own.
What Is SKILL.md, and Why Should Web Designers Care?
SKILL.md files provide a structured way to enforce coding standards and design preferences for AI agents, moving beyond temporary, inconsistent prompt engineering.
Original article
SKILL.md files are structured instruction sets that teach AI coding agents how to perform specific tasks according to professional standards — closing the gap between raw AI output and production-quality work. Unlike one-off prompts, skills are durable, versioned, and shareable across teams, encoding judgment like preferring CSS Grid, using design tokens, and enforcing accessibility checks. Web designers should take notice because AI without clear guidance drifts from established standards, and SKILL.md is a practical way to define them once and apply them consistently.
Step-by-Step Guides, Screen Capture, and SOPs (Website)
iGenFlow automates documentation by recording browser interactions and converting them into structured SOP guides.
Deep dive
- Automatically captures browser workflows as they happen
- Records cursor position, page transitions, and clicks
- Generates step-by-step SOP documents with annotated screenshots
- Allows users to edit text and blur sensitive information in screenshots
- Limits free plans to 50 steps per workflow
- Supports exports for internal knowledge bases and onboarding tools
Decoder
- SOP (Standard Operating Procedure): A set of step-by-step instructions compiled by an organization to help workers carry out complex routine operations.
Original article
Turn browser operations into clear step-by-step guides.
iGenFlow helps teams capture web-based workflows as they happen. It captures clicks, page transitions, screenshots, cursor positions, and notes, then converts the process into a structured visual guide that is ready for support, onboarding, training, and internal knowledge bases.
Instead of writing every step manually, you can perform the workflow once and let iGenFlow create a draft SOP with annotated screenshots and editable instructions.
What iGenFlow Does
Automatic workflow capture
Start the browser extension, complete the task in your web app, and iGenFlow captures each meaningful action with context, screenshots, and cursor markers.
Step-by-step SOP drafts
Captured actions are converted into a readable process document, so product, support, operations, and customer success teams can review and publish instructions faster.
Editable screenshots and instructions
Teams can refine the generated guide, adjust text, highlight important areas, blur sensitive details when the plan allows it, and keep documentation aligned with real product behavior.
Common Use Cases
- Create product documentation from browser workflows.
- Build customer support articles with screenshots and clear steps.
- Document internal operations for repeatable standard procedures.
- Prepare onboarding guides for sales, support, and implementation teams.
- Capture knowledge base content before product details are forgotten.
How It Works
- Install the iGenFlow browser extension.
- Click Start Capture before beginning the workflow.
- Operate the website normally while iGenFlow captures each step.
- Review the generated screenshots and step descriptions.
- Edit, export, and share the guide according to your plan permissions.
Plan Boundaries
Free users can capture and export workflows up to 50 steps. Longer workflows, selected import and export options, AI polishing, speech usage, blur effects, continued capture, and other advanced capabilities may require a paid plan or quota.
FIFA's World Cup Typography Foul: UI Design Learnings
FIFA’s custom 'FWC26' typeface illustrates how high-contrast, ultra-condensed display fonts fail when used for functional data in broadcast UI.
Deep dive
- Display fonts like FWC26 are optimized for visual impact at large scales rather than legibility at small scales.
- Ultra-condensed, squarish letterforms cause 'counter-filling', where inner shapes of letters disappear at small sizes.
- High-contrast, bold display weights are prone to aliasing and blur on low-resolution screens.
- Mixing decorative brand fonts with standard sans-serif fonts for numerical data is a necessary compromise for accessibility.
- Typefaces intended for interfaces should be tested for legibility at small sizes specifically with numbers to avoid misidentification, such as mistaking a zero for an eight.
Decoder
- Display Typeface: A font designed for large, striking headlines rather than body text.
- Ultra Condensed: A font style where the width of the characters is significantly narrowed.
- Tricode: A three-letter abbreviation for a country, commonly used in sports broadcasting (e.g., RSA for South Africa).
- Counter: The area of a letter that is entirely or partially enclosed by a letter form or a symbol (e.g., the hole inside an 'o' or 'e').
Original article
It’s World Cup – and I didn’t really care … until I saw the typeface on TV 😳. Isn’t this a bit hard to read?! And then I got a message on LinkedIn, asking me what I think about the typeface on the scoreboard. This made me go into a rabbit hole that forced me to watch a game (or at least parts of it). So let’s find out what the issue with the FIFA World Cup 2026 font is and what it can learn us for UI design!
The typefaces used for FIFA 2026
FIFA uses a custom-made typeface for the 2026 world cup, called FWC26 designed by Alistair McCready from the New Zealand based foundry Monolith. It’s a striking, vivid display typeface, available in various weights and widths. FIFA uses it mostly in all caps, Ultra Condensed, Black. That’s nice for a headline in their striking social media post, conveying a bold, dynamic and athletic spirit. But it is it appropriate for smaller sizes? Or on low resolution screens? No.
Analyzing the FIFA scoreboard
Now let’s go back to the thing that started all of this, the scoreboard in the top left corner. Compared to FIFA’s social media and promotional graphics, here it is not only the FWC26 font. It is paired with Open Sans used for the timer and the scores. Obviously someone found out that it would have been too challenging on a TV screen.
But they still kept the bold and striking typeface for the country abbreviations. This is problematic because:
- Letters are very tight and squarish, the inner shapes tend to disappear.
- This makes them harder to distinguish
- Tricodes for countries are not always that obvious (I learned that RSA stands for South Africa)
- Flags next to it help, but not everyone knows them either
- Plus, this will be seen on a variety of screens and resolutions, so it should be as legible as possible
How could this look differently?
If FIFA had used their custom typeface FWC26 for all the information (like they do on social media) this would have been much worse. I simulated it below, and it’s especially hard when you look at the score. The slashed zero could easily be mistaken for an eight. This typeface is simply inappropriate for information design.
Looking at the original design in version 2 makes it clear why they picked the Open Sans typeface for numbers. But they still could have used a wider, and less bold version of their custom font to make it more legible without losing character. I simulated this in version three.
What can we learn for UI design?
If you pick a font that is very condensed and black and small it’s a perfect recipe for making things harder to read. It’s not that I don’t like the FIFA typeface, I think it’s a good choice for branding, but there should be a companion version for smaller sizes.
A typeface can also be too contrasting for its size. Which of these examples above do you find most readable? I bet it’s not the one on the left, using the Compressed Black style. So picking slightly wider and not too bold fonts can be very helpful for your design components.
Any more typography fouls that I should know about? Let me know, as I’m preparing a video about it. Until then enjoy the World Cup if you watch. I’ll only watch it to complain about the type 😉.
SpaceX signs computing power deal with open-source AI startup Reflection worth up to $6.3 billion
SpaceX is leasing compute capacity at its Memphis data center to Reflection AI in a deal valued at up to $6.3 billion through 2029.
Decoder
- GB300: A hypothetical, high-performance Nvidia GPU architecture used for training large-scale AI models.
- Colossus 2: SpaceX’s large-scale data center facility in Memphis, Tennessee, repurposed for commercial AI model training.
Original article
- SpaceX has signed a major computing power deal with Reflection AI for access to Nvidia GB300 chips at Elon Musk’s Colossus 2 data center.
- The open-source AI startup will pay Musk's company $150 million per month starting July 1, 2026, with payments totaling about $6.3 billion if the deal runs through 2029.
- The deal shows how SpaceX has turned Colossus into a commercial computing power platform, landing recent deals with Anthropic, Google and Cursor.
SpaceX has signed a major computing power agreement with Reflection AI, making the open-source artificial intelligence startup the latest outside company to tap Elon Musk's Colossus infrastructure.
Under the agreement, Reflection will get immediate access to Nvidia GB300s, top-of-the-line AI chips used to train and run advanced models, and has agreed to pay SpaceX $150 million per month beginning July 1, 2026, through 2029, according to materials viewed by CNBC.
The payments would total about $6.3 billion if the agreement runs through the end of its term.
Either company can end the contract with 90 days' notice after the first three months.
The deal shows how SpaceX is using its massive data center build-out after its record initial public offering. The company built Colossus in part to power Grok, Musk's AI chatbot and rival to ChatGPT. Now, SpaceX is also using that infrastructure to sell computing power capacity to outside AI companies.
SpaceX has already struck computing power-related deals with Anthropic, Google and Cursor, and Musk's company is now acquiring Cursor. Reflection adds another customer to that roster, and a strategically different one: an AI lab focused on open-source models at a moment when governments and enterprises are reassessing dependence on closed AI systems.
The timing is notable. Open-source AI has gained momentum after Anthropic cut off access to Fable and Mythos, raising questions about the risks of relying on closed-model providers for critical work. The episode has given open-model companies a stronger argument that customers should be able to inspect, customize and run models with more control.
Reflection has leaned directly into that pitch as the startup, last valued at $25 billion, is trying to build American open-source AI models that can compete with frontier systems from OpenAI, Anthropic and Google, while offering governments and enterprises more flexibility than closed systems.
"Recent events highlight how important open source is to the AI ecosystem, with more nations and enterprises recognizing the risks and costs associated with exclusively depending on closed models," a Reflection spokesperson said in a statement.
Reflection said the agreement gives it additional computing power, or compute, capacity to accelerate what it calls "American open intelligence."
The startup has not yet released a public frontier open-source model, but it has been building momentum with government and national security customers. The company is working with the Department of Energy's Genesis Mission and has been part of broader Pentagon AI efforts.
For SpaceX, the deal is another sign that compute itself has become strategic currency in the AI race. Access to advanced Nvidia chips remains one of the biggest constraints for companies trying to train and serve frontier models. By opening Colossus to outside customers, the company is positioning itself alongside cloud providers and AI infrastructure companies that are racing to sell scarce graphics processing unit capacity.
It also gives SpaceX another way to justify its growing AI infrastructure narrative.
Investors have been watching whether SpaceX can expand beyond rockets and Starlink into AI, data centers and compute services.
Alibaba's AI video model rises to No. 2 in global rankings, as OpenAI's Sora and ByteDance's Seedance fall away
Alibaba's HappyHorse 1.1 video generation model is challenging OpenAI’s Sora and ByteDance’s Seedance for the second-place spot in global video AI rankings.
Original article
Alibaba's HappyHorse 1.1 AI video generation model delivers production-ready video through an API built for integration into enterprise software stacks. It is now live on Alibaba Cloud Model Studio with a 40% sitewide launch discount for the first two weeks. HappyHorse supports text-to-video, image-to-video, and subject-to-video generation, as well as video editing. Its abilities cover the full spectrum of commercial video needs, from ideation through production to post-production.
Anthropic says Claude may want to see your ID
Anthropic will start requiring identity verification for flagged users via Persona starting July 8, potentially to navigate regulatory pressure and ongoing conflicts with the US government.
Decoder
- Persona: A third-party identity verification service that handles document scanning and biometric verification for businesses.
- KYC (Know Your Customer): The process of a business verifying the identity of its clients to prevent fraud and comply with legal requirements.
Original article
Anthropic may ask Claude users to verify their age and identity by uploading their government-issued documents, according to a new version of the company’s privacy policy.
The AI giant says the move was to allow users to appeal having their account flagged for potentially fraudulent activity rather than outright banning them, but comes at a time when Anthropic seeks to placate the Trump administration amid an ongoing standoff over who gets access to the company’s AI tools.
According to a new section in its latest privacy policy published earlier in June and set to take effect on July 8, Anthropic says it will ask for a user to prove their age or identity “in certain circumstances,” without providing specific examples.
While Anthropic has long required users to be over 18 to use Claude, the company earlier this year introduced age-verification checks to comply with various states and countries that require them. Identity checks were also announced but weren’t reflected in the company’s privacy policy until more recently.
When triggered, the policy would require those users to upload a photo scan of a government-issued passport or driver’s license. Anthropic says it will also collect a person’s selfie photo or video and the person’s digitized version as a face geometry template (which some states, like Illinois, consider to be legally protected biometric data). Anthropic says it will also keep a record of the verification result, such as whether the user has reached a certain age.
When reached by TechCrunch, Anthropic spokesperson Michael Aciman shared a link to an X post from Anthropic’s Thariq Shihipar saying that the change applies only to a “small subset of users” whose accounts are flagged but not outright banned.
“[Anthropic’s identity verification policy] was updated on June 17 as an update to the appeals process,” said Shihipar in the post. “It’s unrelated to the Fable or Mythos rollout.”
Anthropic said it is allowed to require users to upload a copy of their IDs for a number of reasons, such as for requiring users to verify themselves for creating and administering their Claude account, and enforcing its terms of service, such as to prevent and investigate fraud, abuse, and violations of its terms, including unlawful or criminal conduct, and to investigate and resolve security issues.
The move to keep closer tabs on who is using Anthropic’s AI tools may be one way for the company to comply with a variety of ongoing legal challenges, regulatory changes, and inbound pressures from the Trump administration.
The tech giant remains largely at an impasse with the White House, more than a week after Trump officials effectively forced Anthropic to pull its latest cybersecurity models over allegations that an apparent jailbreak could break the models’ guardrails. Other reports have pointed to personality clashes between the company and the Trump administration as the greater culprit of the breakdown in relations.
This latest clash comes months after the Department of Defense designated Anthropic a “supply chain risk”, apparently in retaliation for not allowing the government to use its technology for mass domestic surveillance or powering fully autonomous weapons.
Anthropic said it uses the San Francisco-based company Persona as its identity-checking provider and that users may “see a verification prompt when accessing certain capabilities, as part of our routine platform integrity checks, or other safety and compliance measures.”
Anthropic said that it decides how long Persona will retain its users’ identity documents, but Anthropic’s spokesperson did not immediately say when the data was deleted.
Persona can still face U.S. government demands for users’ information that it stores on its servers.
Persona is backed by Founders Fund, an investment firm founded by Trump backer Peter Thiel, who also invests in Anthropic. The identity-checking firm has faced criticism from users for its links to Thiel. Earlier this year, Discord chose Persona for its age-verification checks, then quickly reneged following user backlash for choosing Persona.
claude-sonnet-5
The identifier 'claude-sonnet-5' has been spotted in provider systems, hinting at a potential upcoming model launch.
Original article
Full article content is not available for inline reading.
Tencent tests AI assistant in China's most popular app as it looks to catch up with rivals
Tencent is integrating a native AI assistant named Xiaowei directly into WeChat, aiming to leverage its 1.4 billion user base to dominate task execution.
Deep dive
- Tencent is testing its 'Xiaowei' AI assistant within WeChat, which serves over 1.4 billion monthly active users.
- The assistant is designed to execute tasks, such as launching mini-programs and sending messages, rather than just providing conversational responses.
- This move directly challenges competitors like Alibaba, DeepSeek, and Zhipu in the Chinese AI market.
- The project utilizes Tencent's internal 'Hunyuan' model architecture.
- The company recently hired a former OpenAI researcher as its chief AI scientist to accelerate its development cycle.
Decoder
- Mini-programs: Lightweight applications that run inside the WeChat ecosystem, eliminating the need to install separate standalone apps for specific services.
Original article
Key Points
- Xiaowei, "a native AI assistant," is being tested "on a small scale" in Weixin, the Chinese version of WeChat.
- Users can interact with Xiaowei with text or voice, communication with friends and launch mini-programs.
- WeChat is an indispensable part of daily life in China and Tencent is trying to tap into its huge user base to expand use of its AI services.
Tencent on Monday said it is testing an AI assistant within WeChat in China as the tech giant looks to step up efforts to challenge rivals in the country's competitive artificial intelligence market.
Xiaowei, "a native AI assistant," is being tested "on a small scale" in Weixin, the Chinese version of WeChat, Tencent said in a statement translated by CNBC.
Users can interact with Xiaowei with text or voice, communicate with friends and launch "mini-programs," Tencent added. Mini-programs are apps that run inside of WeChat.
Tencent executives have been mulling further integration of AI into WeChat since last year, with investors watching closely to see if this can be a new revenue stream and a way to monetize AI.
WeChat and Weixin have more than 1.4 billion monthly active users combined, with the majority in China. It is an indispensable part of daily life in China, where people use the app to message friends, make payments, book restaurants and much more.
By integrating an AI tool into an app with a huge user base, Tencent has an opportunity to capture a large number of them for its services.
"Putting an assistant inside Weixin is the first time Tencent uses the advantage it has held all along, and that matters a lot," Howard Yu, the LEGO professor of management and innovation at IMD, told CNBC by email.
"A standalone chatbot gives you an answer. An assistant wired into Weixin completes the task. And it's this second advantage that no rival can copy," Yu added.
The company did not give further details about the capabilities Xiaowei would have or what AI models it is based on.
Tech companies are talking up the potential of so-called AI agents, which they see as digital assistants that are able to carry out complex tasks on a user's behalf across different apps and services.
The new AI assistant is part of a bigger move from Tencent to challenge rivals like Alibaba, DeepSeek and Zhipu in China, which has become an incredibly competitive AI market. This year, Tencent poached an OpenAI researcher to become its chief AI scientist.
Tencent also develops its own family of models under the brand name Hunyuan.
Google Investing in ‘Backrooms' Studio A24
Google is investing in film studio A24 to research AI tools for movie production, though the deal excludes access to A24's private data.
Original article
Google is investing in A24, the independent movie studio that released the hit, 'Backrooms'. The investment is part of a new AI research partnership between the two companies aimed at creating new tools for movie production and distribution. The multiyear, nonexclusive deal will likely involve the studio's roster of artists, but it won't give Google access to A24's data. A24 is already developing an application for AI-generated storyboards.
Meta's WhatsApp head to step down, will be replaced by Indian fintech founder Kunal Shah
Meta is replacing longtime WhatsApp head Will Cathcart with Kunal Shah, the founder of the Indian fintech app Cred.
Original article
- WhatsApp head Will Cathcart will step down from his role after more than seven years, Meta CEO Mark Zuckerberg said.
- Meta is tapping Kunal Shah, who founded Indian fintech startup Cred, to lead WhatsApp.
- Cathcart will move into another role at Meta where he'll "build new products from the ground up," Zuckerberg said.
Meta's longtime WhatsApp leader will step down and be replaced by the founder of Indian fintech startup Cred, the company announced Monday.
Will Cathcart, who has served as the head of the WhatsApp messaging service for more than seven years, is transitioning into another role at Meta, where he'll "build new products from the ground up," Meta CEO Mark Zuckerberg wrote in a Facebook post.
"I'm excited to continue to work together closely," Zuckerberg said.
A Meta spokesperson declined to comment on the specifics of Cathcart's new role.
Cathcart wrote in a post on X that WhatsApp "is in the strongest position it's ever been — and that felt like the right moment to step back."
Meta acquired WhatsApp in 2014 for $19 billion, and it's grown into one of the most popular messaging services globally, counting more than 3 billion monthly active users.
Last month, Meta began rolling out subscription plans for WhatsApp, Facebook and Instagram, and said it would test new subscriptions for its artificial intelligence services. The moves could help Meta diversify its business beyond selling ads and recoup some of the costs from its heavy AI investments.
Cathcart will be succeeded by Kunal Shah, who founded Cred in 2018. Cred is a credit card payments platform that rewards users who pay their bills on a timely basis.
Shah's "builder mentality and global perspective" made him a good fit to lead WhatsApp, Zuckerberg said.
As part of Shah's appointment, Meta is investing $900 million in Cred, the startup said in a release. Cred is now valued at $4.5 billion post-money, according to the company.
Trump Seeks to Boost Quantum Computing With New Executive Orders
President Trump issued executive orders to accelerate quantum computing development while addressing the security risks the technology poses to encryption standards.
Original article
President Donald Trump has signed a pair of executive orders aimed at speeding up the development of advanced quantum computers and mitigating the security threats they present.
The T-shaped UX professional is giving way to the polymath architect
The rise of AI is eroding the traditional T-shaped specialist model, favoring 'polymath architects' who coordinate end-to-end workflows over singular craftsmanship.
Decoder
- T-shaped professional: A concept describing an individual with deep expertise in one specific area (the vertical bar) and a broad ability to collaborate across many disciplines (the horizontal bar).
Original article
AI is reducing the value of narrow specialization by making it easier for individuals to perform work that once required multiple specialists, shifting demand toward professionals who can oversee entire workflows and combine broad capabilities with strong judgment. While deep expertise remains important for evaluating quality, making decisions, and spotting mistakes, the most valuable people will increasingly be those who can work across disciplines, direct AI tools effectively, and focus on outcomes rather than a single role or craft.
The hidden UX of payments
Trust in payment interfaces is built through deliberate friction, clear system feedback, and managing moments of user uncertainty.
Original article
Trust in financial products is often shaped not by major features or visual design, but by small moments of uncertainty—such as confirmation screens, loading states, and irreversible actions. Deliberate friction, visible system feedback, and clear communication can make fast transactions feel safer, high-risk actions feel more controlled, and complex capabilities feel more tangible. The key idea is that trust is built when users understand what is happening and feel in control, especially during consequential moments, and these seemingly minor interactions are often the most important yet most overlooked parts of the product experience.
The text in Claude Code's “Extended Thinking” output is not authentic
The full reasoning process behind Claude Code’s 'Extended Thinking' feature is encrypted and inaccessible to standard users.
Original article
Claude Code's 'Extended Thinking' reasoning is encrypted. Anthropic holds the key, and users' machines never receive it. The API returns a summary of the reasoning, and not the reasoning itself. Receiving the full thinking output requires an enterprise agreement.
Instagram looks to take on streaming services with longer-form, episodic, and live formats for its TV app
Instagram is aggressively expanding its living room footprint by testing episodic series and Live TV features on its television app.
Original article
Instagram is experimenting with longer-form content, episodic series, and Live TV. It is testing a Series feature for Reels designed to make it easier to keep up with serialized content. Instagram's TV app is being rolled out to Samsung TVs. The app has received several new features, including channels, the ability to cast from a mobile device, and support for horizontal videos and stories.
John Ternus set to re-establish importance of Apple's design team when he takes over as CEO
Incoming Apple CEO John Ternus intends to restore the industrial design team's authority to pre-2015 levels as the primary driver of hardware development.
Original article
Apple is reportedly preparing for a shift back toward design-led product development under incoming CEO John Ternus, after years in which operations and finance gained greater influence following the departure of Jony Ive. Ternus has been working closely with Apple's industrial design team and wants to restore its authority, reflecting a philosophy closer to the era when design played a dominant role in shaping Apple's products. He is expected to become the public face of major hardware launches, including Apple's upcoming foldable iPhone, and has emphasized that Apple products should remain among the most beautifully designed items customers own.
WhatsApp is rolling out a new message animation on iOS
WhatsApp is reintroducing message animations for iOS, adding a dedicated settings toggle to manage visual motion effects.
Original article
WhatsApp is testing a redesigned message animation for iOS that makes sent and received messages smoothly fade and scale into view instead of appearing instantly. The feature, currently available to some beta testers, includes a new toggle in Settings → Chats → Animations, allowing users to enable or disable message animations for the first time on iPhone, bringing the iOS experience more in line with Android.
AI Image Tools (Website)
Tiktomato bundles image editing, video generation, and photo restoration into a single browser-based workspace.
Original article
Create AI images, transparent PNG cutouts, logos, thumbnails, headshots, restored photos, upscaled images, and videos in one workspace.
Turn Photos into Living Moments (Website)
Vimi claims to instantly animate static photos into videos using AI-powered filters.
Original article
Vimi is an all-in-one AI image video generator that turns photos into living moments. Choose your favorite filters, upload a photo, and generate stunning videos instantly.
Regular Practice gives whole-food brand Frood an identity built for joy, not worthiness
Frood adopts a loud, playful visual identity to challenge the typically joyless and preachy branding of healthy food products.
Original article
Frood, a new range of healthy meal blends created by Frida Redknapp, aims to make nutritious home cooking quick and enjoyable without relying on the typical “healthy food” aesthetic. Designed by Regular Practice, the brand uses a bold, colorful identity, simple messaging, and appetizing food photography to emphasize flavor, convenience, and real ingredients, helping Frood stand out in a crowded market while making healthy eating feel approachable and fun rather than preachy.
Tátil Design gives global consultancy Vanto Group a seismic rebrand
Tátil Design rebranded Vanto Group using a seismic, dot-grid visual metaphor to shift the firm’s image from corporate consultant to boutique innovator.
Original article
Design agency Tátil Design created a new identity for Vanto Group to better communicate its distinctive consulting approach, replacing generic industry messaging with a unique "Third Territory" positioning centered on helping organizations build the future through conversations and language. The rebrand uses seismic movement as a visual metaphor, featuring a modular dot-grid system, a fluid V-shaped logo, custom typography, and a new strategic vocabulary that helps Vanto clearly explain its value to clients while emphasizing the more personal, human approach of a boutique consultancy.
How the UK's Social Media Ban Could Transform Graphic Design Hiring
A UK ban on social media for those under 16 could end the industry's obsession with hiring designers solely for their social media influence.
Original article
The UK's ban on social media for under-16s may do more than protect young people - it could quietly reshape how creative agencies hire designers.