Loading digest...
Jun 18
1 / ?
AI infrastructure

Production Infrastructure for AI Agents

Vercel released 'eve,' an open-source framework that treats AI agents as file-based projects with built-in sandbox environments and durable execution.

Summary

What: The framework allows developers to define agents as directory structures containing instructions, tools, and subagents, while managing production needs like human-in-the-loop approvals, scheduling, and observability.
Why it matters: This shifts agent development from 'bespoke plumbing' to a standardized application architecture similar to how Next.js modularized web development.
Takeaway: Run `npx eve@latest init my-agent` to scaffold your first agent and view the documentation inside `node_modules/eve/docs`.

Deep Dive

  • Agents are defined via a file tree: agent/instructions.md, agent/tools/, and agent/subagents/.
  • Durable execution allows sessions to pause and resume across crashes or deployments.
  • Includes native sandboxed compute for agent-generated code execution.
  • Supports human-in-the-loop approvals using the needsApproval configuration in tool definitions.
  • Integrates with MCP (Model Context Protocol) for connecting to third-party data like Slack, GitHub, and Snowflake.
  • Agents deployed via Vercel automatically gain observability spans and metrics in the Vercel dashboard.
  • Supports cron-based scheduling for autonomous agent tasks.

Decoder

  • Durable Execution: A pattern where workflow state is persisted so that long-running processes can survive server failures or deployments.
  • MCP (Model Context Protocol): An open standard for connecting AI models to local or remote data sources and tools.
  • Sandbox: A secure, isolated environment where untrusted code is executed to prevent it from accessing the host file system or system resources.

Original Article

Today, we are proud to introduce eve, an open-source agent framework for building, running, and scaling agents. eve is designed around the idea that building an agent should mean defining what it does without assembling all of the pieces that it needs to run in production. Instead, eve comes with production already built in:

  • Durable execution
  • Sandboxed compute
  • Human-in-the-loop approvals
  • Subagents
  • Evals
  • And more

eve is the framework that we build and run our own agents on.

Agents today are where the web was before frameworks, with everyone hand-rolling the same plumbing and nothing carrying over to the next one. Next.js ended this for the web, and eve is doing the same for agents.

An agent is a directory

This is an eve agent.

agent/
  agent.ts                   # the model it runs on
  instructions.md            # who it is
  tools/
    run_sql.ts               # what it can do
    post_chart.ts
  skills/
    revenue-definitions.md   # what it knows
  subagents/
    investigator/            # who it delegates to
  channels/
    slack.ts                 # where it lives
  schedules/
    monday-summary.ts        # when it acts on its own

Each file describes one component of the agent, so at a glance, the tree tells you what an agent is, what it does, where it lives, and when it acts on its own.

Create an eve agent in minutes

Every agent starts with its definition.

import { defineAgent } from "eve";

export default defineAgent({
  model: "anthropic/claude-opus-4.8",
});

The agent.ts file is where you configure the agent itself. You can define the model with one line, with provider fallbacks supported through AI Gateway, and compaction, model options, and other optional fields are there when you need them.

Giving your agent a job and personality is as simple as creating an instructions.md file, which serves as the system prompt that eve puts in front of every model call.

You are a senior data analyst. You answer questions about the team's data.
- Prefer exact numbers to hand-waving. If you can compute it, compute it.
- State the assumptions behind any number you report (date range, filters, grain).
- Use the tools available to you rather than guessing. If you cannot answer from the data, say so plainly.

You create files for what your agent does, like post_chart.ts and revenue-definitions.md for tools and skills, and eve wires them into a working agent without any boilerplate or plumbing to manage. You can just focus on what your agent does instead of how it does it.

Why we built eve

We had built agents for years at Vercel, v0 among them. But once coding agents made building one something anyone could do, everyone did. We shipped hundreds of agents and internal apps, and it looked like a productivity revolution.

But underneath it, every team was building and rebuilding the same plumbing before their agent could do anything, and none of it carried over from one use case to the next. Each agent was designed for a different task, but they all had the same needs, and the same structure kept emerging to meet them. Agents have a shape.

eve is that shape made into a framework. Every generation of software earns its abstractions once enough people have built the same thing the hard way, and agents are there now.

Batteries included

Everything an agent needs in production ships with the framework.

A durable session for every conversation

Agents wait on people, call slow systems, and run for hours, days, or weeks. In eve, every conversation is a durable workflow with each step checkpointed, so a session can pause, survive a crash or a deploy, and resume exactly where it stopped. This durability is built on the open-source Workflow SDK.

A sandbox for every agent

The code your agents write should be treated as untrusted, so eve keeps agent-generated code out of your application runtime entirely. Every agent gets its own sandbox, an isolated environment for shell commands, scripts, and file reads and writes, running in a separate security context from the harness that controls the agent. The backend behind this sandbox is an adapter. When deployed, it runs on Vercel Sandbox. Locally, it runs on Docker, microsandbox, or just-bash, and you can write an adapter for any other provider.

Human-in-the-loop approvals

Agents act on real systems, and some of those actions should require a person to approve them. Any action in eve can be configured to require approval, and the agent will pause there and wait, indefinitely if it has to, without consuming any compute. Once approved, eve continues the task right from where it left off.

Secure connections to tools, data, and services

Agents need to connect to your backends, data, and other third-party services. In eve, a connection is a file that points at an MCP server or any API with a compatible OpenAPI document.

import { defineMcpClientConnection } from "eve/connections";

export default defineMcpClientConnection({
  url: "https://mcp.linear.app/sse",
  description: "Linear workspace: issues, projects, cycles, and comments.",
  auth: {
    getToken: async () => ({ token: process.env.LINEAR_API_TOKEN! }),
  },
});

eve discovers the remote tools, hands them to the model, and brokers the auth, and the model never sees the connection's URL or credentials. Vercel Connect handles interactive OAuth with consent and token refresh built in. At launch, eve agents can connect to Slack, GitHub, Snowflake, Salesforce, Notion, and Linear, plus anything else you can reach over OAuth, an API key, or an MCP server.

The same agent on every channel

Most agents live in exactly one place because every new surface is its own integration to build. In eve, the same agent serves every surface, and each channel is just a small adapter file. The HTTP API is on by default, with Slack, Discord, Teams, Telegram, Twilio, GitHub, and Linear included, and defineChannel covers custom channels. One channel can also hand off to another, so an incident webhook can open an investigation thread in Slack.

Tracing and evals built in

When an agent gets something wrong, the first question is what the agent actually did. In eve, every run produces a trace. Each model call and tool call appears in order with its inputs and outputs, down to the commands the agent ran in its sandbox, so you can replay the run instead of piecing it together from logs.

ai.eve.turn                      # one span per turn
├── ai.streamText                # the model call
│   └── ai.streamText.doStream
└── ai.toolCall                  # run_sql, with inputs and outputs

The spans are standard OpenTelemetry and export to any tracing service you already run, whether that is Braintrust, Honeycomb, Datadog, or Jaeger. On Vercel, they surface in an Agent Runs tab under Observability, giving you one place to watch every session and drill into any run. Evals let you go further, with scored test suites you can run locally or wire into CI.

Extend an agent one file at a time

The most common way to give an agent capabilities is to give it tools, and to teach it how to do things with skills. Today that means building the tool, writing the skill, and then wiring both into whatever runs your agent loop. With eve, a tool is one TypeScript file and a skill is one markdown file.

import { defineTool } from "eve/tools";
import { z } from "zod";
import { runReadOnlySql } from "../lib/sample-db";

export default defineTool({
  description: "Run a read-only SQL query against the orders and customers tables.",
  inputSchema: z.object({
    sql: z.string().describe("A single read-only SELECT statement."),
  }),
  async execute({ sql }) {
    const { columns, rows } = await runReadOnlySql(sql);
    return { columns, rows: rows.slice(0, 500), truncated: rows.length > 500 };
  },
});
---
description: How this team defines revenue. Load before answering any revenue question.
---

Revenue is recognized net of refunds, over the subscription term.
Weeks are Monday-anchored, in UTC.
Exclude trial and internal accounts from every number.

Notice what is missing. Instead of writing all of the boilerplate to wire these up and register them with your agent, eve handles it for you.

A file's name and place in the tree are its definition. eve picks up the tool and skill at build time, hands the model their descriptions, and the model takes it from there. Just as Next.js turns a folder into a route by owning the routing, eve turns a file into an ability by owning the agent loop.

Add human-in-the-loop approval

Requiring approval for an action is one field on the tool.

export default defineTool({
  description: "Run a read-only SQL query against the warehouse.",
  inputSchema: z.object({ sql: z.string() }),
  needsApproval: ({ toolInput }) => estimateScanGb(toolInput.sql) > 50,
  async execute({ sql }) {
    // unchanged
  },
});

Let the agent write its own code

The tools you define aren't the ceiling. eve gives your agent a real computer with a shell, so it can run bash, grep, and anything else you'd run in a terminal. When a job calls for code that doesn't exist yet, the agent writes and runs it.

Delegate work to a subagent

An eve agent can also delegate. A subagent is the same shape one level down, a directory inside subagents/ with its own instructions, tools, and sandbox. The parent calls it just like it calls a tool.

Start and interact with your agent

Run the agent locally

To start an eve agent, you run its dev server: eve dev

Test the agent with evals

Talking to the agent proves one run at a time. Evals test your agent the way you test the rest of your software, with scored checks written in files like everything else in the project.

import { defineEval } from "eve/evals";
import { includes } from "eve/evals/expect";

export default defineEval({
  description: "The analyst answers revenue questions by the team's rules.",
  async test(t) {
    await t.send("What was revenue last week?");
    t.completed();
    t.calledTool("run_sql");
    t.check(t.reply, includes("net of refunds"));
  },
});

Ship it

The agent has lived on your laptop long enough. Shipping it is normally the step where the agent work stops and the infrastructure work begins. With eve there is nothing to provision, because the agent is an ordinary Vercel project, and it deploys the way any other frontend or backend does. Use vercel deploy.

Introduce the agent to your team

Getting an agent into Slack used to mean building a Slack app first, including the app config, bot token, event subscriptions, webhook endpoint, and signing secret, all before the agent said a word. With eve, a channel is one command: eve channels add slack.

Put the agent on a schedule

The Monday revenue report should not wait for someone to ask. A schedule is one more file, a cron expression and a handler that starts the agent on its own clock.

Run the agent like the rest of your software

An agent your team depends on is production software, and a change to its instructions can break it as surely as a change to its code. Because an eve agent is files in a directory, it lives in Git like the rest of your code, and a new prompt, tool, or skill is a commit with a diff, a review, and a history.

How we run Vercel on eve

We run more than a hundred agents in production at Vercel, and they are part of how the company operates every day. Examples include:

  • The data analyst: Handles more than 30,000 questions a month.
  • The autonomous SDR: Works every new lead the moment it comes in.
  • The sales cockpit: Answers pipeline and forecast questions from Snowflake and Salesforce.
  • The support engineer: Solves 92% of tickets on its own.
  • The content agent: Runs a full review pipeline for content.
  • Routing agent: Routes tasks to the correct specialized agent.

Get started

A year ago, agents triggered less than 3% of the deployments on Vercel. Now, they trigger around 29%. The public preview is open today, and the CLI wizard walks you through your first agent in under a minute.

npx eve@latest init my-agent

Everything eve can do is at eve.dev/docs and development happens in the open at github.com/vercel/eve.

AI securitycloudapi

Vercel Connect

Vercel Connect replaces static, long-lived provider tokens with runtime, task-scoped credentials to prevent secret leakage in agentic workflows.

Summary

What: Vercel launched Vercel Connect, a public beta service that uses OIDC and runtime credential exchange to provide temporary, granular tokens for services like Slack, GitHub, and Linear. It allows developers to remove static environment variables, such as SLACK_BOT_TOKEN, by requesting short-lived permissions during execution.
Why it matters: This marks a move toward 'just-in-time' infrastructure access, where security is defined by specific task requirements rather than persistent, broad-scope permissions that are difficult to rotate and audit.
Takeaway: Install the plugin with 'npx skills add vercel/vercel-plugin --skill vercel-connect' to begin replacing your hardcoded provider secrets with dynamic runtime tokens.

Decoder

  • OIDC (OpenID Connect): An identity layer on top of the OAuth 2.0 protocol that allows applications to verify the identity of a user or service based on the authentication performed by an authorization server.
  • MCP (Model Context Protocol): An open standard that enables AI models to connect to various data sources and tools consistently.

Original Article

Giving your agents access to your tools, data, and services is what makes them useful. As agents perform deeper work across systems, authenticating and authorizing that access becomes central to your application architecture.

Today, agent access is usually granted through long-lived provider tokens stored in your environment variables, provisioned for everything your agent might need. These tokens are shared across every user, never expire, and give your agent full reach across every task, no matter how small the job.

A vault makes that token harder to steal. It doesn't make it less dangerous. The problem is what happens when the token leaks: everything it can touch is now exposed.

We built Vercel Connect to solve this problem. Now in Public Beta, Vercel Connect replaces the stored token with runtime credential exchange. You register a connector once. When your agent has work to do, your app proves its identity to Vercel Connect and gets back a short-lived credential, scoped to the task. Everything you used the token for still works. The agent just requests access each time instead of holding it.

Register a connector once, then reuse it across projects and environments

A connector is a reusable connection between your Vercel team and a provider like Slack or GitHub. You create it once from the dashboard or the CLI, then attach it to the projects and environments that need it, with project-level access controls.

vercel connect create slack --name mybot

The relationship with the provider becomes a single entity you can see and manage, not something scattered across a dozen environment variable panels where a rotation means hunting down every copy.

Your coding agent can run this setup too. Install the vercel-connect skill with npx skills add vercel/vercel-plugin --skill vercel-connect, and it can create and attach connectors for you.

Request scoped tokens at runtime

With a connector in place, the agent asks for a credential only when it has work to do. The @vercel/connect SDK returns a token you use immediately against the provider API, and no provider secret lives in your app.

import { getToken } from '@vercel/connect';

const token = await getToken('slack/mybot', {
  subject: { type: 'app' },
});

Tokens are short-lived, with a lifetime that depends on the provider. The SDK refreshes them automatically, so you never rotate a secret by hand. That leaves one question. If your app holds no secret, what proves it's allowed to ask?

The app proves its identity with OIDC

The proof is an identity your app already has. Every deployment on Vercel gets an OIDC identity, and when your app or agent requests a token, the SDK presents that identity to Vercel Connect. Vercel Connect verifies it, checks that the project and environment are allowed to use the connector, and returns the provider credential. That round trip is the runtime credential exchange.

The same identity is available during local development through vercel link and vercel env pull, and outside Vercel, the SDK accepts a Vercel access token. Either way, there is no provider secret in your app to leak, commit, or copy between environments.

Scope each token to exactly what the task needs

Not every task needs the same reach, even within a single agent. One step might read a repository while the next opens an issue. Each requests exactly the access it needs, and the request itself sets limits. A request can include:

  • Provider scopes
  • An installation ID
  • Resource restrictions
  • Provider-specific authorization details

GitHub is the sharpest example because it can restrict a token to specific repositories and permissions.

import { getToken } from '@vercel/connect';

const token = await getToken('github/mybot', {
  subject: { type: 'app' },
  authorizationDetails: [
    {
      type: 'github_app_installation',
      repositories: ['myorg/repo1'], 
      permissions: ['contents:read'], 
    },
  ],
});

The deployment agent can read that one repository and do nothing else. A fine-grained GitHub App install can be narrow too, but an install is a standing grant, set up once and trusted from then on. This limit exists for one request, one task. Least privilege becomes the shape of the request.

Act on behalf of a specific user, with per-user token scoping

A shared bot token gives every user's request the same identity and reach. Vercel Connect lets you set that identity. Switch subject from the app to a named user, and the token acts on that user's behalf, scoped to what that user authorized.

import { getToken } from '@vercel/connect';

const token = await getToken('linear/mybot', {
  subject: { type: 'user', id: 'user_123' },
});

When a user first grants access, startAuthorization runs the consent flow through a callback URL, a webhook, or a device code. After that, the agent requests tokens as that user.

Contain access by environment, and revoke it when you need to

A connector is attached to the projects and environments you choose, so you can run a separate connector for development, preview, and production instead of pointing one at all three. When each environment has its own connector with an authorization grant and scopes, a credential compromised in development cannot be replayed against production.

Separate connectors limit where a credential works, but they don't pull back access already issued. That's normally the painful part. With a stored token, that means a rotation. You mint a new secret, update every place the old one lived, and redeploy whatever depended on it. With Vercel Connect, you revoke the connector's tokens, either your own or all of them.

# Revoke just your own tokens for a connector
vercel connect revoke-tokens slack/mybot --my-tokens
# Or revoke every token, across all users and installations
vercel connect revoke-tokens slack/mybot --all-tokens

What revoking does depends on the provider. Where the provider supports revocation, Vercel Connect revokes the token at the provider. Where it does not, Vercel Connect stops issuing new tokens for that grant, and a token already issued stays valid at the provider until it expires.

Drive event-driven agents from verified Slack triggers

So far, your agent has been the one reaching out. It requests a token and calls a service when it has work to do. Triggers run the other way. A connected service sends an event to your app, and your agent responds.

Vercel Connect receives the provider's webhook, verifies it, and forwards it to your project. Trigger forwarding is in beta and supports Slack, GitHub, and Linear today. A Slack connector can forward its verified webhooks to up to three of your projects, so a message in Slack can wake an agent that acts on it.

The flow runs end to end without a provider secret in your app:

  • A user posts a message in Slack.
  • Slack sends the event to Vercel Connect.
  • Vercel Connect verifies the event against the Slack signing secret it holds, then forwards it to your Vercel app, re-attested with its OIDC identity.
  • Your app verifies that attestation, then requests a scoped runtime token.
  • The agent acts and responds.

The Slack signing secret does not disappear. It moves server-side to Vercel Connect, which verifies the upstream webhook and re-signs the forwarded request with an identity your app can check.

Vercel Connect meets your code where it already is

Underneath everything is one call. Whether your agent is built on the AI SDK, runs as a background job, or is a loop you wrote yourself, it asks for a token the same way, with getToken. Around that call are adapters for the stack you already run.

Access becomes something you request, scoped to the task

An agent becomes more useful the more it can reach, which is exactly why access is the part to get right. Every system the agent can touch is a system someone could reach through a leaked token. With runtime credential exchange, nothing is provisioned for everything. Nothing is shared by every user. Nothing lasts forever. Nothing reaches past the task in front of it.

Frequently asked questions

What is Vercel Connect?
Vercel Connect lets your agents and services access external systems on behalf of your users and teams. Instead of storing provider credentials in long-lived environment variables, you request user-authorized tokens at runtime with project-level access controls.

What problem does Vercel Connect solve?
It removes long-lived third-party secrets from your runtime while still letting agents act on external APIs. You register a connector for a provider, link it to projects and environments, and request provider tokens at runtime.

When should I use Vercel Connect instead of Integrations?
Use Vercel Integrations for marketplace-managed installs and provider-managed products in the Vercel Marketplace. Use Vercel Connect when you need delegated runtime credentials and user authorization for agent workflows.

Which connectors are available?
Vercel Connect supports generic OAuth and API key connectors, plus dedicated connectors for Slack, GitHub, Linear, Discord, Notion, Salesforce, Figma, and Snowflake.

How does pricing work?
Pricing is based on token requests. The Hobby plan includes 5K token requests per month at no additional cost. On Pro and Enterprise plans, token requests are billed at $3 per 10K token requests.

What are the current Beta limitations?
Trigger forwarding is limited to Slack, GitHub, and Linear, connector branding fields cannot be fully cleared after you set them, and token revocation, token lifetime, and scope granularity depend on provider support.

Tech aiagentsrobotics

AI coding agents taught robots how to install GPUs and cut zip ties

Nvidia researchers developed the ENPIRE harness, allowing AI coding agents to autonomously train robots to perform complex hardware assembly tasks like GPU installation.

Summary

What: The Nvidia GEAR lab, in collaboration with CMU and UC Berkeley, created the ENPIRE harness. It enables AI coding agents (such as Claude Code and OpenAI's Codex) to debug robot behavior, analyze logs, and iteratively improve training policies. The agents achieved a 99% success rate on tasks including inserting GPUs and cutting zip ties.
Why it matters: This represents a shift toward self-improving physical infrastructure where AI agents, rather than human engineers, manage the trial-and-error cycle of robotic skill acquisition.

Deep Dive

  • ENPIRE Harness: A four-module framework for AI agents covering task reset/verification, policy refinement, multi-robot parallel evaluation, and failure log analysis.
  • Autonomous Training: Agents ingest research papers and system logs to iterate on robot control code without human intervention.
  • Performance Gains: Eight-agent teams reached 99% success on the 'Push-T' task in two hours, significantly faster than four-agent or single-agent counterparts.
  • Hardware Limitations: Robots often sit idle while agents process LLM inputs or wait for parallel training sessions.
  • Open Source: Nvidia plans to open-source the harness to enable home-based robotic research labs.

Decoder

  • Push-T: A common benchmark task in robotics where a model must manipulate a T-shaped object to a specific target position.
  • Agentic Harness: Software infrastructure that provides LLMs with persistent memory, tool-use capability, and constrained feedback loops to perform multi-step tasks autonomously.
  • Quantization: A technique used to reduce the memory footprint and compute requirements of LLMs by using lower-precision numerical formats for model weights.

Original Article

What happens when you give AI coding agents a lab full of robotic arms, some compute resources, and a “generous token budget” for teaching the robots various tasks? The agents can apparently figure out a training regimen that teaches the robots to successfully cut zip ties and even insert GPUs into thin sockets on motherboards.

That glimpse into how AI can act in a fully autonomous way to automate robot training was made possible by a new agent harness framework—software that wraps around AI models to enable their use of various tools while also providing capabilities such as memory, context, constraint, and feedback loops. That agentic harness, called ENPIRE, was developed by robotics researchers at the Nvidia GEAR (Generalist Embodied Agent Research) lab alongside collaborators from Carnegie Mellon University in Pittsburgh and the University of California, Berkeley.

“A part of our NVIDIA GEAR lab now self-improves tirelessly overnight,” wrote Jim Fan, director of AI at NVIDIA, in a LinkedIn post. “We just read the reports in the morning.”

Fan also jokingly described the goal of such AI-directed robot training, saying, “We all take a holiday and Jensen wouldn’t even notice,” in reference to Nvidia founder and CEO Jensen Huang. But it’s not only Nvidia robotics researchers who could benefit—Fan said the team would be open-sourcing everything so anyone can host their own “self-running robot lab at home.”

The ENPIRE harness has four modules that enable AI coding agents to perform automatic reset and verification on tasks, refine policies that guide robotic behavior, evaluate such policies across multiple physical robots working in parallel, and address failures by analyzing logs, ingesting research papers, and improving training infrastructure and algorithm code. More technical details are available in the research paper uploaded on June 16, 2026.

The harness was tested with three different AI coding agents, including OpenAI’s Codex with GPT-5.5, Anthropic’s Claude Code with Opus 4.7, and Moonshot AI’s Kimi Code with Kimi K2.6. Teams of the coding agents independently developed different algorithmic approaches to robot training, tested them in real-world experiments, and then retained whatever changes helped raise the overall success rate over repeated cycles of self-directed testing.

The success and limits of AI-directed robot training

Equipped with ENPIRE, the AI coding agents developed strategies for robotic self-improvement that achieved a 99 percent success rate across several manipulation tasks, including the standard “Push-T” task that challenges robots to move a T-shaped block to fit a target position on top of a table. Other tasks included organizing pins in a pin box, tying and cutting zip ties, and placing a GPU into a motherboard before unplugging the graphics card again to reset for the next trial.

The most promising result may have come from the pin insertion and organization task. In that robot-training scenario, AI coding agents achieved nearly 100 percent success faster than a “frontier human-in-the-loop method” developed by many of the same human researchers.

Such experiments also showed how larger teams of up to eight AI coding agents could achieve high success rates in robot training more quickly than smaller four-agent teams or single agents working alone. For example, the eight-agent team achieved 99 percent success on the Push-T task in two hours of research time, compared to the four-agent team requiring three hours and the single-agent team requiring nearly five hours.

But the human researchers also discovered some crucial limitations when unleashing AI coding agents as autonomous robot trainers. The robots often sat idle and unused while the coding agents were busy “reading logs, writing code, debugging, or waiting for the language-model backbone.” Larger teams of coding agents also spent more time summarizing each other’s ideas and less time actually using the robots, and the coding agents sometimes failed to make full use of available compute resources when launching parallel training sessions.

The faster success rates enabled through more agents and robots working together also came at the cost of higher token consumption—a noteworthy consideration at a time when AI developers such as Anthropic are weighing pricing changes that would significantly increase the token-related costs of using AI services.

Flush with cash from the AI boom, Nvidia has been busily pushing its vision for physical AI through multiple robotics initiatives. On May 31, the company announced a partnership with the prominent Chinese robotics company Unitree to provide a “Reference Humanoid Robot” for research labs developing general-purpose AI-powered robots.

During a whirlwind tour of South Korea in early June, Nvidia founder and CEO Jensen Huang also met with Hyundai Motor Executive Chair Chung Euisun to discuss scaling up the mass manufacturing of AI-powered robots. Hyundai Motor Group owns the US robotics company Boston Dynamics, which is already well-known for its four-legged “robot dog” Spot and has been working to commercialize its Atlas humanoid robot.

Design aiagents

What is AX?

John Maeda defines 'Agent Experience' (AX) as a fundamental shift where screens transition from control cockpits to evaluation windows for AI-generated actions.

Summary

What: Maeda argues that while AI reduces the 'Gulf of Execution' by handling tasks, it shifts the designer's burden to the 'Gulf of Evaluation,' where the primary task becomes verifying and auditing agent intent.
Why it matters: This signals that future interface design must prioritize visibility, auditability, and trust-building tools over the traditional design focus on input affordances and manual navigation.

Deep Dive

  • From UX to AX: User Experience (UX) was about operating a cockpit; Agent Experience (AX) is about monitoring an AI agent's actions.
  • Collapsed Execution: AI agents make command execution effortless, effectively nullifying the need for complex menu-based interfaces.
  • Widened Evaluation: The critical challenge is now ensuring the user can judge if the agent performed the correct task.
  • The Screen as Window: In AX, the UI is not a place for 'work' but a place for 'judgment' and steering.
  • Accessibility as Foresight: Users with experience navigating non-visual or command-driven interfaces (e.g., blind users) possess skills that map naturally to agentic interaction.
  • Two-Handed Design: True AX requires both a command interface (for intent) and a canvas interface (for inspection and verification).

Decoder

  • Gulf of Evaluation: A concept by Don Norman describing the effort required to interpret the state of a system to determine if the desired goal was achieved.
  • Affordances: The perceived properties of an object that determine how it can be used, like a handle 'affording' being pulled.

Original Article

For most of my career I designed the cockpit.

I made the menus and the tabs and the flows. I arranged the dashboards and the confirmation dialogs and the little blinking things that tell you something is happening. I was proud of this work. And the better I made it, the more considered the controls became, the more I quietly assumed that a human would always be the one flying the plane. That assumption is the thing I want to talk about. Because it is ending, and it ended from a direction I never thought to watch.

It came at me from the dark.

What I Learned From (Temporary) Blindness

Years ago, during a layover in Frankfurt, I wandered into a museum called DIALOGUE MUSEUM. The experience begins when the lights go out. Not dimmed. Out. You are led through total blackness by a guide who is blind, through simulated parks and streets and restaurants and one noisy little public square. Within seconds I was gripping the handrail. A moment later I was holding the hands of strangers, all of us shuffling forward like a single nervous animal. And our guide? She moved through it effortlessly. Confidently. Quickly. The instant the lights disappeared, the thing I had filed away as a disability became, right in front of me, an ability. I have carried that afternoon around for years and never quite known where to set it down.

Later I worked alongside a blind colleague who used his iPhone faster than anyone I have ever met. He ran VoiceOver at a speed where the speech was, to my ears, a single unbroken hiss. To him it was plain language. This was years before any of our machines could listen and talk the way they do now. And because he didn’t need any screen brightness, which is, after all, mostly for the rest of us, his battery lasted nearly three days on a charge. Three days! I was lucky to make it to dinner.

At the time I thought of both of these instances as examples of “accessibility.” I had a tidy folder in my mind for them. Now, I think the folder was mislabeled. What I had actually seen, twice, was an early glimpse of AX.

UX, user experience, was the cockpit I spent my life building. You learned how the software worked so that you could operate it. AX, agent experience, asks a different question entirely. Not “how do I work this thing?” but “what do I actually want, and can I just say so?” AX is about teleporting to the goal.

Years ago I wrote the foreword to Erika Hall’s lovely book Conversational Design (2018), and one of her ideas has never left me. Conversation is not some futuristic interface we are busy inventing. It is the oldest interface we have. No baby is born knowing how to use a dropdown menu. A baby is born knowing how to express intent. A cry, a reach, a look. As parents we need to know how to read the feedback that comes back. The breakthrough of chat, then, was not that our computers suddenly learned to talk. It was that intent finally moved to the foreground. We stopped struggling with operating a UI. We started telling it what we want instead.

The Two Gulfs: Execution and Evaluation

That small move changed where the user pain lives. The cognitive scientist Don Norman gave us two terms for the troubled relationship between a person and a machine. The first is the Gulf of Execution. Can I figure out how to do the thing? The second is the Gulf of Evaluation. Can I tell whether the thing got done?

For decades, design was almost entirely a war on the first gulf. It is why Norman is famous for doors: the door you shove when you were meant to pull, the one that shames you in front of other people. All our menus and tooltips and onboarding flows were, at heart, just better door handles. Designers became experts at designing the right kind of doorknobs for users. Also called “affordances.”

Now watch what agents do to that door. The Gulf of Execution collapses toward zero. You no longer need the handle, because you no longer work the door yourself. You say “I’m going outside,” and you are outside.

But the second gulf, the quiet one we underinvested in for forty years, swings wide open. Did the agent actually do what I meant? How do I inspect it? How do I steer it? How do I trust it? The work did not disappear. It moved. From doing to judging.

So the screen does not vanish in this new world, the way some people like to predict. It simply changes jobs. In UX the screen was where the work happened, the surface you pushed and dragged. In AX the screen becomes where judgment happens, the surface you read to decide whether to trust what just got done. The cockpit becomes a window.

And this is why chat, all by itself, is like working with one hand tied behind your back. Language is a magnificent hand. It lets you ask for nearly anything. But asking with no place to inspect the result, no canvas to compare and steer and verify against, is only half a craft. Chat plus a canvas gives you both hands free at last. One hand to say what you want. One hand to see whether you got it. One hand for each gulf. That two-handedness, that ambidexterity, is what I think AX is really about.

It’s Time To Think Differently

Which brings me back to the dark.

The people best prepared for this new world may not be the ones we would guess. They may not be the power users who mastered the old cockpit, the keyboard-shortcut wizards and the dashboard gurus. Research has long shown that many blind users process synthetic speech far faster than sighted listeners. Often at two to three times the pace of ordinary conversation, at rates the rest of us simply cannot follow. Years spent navigating computing through language, structure, sequence, and memory, rather than through visual layout, turn out to be exactly the muscles this moment asks for.

In that Frankfurt museum, the lights went out and expertise turned over in an instant. The ones who struggled were not the ones we are trained to imagine struggling. The ones who led were not the ones we are trained to imagine leading.

The age of agents, I suspect, will feel a little like that room. The lights we are used to designing by are going down. And some of the instincts we long treated as peripheral are walking, right now, to the center of the floor. They are offering us their hand.

—JM

Data aiinfrastructure

Data Processing is Becoming a GPU Workload

Data processing is shifting from CPU-bound SQL to GPU-heavy inference, forcing organizations to build heterogeneous pipelines that handle API-based I/O and streaming inference.

Summary

What: Robert Nishihara of Anyscale argues that modern data processing—like embedding documents, transcribing media, and model-driven curation—is now an inference-heavy task. This renders traditional CPU-based, stage-gated ETL architectures inefficient, requiring specialized GPU compute and streaming execution models.
Why it matters: This transition signifies the end of the homogeneous CPU cluster era for data processing, as the value of data now depends on the inference results (embeddings, labels) rather than just structured relational transformations.
Takeaway: If you are processing multimodal data, evaluate moving from Spark-style batch processing to streaming architectures that can overlap compute and handle the non-deterministic latency of GPU inference.

Deep Dive

  • Three Shifts: Moving from tabular to multimodal data, from SQL to inference, and from CPU-centric to GPU-centric compute.
  • GPU Utilization: Standard ETL clusters suffer from underutilization when forcing heterogeneous GPU inference workloads onto CPU-optimized instances.
  • Streaming Execution: Inference-heavy pipelines require streaming rather than bulk-synchronous processing to prevent GPU idle time during CPU-bound tasks.
  • I/O Complexity: Modern pipelines increasingly rely on API calls, requiring advanced retry/timeout handling and concurrency control.

Decoder

  • Inference: The process of running a trained machine learning model on new data to generate predictions or insights.
  • ETL: Extract, Transform, Load; the traditional process of moving data from source to destination after manipulation.
  • Multimodal: Data sets that contain multiple forms of information, such as text, images, video, and audio.

Original Article

Data Processing is Becoming a GPU Workload

For decades, the mental model for a data pipeline was straightforward. You took raw data, parsed it, filtered it, joined it, and wrote the result somewhere useful. The dominant abstraction was tables, the dominant language was SQL, and the compute substrate was a cluster of homogeneous CPU machines.

That world is not going away.

But a structural shift is underway across the data landscape. A growing fraction of high-value data processing no longer looks like traditional ETL. It is moving to GPUs.

The logic driving this is simple. Vast quantities of actionable, high-value information are stored as unstructured, multimodal data. A modern pipeline might read every executed corporate contract to identify hidden business risks, extract product insights from video recordings of external customer meetings, or analyze internal productivity bottlenecks across Slack and email threads. It might embed every document, transcribe every conversation, or run a vision-language model over petabytes of robotic sensor data.

Historically, companies could store this information, search metadata around it, or build narrow task-specific systems, but they could not deeply understand the raw data at scale. You are not going to run a standard SQL query on a raw video or understand a robot’s trajectory with a group-by statement.

The way to process this kind of data is with models. Today’s multimodal models and embedding models make it possible to search, extract, and structure data of all types. This means running inference, and inference runs on GPUs. As a result, data processing is becoming inference heavy, and therefore GPU heavy.

Existing big data systems were built for an era of computing on homogeneous CPU machines. This shift is introducing a new category of systems challenges. For organizations that solve them, the opportunity is to extract value from fundamentally new sources of data.

The three shifts behind GPU data processing

Three related shifts are happening at once.

  • Tabular to multimodal: Historically, most data that businesses could effectively work with was nicely structured in tables. Unstructured data like video, audio, PDFs, and sensor data were too unwieldy to process at scale, so it was dumped into storage and ignored. The rich, unstructured formats that used to be inaccessible programmatically are suddenly the primary sources of new insights.
  • SQL to inference: SQL is the primary way that companies manipulate tabular data. It is incredibly powerful, but it is not the right tool for working with a video or a PDF or robotic sensor data or genomic sequence data. For these highly unstructured data types, the right tool is model inference. Inference is becoming the central subroutine in complex data pipelines that fuse model execution with regular processing.
  • CPUs to GPUs: Multimodal data processing is inference heavy, and inference runs on GPUs. While the scale of the CPU processing is often much greater, the GPU stages are frequently much more costly and therefore more important to optimize.

Inference creates structure from unstructured data. In many pipelines, GPU data processing does not replace traditional data processing. Instead, it makes traditional data processing possible on new sources of data. Inference is used to extract labels, embeddings, classifications, summaries, entities, or structured records from unstructured data. Once that structure exists, SQL engines, Spark jobs, vector databases, and other traditional tools become useful downstream.

Why Now?

Two underlying trends are accelerating the shift toward GPU data processing.

Data Curation is Now Model-Driven

As model quality increases, the marginal utility of low-quality data decreases, and the importance of improved curation techniques grows. While traditional curation relied on simple heuristic filters such as word frequencies, excessive repetition, or length constraints, modern curation is increasingly model based. The process of judging the quality of data, or of rewriting it to improve that quality, is an inference task.

Data quality is a moving target. The quality level of data that was sufficient to train one model may not be sufficient to train a next-generation model. As model quality increases, data quality must increase in tandem. As this quality bar continues to rise, the amount of inference used broadly across training data preparation and curation will only grow with it.

Scaling with Compute, Not Just Data Volume

We are used to thinking about scaling a training run by increasing the volume of training data. Data scale still matters, but it is only one ingredient. Compute is the other ingredient, and a growing number of techniques are essentially methods for turning compute into high-quality data.

This happens in a few ways:

  • Model-driven curation: The quality refinement process described above.
  • Synthetic data and simulation: Using models or simulators to generate new data for training.
  • Reinforcement learning: A data-efficient technique for using a model in conjunction with an environment, simulator, tool, or evaluator to generate training data.
  • Reasoning and continual improvement loops: Performing large quantities of reasoning, evaluation, or execution, and transforming those traces into lessons that can be fed back into the model’s context.

These methods turn inference into data generation. This trend will drive an even greater need for GPUs across the entire data processing lifecycle.

Why this is not just Spark on GPUs

Forcing modern, inference-heavy AI workloads into existing big data architectures exposes severe limitations. These are the types of systems challenges we designed Ray and Anyscale to solve.

Homogeneous clusters cause underutilization

AI data processing is highly heterogeneous. A standard pipeline might have preprocessing operations that are CPU-bound or memory-bound, while the inference step is GPU-bound. Even within inference itself, disaggregating prefill and decode means the prefill phase may be bound by GPU compute, while the decode phase is bound by GPU memory bandwidth. Putting these operations into a single pipeline requires disaggregating the stages of compute, selecting the right compute shape for each stage, and right-sizing each compute pool. Attempting to coerce these diverse workloads into a single, homogeneous instance shape can lead to severe underutilization of your most expensive hardware.

Pipelines need streaming execution, not stage barriers

Traditional bulk synchronous parallel systems, like Apache Spark, proceed one stage at a time and often materialize intermediate data. GPU data processing is most efficient when different stages of compute overlap. Proceeding one stage at a time with strict barriers means that your GPUs may sit idle while the CPU stages run. Also, GPU nodes and GPU memory are limited resources and so materializing data between stages may not even be an option.

Extreme stragglers are the norm

In traditional data processing, each stage of computation is relatively predictable. In an AI pipeline, that predictability disappears. A single stage may consist of LLM inference with highly variable input and output lengths. Alternatively, it could involve an entire agentic loop, such as cloning a codebase, patching the code, compiling it, and running a test suite. In these scenarios, the amount of time it takes to process different rows can vary wildly.

I/O now means APIs

In AI pipelines, I/O often means calling external APIs, vector databases, or models hosted in other clusters. Handling these API-bound stages requires heavy multithreading or asynchrony. The system needs high concurrency without overwhelming downstream services. It needs backpressure to avoid global rate limits. It needs retries and timeout handling. And it needs to compose these API-bound stages with CPU and GPU stages in the same pipeline.

Case Studies in the Wild

This pattern is visible across the Ray community and Anyscale users. Industry leaders have already rebuilt their stacks for this shift.

Multimodal data curation

  • Netflix runs complex multimodal data curation.
  • Alibaba built Data-Juicer, a foundation model data preparation system.
  • Nvidia developed NeMo Curator, an open-source framework for preprocessing and curating text, audio, images, and video.

Audio, video, and image processing

  • ByteDance runs massive video and audio pipelines, as well as multimodal embedding computations.
  • Apple manages petabyte-scale multimodal data processing pipelines.
  • xAI runs large-scale image and video processing.
  • Runway powers its generative video processing pipelines.

Robotics and autonomy

  • Motional processes petabytes of autonomous drivelogs.
  • Physical Intelligence orchestrates its robotics training data preparation.
  • Bedrock Robotics manages its training data preparation pipelines.

Large-scale batch inference

  • Roblox orchestrates batch inference workloads at scale.
  • Pinterest achieves order-of-magnitude cost reductions for their batch inference workloads.
  • Notion computes large-scale document embeddings.
  • Applied Intuition runs complex batch inference workloads.

Conclusion

As organizations begin extracting more value from various forms of multimodal data, they will begin collecting far more of it. As the data volume grows, the architectural gravity of data engineering will continue to shift toward accelerated, heterogeneous hardware.

Data backenddatabasepostgresql

The NULL in your NOT IN

Using NOT IN in PostgreSQL triggers three-valued logic traps that can silently return zero rows if any NULL values exist in your data.

Summary

What: In SQL, NOT IN is treated as a series of inequality checks (AND), meaning a single NULL value causes the entire expression to evaluate to 'unknown' (neither true nor false), resulting in empty result sets. While PostgreSQL 19 will improve optimizer handling for non-nullable columns, developers should currently favor NOT EXISTS to avoid these pitfalls.
Why it matters: This highlights a fundamental design choice in the SQL standard where three-valued logic (true, false, unknown) creates semantic traps that the database planner historically failed to optimize because the rewrite to anti-joins was not mathematically equivalent when NULLs were present.
Takeaway: Stop using NOT IN for subqueries. Replace it with NOT EXISTS to maintain consistent behavior and better query performance regardless of data nullability.

Deep Dive

  • NOT IN is parsed as a chain of inequalities connected by AND (e.g., x <> a AND x <> b AND x <> NULL).
  • Any comparison with NULL yields 'unknown' in SQL's three-valued logic.
  • A row is only returned if the final predicate evaluates to TRUE; 'unknown' discards the row.
  • A single NULL on the right-hand side of a NOT IN subquery causes the entire query to return zero rows.
  • A single NULL on the left-hand side of the comparison also causes that row to be discarded.
  • The PostgreSQL optimizer treats NOT IN as an 'opaque' filter (SubPlan), preventing efficient join strategies like Hash Anti Join.
  • PostgreSQL 19 (expected release March 2026) introduces a change to promote NOT IN to an anti-join, but only when the planner can prove both sides are NOT NULL.
  • EXCEPT can be used for whole-set comparisons, but it treats NULLs as equivalent, which may or may not be desired compared to NOT EXISTS.

Decoder

  • Three-valued logic: A logic system incorporating true, false, and unknown, where unknown is the result of any comparison involving NULL.
  • SubPlan: An opaque, nested sub-operation in an execution plan that the PostgreSQL optimizer cannot easily reorder or optimize, often leading to slower performance compared to a native join.
  • Hash Anti Join: A high-performance join strategy that identifies rows in the outer table that have no match in the inner table, typically used for NOT EXISTS queries.

Original Article

A NOT IN query can return the wrong answer without telling you. It is valid SQL, it runs without an error, and it hands back a perfectly well-formed result set that happens to be empty when it should not be. No warning, no hint, nothing in the logs: just zero rows where you expected hundreds, and a database that considers it correct.

Almost always the cause is a single NULL sitting somewhere you forgot to look, combined with two keywords you have typed a thousand times: NOT IN. None of it is a Postgres bug. This is exactly what the SQL standard mandates, implemented faithfully. That is precisely what makes it so easy to walk into, and why the planner could not safely optimize around it for the better part of Postgres's history. It comes down to one if statement in the parser.

Sample schema

Nothing elaborate. A table of products, one of which has no category assigned yet, and a table of archived categories that happens to contain a NULL:

CREATE TABLE products (id int, category_id int);
INSERT INTO products VALUES (1, 10), (2, 20), (3, NULL), (4, 10);

CREATE TABLE archived (category_id int);
INSERT INTO archived VALUES (20), (NULL);

The NULL in archived is not contrived. The moment a column is nullable (and most are, by default), a NULL can find its way into any subquery you point a NOT IN at. That is the whole point: this is not an exotic data condition, it is the ordinary one.

The query that returns nothing

Here is the request you have written a hundred times: give me the products whose category is not archived.

SELECT id, category_id FROM products
WHERE category_id NOT IN (SELECT category_id FROM archived);

You expect products 1 and 4 (category 10, which is not in the archived set). What comes back is:

 id | category_id
----+-------------
(0 rows)

Every row gone. Not a subset, not an off-by-one: all of them. Drop the NULL from archived and the same query behaves:

SELECT id, category_id FROM products
WHERE category_id NOT IN (SELECT category_id FROM archived
                          WHERE category_id IS NOT NULL);
 id | category_id
----+-------------
  1 |          10
  4 |          10
(2 rows)

To understand why a single NULL empties the entire result, we have to stop thinking of NOT IN as a single thing and watch the parser take it apart.

IN is an OR, NOT IN is an AND

IN is not a primitive operator. It is shorthand that the parser rewrites into a chain of equality comparisons joined by OR:

x IN (a, b, c)
-- becomes
x = a OR x = b OR x = c

NOT IN is the logical negation of that, and by De Morgan's law negating an OR of equalities gives you an AND of inequalities:

x NOT IN (a, b, c)
-- becomes
x <> a AND x <> b AND x <> c

This is not an analogy. It is literally the expression Postgres builds, and you can read it straight off an EXPLAIN. The literal-list forms collapse into array operators whose names give the whole game away:

EXPLAIN (COSTS OFF) SELECT * FROM products WHERE category_id IN (1, 2, 3);
--  Filter: (category_id = ANY ('{1,2,3}'::integer[]))

EXPLAIN (COSTS OFF) SELECT * FROM products WHERE category_id NOT IN (1, 2, 3);
--  Filter: (category_id <> ALL ('{1,2,3}'::integer[]))

IN is = ANY: equal to any element, an OR. NOT IN is <> ALL: different from all elements, an AND.

Three-valued logic does the rest

SQL does not have two truth values, it has three: true, false, and unknown. Any comparison against NULL yields unknown, because NULL means "no value here" and you cannot ask whether an absent value is different from 20.

true AND unknown is unknown, not true. A WHERE clause keeps a row only when its predicate evaluates to true. Both false and unknown cause the row to be discarded. So product 1 is dropped. Run the same arithmetic for product 4 and you land on unknown again.

The mechanism in one sentence: the instant a single NULL enters the right-hand side, the trailing AND unknown term can never be true, so the whole NOT IN can never be true, so every row is discarded, regardless of how many million rows you have or what they contain.

NULLs on the left side too

Keeping NULLs out of the subquery is not enough. The same unknown arises from NULLs on the left: product 3 (whose category_id is NULL) evaluates to unknown AND unknown, so it is dropped even against a spotless right-hand set. IN and NOT IN are not complements: a row can fail both tests simultaneously. There is a NULL-shaped gap between them that belongs to neither.

The seam, in the source

All of this reduces to one branch in one function. Open src/backend/parser/parse_expr.c and find transformAExprIn, the routine that turns both IN and NOT IN list expressions into something the planner can chew on.

/*
 * If the operator is <>, combine with AND not OR.
 */
if (strcmp(strVal(linitial(a->name)), "<>") == 0)
    useOr = false;
else
    useOr = true;

That is the entire fork. IN arrives carrying the operator = and gets useOr = true; NOT IN arrives carrying <> and gets useOr = false. There is no special-casing of NULL anywhere in this function, and there does not need to be: the three-valued behavior is an emergent property of having chosen AND.

A grammar asymmetry: list vs. subquery

A list and a subquery are built differently. The list form is the <> ALL chain of inequalities you just saw. The subquery form is not a <> ALL at all: it becomes NOT (foo = ANY (subquery)). Different shapes, same truth table, a NULL in the comparison makes the result unknown, and unknown loses.

Why the planner won't save you

The difference between query plans is structural. A Hash Anti Join builds one hash table from the inner relation and streams the outer relation through it exactly once. A SubPlan gives the planner none of that. It cannot be reordered during the global join search, it cannot have the outer query's join clauses pushed into it to drive an index scan, and it has no multi-batch spill logic to fall back on if the hashed set turns out larger than the estimate.

The reason the planner is stuck with that shape is precisely the NULL semantics from earlier. An anti-join keeps a row when it finds no match. But NOT IN must discard a row when the comparison goes unknown, and "unknown" is not the same as "no match".

The fix is landing in PostgreSQL 19

A performance fix is landing in PostgreSQL 19 that converts NOT IN to an anti-join when it can prove that neither side of the comparison can yield NULL values. This removes a planner limitation that has stood since anti-joins were introduced, but it only applies when the planner can prove no NULL can reach the comparison. If your column is nullable, you are exactly where you have always been.

Decision matrix

You wrote Internal shape NULL on right → NULL on left → Planner (≤ PG 18)
IN (1,2,3) = ANY, an OR absorbed, no harm row dropped (no match) ScalarArrayOpExpr
NOT IN (1,2,3) <> ALL, an AND all rows dropped row dropped ScalarArrayOpExpr
IN (subquery) = ANY sublink absorbed row dropped Semi Join
NOT IN (subquery) NOT (= ANY) sublink all rows dropped row dropped opaque SubPlan filter ¹
NOT EXISTS (...) anti-join sublink row kept row kept Anti Join

¹ PostgreSQL 19 promotes this to an Anti Join when both the outer expression and the subquery output column are provably NOT NULL.

What to do instead

Default to NOT EXISTS. Make this the habit and you never hit the trap again. Same anti-join semantics and the same plan as the working NOT IN would have wanted, on every version, nullable columns or not:

SELECT id, category_id FROM products p
WHERE NOT EXISTS (
    SELECT 1 FROM archived a WHERE a.category_id = p.category_id
);

Filter the NULLs out of the subquery when you have to keep the NOT IN:

SELECT id FROM products
WHERE category_id NOT IN (
    SELECT category_id FROM archived WHERE category_id IS NOT NULL
);

Use EXCEPT for whole-set comparisons, but note that EXCEPT treats NULL as a value and will remove rows where the join column is NULL.

AI security

Brain the Size of a Planet: Are LLMs Thonking too Hard?

Extensive testing of 26 LLM configurations shows that higher reasoning effort and newer models frequently fail to outperform smaller models at security bug triage.

Summary

What: Researcher Parsia Yasini ran 2,080 test iterations using GPT-5.4/5.5 and Claude-4.6/4.7/4.8 on two security vulnerabilities, finding that function-level input significantly increases success rates over file-level input, where models often struggle.
Why it matters: This reveals that for security static analysis, prompt engineering and 'reasoning effort' settings often induce 'gaslighting' or excessive token usage without improving detection accuracy, especially when models are overwhelmed by context.
Takeaway: When using LLMs for security analysis, pass individual functions instead of entire source files to dramatically improve detection success.

Deep Dive

  • Tested 26 combinations of GPT-5.x and Claude-4.x models against two vulnerabilities: openbsd-sack and freebsd-nfs-vuln.
  • Success rate for vulnerability detection was 70.8% at the function level, but dropped to 1.7% when passing the entire source file.
  • 'High' or 'Extra High' reasoning effort often resulted in lower performance compared to 'Medium' settings.
  • GPT-5.4/5.5 generally outperformed Claude equivalents in full vulnerability chain resolution.
  • Claude models were the only ones to consistently reference CVEs in their analysis.
  • Total cost for the full experimental series was approximately $9,200.
  • An 'LLM Council' triage approach (majority voting) achieved 86.2% unanimity, making it an effective way to handle multiple AI outputs.

Decoder

  • Triage: The process of determining the priority or relevance of security vulnerabilities.
  • Content Filtering: An automated safety mechanism that blocks an LLM from responding to a prompt deemed sensitive or dangerous.
  • Context Window: The maximum amount of information (in tokens) an LLM can consider in a single request.

Original Article

Full article content is not available for inline reading.

Read the original article →

AI infrastructurehardware

Building AI Agents for AR Glasses and XR Devices with NVIDIA XR AI

NVIDIA's open-source XR AI library provides a modular foundation for connecting AR and VR headsets to GPU-accelerated cloud services for real-time multimodal interaction.

Summary

What: The public beta includes an open-source library that routes camera/microphone streams to NVIDIA's Cosmos (vision) and Nemotron (language) models. It leverages the Model Context Protocol (MCP) to let XR devices access enterprise data and tools, enabling agents to understand environments and respond within a single spatial session.
Why it matters: The architecture separates media transport from model reasoning, allowing developers to upgrade specific components (like models or clients) without rebuilding their entire XR application stack.
Takeaway: Clone the repository from 'https://github.com/NVIDIA/xr-ai.git' to experiment with the 'simple-vlm-example' and integrate your first enterprise data source via MCP.

Decoder

  • VLM (Vision-Language Model): A neural network that can process and understand both visual input and textual instructions, allowing it to perform tasks like image captioning or visual question answering.

Original Article

Building AI Agents for AR Glasses and XR Devices with NVIDIA XR AI

Developers building for AR glasses and wearable devices face an infrastructure gap. The hardware is ready, but creating AI experiences requires integrating live camera and microphone streams, multimodal AI models, enterprise data, tool use, deployment infrastructure, and device-specific runtimes.

NVIDIA XR AI is designed to address this challenge by providing a reusable foundation for connecting extended reality (XR) devices to GPU-accelerated AI services running in the cloud, data center, workstation, or edge.

Now publicly available in beta, developers have access to an open source library for building intelligent agents for AI glasses, AR glasses, and XR headsets. These intelligent XR agents can see what users see, understand spoken or typed intent, call enterprise tools, and respond within the same XR session. They can help frontline team members find the right information, guide workers through procedures, verify outcomes, and capture the evidence.

XR AI brings intelligence to people where they work, whether in field service, remote assistance, industrial operations, healthcare, training, or other hands-busy environments.

NVIDIA partners in healthcare and manufacturing provide useful examples of how this pattern can be applied. Researchers in the Cong Lab at the Stanford School of Medicine and the Wang Lab at Princeton University have explored XR and AI workflows for stem cell therapy research, helping researchers access contextual information and interact with laboratory systems while remaining focused on complex procedures.

In manufacturing, Siemens is exploring in a research context how NVIDIA XR AI and NVIDIA DGX Spark can help factory engineers find maintenance information, troubleshoot issues, verify work, and capture what happened on the shop floor.

This post walks through the process of building an intelligent XR Agent for your use case. It also explores how XR AI combines visual grounding using NVIDIA Cosmos, voice-first interaction with NVIDIA Nemotron models, enterprise connectivity using Model Context Protocol (MCP), and flexible agent orchestration with frameworks such as NVIDIA NeMo Agent Toolkit.

Components and architecture of an intelligent XR Agent

An intelligent XR Agent starts with live context from the user’s XR device. Camera frames, microphone audio, and data messages flow into the XR Media Hub, where they can be routed to models, tools, and agents that understand the user’s environment and intent. NVIDIA Cosmos models provide visual grounding; NVIDIA Nemotron models provide language understanding, reasoning, and tool calling; and MCP servers expose enterprise tools and data sources. Agent frameworks such as NVIDIA NeMo Agent Toolkit can orchestrate workflows across models and tools, while NVIDIA CloudXR can add rendered spatial content when an application needs rich 3D interaction.

XR AI keeps this architecture modular by separating media transport, model services, tool access, agent orchestration, and client delivery. Video pixels can remain in shared memory while lightweight metadata moves through the system, so agents retrieve image data only when a task requires it. This reduces unnecessary model inference and data movement while letting developers swap clients, models, MCP servers, orchestration frameworks, and deployment environments without rebuilding the entire agent.

The same design also supports multi-user and multi-agent scenarios. Participant identity acts as the routing boundary: multiple clients can connect to the same hub, multiple agents can observe the same streams, and each response is routed back to the correct participant. This pattern enables one foundation to support visual understanding, voice interaction, enterprise tool use, real-time reasoning, context-aware XR responses, and flexible deployment across AI glasses, AR glasses, XR headsets, mobile devices, web clients, and CloudXR-powered experiences.

Get started

XR AI is now available in public beta. The following sections walk through how you can use XR AI to quickly get to a working intelligent XR Agent, including:

  • Live camera, microphone, and device data streams
  • Real-time multimodal interaction
  • Visual grounding through Cosmos-powered VLMs
  • Voice interaction through speech recognition and Nemotron models
  • Enterprise connectivity through MCP
  • Searchable visual knowledge capture and retrieval workflows
  • Optional agent orchestration through NeMo Agent Toolkit or other frameworks
  • Optional CloudXR-rendered spatial content

Build your first intelligent XR agent with the public beta

Step 1. Clone the XR AI repository

The GitHub repository includes sample agents, model-server launchers, MCP servers, web clients, XR workflows, and the core media infrastructure. The quickest way to understand the system is to start with a simple multimodal agent and then add capabilities one layer at a time.

bash git clone https://github.com/NVIDIA/xr-ai.git cd xr-ai

Step 2. Start the AI services

The larger examples use shared AI services that can be started independently:

bash cd agent-samples/model-servers uv sync uv run model_servers

This starts the model processes used by the heavier demos and leaves the weights loaded in the background. In the current repository, the model server stack includes:

  • nvidia/parakeet-tdt-0.6b-v3 for speech-to-text
  • nvidia/Cosmos-Reason1-7B for vision-language reasoning
  • nvidia/Llama-3.1-Nemotron-Nano-8B-v1 for fast, latency-sensitive language responses
  • NVIDIA-Nemotron-3-Nano-30B-A3B for deeper tool-calling workflows

Step 3. Run a sensor-first XR agent

Start the simplest working agent:

bash cd agent-samples/simple-vlm-example uv sync uv run simple_vlm_example

When the service starts, it prints a web client URL and authentication token. Open the web client, connect, and send a prompt such as ping or ask a question through the microphone.

Step 4. Connect enterprise data through MCP

Most enterprise agents need more than live perception. XR AI uses Model Context Protocol (MCP) as the integration layer for these workflows. The repository includes MCP servers for XR-specific capabilities:

  • vlm-mcp for visual question answering
  • video-mcp for video analysis and queries
  • render-mcp for scene manipulation
  • oxr-mcp for OpenXR spatial information
  • vec-mcp for vector and spatial utilities
  • transcript-mcp for transcript ingestion and retrieval

Step 5. Add agent orchestration

function_groups:
  xr_tools:
    _type: mcp_client
    server:
      transport: streamable-http
      url: "http://localhost:8220/mcp"

workflow:
  _type: react_agent
  tool_names:
    - xr_tools

Step 6. Add CloudXR-rendered spatial experiences

bash cd agent-samples/xr-render-demo uv sync uv run xr_render_demo

This workflow launches the XR Media Hub, CloudXR runtime, model services, MCP servers, and an agent worker. The agent can call rendering tools through MCP to create, update, and manipulate objects in a user’s spatial environment.

AI llmstartup

Cursor's new model

Cursor is preparing to release a new, massive AI model trained from scratch with over 1.5 trillion parameters to power advanced agentic development workflows.

Summary

What: Cursor developers announced that the upcoming model was trained using 100,000 GPUs. It is designed to extend capabilities beyond standard autocomplete and pair programming, moving toward fully autonomous software engineering tasks.
Why it matters: This signals an industry trend where specialized, high-parameter models built specifically for reasoning and agentic workflows are being vertically integrated into IDEs.

Original Article

And here’s the full video of @mntruell announcing Cursor’s new model at Compile.

I’ve found myself explaining LLMs to more and more friends and family.

One component I’ve been covering a lot lately is model weights.

If you aren’t totally clear on what these are, here’s a simple(ish) overview - no calculus knowledge required.

A “what the heck are model weights” 🧵 First - what is a weight.

I like analogies so what I usually tell people is - imagine you’re trying to predict the chance that you are going to get to the airport on time.

We’ve all been in this situation.

There’s some key inputs you’d use to understand you’ll make it to the airport on time - things like how much traffic there is, how early you left your house, distance from the airport.

Now, instead of asking - what’s the most likely to happen, you assign a strength or influence to each.

So traffic, yeah that sucks, and it can really influence if you make it on time, same with leaving early. Distance from the airport might be more of a medium influence, etc.

Now for the magical mathematical part where I’ll leave the math out and keep it high level.

You essentially multiply each input by how important they are and get a final score.

The importance values you just calculated, yup - that’s the weight.

Boom - you now understand in super simple terms, what model weights are.

But this is a thread, so now let’s go deeper. In LLMs, the weights are numbers, and those numbers are stored in - bingo, you guessed it, a file.

These numbers are stored as floats, just think - numbers with decimals and the ability to have more numbers after the decimal, like 18.23 vs just 18.

Now here’s the kinda wild part that blows some people’s minds.

The file, with all these numbers - that’s the model.

Load weights into memory, send in tokens, get outputs.

Got it? Let’s go even deeper then.

Every day since @perplexity_ai released Computer, I have built something new.

It has been an awesome experience, and as I've said many times, I'm not one-shotting something, showing a screenshot, and calling it done.

The first prompt, and initial build, is the start, not the end, and definitely not a finished product.

So, now that I've got a bunch of projects I'm refining the code on, I thought I'd share how I'm refining them, and some prompts and workflows you can use to go beyond the first shot.

I'm going to use this fun little stock portfolio analyzer someone suggested I build as an example.

A Perplexity Computer code optimization thread 🧵

The first thing I do after the initial build is evaluate the codebase, and you don't have to leave Perplexity Computer, you can do this right in there with a prompt like this.

And you'll get back some really nice detailed analysis, broken down into sections like I specified in the prompt, i.e. what's good, needs work, and my favorite - glaringly wrong.

So let's start with what came back as good:

Whoa, it did it. @perplexity_ai Computer just one-shotted a ful-stack fund in a box.

Over 4,500 lines of code, and it works.

The goal was to build a system that could credibly run a small fund's core workflow with 1-2 humans vs. the current model which is 10 analysts on terminals.

I came up with the idea by asking what could I build with computer that would be more valuable than a $30,000/year Bloomberg terminal.

Here's a screenshot of the fully working web app.

More details below, in what I think my might be the world's first Perplexity Computer Thread 🧵

First, here's the idea I worked on with Perplexity.

I then had it build me a prompt, and it build a monster prompt, I'll share it in a few segments because to one shot this, I needed a serious prompt.

Prompt Part 1:

You are an autonomous engineering, product, and research team building Thesium(.)finance, an AI‑native fund operating system where agents maintain live theses on every name and theme, and humans supervise a workstation called Thesium Desk.

Your goal is to design and implement an MVP of Thesium(.)finance that can credibly run a small fund’s core workflow end‑to‑end (research → risk → execution), with 1–2 humans supervising instead of a floor of analysts.

I’ve had a lot of people ask me about running models locally lately.

So here’s essentially what I keep sending to all my friends, and thought why not share with all of you.

And you honestly don’t need to know anything about how LLMs work under-the-hood to follow this.

A running LLMs locally thread 🧵 First things first. I’m heavily biased towards Macs, and you should be too.

Most software engineering today is done on a Mac, and all the cool new stuff comes out for Mac first, like the Codex Desktop app.

When it comes to running LLMs locally, Apple Silicon changed everything.

The unified memory architecture means the CPU and GPU share the same memory pool. For LLMs, that’s gold.

Models need big contiguous memory. On a Mac with 64–128GB unified memory, you can run models that would choke on many consumer GPUs.

If you’re choosing hardware, a Mac Studio with 64GB+ unified memory opens far more doors than a base Mac mini. Once you hit 128GB unified memory, you’re in serious territory. That’s when 70B-parameter class models become playable with quantization.

Now the software stack. There are lots of options, but I just recommend easy mode to all my friends.

And the easiest entry point is Ollama.

It’s basically “Docker for LLMs.” You install it, then:

ollama run llama3

And suddenly you have a local model chatting with you.

It handles: – Model downloads – Quantized builds – Metal acceleration – Simple REST API

It uses llama.cpp under the hood, which is highly optimized for Apple’s Metal GPU framework.

This is the smoothest path for: – Llama models – Mistral – Mixtral – Code models – Small 7B–13B experimentation

ClawdBot is amazing - it absolutely deserves all the attention it’s getting.

But, a lot of people are going to get hacked.

And that’s because way too many people are diving in without thinking about security.

Security researchers have already shown how prompt injection can be used to delete ALL of your email 😳

So there’s a few things you should know about ClawdBot security before you let it lose - it won’t take too long, but yes, it’s worth the time.

A ClawdBot security 🧵

First, there’s a Sandbox Mode - enable it.

Second, run a security audit, just type this in your terminal: clawdbot security audit

I've been using @diabrowser for a few weeks now and it's become one of those rare products I become completely fanatical about.

My timeline lately has been wild, so many people having the same experience. I thought I'd share my top ten favs in a 🧵

AI llm

Mistral has a new model coming this Summer

Mistral CEO Arthur Mensch announced a new, sparse-architecture model family arriving this summer with an early access program launching in July.

Summary

What: CEO Arthur Mensch confirmed that Mistral is launching a new family of sparse-architecture models. An early access program for research and industry partners opens in July, with all models remaining open-weight. The company is also promoting its Studio deployment and Forge training products, which can run within a customer's own Virtual Private Cloud (VPC) or private data centers.
Why it matters: Mistral is positioning its infrastructure as a sovereign alternative to US-based AI providers by offering on-premises deployment options and open-weight models, directly challenging the opaque 'API-only' development model pushed by firms like OpenAI.

Deep Dive

  • Mistral is transitioning to a 'sparse' model architecture to improve inference efficiency.
  • The company is prioritizing 'open-weight' releases to allow for independent auditing and control.
  • Studio and Forge are designed as portable products that function independently of US cloud service providers.
  • New models are designed to operate within a customer's VPC or private datacenter.
  • The Forge platform specifically focuses on using human-AI interaction logs to improve training cycles.

Decoder

  • Sparse model: An architecture where only a subset of the model's parameters are activated for any given input, resulting in faster and cheaper inference compared to dense models.
  • Open-weight: A release strategy where model weights are provided for download, allowing users to run the model on their own hardware, unlike closed models available only via API.
  • VPC: A private, isolated section of a public cloud provider's network where a user can launch resources in a virtual network they define.

Original Article

We somehow got put in the spotlight the last few days! First we'd like to thank the organizers of the AI show for that, we can't get enough of this stuff. I'll say a few things about where we are and what we do.

First, we have a nice model coming this summer – we hope it will delight and surprise in a few capabilities. This will be the start of a new family of models, fat indeed, but sparse. We're opening up an early access program in July for key partners in research, government and the industry.

This model and upcoming ones will be open-weight. We believe this is critical for our customer confidence and for the research and developer communities. You cannot own, inspect, audit, or improve a system you are only permitted to reach through someone else's interface, especially if data recording can no longer be turned off.

We've built Studio (for deployment) and Forge (for training) as portable products, and are now hosting them on infrastructure we control. We'll run in your VPC, your datacenter, or on our infrastructure that is decoupled from US service providers. We have capacity online, it's growing fast, and we can help you secure it.

We're working with companies and governments around the world to make sure their AI systems are up and running outside of external control, improving with each model release, and with an efficient cost structure. Forge allows to continuously train models based on recorded human-AI interaction, a key unlock for efficiency.

AI, just like oil in the 20th century, is about to become the major source of leverage and power in the world. Depending on how the coming years unfold, it will either lead to a world of wealth and abundance for all, or to the worst extractive economies that the world has ever seen. We're there to fight for the first scenario, as we progress AI research and accelerate its diffusion across the world – we're hiring if you like the quest.

Tech aihardware

Mobileye is entering the US robotaxi market with standalone service

Mobileye plans to launch a standalone robotaxi service in 2027, scaling to 17,000 vehicles over five years.

Summary

What: CEO Amnon Shashua announced the vertically integrated service will use the company's Moovit platform, starting with a 100-vehicle pilot next year, complementing existing partnerships with Lyft and VW.
Why it matters: Mobileye is moving from a pure technology supplier to an operational service provider to accelerate the adoption of its autonomous driving stack.

Decoder

  • Vertical integration: A business model where a company owns or controls multiple stages of its production or service delivery chain rather than outsourcing.

Original Article

The driving technology company Mobileye plans to launch a robotaxi service in an as-yet-unnamed US city in 2027, it said earlier today. The service will be vertically integrated, using Mobileye’s Moovit mobility platform to interact with customers booking rides, coordinate drivers, and so on. The Israeli company, which was bought by Intel in 2017 before going public again in 2022, says it will start with around 100 robotaxis early next year.

“Mobileye has spent more than two decades building the technologies required for autonomous driving,” said Amnon Shashua, founder and CEO of Mobileye. “Today we are taking the next step: combining those technologies with operational ownership to create a financially and geographically scalable robotaxi business designed from the ground up for global deployment.”

The company first rose to prominence in the mid-2010s, when Tesla began using Mobileye’s advanced driving assistance systems (ADAS) as part of Autopilot. That relationship lasted until 2016, when Mobileye dropped Tesla as a customer after being alarmed that a driver assistance system was being sold to end users as driverless technology. Since then, Mobileye has continued to work with other partners on ADAS and autonomous vehicles.

It has developed a new “SuperVision” ADAS that combines cameras and radar sensors, used by Porsche and Polestar, among others. On the robotaxi front, it has partnered with Volkswagen Group’s MOIA to develop a commercially available robotaxi based on the VW ID. Buzz minivan, and last year, Mobileye revealed plans to work with Lyft to deploy robotaxis in Dallas, “as soon as” this year.

“This initiative is not a replacement for our existing partnerships; it is an extension of them,” said Shashua. “We remain deeply committed to enabling automakers and mobility providers with Mobileye Drive. At the same time, operating our own service allows us to accelerate adoption, gain direct operational experience, and showcase the full potential of autonomous mobility.”

If Mobileye’s experience with the initial 100 robotaxis goes well, it says it will scale up to around 17,000 robotaxis within the following five years. “The robotaxi revolution has only just begun, and its potential for transforming how we travel around the world continues to increase,” Shashua said.

Tech opensourcebackend

Lore (Website)

Epic Games has released Lore, an open-source version control system specifically engineered for large binary assets and complex digital projects.

Summary

What: Lore uses a Merkle tree-based, content-addressed storage architecture to manage large binary files, providing a scalable alternative to Git for developers and artists.
Why it matters: Standard version control tools like Git often struggle with large media files; Lore's approach to sparse hydration and chunked storage addresses a major pain point in game and entertainment development.

Deep Dive

  • Content-addressed storage: Data is referenced by its cryptographic hash rather than a file path, ensuring integrity.
  • Merkle tree: A data structure where every leaf node is a hash of a data block, and non-leaf nodes are hashes of their children.
  • Sparse hydration: The process of downloading only the necessary subset of files into a local workspace instead of the full repository.
  • Binary-first: Architecture optimized for non-textual data like 3D models, textures, and video assets.

Decoder

  • Hydration: The process of populating a local workspace with file data from a central repository.

Original Article

Lore: next-generation open source version control

Maintained by Epic Games, Lore is designed for unprecedented scalability of both data and teams. It’s optimized for projects—including games and entertainment—that combine code with large binary assets, and caters for the needs of developers and artists alike.

Get started with Lore

Find us on GitHub

Access and contribute to Lore on the Epic Games GitHub.

Read the docs

Delve into Lore’s ethos and architecture.

Join the conversation

Chat with us and our community on Discord.

Overview

Easy setup, on-demand scalability

Get started in local mode in minutes. Then, scale up as far and as fast as you need.

Fast and efficient processes

Scale without slowdowns, thanks to shared, reusable data and as-needed downloads.

Free branching

Quickly and easily create, manage, and sync branches to freely experiment, iterate, and release.

History you can trust

Confidently track and manage revisions with Lore's verifiable tamper-evident source of truth.

Intuitive interface

Enjoy complete one-to-one access to the full Lore functionality via the CLI.

Full-surface API

Extend, customize, and integrate Lore via C/C++, C#, Rust, Go, Python, or JavaScript.

Lore’s architecture

Lore is a centralized, content-addressed version control system that represents repository state as Merkle trees and an immutable revision chain, optimized for binary-first storage, deduplication, and sparse/on-demand data hydration at scale.

Content-addressed storage

Repository data is stored and referenced by content hash in a Merkle tree, enabling fast comparisons, integrity checks, and reuse across history and branches.

Immutable revision chain

A revision's hash signature is derived from its revision state, including parent revision hashes and contained data hashes, forming an immutable chain with cryptographic integrity.

Chunked storage for large files

Files are stored as reusable chunks with indexed lookup, reducing duplication and enabling efficient updates and transfer for large binary assets.

On-demand hydration and sparse workspaces

Workspaces can stay lightweight by fetching file data only when needed, so you don't have to download everything up front.

Centralized service with caching

A service-backed architecture uses caching in front of durable storage to scale throughput for large teams and repositories.

Lightweight branches and fast switching

Branches are lightweight mutable references, so creating and switching branches is low-overhead without duplication of underlying data.

Lore’s repositories

Lore Library, Server & CLI

Javascript SDK

Python SDK

C# SDK

Go SDK

Fully open source

At Epic, we believe a truly open ecosystem won’t be built by any one company, but collectively and collaboratively using open standards. That’s why Lore is fully open source and released under an MIT license. Let’s build the version control system of the future in the open, learning from each other’s needs and ideas from the start. Come join us!

Frequently asked questions

What is Lore?

How is Lore licensed?

Why open source?

How does Lore differ from other version control solutions?

What platforms does Lore support, and where can I find the SDKs?

How do I get started?

Does Lore support file locking?

What infrastructure do I need to run a Lore server?

How does Lore handle merge conflicts?

How do I report a security vulnerability?

Is Lore production-ready?

Does Lore have a desktop client?

Where is Lore already in use at Epic?

How can I contribute?

Will Lore always be fully open source?

What’s on the roadmap?

Who maintains Lore?

Tech aifrontend

Anthropic ships major Claude Design overhaul with design system imports, code round-trips, and a fix for its token-burning problem

Anthropic's updated Claude Design platform now supports design system imports and code round-tripping to reduce token overhead.

Summary

What: The update transitions Claude Design from a prototype tool to an enterprise-ready platform, allowing users to import existing design systems from GitHub or design files to generate consistent UI at scale.
Why it matters: By integrating existing design tokens and codebases, Anthropic is trying to move beyond simple chat-based generation into reliable, production-ready frontend workflows.

Decoder

  • Round-tripping: The ability to modify generated code in a local IDE and sync those changes back to the AI environment, maintaining consistency.

Original Article

Anthropic has shipped an overhauled version of Claude Design that attempts to fix its token consumption issue. With the update, Claude Design has transformed from a prototype toy to an enterprise platform. Users can now bring design systems into Claude Design from a GitHub repository, design files, or raw uploads. It can output consistent branding at speed and scale.

Tech careeraiproductivity

You Got Faster. Your Company Didn't

Individual productivity gains from AI are often canceled out at the organizational level by the 'hidden' burden placed on reviewers to fact-check AI-generated output.

Summary

What: Engineers using AI to generate long-form documents or code save time themselves but shift the labor of verification, editing, and fact-checking onto their colleagues, often resulting in slower team-wide output.
Why it matters: The lack of trust in unverified AI-generated content necessitates a high-effort review process that is more taxing than reading human-authored work.
Takeaway: If an AI-generated document or pull request is too long or unclear, ask the author to cut it down to the core decisions and tradeoffs before you begin your review.

Original Article

You Got Faster. Your Company Didn’t.

“I would have written a shorter letter, but I did not have the time.” — Blaise Pascal

Because of AI, everyone on your team is more productive than they were a year ago, just ask them. So why isn’t the company itself faster?

I think I know why.

Let’s say an engineer needs to write a tech brief for a database migration. Two years ago, this would’ve cost him an entire afternoon: reading the code and some articles online, weighing the options, writing, deleting, rewriting. The result was short, and every word of it had survived contact with his brain.

Fast forward to today, and he pastes the context into a model and hits send. A few minutes later, the agent hands back a plan several times longer than anything he would’ve written by hand.

Well, he’s more productive now, right? A fraction of the time, many times the output. But what about everyone else? A handful of reviewers open a document several times longer than it needs to be, with that unmistakable AI smell on it.

And the length is not the biggest concern! Given that the doc was clearly AI-generated, every reviewer is now also fact-checking the thing. The brief says the current job processes events sequentially. Does it, though? It says the migration touches nine tables. Is it nine? When a colleague writes a sentence like that by hand, you trust it, because someone counted and put their name on the count. When a model writes it, and the author didn’t check, the sentence looks exactly the same. You can’t tell which claims he stands behind and which ones the model dreamed up, so you have to treat every single line as unverified. The reviewers end up doing the thinking the author skipped (except this time it arrives nicely formatted, and with confidence 🥲).

So every one of those reviews now takes longer than it used to. He saved himself the afternoon and quietly spent everyone else’s. The time was just transferred, and because a document has one writer and many readers, one person’s shortcut becomes everyone else’s problem.

You see, a document is supposed to be a service. The (implicit) deal is that the writer spends their time so the readers don’t have to. It’s why Pascal, in that quote at the top, was apologizing: the long letter is cheap for me and expensive for you, the short letter is expensive for me and cheap for you. At the workplace, I usually owe you the short one because there’s one of me and many of you. Compression, editing, and fact-checking are the work.

By the way, it’s not just documents… I’m also seeing the same pattern in pull requests, automated tests, and even decisions. We’re going faster by passing the slow part (the reading, the actual understanding) to whoever comes next. A Ponzi scheme?

Don’t get me wrong: by all means, do use AI. I do too, and I’m (probably) not going back. The thing is: the model is giving you many hours back, so please spend a bit of them editing!

I already have a rule for AI-written code: if I can’t explain the change, I can’t ship it. The same rule applies here: if you can’t defend a sentence with the document done, it’s not really done, is it?

And if you’re on the receiving end of these, you’re allowed to push back and say: “This reads like an unedited draft. Can you cut it down to the decision, the tradeoffs, and what you need from me? Happy to review it then.”

So that’s why the company never speeds up, even when everyone in it does. The time that the engineer saved didn’t go anywhere good: it landed on everyone who had to read his document.

And again, he’s not lying when he says he’s faster. He is. So am I, most days. The speed is very much real for each of us individually. It’s just that when you add it up across the team, it points the wrong way; everybody’s faster, and the whole thing somehow moves slower.

Which makes me think we owe the people reading us a bit more than we’ve been giving them lately.

Tech aillminfrastructure

Local Qwen isn't a worse Opus, it's a different tool

Local 27B LLMs like Qwen are not direct substitutes for frontier models like Claude, but they offer significant value for specific, air-gapped business workflows.

Summary

What: Founder Alex Ellis explains that while local models lack the general reasoning of frontier AI, they are effective for specialized tasks like customer support diagnostics and telemetry analysis, provided users avoid 'infinite loops' by properly tempering the model.
Why it matters: Business-critical workflows requiring data sovereignty and fixed costs make local hosting an essential operations strategy for lean teams, despite the overhead of managing hardware.
Takeaway: When running local models for production tasks, avoid open-ended agentic loops; instead, restrict models to bounded tasks like code analysis and system diagnostics.

Deep Dive

  • Cost/Benefit: Local models can be cost-effective for specific workflows but require expensive hardware (e.g., RTX 6000 Pro).
  • The Loop Problem: Local models frequently enter infinite recursion when task complexity exceeds their reasoning capacity.
  • Data Sovereignty: Local hosting allows for the analysis of sensitive customer diag-dumps without violating privacy agreements.
  • Model Tuning: Success with Qwen relies on specific llama.cpp parameters like temperature, quantization settings, and cache-type optimizations.
  • Operational Overhead: Managing local models requires building internal tooling for model routing, identity, and metering (e.g., the author's 'Toilgate' tool).
  • Hardware Reality: Consumer-grade 3090s are prone to failure and bandwidth constraints; enterprise hardware is often required for reliable, high-context performance.

Decoder

  • Quantization: Compressing model weights to reduce VRAM usage, often at the cost of reasoning depth.
  • Inference: The process of running a trained model to generate output.
  • Tokens: Basic units of text processed by LLMs; high token consumption in agentic loops leads to increased costs and potential model degradation.
  • Speculative Decoding: Using a smaller model to generate candidate tokens which a larger model then verifies, significantly increasing generation speed.

Original Article

Full article content is not available for inline reading.

Read the original article →

Tech aiagentsllm

Why we're bullish on loops

Industry leaders are pushing for 'loops' where agents prompt themselves to complete complex, long-running tasks, potentially automating iterative product improvements.

Summary

What: Engineers are moving from prompting agents for one-off code to building autonomous loops that include a goal, context (data/tools), and self-evaluation. Tools like Claude Code and various harness frameworks allow agents to manage tasks like fixing flaky tests or performing codebase migrations without continuous human guidance.
Why it matters: This signals a shift toward 'self-driving' products where agents handle repetitive maintenance and minor UX tweaks, allowing engineers to focus on higher-level strategic work.

Deep Dive

  • Define clear goals before initializing agentic loops to avoid 'slop' output.
  • Curate context dynamically rather than dumping information at the start.
  • Use test-driven development metrics or LLM-as-judge for internal loop evaluation.
  • Implement subagents to handle specific work, preventing context window degradation.
  • Utilize cloud execution to allow loops to run asynchronously on long-term tasks.

Decoder

  • LLM-as-judge: A technique where a secondary, often stronger, LLM evaluates the outputs of a primary LLM based on specific rubrics or tests.
  • MCP: Model Context Protocol, an open standard that allows AI assistants to connect to data sources and tools consistently.
  • Tokenmaxxing: A disparaging term for the tendency of AI models to consume massive amounts of tokens (data) unnecessarily to reach a solution.

Original Article

Why we're bullish on loops

WTF are loops, why is everyone arguing about them, and why do they actually matter?

When the creators of both OpenClaw and Claude Code speak, people listen. And last week Peter Steinberger and Boris Cherny were both talking about the same concept: loops.

Their argument? You shouldn’t be prompting agents to write code, but building loops that prompt themselves to write code, so agents can complete long running tasks and you can use multiple agents at once to go further, faster.

What’s needed to engineer a loop?

You need four things:

1. A goal

Agents are capable, but you need to scope the loop so they know what you want them achieve. A loop without a goal is a slop cannon.

2. Context

Context is fuel and loops are often starved of it. Context can include tools, skills, analytics data, errors, memories – any information that helps the agent find work and complete the loop. It’s best to curate context and feed it throughout, rather than dump it all upfront. The agent needs to be able to fetch and react to new inputs.

3. Evaluation

This is how the agent checks itself. Tests, evals, metrics, LLM-as-judge, playgrounds. Test-driven development is so back, or maybe it never left? This is a big difference between loops and prompting: agents do the verification, not engineers.

4. An agent (obviously)

The most basic is using something like Claude Code with a while true (aka Ralph) or using /goal. More complicated is purpose-built harnesses and context systems – e.g. an agent on a cron that pulls signals from your product data and emit work to subagents, or a loop that codegens its own test suite to verify itself.

Good examples of loops include:

  • PR babysitter. The goal is to get a pull request to pass tests and “get CI green.” The context is the changes (diff) as well as the testing suite, and the evaluation is done by the CI.
  • Bug fixer. The goal is to fix the bug. The context is the bug report and error trace. The evaluation is the test suite, snapshots, logs.
  • Flaky test hunter. The goal is to kill flaky tests. The context is CI history and retry logs. The eval is consecutive green runs.
  • Performance autoresearcher. The goal is to beat a benchmark. The context is the system, metrics, and budget. The eval is whether it is faster, better, etc. on that metric. We recently used Karpathy’s autoresearcher loop and it fixed a 3-year-old bug in our query engine and increased performance by 11%.

Why is everyone talking about loops right now?

Because it works. Yes, Peter and Boris lit the fire, but they’re doing so because it’s a real thing. The real “why now” is new and improved capabilities:

  1. Models are better at long running tasks. METR finds that Opus 4.6 can complete 50% of tasks that take 12 hours, over 6x the 1 hour 40 minutes of Opus 4 from a year ago. Fable (RIP) pushed this even further.
  2. Stories about huge tasks completed. Stripe performed a codebase-wide migration in a day that would take a team two months by hand. Lovable found it can now one shot apps that previously took hundreds of prompts to build.
  3. Loops are built-in now. Claude Code shipped a /loop command, both it and Codex have automations, and there’s even a Ralph plugin for Claude Code.
  4. Subagents separate the loop from the work. The main loop can spin up subagents that do the work and report back, saving tokens and preventing degradation.
  5. Harnesses are maturing. Compaction keeps context windows from filling. Skills and MCP enable agents to use more tools. Cloud execution lets you kick off a loop and walk away.

Loops didn’t magic themselves into existence based on a couple of tweets. They’re an expression of real industry-wide progress.

Loops are not just a new AI thing

Critics deride loops as a ploy from OpenAI and Anthropic to get everyone tokenmaxxing, but we think there’s a greater goal here: self-driving products.

Rather than an engineer needing to prompt an agent to progress a project, the agent can prompt itself. The product improves itself without input. User problems get solved faster and yes, numbers go up.

And here’s the thing, product engineers already complete this loop manually by:

  1. Collecting data through analytics and talking to users
  2. Building and shipping improvements to their products based on that data
  3. Evaluating how that improvement performed to guide future development
  4. Repeating constantly

PostHog has helped product engineers complete this loop for years now, so we think we’re in a great position to help agents with it too. It’s why we’re betting on building features that help make your product self-driving, like our Slack app, PostHog Code, and Replay Vision. Yes, we’re a little AI-pilled.

There are, of course, limits. Loops aren’t about to eliminate all engineering work, but they can put the 1% gains on cruise control: the bugs, UX issues, paper cuts, and conversion tweaks. The things that drain engineering hours, but rarely need strategic input.

The more you can automate these tasks, the more time you can spend on more impactful and (frankly) interesting work. The “self” in self-driving doesn’t mean autonomy from the engineer, it’s autonomy from user instruction as the starting point.

Code was never the problem

The opposition to loops is easy to understand: it’s another shift in the way software is built. Being told they should be “designing loops” makes engineers feel like they’re being replaced. The work is increasingly abstracted away from writing code.

But the rise of product engineers already showed that writing code was only a small portion of the work. The direction, taste, and empathy of a product engineer remain critical for building successful products in a loop-driven future.

Words by Ian Vanagas whose favorite type of loop is froot.

Tech designfrontendai

Building a design system specced for engineers and agents

Design systems are no longer just for human teams, but essential 'API' layers for AI agents to prevent code-generation sprawl in production codebases.

Summary

What: Evil Martians helped dev-tools platform Currents consolidate 791 disparate files into a structured design system in seven weeks. They used AI to audit legacy UI inconsistencies and defined strict token groups using OKLCH colors, enabling both human engineers and LLMs like Claude to generate consistent, valid UI components.
Why it matters: As AI-assisted coding accelerates, UI/UX divergence happens at 10x speed; design systems are shifting from 'style guides' to 'machine-readable constraints' for automated systems.
Takeaway: If your team uses AI for feature development, audit your UI components for divergence; if you see the same component following different rules in three places, prioritize a unified design system immediately.

Deep Dive

  • Use a 'North Star' UI prototype in Figma to align stakeholders and LLMs on visual direction.
  • Automate auditing of legacy codebases using LLMs to map thousands of inconsistent icons and colors.
  • Reduce color palettes to functional groups (elevation, content, UI, border) to simplify agent decision-making.
  • Adopt OKLCH color spaces for AI-readability, as they allow programmatic palette extension based on hue, lightness, and chroma.
  • Map all legacy icon/color references to new tokens via a lookup table to ensure deterministic migrations.

Decoder

  • OKLCH: A perceptual color space that makes it easier to programmatically generate color variations (like shades or tints) that look consistent to the human eye.
  • SVGR: A tool that transforms SVG files into React components, simplifying icon management in frontend projects.
  • North Star UI: A design artifact representing the ideal, finalized version of a user interface, serving as a blueprint for ongoing development.

Original Article

Building a design system specced for engineers and agents

AI-assisted coding allows technical founders and lean engineering teams to try new languages and frameworks, write more code, and ship new designs. It’s the perfect solution for validating ideas, building PoCs and MVPs. However, as adoption grows, it’s time to drastically elevate the UX and UI.

Currents is a test observability platform for running, debugging, and analyzing Cypress and Playwright suites in CI. It grew with a small team of strong engineers who wrote high-quality code but had to make hundreds of design decisions on the spot, like choosing icon and font sizes, colors, and filter options.

Currents came to Evil Martians to improve their UI and standardize design decisions. In just seven weeks, our team ran an AI-assisted UI audit and built a design system that can be used by engineers or agents …without a designer in the room. This allows Currents to invest in distribution and gives the team a design foundation that grows with the product.

Why Currents hired Evil Martians

Andrew Goldis, CEO at Currents, wanted to improve the experience for the users and his team. “Every time we want to work on something new, we need to go back and decide whether we need to reuse components or introduce new ones. And then we need to invest additional time into polishing them or we just use old school components,” he said.

Since Andrew was operating with a lean team, he didn’t have the resources to define the necessary app design guidelines and standards internally. This is when we came in.

There were two business triggers that made hiring us urgent:

  1. AI was amplifying divergence. The Currents team was using Cursor and Claude for code generation, but without a design foundation, every AI-assisted PR risked adding new components. Andrew wanted the design system to be AI-readable for consistent future deliveries.
  2. GTM was being held back. In devtools, the way a product looks plays a big role in the purchasing decision. Currents had a steady user base, but the UI had certain challenges to address before the team could feature it confidently in ad campaigns, on the landing page, or in sales demos.

Part 0: designing the vision

For an interface and product to feel crafted, a designer needs to make hundreds of small (and not-so-small) decisions that may not be visible first. This leads to a design system.

But a design system alone is hard to digest, so Arthur Objartel, the Evil Martians product designer working on this project, always starts by selling his vision.

In this case, he rebuilt key screens in Figma to use as a north star UI in order to show Andrew how the new design would look across key pages. This helps the client visualize the future product and commit to all the key design decisions right at the start. A north star UI also gives the designer, the developer, and the LLM a shared ground for making decisions.

Part 1: running an AI-assisted audit in week one

Before any design work, Evil Martians needed to understand the full picture, which was scattered around 791 files with design information. In the past, this would’ve taken us two to three weeks to manually inventory every icon, color, and font size. In this case, with LLM assistance, Arthur completed it in a third of the time.

This is what the audit showed:

  • Two icon libraries running in parallel (@geist-ui/react-icons and react-icons/vsc), plus 69 local custom icons with 323 resized usages. For example, the same Check icon was being used 19 times at six different sizes.
  • Two competing font systems with up to five sizes per screen.
  • Many hardcoded colors without clear guidance on when to use which. There were 236 unique colors with 1,413 uses spread across five color reference systems.
  • An inconsistency of button types across the product, hurting design cohesion.
  • Three different filter components across views, which affected the user experience and predictability.

Part 2: building a design system specced for engineers and AI

When a client runs into a problem like “we don’t know what colors to choose, what font to pick, and which button variant to use out of 10 different ones,” it’s always due to the lack of design guidelines or visual direction.

The solution is a design system. In Currents’ case, it’s deliberately small and consists of a set of Storybook docs and general usage guidelines for each of the system’s foundational parts: typography, icons, colors, spacing, and corner radius values. It also includes deep research on current usage and components location, a migration plan on how to swap previous tokens for new ones, and a new system description.

Typography: Innovator Grotesk

After comparing a dozen open-source faces, Innovator Grotesk won on one practical point: its metrics are almost identical to Inter, the team’s existing font. That meant Currents could switch immediately, with no layout overhaul and nothing to re-space. A new typeface usually costs weeks of reflow; here it was close to a drop-in.

To keep type consistent without a designer in the loop, the rule is simple: fewer sizes, and an obvious answer for which one to use. Arthur built the system from scratch with six sizes and five token groups, each with a dedicated job. The UI group is only for interactive components, so when an engineer or an agent needs a button label, there is exactly one correct token: ui.default.

Icons: a custom set, mapped to the old one with 90% AI accuracy

Open-source icon sets tend to look the same and lack personality. So we went with Figura One because it comes with stylish, pixel perfect icons that have a great set of metaphors.

The real question was adoption. Figura One had no public React library, which meant the team would have had to process every icon by hand before shipping it. Rather than hand Currents that work, Arthur built a local wrapper with SVGR and merged it into production with the frontend team’s sign-off. The icons went from a liability to a one-line import.

To make the migration mechanical instead of manual, we mapped the new set against all 191 legacy icons. The AI proposed a correct Figura One match for roughly 90% of them, and the rest were mapped by hand. The result is a lookup table an engineer or an agent can follow icon by icon, with no judgment calls left in the swap.

Color: 1,413 usages collapsed into four token groups

The first challenge before creating a new color token system was to match the UI to the brand colors. The UI used blue as primary and success colors, yet the Currents brand color was green. This confused users who usually expect successful actions to be green.

But making everything green in the interface would also be confusing, so Arthur solved it by picking different greens: cold green as an accent brand color and warm green for success indicators.

From there, the audit’s 236 colors and 1,413 usages collapsed into four groups, each with a clear job: elevation, content, UI, and border. Fewer colors, and an obvious answer for which one applies, the same principle as the type scale.

Every color is defined in OKLCH, and this is the part that makes the system genuinely AI-readable. Because OKLCH expresses a color as plain, human-readable lightness, chroma, and hue, an agent can extend the palette without guessing: hold the hue and lightness, step the chroma, and the new shade already belongs to the system. Ask it for “a border one step softer than border.default, same hue,” and you get a value that holds up against the rest of the palette instead of a one-off hex that drifts.

Part 3: going the extra mile

Evil Martians believe in adding value fast and tend to fix things as we work. The left sidebar was a pain point for many users in terms of navigation. For example, the theme switcher, docs links, help, and changelog buttons were sitting in the top-right of the product, disconnected from primary navigation. Aligned with the client, we changed the sidebar design to help users immediately and to set the ground for future updates.

We also noticed the tool was using three different filtering options. When an app has different filtering options, users lose predictability and have to relearn how to navigate your interface each time.

To stop this, before waiting for the full app redesign, we grouped all previous filter implementations into one component handling multi-select, ranges, dates, and presets with a single interaction model. Also, to make the most out of this functionality, we added quick filters and presets that allow users to save common queries.

Results and next steps

I really like all the attention to details and the way Arthur delivers the assets. It’s next level! I’m not used to it and it’s very refreshing.

— Andrew Goldis, CEO of Currents

To summarize, in just seven weeks, Currents got access to:

  • An AI-assisted audit of every icon, color, and font size to understand the full picture
  • A design system living in Figma, Storybook, and GitHub, readable by AI agents and engineers
  • An icon, typography, and color migration map to simplify the adoption of new components
  • A new color system that is easy to understand and use
  • A north-star UI design of the most critical screens for engineers to have a reference point when applying changes

The implementation is now on Currents’ side. The team hired a front-end engineer who’s already implementing the system, which according to the client “it’s looking great.” Arthur is still in close contact with the team, guiding them through the process and answering any ad-hoc questions. We’re also looking forward to seeing the reactivation of paid distribution and go-to-market, an improvement in developer experience, and a consistent design and predictable UX.

How to know if you need to invest in a design system in 2026

A design system is the type of thing that goes unnoticed when it works, but becomes extra visible when you don’t have one. With AI-code generation becoming the new norm and allowing engineers to produce more code faster, not having a design system becomes a liability. For example, if you’re now producing 10x the output, you’re also introducing inconsistencies at that pace.

But, how to know if this is what you need? Here are three signals to tell if you’re past the decision point:

  • Your product is dense or technical and you don’t feel comfortable showing the UI or demoing the tool in sales conversations
  • Your team uses LLMs for coding new features, or will soon, and you don’t have designers on your team
  • You can name three places in your UI where the same components follow different rules

Putting together a design system has now become less expensive than before. AI-assisted work has dramatically cut the hours required to audit the state of your tool. However, the cost of building without a design foundation has gotten exponentially higher.

If at least two of the reasons on the list are true for you, the question isn’t whether or not to invest in a design system but when.

Design aiagents

Four Ways We're Using Our MCP Server at Figma

Figma is expanding its Model Context Protocol (MCP) server integration to allow agents to generate and update assets across Slides, FigJam, and prototypes.

Summary

What: Figma’s MCP implementation now enables users to pull live data from Slack, Google Drive, Asana, Notion, and Hex to automate the creation of decks and workshops, while adding a new asset-download tool.
Why it matters: Figma is betting that the future of design workflows is not just drawing, but programmatically linking canvas elements to live operational data via AI agents.
Takeaway: Developers using Figma's MCP can now test the write-beta capabilities to automate asset exports and deck updates using their internal tool data.

Deep Dive

  • MCP Integration: Expands the Model Context Protocol to bridge design canvases with external business tools.
  • Cross-Tool Automation: Enables real-time syncing of content from Slack, Notion, and Hex into Figma/FigJam.
  • Asset Management: Introduced download_assets to automate exports (SVG, PDF, JPG, PNG).
  • Agentic Prototyping: Allows for generative design workflows where agents handle creation tasks based on prompts.
  • Write Beta: The platform has opened up write-back capabilities for broader agentic control.

Decoder

  • MCP (Model Context Protocol): An open standard for connecting AI models to live data sources, allowing agents to query and manipulate external systems securely.
  • Gulf of Execution: A concept in cognitive psychology describing the difficulty users face in determining how to interact with a system to achieve a goal.

Original Article

Full article content is not available for inline reading.

Read the original article →

Design ai

What Figma Made Visible

Figma’s role in teaching design logic through manual friction is at risk as AI-driven automation replaces the struggle required to build true structural intuition.

Summary

What: Designer Murphy Trueman argues that the 'friction' in manual design workflows—debugging component relationships, spacing, and styles—is where professional craft and structural understanding are actually formed.
Why it matters: The industry is moving toward 'plausible-looking' designs generated via AI, which may create a generation of designers who understand tool operation but lack deep, system-level design knowledge.

Deep Dive

  • Design Friction as Pedagogy: Manual struggle with UI layouts teaches the underlying logic of component systems.
  • AI Abstraction: Automation of tedious tasks risks removing the 'learning moments' that occur when fixing errors.
  • Structural Superficiality: There is a risk that AI-generated systems look coherent on the surface but break when scaled or themed.
  • Disposability of UI: Speed of generation encourages a lack of critical attention to the structural integrity of a design.
  • Professional Shift: The author questions whether new practitioners will develop the same intuition without the historical experience of 'wrestling' with the software.

Original Article

What Figma made visible

And whether the next generation of practitioners will get the same thing.

What changed, coming back to design after a stint in front-end development, was Figma.

I'd been doing WordPress theming and custom PHP templates long enough to understand the logic of components without having a design tool that thought in them. Long enough, also, to know what I didn't like about front-end: a specific texture of problem where the work is blocked by something in a package you didn't write and can't fully see. NPM errors at 10pm with no obvious cause. Dependency conflicts with no clean resolution. I liked building things. I didn't like being stuck on the unblocking. Coming back to design was the right call.

The logic I'd built up in front-end didn't leave, though. A partial in a WordPress theme is a component in everything but name — you make a change in one file and it propagates, you build something once and reuse it, you start to feel the shape of a decision made upstream becoming a constraint downstream. But in a template file you're imagining the structure. In Figma, you're holding it.

You create a component, attach a style, and watch a change move through a file the moment you make it. The first time a colour style update spread across an entire file without me touching anything else, something clicked — not in how I used the tool, but in how I understood the work. That's where the engineering thinking I'd built up started talking to the design thinking I'd been developing, and the two stopped feeling like different careers.

That's where I fell into design systems properly. The connection between structure and output, the way a decision about a spacing scale becomes a constraint for every team that consumes it, the satisfaction of building something that holds together when someone else picks it up in a context you never anticipated. It was problem-solving I could see, and trace, and feel the edges of.

I've been thinking about that a lot lately, because I'm not sure I'd fall into it the same way if I were starting now.

The AI tooling in design tools is useful, and I've written about AI and design systems long enough to know better than to dismiss it. But what it's starting to replace isn't just work — it's the friction that makes you understand what you're doing. When you manually set a style and chase it through a file, or spend an hour figuring out why a spacing decision that looked fine in isolation is breaking a layout three screens over, you're learning the system through the failure. The problem is the lesson. The thing that's wrong is pointing at something real about how the pieces relate, and you come out the other side knowing something you didn't know going in, because the tool made you earn the answer.

AI assistance smooths that out. It fills in the gaps, suggests the fix, generates the variant, and the output is often fine — designs look considered, components get made. But the craft knowledge that used to accumulate through wrestling with a file doesn't accumulate the same way when the tool is doing the wrestling for you. The person on the other end may be producing good-looking work without developing much understanding of why the decisions behind it matter.

There's a version of this I'm willing to accept. Tedious work is tedious, and automating the parts that don't require judgement so there's more time for the parts that do is a reasonable trade. But there's a difference between automating the tedious and automating the effortful, and from where I sit, what's being abstracted away now isn't just repetition — it's the problem-solving. The moment where you're stuck on something small and the act of getting unstuck teaches you something about the system you wouldn't have learned any other way. That moment is disappearing, and I don't know what replaces it.

UI is starting to feel disposable in a way that unsettles me. Not as a complaint about quality. Structurally.

The pace of generation means nobody dwells in a screen long enough to notice whether it's actually considered or just plausible, and plausible is increasingly good enough because the bar for "does this look designed" keeps dropping when the tools do the designing. The decisions are still there, technically, but the attention that turns decisions into craft has somewhere else to be.

You can produce a lot of considered-looking work without any of it being considered.

What that produces, eventually, is a design system built fast — surface coherent, components present, tokens named — that frays the moment someone tries to theme it or extend it or hand it to a consuming team. Not because the decisions were wrong, but because nobody fully understood what they were deciding when they made them. The structure held the appearance of intention without the substance of it. I was seeing that pattern before AI tooling existed. The speed increase just makes it easier to arrive there faster, and harder to notice you're heading there at all.

Designers said it about Photoshop. Developers said it about frameworks. The complaint usually ages badly — the people making it tend to look, in retrospect, like they were defending familiarity rather than something real. I'm aware of that, and it's possible this is just what it feels like to watch a practice change and not be the person it's changing for.

But the specific thing that made Figma matter to me wasn't efficiency. It gave me a way to hold abstract structural ideas in my hands — the relationship between a component and a style, between a token and a decision, between a change made in one place and every instance that inherits from it, made visible, made editable, made something you could touch and understand by touching it. That's how I learned to think about the work.

Whether the next layer of tooling gives anyone that same contact with the underlying ideas, I don't know. The people coming into design systems practice now are learning in a different environment, with different feedback loops and different friction levels — that might produce designers who think just as well and just differently. Or it might produce designers who know how to operate tools that have already resolved the decisions for them. The industry can't tell yet either.

Design frontendreactai

Design Canvas that Writes Code (Website)

Lunagraph bridges the gap between design and production by allowing users to create UI directly in React code via an AI-powered visual canvas.

Summary

What: Lunagraph is a design tool that generates real HTML, CSS, and React code, utilizing Claude Code to allow developers and designers to modify components, manage state, and wire up logic directly within the canvas.
Why it matters: This represents an attempt to eliminate the 'handoff' problem by making the design interface itself a code editor, effectively turning visual design into a direct IDE plugin.
Takeaway: Test the public beta if you are looking to unify your design system and codebase without separate design software.

Deep Dive

  • Integrated canvas that outputs production-ready React, HTML, and CSS.
  • Uses Claude Code as an AI design partner for component refactoring and state management.
  • Supports local file system access to edit existing project repositories.
  • Enables live previews of components via an iframe connected to local development environments.
  • Encourages a 'no-handoff' workflow by treating the code as the source of truth for design.

Decoder

  • Claude Code: A CLI tool that allows AI models to interact directly with local codebases and filesystems to perform complex engineering tasks.
  • Handoff: The process where a designer delivers static prototypes or assets to a developer, which the developer must then interpret and convert into code.

Original Article

The design canvas you know, but it writes real code, powered by Claude Code.

Lunagraph lets you design and create UI using real HTML, CSS and React code. Stay consistent with zero handoff and work between teams, designers, product, developers, and agents.

Sculptors don't hand off a sketch and hope someone else carves it right. They shape the stone themselves.

Software should work the same way.

The ones who agonize over every pixel should be shipping the final code for the design.

Design with the raw material. The code itself, not an abstraction of it.

The best design decisions come from working directly with what ships. No handoff gaps, no lost-in-translation moments — just designers shaping real components, real logic, real output.

Finally, design is code.

The deliverable isn't a design file. It's the code itself. No more translating pixels into tickets. What you craft is what ships.

export function NewComponent() {
  return (
    <div className="flex flex-col gap-4 w-fit p-6 bg-background">
      <h2 className="font-serif tracking-tight font-medium text-3xl text-foreground leading-tight">
        Finally, design is code.
      </h2>
      <p className="max-w-2xl leading-relaxed text-lg text-muted-foreground">
        The deliverable isn't a design file. It's the code itself.
        No more translating pixels into tickets.
      </p>
    </div>
  )
}

Work together with Claude Code, a creative design partner that carries the full picture.

Your documents, the canvas, your moodboard and inspiration, down to the codebase itself. Every decision informed by everything you've already built.

The full design-to-codebase round trip, in one platform.

The chat sees your canvas and your local codebase. Design on the canvas, implement straight into your repo, preview the result in a live iframe — then screenshot both to compare. No context switching, no handoff.

AI File Access

These paths let the AI read/write files outside your project.

Join designers who write real code.

Design and ship in one canvas.

Design ai

Designing with Uncertainty: How AI Supercharges Probabilistic Thinking

Designers should stop treating AI predictions as facts and adopt a probabilistic mindset to build resilient, human-centric systems.

Summary

What: Pratik Joglekar argues that AI is a tool for probabilistic signaling, not deterministic truth, requiring designers to build explicit safeguards, human-in-the-loop overrides, and clear communication of uncertainty.
Why it matters: Transitioning from designing for deterministic outcomes to probabilistic ones is essential as AI integration moves from simple text generation to high-stakes product decision-making.
Takeaway: Identify one area in your current project where AI output is presented as fact and add a confidence indicator, an 'I think this is wrong' feedback mechanism, or a manual override.

Deep Dive

  • Treat AI outputs as weighted signals rather than definitive answers.
  • Avoid wrapping probabilistic AI systems in deterministic interfaces that imply false certainty.
  • Use human-in-the-loop (HITL) not just as a safety net, but as a refinement engine to improve models.
  • Implement resilient design by anticipating how AI models degrade and planning fallback paths.
  • Monitor second-order effects; short-term conversion gains can often hide long-term health risks.
  • Design for 'likelihood' rather than 'success' by modeling multiple possible user paths.

Decoder

  • Probabilistic thinking: A framework that views outcomes as ranges of likelihoods rather than binary true/false results.
  • Human-in-the-loop (HITL): A model where a human participates in the system's decision-making process to verify, override, or refine AI-generated outputs.
  • Confidence score: A quantitative metric provided by an AI system to indicate the probability that its prediction or classification is correct.

Original Article

Full article content is not available for inline reading.

Read the original article →

Data aillmresearch

Predicting model behavior before release by simulating deployment

OpenAI is moving from static safety benchmarks to deployment simulation, replaying real-world conversation logs through candidate models to predict production behavior.

Summary

What: OpenAI's new technique uses de-identified user data to simulate how a model reacts to various inputs before full release. This provides more accurate estimates of unwanted behaviors compared to traditional evaluation methods.
Why it matters: This signals a shift toward empirical, data-driven safety testing where labs increasingly rely on historical production traffic rather than synthetic or curated benchmarks to model risk.

Deep Dive

  • Methodology: Replays real conversation prefixes against candidate models to identify potential issues.
  • Benefit: Offers more realistic coverage than static benchmarks like MMLU or HumanEval.
  • Metric improvement: Provides better rate estimates for model failure modes by simulating production conditions.
  • Privacy: Uses de-identified datasets to ensure user confidentiality.

Decoder

  • Deployment Simulation: A testing approach where developers use historical interaction data to stress-test new model versions before release.
  • De-identified data: Information that has been processed to remove personally identifiable information, allowing it to be used for research without compromising user privacy.

Original Article

OpenAI's Deployment Simulation is a new pre-release safety technique that replays real (de-identified) user conversation prefixes from previous deployments and generates responses with a candidate model to predict how it will actually behave in production, providing far more realistic coverage and better rate estimates of undesired behaviors than traditional evaluations.

Data enterprisebackend

Building an AI Database for Agentic GTM Operations

Rippling rebuilt its go-to-market data stack into a centralized lakehouse, replacing fragmented warehouses with a unified intelligence layer for AI agents.

Summary

What: Rippling unified its sales data using a lakehouse architecture, incorporating ML-based entity resolution, a medallion data structure, and semantic search to feed AI agents via a natural-language interface called Genie.
Why it matters: This transition highlights the growing need for 'agent-ready' data infrastructure where raw telemetry is cleaned, resolved, and vectorized to serve as the long-term memory for autonomous workflows.

Deep Dive

  • Unified Architecture: Shifted from fragmented data warehouses to a single lakehouse for consistent insights.
  • Entity Resolution: Utilizes ML to deduplicate records across disparate third-party sources.
  • Medallion Architecture: Implements layered data processing (Bronze/Silver/Gold) to ensure quality and reliability.
  • Semantic Intelligence: Uses vector search to allow agents to query and understand business data via natural language.

Decoder

  • Lakehouse: An architecture that combines the low-cost, scalable storage of a data lake with the data management and performance features of a data warehouse.
  • Medallion Architecture: A data design pattern that organizes data into layers of increasing quality: Bronze (raw), Silver (cleansed), and Gold (business-ready).
  • Entity Resolution: The process of identifying different records that refer to the same real-world entity (e.g., merging two entries for the same customer).

Original Article

Rippling rebuilt its go-to-market data foundation into an AI Database on a lakehouse to power agentic operations, replacing fragmented warehouses and manual processes with a unified intelligence layer. Key elements include ML-based entity resolution across third-party sources, medallion architecture, semantic search over enriched conversation data, and a natural-language Genie interface used by AI agents.

Data aibackendperformance

From Scoring to Spelling: Rebuilding Ads Retrieval at Instacart

Instacart replaced its scoring-based BERT ads retrieval system with a generative model that predicts product recommendations token-by-token.

Summary

What: Instacart moved away from scoring every product ID via BERT, instead using a generative approach that outputs Instacart Semantic IDs. This resolved vocabulary constraints, structural drift, and cold-start issues for the ads platform.
Why it matters: This demonstrates a growing trend in retrieval systems where traditional scoring bottlenecks are bypassed by treating recommendation as a language generation task.

Deep Dive

  • Generation vs. Scoring: Eliminated the need to score the entire product catalog, significantly reducing latency.
  • Semantic IDs: Utilizes an embedding-based identifier system to represent product features in a continuous space.
  • Generalization: The generative approach allows for better session-based recommendations and handles new products more effectively.
  • Performance: Solves the issue of structural drift where older models failed to account for changing inventory or user trends.

Decoder

  • Retrieval: The phase of a recommendation system that quickly narrows down millions of items to a smaller set of candidates for ranking.
  • Cold-start problem: The challenge of recommending items (or users) that have little to no interaction history.
  • Structural drift: A phenomenon where a model's performance degrades over time because the underlying data distribution changes, causing old relationships to lose relevance.

Original Article

Instacart rebuilt its ads retrieval system by moving from a scoring-based BERT model (that scored every product ID) to a generative approach that “spells out” recommendations token-by-token using Instacart Semantic IDs. This shift solved vocabulary bottlenecks, cold-start problems, and structural drift, enabling better generalization, full catalog coverage, and stronger session understanding.

Data databasepostgresqldevops

pg_kpart version 1.0

The pg_kpart PostgreSQL extension enforces partition key usage by blocking any query that would otherwise trigger a full-hierarchy scan.

Summary

What: Maintained by Gilles Darold at HexaCluster, pg_kpart prevents I/O storms on large partitioned tables by rejecting queries that lack a valid predicate on the partition key, supporting both whitelist/blacklist scoping and audit modes.
Why it matters: It shifts partition management from an informal developer convention to a hard, database-level constraint, preventing performance degradation caused by accidental full table scans.
Takeaway: If you manage large PostgreSQL partitioned tables, install pg_kpart and enable audit mode to identify problematic application queries before enforcing blocking.

Deep Dive

  • Purpose: Rejects queries that fail to prune partitions.
  • Mechanism: Acts as a guardrail against I/O-intensive scans.
  • Control: Supports audit mode for non-blocking monitoring, and blacklisting/whitelisting for selective table application.
  • Integration: Uses custom SQLSTATE codes for application-level error handling.
  • Platform: Currently restricted to Linux environments.

Decoder

  • Partition pruning: An optimization where the database engine skips reading files or segments that do not contain relevant data based on the query's filter.

Original Article

pg_kpart version 1.0

Bangkok, Thailand - June 12, 2026

pg_kpart - Reject queries that scan all partitions without using the partition key

pg_kpart is a PostgreSQL extension that rejects queries which would scan every partition of a partitioned table without a usable predicate on the partition key. It prevents accidental full-hierarchy scans caused by missing WHERE/JOIN conditions on the partition key.

As a DBA, you have almost certainly run into queries that hit a partitioned table without using the partition key. On tables holding hundreds of millions or billions of rows, it is a disaster: PostgreSQL has no choice but to scan every partition, the server's I/O subsystem saturates, and overall performance collapses for everyone connected to the instance.

The rule it enforces is simple: if a query against a protected partitioned table cannot prune any partitions, it is not allowed to run. The author gets a clear error instead of a server-wide I/O storm, and the query has to be rewritten to filter on the partition key — which is exactly the access pattern partitioning is meant to encourage.

pg_kpart turns "please always filter on the partition key" from a fragile convention that depends on every developer remembering it into a guarantee enforced by the database itself. With an audit mode to ease the rollout, blacklist/whitelist scoping to target exactly the tables that matter, and a dedicated SQLSTATE so applications can react to violations cleanly, it slots neatly into a production workflow. If you run large partitioned tables and you would rather not be at the mercy of the next query that forgets the partition key, this extension is for you.

See this blog post for more information: https://hexacluster.ai/blog/pg-kpart-postgresql-extension

Links & Credits

pg_kpart is an open project. Any contribution to build a better tool is welcome. You just have to send your ideas, features requests or patches using the GitHub tools.

Thank to the developers who submitted patches and users who reported bugs and feature requests, they are all cited in the ChangeLog file.

Links:

About pg_kpart

The objective of pg_kpart is to have a tool to help PostgreSQL DBAs to enforce the correct use of PostgreSQL partitioning.

Tool created at HexaCluster Corp and maintained by Gilles Darold.

pg_kpart works on Linux platform and is available under the PostgreSQL license.

Data enterprisecloud

Introducing Lakehouse//RT: Real-Time Performance on a Unified Lakehouse

Databricks' new Lakehouse//RT engine aims to deliver sub-100ms query performance directly on unified lakehouse data without requiring separate serving layers.

Summary

What: Powered by the new 'Reyden' engine, Lakehouse//RT enables real-time analytics and BI by eliminating data movement between Delta Lake tables and serving systems, while maintaining centralized Unity Catalog governance.
Why it matters: This signals a major push to consolidate real-time serving back into the primary data platform, reducing the 'three-tax' problem of data duplication, fragmented governance, and complex pipeline engineering.
Takeaway: Contact your Databricks account team to evaluate Lakehouse//RT, which is currently in Beta for read-only workloads with 30% introductory pricing through January 2027.

Deep Dive

  • Performance: Claims millisecond latency on complex joins and aggregations at scale.
  • Architecture: Unifies real-time serving with batch data; replaces side-car systems like Redis or specialized OLAP cubes.
  • Governance: Inherits existing Unity Catalog permissions without re-configuration.
  • Scaling: Utilizes incremental autoscaling per node rather than full-cluster replication.
  • Availability: Currently Beta for read-only workloads.

Decoder

  • Serving layer: A specialized, often proprietary database used to store a copy of data for fast read access, typically used in addition to a data lake.

Original Article

  • Databricks launches Lakehouse//RT, powered by new real-time engine Reyden, to provide millisecond query speeds directly on your lakehouse without data movement
  • Preview users have seen up to 16x better performance vs. real-time serving layers, with response times as low as 10ms on smaller datasets and sub-100ms performance on larger ones
  • This eliminates complex architectures and extra pipelines by unifying real-time serving with centralized Unity Catalog governance

When we introduced the lakehouse architecture, our vision was to create a single, unified platform for all your data needs by eliminating the divide between data lakes and data warehouses. We proved this was possible with Databricks Lakehouse, bringing diverse workloads in analytics, BI, AI, and ETL together on a single platform using open data, removing duplication and centralizing governance.

Now, we are unifying real-time serving with our core data platform. Today, this is most commonly accomplished by using a separate serving layer or specialized engine. This results in siloed data copies that add complexity, cost, and risk to your data architecture.

Databricks is pleased to announce that we are bringing millisecond performance directly to the lakehouse. We’re introducing Lakehouse//RT, Databricks’ new real-time data warehouse designed for operational analytics, BI and app serving, and observability workloads. Lakehouse//RT is powered by Reyden, a breakthrough new engine for real-time workloads that require immediate responsiveness at high concurrency.

Separate serving layers: a broken compromise

As organizations expand data access across users, applications, dashboards, and agents, demand for real-time responsiveness under high concurrency continues to grow. The traditional answer was to introduce a dedicated serving layer. While fast for reads, this approach requires you to copy data to a new layer, isolating it from the rest of your platform while introducing more complexity across your environment.

Copying your data into a separate serving layer isn't free. It costs you three times, before you've served a single query.

  1. You pay in duplication. You extract your data from open formats like Delta and Iceberg and copy it into proprietary storage no other engine can read. Now you own a second ingestion pipeline, a new set of failure modes every time a sync breaks, and fresh operational overhead every time the source data changes.
  2. You pay in governance. The security policies, access controls, and business logic you defined once in Unity Catalog don't follow the data into the serving layer. So you define them again, in a second place. The moment the two drift, you've got inconsistent rules, fragmented access, and a gap your security team has to explain.
  3. You pay in engineering. Someone owns that pipeline. Someone debugs the sync failures. Someone runs the second cluster. The engineers closest to your most latency-sensitive workloads end up spending their days on plumbing instead of product.

The kicker: And after you've paid all three, the serving layer still can't run all your queries. The moment a query gets complex (e.g. joins, window functions) or the data gets big, it collapses.

Lakehouse//RT: Real-time performance, powered by Reyden

Lakehouse//RT is a new real-time warehouse that delivers millisecond performance at massive scale, without data movement. You can support real-time workloads while continuing to use the same open formats, governance model, and central data architecture already powering your analytics and AI.

Preview participants have seen up to 16x better performance vs. real-time serving layers, with response times as low as 10ms on smaller datasets and sub-100ms performance on larger ones. On standard analytical benchmarks, Lakehouse//RT delivers sub-100 millisecond latency at 12,000 queries per second.

Lakehouse//RT outperforms across benchmarks

This new approach means that Lakehouse//RT can maintain low latency, even at thousands of queries per second, on both big and small datasets, where other data warehouses or specialized real-time engines can spike in speed or even fail entirely.

Here is what that looks like across three dimensions:

1. Under load: It is easy to deliver low latency with a single query. The challenge comes when a dashboard or application is firing thousands of queries at the same time to the system. You don’t want your end users to open your analytical application and wait seconds or even minutes for it to load. We tested Lakehouse//RT against the leading alternatives on query latency as we push throughput from a handful of queries per second into the thousands. The alternatives all behave the same way. Latency holds for a while, then climbs, and then the engine stops responding altogether. Lakehouse//RT stays flat across the entire range, scaling to thousands of queries per second without sacrificing on query latency.

2. At scale: This test is based on TPCH, a standard decision-support benchmark. We ran a suite of queries over a sales schema that combines large table scans, multi-table joins, and aggregations, which is the shape of everyday business reporting. We run it from small datasets up to a terabyte, the path every dataset takes as usage and history accumulate. Lakehouse//RT keeps latency low as the data grows, and the chart shows how performance holds across scale factors. Unfortunately, at large scale factors, 2 of the 3 alternatives we were testing failed to run. Further highlighting the inability of these real-time side stacks to handle any meaningful data sizes.

3. On the hardest queries: This test is based on TPCDS, a more demanding decision-support benchmark for data warehouses. We ran a suite of complex queries built from deep multi-table joins, subqueries, and window functions over a realistic warehouse schema, the kind of analytics an analyst writes when the question goes well beyond a simple lookup. Lakehouse//RT keeps latency low even as the queries get harder, and the chart shows the gap only widening, with one alternative running as much as 25 times slower. And once again, at the largest scale, that same alternative failed to finish at all. Further proof that real-time side stacks built for simple lookups cannot handle the complex analytics businesses run every day.

Millisecond speed at scale, on one unified, well-governed platform

By unifying real-time performance with your central data platform, Lakehouse//RT eliminates architectural trade-offs to deliver three core benefits: real-time answers, streamlined architecture, and consistent governance.

Real-time answers

When it’s critical that you get the fastest, freshest insights, Lakehouse//RT delivers. Customers in demanding industries where every millisecond matters, no matter the number of concurrent queries, dramatically lower their time-to-insight with the real-time lakehouse.

"Meta Enterprise runs analytics for our own teams across supply chain, finance, and beyond - where analysts expect answers instantly, even under heavy concurrency on our largest tables. With Lakehouse//RT, our typical query results come back in 10s of milliseconds with data on the lake without a separate system alongside it." — Srikanth Sakhamuri, Data Engineering Leader at Meta
"SES, a space solutions company, helps governments protect, businesses grow, and people stay connected-no matter where they are. With integrated multi-orbit satellites and our global terrestrial network, we deliver resilient, seamless connectivity. Our operations dashboards run on billions of rows of live telemetry and demand answers in milliseconds at high concurrency. Lakehouse//RT delivers exactly that directly on our Databricks data - 20 times faster than our previous query times and at a fraction of the cost, as we no longer need to operate a separate serving layer to meet our latency requirements." — Dennis Rossberg, Senior Data Cloud Architect at SES
"Enverus is the energy industry's AI and data platform, built on 25+ years of proprietary intelligence with 2.7 petabytes of continuously updated data, 350 million+ courthouse records, and $500 billion+ in annual transactions covering the full energy value chain. This means our analytics have to stay interactive, even as analyst and embedded-app traffic scales. With Lakehouse//RT, queries return in 10s of milliseconds for some queries, and up to 100x faster on others than our specialized real-time engine. That performance means we can collapse our separate analytics stack into a single unified Lakehouse." — Paul Lamb, Director, Enterprise Analytics at Enverus

Simplified architecture

Instead of copying and moving data and building extra pipelines, teams can rely on a single, agile platform to get the compute power they need without proprietary tools. This means less complexity and system sprawl.

"Our platform serves hundreds of queries per second for real-time performance data across our entire client base, so consistency and latency directly impact customer experience. With Lakehouse//RT, we're seeing consistent sub-200 millisecond performance on our core dashboard queries. Being able to achieve that directly on governed lakehouse data dramatically simplifies our pipeline and serving architecture." — Kayvon Raphael, Senior Director of Engineering at Magnite
"Threat lookup requires consistently low latency, even as usage scales across users and agents. What we're seeing with Lakehouse//RT is millisecond performance on live data with 5x improvement in response time, which creates a path to run those workloads on our lakehouse instead of maintaining a separate serving system." — Chris Kopek, Head of Data Platforms, Cisco
"At Halcyon, our teams monitor security data across millions of endpoints, correlating disparate signals in order to identify critical threats within seconds. As our customers' security needs grew, so did the load on our systems. Lakehouse//RT delivered the performance and concurrency we needed. Our critical queries now run about 4x faster, directly on our Lakehouse, without a separate caching system." — Seagen Levites, Senior Director Quantitative Analysis at Halcyon AI

Strong, consistent governance

At the same time, governance remains centralized. Security policies, permissions, access controls, and business logic stay consistently defined and enforced with Unity Catalog. Your teams don’t have to duplicate rules or chase broken governance. You set it up once, and it works everywhere.

"Lakehouse//RT ran more than a third faster on average than our prior warehouse on our healthcare dataset, with 10× faster queries [on some workloads]. That translates directly to quicker information access and more decision time for our customers. We had considered a dedicated real-time system to augment our Lakehouse architecture, but Lakehouse//RT removed that need, giving us that speed natively with consistent governance." — Mehrshad Setayesh, SVP Engineering (Data, Platform, AI) at PointClickCare
"Bally’s is one of the industry’s largest global gaming and lottery technology groups with millions of transactions a day across ~60TB in Delta Lake under Unity Catalog. Our operations teams need answers in seconds, and to deliver that, we’d been running separate low-latency serving systems alongside the lakehouse. Lakehouse//RT eliminates that trade-off: 7x faster, sub-second performance on the same data, straight from our governed Delta tables. No copies, no extra clusters, no second system to secure. That simplicity is especially important in a highly regulated industry, where maintaining the highest standards of data governance, security, and privacy is fundamental to how we operate." — Mark Borg, Senior Vice President of Data at Bally’s
"Equilibrium Energy is reimagining how energy trading is done - AI agents working alongside human traders, on live data pulled from dozens of disparate sources, at the speeds the market actually requires. It's a workload most real-time architectures can't keep up with. Lakehouse//RT delivered up to 3.6x faster median latency than SQL Serverless on our portal queries, fast enough that traders can think with the data instead of waiting on it – running scenarios, exploring alongside AI agents, and making decisions in seconds. Keeping it all on a single platform – instead of stitching a separate real-time layer onto our stack – lets us move at this speed without sacrificing governance." — Tarek Rached, Director, Data Platform at Equilibrium Energy

Partners

"Deloitte's alliance with Databricks continues to build incredible momentum as we help organizations transform their data into strategic, AI-ready assets. The launch of Lakehouse//RT marks a significant leap forward, providing the real-time capabilities needed to fuel advanced analytics and accelerate time-to-value. We are excited to deepen our collaboration with Databricks and bring this latest innovation to our clients to drive measurable, impactful business outcomes." — Thomas Zipprich, Principal and Global Databricks Alliance Leader, Deloitte Consulting LLP

"As we see accelerating momentum in our partnership with Databricks with our new Business Group Launch, the enterprise demand for real-time data and AI has never been clearer. The launch of Lakehouse//RT delivers the speed and open architecture our clients need to drive intelligent business reinvention. We look forward to continuing our journey with Databricks to unlock new possibilities." — Jigyasa Singh, Global Databricks Business Group Lead, Accenture

"Sigma now connects directly to Lakehouse//RT, Agent Bricks, Genie Agents and Lakebase, so joint customers can get sub-second query performance at scale, explore billions of rows through a familiar spreadsheet interface, build agents that act on that data and manage the full agent workflow - memory, state and all - without ever leaving the governed environment they already trust. The hardest part of enterprise AI isn’t building the model. It’s making agents work on real business data, under real permissions, at scale. That’s exactly what Sigma and Databricks solve together." — Mike Palmer, CEO of Sigma

A new engine, and a new compute model

In addition to performance, simplicity, and governance benefits, Lakehouse//RT also takes the decision burden off your teams:

AUTO sizing. You no longer pick a t-shirt size. Databricks automatically determines the right baseline compute for your workload, so there is no guessing, and no cycle of sizing up when queries slow down or sizing back down to save cost.

Incremental autoscaling. Traditional warehouses handle more concurrency by spinning up whole copies of themselves, 2X, then 3X, then 4X. A small increase in demand can double your bill. Lakehouse//RT scales by adding and removing individual nodes as load changes, so you get exactly the capacity you need and pay for exactly that.

Bring your real-time workloads home

Databricks has long provided the scale and openness required for modern analytics and AI. Organizations no longer need to choose between low-latency performance and an open, unified data architecture. You don’t need a more fragmented stack. You need a more capable data warehouse.

Lakehouse//RT is now available in Beta for select read-only workloads, with more capabilities arriving in the coming months. Talk to your Databricks account team to get started and bring your real-time workloads onto the lakehouse. As an introductory offer, Lakehouse//RT usage is 30% off through January 2027. Once you're in, just pick Lakehouse//RT from the warehouse selector and you're off to the races.

Data aiopensource

datapitfalls (GitHub Repo)

The open-source 'datapitfalls' tool uses Claude to audit entire data reasoning chains—from initial questions to final visualizations—for common analytical errors.

Summary

What: Created by Ben Jones, the tool scans code, charts, and reports against the 'Avoiding Data Pitfalls' taxonomy, catching upstream issues like silent null drops, survivorship bias, and misleading visual encodings.
Why it matters: By moving beyond simple chart linting (pixels) to reasoning audits (methodology), it provides a framework for identifying systematic bias in data products before they reach decision-makers.
Takeaway: Install via `npm install datapitfalls` and set your `ANTHROPIC_API_KEY` to integrate automated reasoning checks into your CI/CD pipelines.

Deep Dive

  • Domains: Covers 8 categories including epistemic errors, technical trespasses, and graphical gaffes.
  • Input Modes: Supports chart images (Claude Vision), SQL/Python/R snippets, and document uploads (PDF/PowerPoint/Jupyter).
  • Logic: Uses Claude API as the reasoning engine to apply Ben Jones' taxonomy.
  • CI/CD: Features a --ci flag for pipeline integration to block commits with critical findings.
  • Library: ESM-only library with TypeScript support for custom programmatic auditing.

Decoder

  • Silent null drops: A common SQL error where a join or filter accidentally removes records containing null values, leading to skewed calculations.

Original Article

datapitfalls

Helping you steer clear of common blunders when working with data

avoidingdatapitfalls.com — run a scan right now, no install required.

What is this?

datapitfalls is an open-source tool that detects the common blunders in your data work that trip up even seasoned practitioners — and its pitfall taxonomy spans the entire data reasoning chain, from the question you start with to the chart you finish with, not just the final pixels.

Most "chart linters" stop at the pixels: they'll tell you your axis is truncated or your colors fail a contrast check. That's useful, but it's a sliver of where data work actually goes wrong. The most consequential mistakes happen upstream — in how a question is framed, how data is collected, how it's transformed, how it's analyzed, and how the results are finally interpreted and communicated. A perfectly formatted chart built on a cherry-picked timeframe is still misleading. A flawless SQL query that silently drops nulls still lies.

Its pitfall catalog spans the whole chain:

question formulation → data collection → transformation → analysis → visualization → interpretation → communication

It's powered by the Claude API and grounded in the pitfall taxonomy from the book Avoiding Data Pitfalls (Wiley) by Ben Jones. Give it a chart, a code snippet, a plain-English description of your analysis, or a whole report — and it returns a structured audit that names the pitfalls, explains why they matter, and tells you how to fix them.

Today you scan one piece at a time — a chart (or several together), a code snippet, a description, or a single document (a PDF report is read end-to-end, prose and charts alike). Detecting pitfalls across a connected, multi-stage workflow as one linked chain is on the roadmap.

This is not a style checker. It's a thinking partner for anyone who wants to work with data more honestly.

The 8 Pitfall Domains

datapitfalls organizes every pitfall into one of eight domains — the pitfall categories from Avoiding Data Pitfalls. Together they span the full arc of a data project.

  ┌─────────────────────────────────────────────────────────────────────┐
  │                                                                       │
  │   1.  EPISTEMIC ERRORS        How we think about and know things      │
  │       └─ confirmation bias · anchoring · the streetlight effect ·     │
  │          precision/accuracy confusion · correlation ≠ causation       │
  │                                                                       │
  │   2.  TECHNICAL TRESPASSES    How data breaks in the pipeline         │
  │       └─ silent null drops · join explosions · type coercion ·        │
  │          encoding issues · pipeline failures                          │
  │                                                                       │
  │   3.  MATHEMATICAL MISCUES    How the numbers go sideways             │
  │       └─ % vs. percentage points · index-number misuse ·              │
  │          compounding errors · denominator blindness                   │
  │                                                                       │
  │   4.  STATISTICAL SLIP-UPS    How inference misleads                  │
  │       └─ Simpson's paradox · base-rate neglect · regression to the    │
  │          mean · multiple comparisons · sampling & survivorship bias   │
  │                                                                       │
  │   5.  ANALYTICAL ABERRATIONS  How analysis distorts                   │
  │       └─ cherry-picked timeframes · inappropriate aggregation ·       │
  │          apples-to-oranges comparisons · missing context             │
  │                                                                       │
  │   6.  GRAPHICAL GAFFES        How charts deceive                      │
  │       └─ truncated axes · misleading encodings · accessibility        │
  │          failures · chartjunk · dual-axis tricks · 3D distortion      │
  │                                                                       │
  │   7.  DESIGN DANGERS          How presentation fails the audience     │
  │       └─ poor layout · missing titles/labels · cluttered dashboards · │
  │          ignoring audience needs · form over function                 │
  │                                                                       │
  │   8.  BIASED BASELINE         Who has a voice in the data             │
  │       └─ unheard voices · undervalued contributions ·                 │
  │          misattributed credit · non-representative sources            │
  │                                                                       │
  └─────────────────────────────────────────────────────────────────────┘

Quick Start

datapitfalls is designed around four input modes, so you can scan your work at whatever stage you're in.

⚠️ Status: The datapitfalls detection engine is live, and you can use it two ways. The command line scans chart images, code snippets, and plain-English descriptions (modes 1–3 below), with a --ci exit code for pipelines. A web app does all of that in the browser and adds document upload — PDF read natively (so Claude reviews the prose and the charts and tables), Word .docx, PowerPoint .pptx decks, Jupyter notebooks, and code files — plus multi-chart scans that compare several charts at once. The web app is per-IP rate-limited, and your input is sent to the Claude API to run the scan but isn't stored by the app. Install the CLI and engine with npm install datapitfalls, or run the web app locally with cd web && npm run dev.

1. Scan a chart image

npx datapitfalls scan ./quarterly-revenue.png

Upload a chart and let Claude Vision flag truncated axes, misleading encodings, missing context, and accessibility issues.

2. Paste a code snippet (Python / SQL / R)

npx datapitfalls scan ./transform.sql
-- datapitfalls will flag the silent inner-join filtering below
SELECT u.id, o.total
FROM users u
JOIN orders o ON u.id = o.user_id;  -- users with no orders silently disappear

3. Describe an analysis in plain English

npx datapitfalls scan --text "We compared this year's signups to last year's, \
but only counted users who are still active today."

datapitfalls recognizes the survivorship bias hiding in that sentence.

4. Upload a report — or several charts at once (web app)

In the web app, drop in a PDF report and Claude scans the whole thing in context — the written claims and the charts and tables on the page (Word .docx, PowerPoint .pptx decks, Jupyter notebooks, and code files work too). Or add several chart images together for a multi-chart scan that catches pitfalls across them: inconsistent scales, inconsistent encodings, and contradictory messages.

Installation

# Install as a project dependency
npm install datapitfalls

# …or run the CLI without installing
npx datapitfalls scan ./my-chart.png

You'll need a Claude API key from Anthropic. Set it in your environment:

export ANTHROPIC_API_KEY="sk-ant-..."

datapitfalls requires Node.js 18 or later.

Programmatic API

Beyond the CLI, datapitfalls is a library you can build on. The core is detectPitfalls(), which scans an artifact (or a whole analysis chain) and returns a structured report:

import { detectPitfalls, formatReport, hasBlockingFindings } from 'datapitfalls';

const report = await detectPitfalls(
  { kind: 'code', content: 'SELECT AVG(rate) FROM metrics;', language: 'SQL' },
  { apiKey: process.env.ANTHROPIC_API_KEY }
);

console.log(formatReport(report));
if (hasBlockingFindings(report)) process.exit(1);

The package is ESM-only and ships its own TypeScript types.

How It Works

   ┌──────────────┐     ┌──────────────────┐     ┌──────────────┐     ┌──────────────────┐
   │  Your input  │ ──▶ │  Pitfall         │ ──▶ │  Claude API  │ ──▶ │  Structured      │
   │              │     │  taxonomy lookup │     │  analysis    │     │  pitfall report  │
   │  chart       │     │                  │     │              │     │                  │
   │  code        │     │  retrieve the    │     │  reason over │     │  pitfalls found, │
   │  description │     │  relevant rules  │     │  input +     │     │  severity, why   │
   │  document    │     │  from 8 domains  │     │  taxonomy    │     │  it matters, fix │
   └──────────────┘     └──────────────────┘     └──────────────┘     └──────────────────┘
  1. You provide input — a chart, a code snippet, a description, or a document.
  2. Taxonomy lookup — datapitfalls pulls the relevant pitfall rules from its catalog of eight pitfall domains.
  3. Claude API analysis — Claude reasons over your input and the pitfall taxonomy, grounded in the knowledge from Avoiding Data Pitfalls.
  4. Structured pitfall report — you get back a clear, prioritized list of pitfalls: what was found, how severe it is, why it matters, and how to fix it.

Based on the Book

datapitfalls is the software companion to Avoiding Data Pitfalls: How to Steer Clear of Common Blunders When Working with Data and Presenting Analysis and Visualizations by Ben Jones (Wiley, 2020).

The book distills two decades of teaching and practice into a map of where data work goes wrong — and that map is the foundation of this tool's pitfall taxonomy. Every rule datapitfalls applies traces back to a pitfall described in the book.

Contributing

This project gets better with every pair of eyes on it. Whether you've found a bug, want to propose a brand-new pitfall rule, or want to improve the docs — you're welcome here.

License

datapitfalls is released under the MIT License. © 2026 Ben Jones / Data Literacy, Inc.

Credits

Created by Ben Jones — author of Avoiding Data Pitfalls, founder of Data Literacy, Inc., and data visualization instructor at the University of Washington.

Built with Claude by Anthropic.

If datapitfalls helps you work with data more honestly, please ⭐ the repo.

Data ai

Probably (Tool)

Probably is a local, privacy-focused data agent that replaces LLM guesswork with deterministic mathematical validation for tabular data analysis.

Summary

What: Probably, a new application, performs natural-language data analysis on local files (CSV, JSON, Parquet) and connected databases (Snowflake, BigQuery, Postgres). It offloads reasoning to cloud-based LLMs while performing all mathematical calculations and data processing locally on the user's machine to ensure accuracy and privacy.
Why it matters: This signals a shift toward 'verifiable' data agents that use LLMs as reasoning interfaces while delegating computational heavy lifting to deterministic engines, directly addressing the hallucination risks associated with using standard LLMs for quantitative analysis.
Takeaway: If you need to analyze datasets without moving data to the cloud, try the public preview at https://www.probably.dev/.

Decoder

  • Deterministic: A system that produces the same output for the same input, ensuring mathematical results are reproducible rather than probabilistic or randomized.
  • Sycophancy: A tendency of AI models to agree with the user's incorrect premises or biases rather than reporting factual, data-supported results.

Original Article

THE VERIFIABLE DATA AGENT

Probably is a secure, local app for accurate data analysis using natural language.

Explores and reports quickly over billions of rows.

Determines verifiable answers to any data question, just ask.

Connects to any data source: local files or a massive data warehouse.

Builds understanding of your business context with every question.

Built Different

Refuses to Guess

Unlike pure LLMs, there's no guesswork or sycophancy. Reports are deterministically verified to ensure figures and statistics are present in your data.

Data Stays Private

Your sensitive data stays entirely on your machine or in your network. Cloud AI only powers reasoning.

Does real math

Mathematical tasks are delegated to a local, processor-optimized compute engine for lightning fast calculation - not an LLM.

Any Data Source

Supports local CSV, JSON, and Parquet, or connect directly to your warehouse: Snowflake, BigQuery, Postgres, and more.

Understands Data Quality

Spots outliers. Missing values. Inconsistent formats. Warns you before problems compound.

Builds Understanding

Learns your context and business over time. Smarter with every interaction.

For Every Role

Generate reports as a CEO, automate EDA as a Data Scientist, or debug pipelines as an Engineer. One tool. Any role. Any workflow.

Answers to all your questions

How is this different from ChatGPT or Claude?

Chatbots are designed to seem helpful and make you feel good. Most of the numbers they produce are made up. Probably is a purpose-built data agent capable of accurately analyzing large amounts of data via rigorous statistics. It will not hallucinate math and it will report only what the data supports.

Can I trust the answers for high-stakes decisions?

Accuracy and verifiability are foundational design constraints of this product. Every action the agent attempts is run through a series of deterministic validators. All outputs are checked letter for number against the actual computation traces executed by the engine. Any outputs that fail validation are rejected for revision. You can see this happening live in the product. Once the agent builds an artifact, that calculation is persisted and can be reproduced across any client in your organization.

How much does it cost?

Probably is extremely affordable; you can use it on a pay-as-you-go basis or via a subscription depending on your needs.

What if I need an answer to a question right now?

No problem. Upload a data file or quickly connect to any data warehouse securely, ask your question, and get human-grade charts and reports within seconds. No waiting for the data team, and no need to learn another complex tool. Best of all, no need to teach yourself statistics.

Do I need my data team to set this up?

Start immediately with any CSV, Spreadsheet, JSON, or Parquet file that you have access to. Probably can handle as much data as you can fit on your hard drive or your data warehouse. No row limits, no size limits. When you're ready to connect to your data warehouse, just ask your team for credentials and you'll be up and running in minutes.

What types of files are supported?

CSV, JSON, Parquet, and spreadsheet files work out of the box with no setup required. You can also connect directly to Snowflake, BigQuery, Postgres, MySQL, MariaDB, ClickHouse, RisingWave, and more. Support for PDF and more coming soon.

What about data security and competitive intelligence?

All analysis happens locally on your machine and your files never leave your control. No cloud uploads, no vendor access to your competitive data. Not even we can see it. Cloud LLMs only power the reasoning—your data never leaves your device.

What if my question is too complex?

It won't be! Just ask in plain language. The agent handles complex analysis behind the scenes and gives you clear answers. If it can't answer based on your files, it tells you exactly why—no guessing, no assumptions. Probably never silently ignores, transforms, or modifies any of your data without explicit instruction.

AI enterprise

ChatGPT's market share slips below 50% for first time

ChatGPT's global market share dropped below 50% for the first time as users migrate to Google Gemini, Anthropic Claude, and xAI Grok.

Summary

What: Sensor Tower's 2026 report indicates ChatGPT holds 46.4% of the market, while Google Gemini captures 27.7% and Anthropic Claude reaches 10.3%. OpenAI's move into advertising and perceived alignment with the U.S. Department of Defense have spurred user churn, while Claude currently leads in paid subscription conversion rates at 13%.
Why it matters: The market is shifting from a 'growth-at-all-costs' phase to one focused on monetization and ecosystem stickiness, where users choose assistants based on platform integration and value alignment rather than just raw performance.

Deep Dive

  • ChatGPT dropped to 46.4% market share by May 2026.
  • Google Gemini rose to 27.7% share, largely driven by deeper ecosystem integration.
  • Anthropic Claude leads in subscription monetization, with 13% of its user base paying.
  • OpenAI began injecting ads into ChatGPT in February, reaching 17% of daily active users by May.
  • User migration is increasingly tied to non-technical factors like corporate partnerships and trust.
  • Total AI app spending is projected to exceed $4.2 billion in H1 2026.
  • Asia experienced the first regional decline in AI app downloads, dropping 3.3% in Q1 2026.

Decoder

  • Sensor Tower: A market intelligence firm that tracks mobile app performance and usage metrics.
  • ARPU (Average Revenue Per User): A key financial metric calculating the total revenue generated divided by the number of active users.

Original Article

More than three and a half years after ChatGPT’s initial release, AI assistants are now used by millions of people worldwide, and the competitive landscape is changing fast. While OpenAI’s chatbot is still the most popular assistant globally, its market share has dipped below 50% for the first time as users are migrating between different assistants like Google’s Gemini, Anthropic’s Claude, and xAI’s Grok, according to analytics firm Sensor Tower’s State of AI Report for 2026.

ChatGPT’s growth has been impressive. It became the fastest app ever to reach 1 billion monthly users, as Sensor Tower reported this month. Notably, OpenAI counts weekly active users, and it last reported 900 million of them in February. The chatbot still remains the most popular AI assistant worldwide with over 1.1 billion monthly users, followed by Gemini with 662 million and Claude with 245 million.

Until January, ChatGPT commanded over 50% market share, but by May’s end, it had fallen to 46.4% thanks to the rise of Gemini (27.7%) and Claude (10.3%). Other assistants, including Grok, Perplexity, DeepSeek, and Meta AI, have less than 5% market share.

Sensor Tower’s State of AI Report also found that users are increasingly willing to switch between assistants. Specific events appear to accelerate that behavior: OpenAI’s deal with the U.S. Department of Defense (DoD) in February triggered a measurable spike in uninstalls, for example — suggesting brand trust and values alignment matter to users, not just features. While Gemini’s momentum is largely due to its integration with Google’s broader ecosystem of tools, Anthropic’s Claude has gained a strong reputation for productivity use cases and is closing in on ChatGPT’s user-retention rate.

In the first half of 2026, people are on pace to download nearly 2.3 billion AI apps and spend over $4.2 billion on them, according to Sensor Tower estimates. That compares to $1.83 billion in spending in H1 2025 — a jump that suggests the industry is shifting its focus from pure growth toward monetization. That said, both download and spend growth rates have decelerated, an indicator that the market may be maturing even as absolute numbers climb.

Regionally, Asia recorded the first download decline of 3.3% in Q1 2026, driven by dips in China and India. Despite leading globally in total downloads, Asia trails North America and Europe when it comes to in-app spending — a split that matters for companies deciding where to invest in premium features and monetization.

In the U.S., users are gravitating toward AI assistants for productivity tasks and spending more on premium features. Across platforms, average revenue per user has grown industry-wide, but Claude is standing out. Thirteen percent of Anthropic’s users are paying for a subscription plan — a conversion rate that leads the field and will be a metric worth watching for investors evaluating which AI businesses are building lasting revenue.

Sensor Tower estimates that the hours spent on AI apps will have increased from 17.2 billion hours in H1 2025 to roughly 36 billion hours in H1 2026. The top three assistants command 89% time spent on AI assistant apps. Meanwhile, adjacent categories like AI companions or AI content-generation apps remain fragmented and wide open to competition, which represents both a risk and an opportunity depending on which players move first.

Ads and shopping

OpenAI started experimenting with ads in ChatGPT in February. According to Sensor Tower, the company has scaled the number of ads gradually, along with the share of users who see them. By May, an average of 17% of daily users were being served ads — a number to watch as ChatGPT’s monetization strategy evolves beyond subscriptions.

Software and shopping are the largest advertiser categories in ChatGPT so far, followed by media and entertainment and food and dining.

As ChatGPT deepens its shopping integrations, it is increasingly sending referral traffic to retailers like Target, Walmart, and Costco. Amazon, which has blocked ChatGPT’s web crawlers, has seen stagnant referral traffic from the platform as a result.

That creates an opening for others. Sites like Walmart have embedded their own AI assistants to help shoppers find products. While Amazon’s Rufus has seen flat user growth, Walmart’s Spark has been gaining ground. Sensor Tower also noted that Amazon shoppers who used Rufus spent more time in the app and converted at higher rates than those who didn’t, hinting that on-platform AI can meaningfully influence purchasing behavior when users actually engage with it.

AI llm

Kimi K2.7 Code vs Claude Fable 5: Landing Pages That Cost 94% Less

A side-by-side comparison shows Kimi K2.7 Code generates landing pages at 94% lower cost than Claude Fable 5.

Summary

What: In an experiment generating 12 landing pages, Kimi K2.7 Code proved 16 times cheaper than Anthropic's Claude Fable 5.
Why it matters: This highlights the significant cost-performance gap currently existing between specialized coding models and broader flagship reasoning models.

Original Article

Kimi K2.7 Code vs Claude Fable 5: Landing pages that cost 94% less

We ran an experiment where we had Kimi K2.7 Code and Claude Fable 5 each produce 12 landing pages for a side‑by‑side comparison. Overall, Kimi K2.7 Code cost about 94% less (16x less) than Fable 5 and...

AI careerresearch

Noam Shazeer is joining OpenAI

Noam Shazeer, co-author of the seminal 'Attention Is All You Need' paper, has left Google to join OpenAI.

Summary

What: Noam Shazeer announced his departure from Google to work with the team at OpenAI. Shazeer is widely recognized for his foundational contributions to Transformer architecture and his work on Google's search algorithms.
Why it matters: This is a significant talent shift in the AI industry, as one of the original architects of Transformer technology moves to a lead role at a primary competitor to his former employer.

Original Article

I’m excited to share that I’ll be joining OpenAI and look forward to working with the exceptional team there.

It was a difficult decision to move on. I’m incredibly proud of the amazing team at Google and everything we’ve built together. It has been an honor and a pleasure to work with all of you.

AI webstartup

Replit is now available in Claude

Claude now integrates directly with Replit, enabling users to transition from design conversations to functional, deployed applications without context switching.

Summary

What: Anthropic's Claude can now hand off development tasks, such as backend setup or feature implementation, directly to Replit via a new connector, eliminating the need to manually copy code between the two platforms.
Why it matters: The integration aims to bridge the gap between AI-driven rapid prototyping and real-world execution by embedding actual development environments into conversational interfaces.
Takeaway: Start a project in Claude and use the Replit integration to push your generated designs into a live, editable environment.

Original Article

Replit is now available directly inside Claude, making it easier than ever to go from a conversation to a fully built, shipped product - without losing context, in one seamless workflow.

Design in Claude, Build in Replit

You can now design on-brand, beautiful apps in Claude Design using natural language. Once your design is ready, send it directly to Replit to continue building, refining, and shipping your app—all through natural language and in one seamless workflow.

No copy-pasting, no context switching, no friction.

Delegate Any Task to Replit

In addition, Claude can hand off any general development task to Replit via the official Replit Connector. Whether you're spinning up a backend, building out a feature, or iterating on an existing project, Claude and Replit work together so you can focus on what matters: bringing your idea to life.

Build What You Imagine

AI has made it easier to explore ideas. Now it's just as easy to act on them. With Claude and Replit working together, the gap between imagination and execution has never been smaller.

Try it today- start a project in Claude, send it to Replit and let Replit take it from there.

AI researchdata

Introducing LifeSciBench

OpenAI launched LifeSciBench to evaluate AI performance on complex, end-to-end life sciences research workflows rather than isolated biology tasks.

Summary

What: LifeSciBench assesses models on scientific reasoning, experimental design, and evidence analysis. It aims to provide a more realistic metric for how AI can contribute to professional scientific discovery compared to standard academic benchmarks.
Why it matters: This highlights the shift toward domain-specific benchmarks that better reflect the multi-step, reasoning-heavy requirements of scientific research compared to general knowledge tests.

Original Article

OpenAI introduced LifeSciBench, an expert-judged benchmark that evaluates AI systems on end-to-end life sciences workflows such as evidence analysis, experimental design, scientific reasoning, and research communication rather than isolated biology tasks.

Tech policyaisecurity

Anthropic Employees Accuse Trump Administration of Targeting Them

Over 150 cybersecurity experts have signed an open letter protesting the Trump administration's ban on Anthropic's models.

Summary

What: Anthropic employees and security researchers argue that the Fable model is being unfairly targeted despite having built-in safeguards against cyber-offensive use.

Original Article

The Trump administration's ban of Anthropic's models has been criticized by cybersecurity experts and Anthropic employees as unfair. More than 150 cybersecurity experts have signed an open letter calling for the administration to lift the restrictions. Employees say that the well-known multiple protections built into the Fable model to prevent its use for cyber offensive uses is evidence that Anthropic is being unfairly targeted.

Tech airesearchenterprise

Uneven Frontiers

AI will significantly accelerate drug discovery, but the physical constraints of clinical development ensure that the industry's primary bottlenecks remain.

Summary

What: Benjamin Liu argues that value in biopharma will coalesce around the slowest, physical parts of the process—recruiting patients and clinical trials—rather than the easily automated discovery phases.
Why it matters: This highlights a strategic shift where AI reduces costs for R&amp;D, but the long-term competitive advantage remains with companies controlling the physical-world clinical supply chain.

Original Article

Uneven Frontiers

How AI will transform biopharma — and why the sequence of change matters

AI will transform pharma, but not evenly and not all at once. Some parts of drug discovery and development are already...

Tech startupenterprisespace-tech

Musk's Next Move May Be a Megamerger of SpaceX and Tesla

Elon Musk is reportedly exploring a merger between SpaceX and Tesla, despite significant potential legal conflicts due to his controlling stake in both.

Summary

What: Elon Musk currently serves as the largest shareholder of Tesla and the controlling executive of SpaceX. The companies have historically shared resources and personnel, and a formal merger could draw scrutiny from regulators and shareholders.
Why it matters: This move would consolidate Musk's diverse engineering and energy ventures into a single corporate entity, effectively centralizing his control over both electric transportation and aerospace infrastructure.

Original Article

Many of Elon Musk's fans and investors expect him to merge SpaceX with Tesla. The two companies have long shared executives and other resources and are jointly developing multibillion-dollar projects. Musk controls SpaceX and is Tesla's largest shareholder, so it could raise legal issues and prompt lawsuits if he made the deal. However, this is unlikely to stop Musk, who has donated hundreds of millions of dollars to Republican candidates, including President Donald Trump.

Tech hardwareenterprise

Apple to Raise Prices Due to Memory Chip Crunch

Apple is set to increase consumer hardware prices due to a sharp rise in memory and storage chip costs caused by the ongoing AI infrastructure boom.

Summary

What: Memory and storage chip prices have quadrupled since 2025 as large-scale AI data center deployments consume available supply, forcing Apple to adjust product pricing ahead of upcoming launches.
Why it matters: The insatiable demand for high-bandwidth memory in AI clusters is creating a supply-chain bottleneck that is now directly impacting consumer electronics pricing.

Original Article

The surging costs of memory and storage chips are prompting Apple to raise its prices on its products. Apple's next major product launch is likely to be in September, but price increases could come sooner. The AI infrastructure buildout is eating up supplies for memory and storage chips. Prices have quadrupled since last year, and they're expected to continue increasing into 2027.

Design hardwaremobile

The foldable iPhone hasn't launched, but Apple is already planning its successor

Apple is reportedly planning a second-generation foldable iPhone for 2027, signaling a commitment to the category rather than a one-off experiment.

Summary

What: Industry reports indicate Apple is developing a 7.8-inch foldable device, likely targeting a permanent premium spot in their product lineup to follow initial foldable market entries.
Why it matters: Apple is likely observing the performance of Android foldables to ensure their own entry avoids the durability and software friction common in first-generation devices.

Original Article

Apple is reportedly already developing a second-generation foldable iPhone for 2027, suggesting the company views foldables as a long-term product category rather than an experiment. Expected to feature a wider, iPad-like design with a 7.8-inch inner display, the foldable iPhone reflects lessons learned from years of Android manufacturers refining the form factor and could become a permanent premium tier in Apple's lineup.

Design enterpriseai

AI is Not a New Marketing Problem. It is a New Brand Interface

The core challenge of AI in marketing is not a new technical problem, but the shift from human-centric to algorithm-centric brand interfaces.

Summary

What: Walker Smith explains that while AI SEO (GEO) and recommendation influence are extensions of existing marketing strategies, the future 'AI-to-AI' marketplace represents a departure from traditional human persuasion.
Why it matters: This highlights that while current AI marketing tactics are just 'new wine in old bottles,' the inevitable shift toward autonomous purchasing agents will eventually require entirely new advertising primitives.

Decoder

  • GEO (Generative Engine Optimization): The practice of optimizing content to be indexed and prioritized by Large Language Models and AI-powered search engines.
  • Consideration set: The group of brands or products that a consumer considers when making a purchase decision.

Original Article

Perhaps the most interesting thing about AI is that it is new yet nothing new.

AI is certainly new to marketing, and bringing all sorts of new challenges with it, although it may not be as big as headlines would have us believe.

Last year, the CMO Survey of the American Marketing Association, conducted in partnership with Duke and Deloitte, found a mere 17.2% of marketers reporting the use of AI to optimize or automate marketing.

Perhaps the percentage this year will be a step-change higher, but the self-estimated three-year projection by marketers last year was only 44.2%.

Nonetheless, AI is breaking out all over. It is a new force to be reckoned with. But in many ways, AI is nothing but the same reckoning as before.

Indeed, this is typical of marketing innovations. Almost always, they are nothing but new ways of solving the same old problems. Which is definitely the case with AI.

Take search, for example.

Eight Oh Two, an SEO and PPC marketing agency, reported this year that 37% of brand-related searches now start with AI instead of a search engine.

This is a big shift, but it’s hardly anything new. There were headlines galore when Amazon became a viable competitor to Google for initial product searches.

The marketing issue then and now was shifting strategy and investments to meet consumers where they live. AI will require change, but the challenge itself is nothing new. Brands just have to adjust accordingly again.

This is true of every change being wrought by AI right now. Recommendations, in particular.

A recent study by location marketing platform Uberall found that 83% of restaurants are “invisible” to consumers in AI search because they don’t get mentioned when people ask AI for restaurants “nearby.”

But this, too, is nothing new. We have known for a long time about the importance of showing up on the first page of Google search results. That’s where the eyeballs are.

So, getting there is the task of getting in front of consumers. Broadly speaking, this is the necessity of getting into the consideration set, or at least the first step in doing so. That’s what SEO is all about.

For AI, it’s called GEO. Which is just a new name for an old challenge. It’s about crafting content and brand information that will get picked up by LLMs. LLMs are different, but what they do and the strategic challenge of influencing them is the same.

Influence is another thing about AI that is nothing new. Increasingly, people are relying on AI for product comparisons and purchasing recommendations.

In this sense, AI is the influencer of demand. But influencing influencers is a long-standing task in marketing — from children influencing mothers to friends influencing friends to content creators influencing followers on social media.

Marketers have been at this for so long that they have gotten very good at influencing the influencers. Marketers may need to go back to school on AI, but the basic idea is no different than before.

Much of this involves getting the attention of LLMs, and attention was the hot issue in marketing immediately before AI appeared on the scene.

In fact, attention is an issue as old as marketing itself and was the reason why the Advertising Research Foundation developed the original inverted pyramid back in 1961.

That funnel was not a purchase funnel, but a funnel for media-buying that began with the necessity of getting attention. All AI is doing is forcing marketers to solve the same old problem again. The answer will be different, but the issue is the same as always.

Marketers are also concerned that AI only cares about facts, not emotions, and that AI will operate independently of efforts by brands to enforce consistency of messaging with consumers. But these, too, are not new concerns.

It was concern about consistency that kept the Coca-Cola Company from using its flagship brand name with its first diet cola, thus naming it Tab instead.

It was a concern when Larry Light introduced the idea of brand journalism using many individual stories to deliver McDonald’s “I’m Lovin’ It” campaign.

It has been an ongoing concern as media have proliferated and fragmented. It is certainly an issue in managing the modern-day army of independent influencers that brands are using as gateways to consumers.

So, there is nothing new about this concern when it comes to AI. Brands will have to manage consistency differently for AI, but the issue of consistency is an age-old challenge in marketing.

In these ways and others, AI is not new at all. Yet, there is one thing about AI that is very different and new. In years past, humans have been the audience for awareness, attention, recommendations, consideration, influence and consistency. That’s true for AI right now, but this is sure to change.

The way I’ve described it before is to say that marketers will have to learn how to “advertise to algorithms.” Which is to say that AI is going to transform shoppers into agents as consumers delegate shopping and buying. Hence, much of the marketplace ahead will be AI-to-AI.

Everything about marketing in the past has been designed for human audiences. Every theoretical concept, every practical application, everything has been about persuading humans.

So far, AI is just another tool for persuading humans, and thus the same challenges as ever. But when consumers hand off shopping and buying to AI agents, humans will be completely out of the loop.

This is not an issue that marketers have had to face before. Humans have always been the target.

But algorithms or AI itself will soon be the target. That’s fundamentally new. All the other stuff is old wine in a new bottle. AI agents are a new vintage.

Until then, marketers would do well not to get ahead of themselves or allow themselves to be intimidated by the changes at work with AI.

The nature of today’s challenges is identical to the challenges of the past. Marketers have successfully met those challenges. There is no reason to suspect anything different now — as long as it’s people at the receiving end.

Data infrastructurecloud

Federated Query Platform for ML at Scale: Architecture and Multi-Tenancy

Guidewire deployed Apache Trino to create a federated query layer across Iceberg, Redshift, and S3, exposing gaps in S3 governance.

Summary

What: Guidewire implemented Trino for federated SQL querying across diverse storage backends (Iceberg, Redshift, OpenSearch). While ABAC (Attribute-Based Access Control) worked well, the team found that S3-level security lacked granular governance.
Why it matters: Engineering organizations attempting to centralize access to heterogeneous data stores often find that standard RBAC/ABAC tools fail to provide sufficient control at the object-storage level.

Decoder

  • Federated Query: A technique that allows users to run a single query across multiple independent data sources without moving the data into a central repository.
  • Iceberg: An open table format for large analytic datasets that supports schema evolution and partition evolution.
  • RBAC: Role-Based Access Control, a method of restricting system access to authorized users based on their assigned roles.

Original Article

Guidewire used Apache Trino to federate SQL across Iceberg, Redshift, OpenSearch, and customer S3 buckets, speeding exploratory ML workflows. Domain catalogs and Lake Formation ABAC plus Trino RBAC handled isolation, but Trino's S3 layer exposed a governance gap.

Data enterprisebackend

The Identity Crisis: Why Entity Resolution Is the Missing Foundation of Every Data Product Stack

Identity resolution is the 'missing foundation' for reliable AI, as poor entity mapping leads to duplicate data and downstream model poisoning.

Summary

What: Modern data stacks require a robust entity resolution layer that bridges the gap between raw data and AI consumption. This involves using blocking, ML-based matching, and graph-based clustering to de-duplicate entities like customers across multiple sources.
Why it matters: As AI agents become the primary users of data, 'trash-in-trash-out' risks are magnified, making entity resolution a mandatory infrastructure layer rather than a niche data engineering task.

Deep Dive

  • Blocking: A pre-processing step used to narrow down candidate pairs for matching, reducing computational complexity.
  • Graph Clustering: Uses network relationships to group potentially identical entities that rule-based systems might miss.
  • Warehouse-native: The trend toward running matching processes directly within the data warehouse/lakehouse to maintain a single source of truth.
  • Risk: Inaccurate identity mapping causes phantom records, leading to incorrect agent decisions and poor model training performance.

Decoder

  • Entity Resolution: The task of determining whether two or more records in a database represent the same real-world entity.
  • MDM: Master Data Management; the process of creating and maintaining a single, consistent source of truth for critical enterprise data entities.
  • Model Poisoning: The introduction of bad, inaccurate, or malicious data into a training set, resulting in sub-optimal model behavior.

Original Article

Identity resolution and warehouse-native MDM are core infrastructure for trusted data products, AI, and compliance. At enterprise scale, local checks fall short, creating duplicate customers, phantom entities, and model-poisoning risk. The pattern combines blocking, rule-based and ML matching, graph clustering, and human review inside the warehouse.

Data infrastructureaws

Why We Moved from Hive-Style Data Lakes to Apache Iceberg?

Moving from Hive-style data lakes to Apache Iceberg enables efficient data pruning, schema evolution, and time travel.

Summary

What: The migration replaces Parquet files managed by AWS Glue with Apache Iceberg, leveraging native S3 Tables support to resolve performance limitations in query planning and partition management.
Why it matters: This transition highlights the shift toward open table formats that decouple physical storage from metadata, solving the scalability bottlenecks inherent in legacy Hive partition schemes.

Decoder

  • Hive-style data lake: A storage architecture where data is organized into folders representing partition keys, which becomes inefficient to query as the number of partitions and files scales.

Original Article

Moving from a Hive-style partitioned data lake (Parquet files on S3 + AWS Glue Catalog) to Apache Iceberg after AWS strengthened its ecosystem support (including S3 Tables) solved long-standing pain points with Hive-style architectures. Iceberg enables efficient pruning, seamless schema and partition evolution, time travel, and better query planning performance.

Data aillmresearch

Fine-tuning a clinical AI model to frontier parity

Heidi AI matched a frontier LLM's clinical performance using a significantly smaller, fine-tuned model trained on proprietary clinician feedback.

Summary

What: Heidi AI replaced a general-purpose frontier model in their Evidence product with a smaller, specialized model after achieving parity in blind side-by-side clinician evaluations.
Why it matters: This demonstrates that for domain-specific tasks, high-quality human preference data and tight feedback loops can outperform larger, general-purpose models.

Decoder

  • Frontier parity: Reaching the same level of performance as the most advanced (frontier) models currently available.

Original Article

Why bigger isn't always better in clinical AI

You don't need frontier scale to reach frontier quality. You need a reward signal that's yours alone, and a tight loop to learn from it. Six weeks ago, we started replacing the best frontier model running in Heidi Evidence with a model of our own, a fraction of its size. On blind side-by-side evaluation, it has already reached parity, to the point where clinicians can no longer tell which is which.

This post is about how we got there, what the result does and doesn't cover, and why we think the pattern generalizes beyond our own use at Heidi.

The signal only clinicians can give

Evidence is Heidi's clinical search product, free to use outside of a patient session. A clinician asks a question and gets an answer grounded in real sources. Evidence has answered more than 3.5 million questions since launch. It’s not the volume of questions that’s valuable; it's that Evidence answers are backed by something the general-purpose labs can't buy, a real clinician telling us which of two responses was the better one. That preference is the signal we train on.

How we measure

Data enterpriseinfrastructure

How Data 360 Segmentation Processes a Quadrillion Records Across Arbitrary Customer Data Models

Salesforce engineers are tackling metadata bottlenecks as they scale Data 360 segmentation to process a quadrillion records per month.

Summary

What: The system processes Spark jobs for arbitrary customer schemas, with metadata bloat—sometimes reaching 500MB payloads—identified as the primary performance constraint when generating query plans.
Why it matters: This illustrates the extreme metadata challenges faced when managing massive multi-tenant data platforms where standard schema-heavy approaches fail.

Original Article

Salesforce's Data 360 segmentation processes 1 quadrillion records per month across arbitrary customer schemas, relationship graphs, and storage systems, while running about 3 million Spark jobs monthly. Metadata became the bottleneck, with some environments hitting 3,000 to 6,000 tables, 500+ MB metadata payloads, and billions of candidate query plans.

AI enterprise

ChatGPT Improves Scheduled Tasks and Retires Pulse

OpenAI has overhauled its task scheduling system in ChatGPT, replacing the 'Pulse' feature with a more reliable, dedicated Scheduled page.

Summary

What: The update provides faster, more reliable task management across web and mobile platforms for Go, Plus, Pro, Business, and Enterprise tiers.

Original Article

ChatGPT has introduced an improved task scheduling feature, enhancing speed and reliability. It is accessible via the new Scheduled page and available to Go, Plus, Pro, Business, and Enterprise users.

AI research

MolmoMotion: Language-guided 3D motion forecasting

Allen Institute for AI introduced MolmoMotion, a language-guided model capable of predicting 3D point trajectories in videos.

Summary

What: The research includes the MolmoMotion-1M dataset, which contains 3D point trajectory data from 1.16 million videos to improve motion forecasting accuracy.

Original Article

MolmoMotion, a new motion forecasting model, predicts future 3D point trajectories in videos using language instructions and initial object positions, outperforming existing methods. The model and its dataset, MolmoMotion-1M, feature vast datasets of 3D point trajectories from 1.16M videos, with a benchmark for accuracy testing.

Tech hardwaremobile

Apple Prepares Second-Generation iPhone Air for Spring 2027

Apple is set to launch a second-generation iPhone Air in Spring 2027 featuring a dual-camera system and an A20 Pro processor.

Summary

What: The upcoming device retains the existing form factor but upgrades to ultrawide photography capabilities and improved battery efficiency using the A20 Pro chip.

Original Article

The second-generation iPhone Air is now in advanced testing. Apple plans to launch the device in Spring 2027. The device will retain its current look, but with a second rear camera for ultrawide-angle photography and an improved battery life. It will be powered by a version of the A20 Pro processor, the same chip coming to this Fall's iPhones.

Tech startupcareer

Mark Zuckerberg Orders His Employees to Start Having Fun Again After Brutal Layoffs Culled Their Colleagues

Meta employees are rejecting Mark Zuckerberg's call for an AI hackathon, citing burnout and job insecurity following the recent layoff of 8,000 workers.

Summary

What: Despite Mark Zuckerberg's attempt to boost morale, Meta employees are increasingly critical of the company's 'hot desk' policies and the workload shift following a 10% workforce reduction.
Why it matters: Management's attempts to simulate a pre-layoff 'hackathon culture' reveal a disconnect between executive leadership and the reality of an demoralized, overworked staff.

Original Article

Mark Zuckerberg attempted to lift employee spirits by promising to host a companywide AI hackathon in July, but workers told him that they were in no mood for such a thing.

Design enterprise

KFC partner with JKR for major global brand identity overhaul

KFC is undergoing a global brand identity overhaul in partnership with JKR, shifting toward a more unified and experience-driven visual system.

Summary

What: KFC and creative agency JKR are redesigning the brand's logo, typography, packaging, and restaurant interiors to replace fragmented regional styles with a consistent, immersive identity.
Why it matters: This transition illustrates the move from tactical, quick-service branding to creating high-touch, experience-focused environments designed to drive brand loyalty in a saturated market.

Original Article

KFC has unveiled a major global rebrand developed with JKR, refreshing everything from its logo, typography, packaging, and digital presence to restaurant interiors and staff uniforms. Built around a standardized version of its iconic bucket and a more expressive visual system, the redesign aims to strengthen global consistency, celebrate the brand's heritage, and transform KFC from a traditional quick-service chain into a more immersive, experience-driven destination through redesigned restaurants, cultural activations, and new customer touchpoints.

Design career

The irritating words and phrases I wish I could mute on Designer LinkedIn

Designer LinkedIn has devolved into a cycle of hyperbolic buzzwords and alarmist AI commentary that ignores actual craft in favor of engagement-bait.

Summary

What: Journalist Tom May criticizes the prevalence of hollow phrases like 'this changes everything' and 'adapt or die' on LinkedIn, noting that they contribute to the dilution of professional discourse.
Why it matters: The platform has incentivized performative, alarmist content, leading to a 'rhetorical landfill' that distracts from the core work of design.
Takeaway: Delete phrases like 'let that sink in' and 'this changes everything' from your professional communication to avoid sounding like typical engagement-farming content.

Original Article

LinkedIn's design community is increasingly criticized for favoring exaggerated claims, recycled AI commentary, and dramatic storytelling over clear, concise communication. Many common phrases and formatting habits create a false sense of urgency or insight, turning straightforward observations into attention-seeking content that often adds more noise than value.

Design web

Turn Product Screens Into Premium Visuals (Website)

Ultramock provides templates for rendering UI designs into professional 3D mockups and promotional videos.

Summary

What: Ultramock is a web-based tool for creating visual assets, mockups, and videos specifically for UI, websites, and mobile applications.

Original Article

Create premium, highly customizable videos, visuals, and 3D mockups that showcase UI designs, websites, and apps.

Design figma

Moodboard 3000 (Figma plugin)

Moodboard 3000 is a new Figma plugin that automates the layout of image collections into cohesive moodboards.

Summary

What: The plugin allows users to select images on a Figma canvas and automatically generate structured layouts for inspiration boards.

Original Article

Moodboard 3000 is an image layout engine for Figma. Select images on canvas, choose a layout, and generate a composed moodboard.

Design

How Hannah Li paints light, and the quiet moment just before something happens

Illustrator Hannah Li’s recent series, The Way Back, moves away from commercial publishing to focus on light, atmosphere, and the stillness of transient spaces.

Summary

What: Hannah Li, a New York-based illustrator, has pivoted from a successful publishing career to create a personal series exploring the quiet, suspended moments in train stations and street corners.

Original Article

Hannah Li's personal series, The Way Back, uses light, atmosphere, and everyday spaces to capture quiet moments of reflection, memory, and transition.

Design research

Optical Illusions and How They Work

Optical illusions demonstrate that the human brain actively constructs perception by filling sensory gaps rather than passively recording incoming data.

Summary

What: The American Museum of Natural History notes that optical illusions expose the brain's tendency to generate images and resolve gaps in sensory input to form a cohesive reality.
Why it matters: This underscores the inherent fallibility of human perception when interpreting visual data, a concept relevant to UI/UX design and data visualization where cognitive bias impacts how users interpret information.

Original Article

Optical illusions reveal that the brain doesn't passively receive sensory data — it actively constructs perception, sometimes filling in gaps or generating images that aren't there.

Design sustainability

Yuya Zhou's Genesis and the Design of Impermanence

Designer Yuya Zhou launched a collection of seven bioplastic lamps that are intentionally engineered to deform and degrade over time.

Summary

What: Presented at Milan Design Week 2026, the 'Genesis' series uses a composite of starch, gelatin, glycerin, and edible pigments that change shape and color based on environmental conditions.
Why it matters: This project challenges the industrial standard of designing for infinite durability, proposing a 'post-extractive' design philosophy where materials are treated as temporary, evolving participants rather than static, manufactured components.

Deep Dive

  • The collection includes seven pieces: four table lamps, two pendant lamps, and one floor lamp.
  • Materials are cast from a mixture of water, gelatin, starch, glycerin, and pigments.
  • Design goal is 'controlled instability,' where the material’s structural integrity varies intentionally.
  • The lamps were exhibited in the 'No space for waste' section of the Isola Design District.
  • Each piece functions as a 'non-repeatable outcome' due to its reaction to ambient humidity and temperature.
  • The series is arranged as a lifecycle, starting with fluid, light structures and progressing to denser, more complex forms.

Decoder

  • Bioplastic: A plastic material produced from renewable biomass sources such as vegetable fats, oils, corn starch, or microbiota.
  • Post-extractive material thinking: A design methodology focusing on materials that are circular and regenerative rather than those requiring permanent extraction of raw natural resources.

Original Article

Yuya Zhou's Genesis, a series of seven bioplastic light objects shown at Milan Design Week 2026, challenges the conventional permanence of design by using materials — starch, gelatin, glycerin, and edible pigments — that evolve, deform, and change over time.

Digest devoured!

Jun 18

Home