NVIDIA and AWS Collaborate to Bring AI to Production at Scale
AWS and NVIDIA are partnering to standardize GPU-accelerated AI infrastructure by launching G7 instances and integrating NVIDIA cuVS into Amazon OpenSearch.
Summary
Deep Dive
- G7 instances support 1 to 8 GPU configurations with up to 256GB of total GPU memory.
- Networking reaches 700 Gbps using AWS EFA (Elastic Fabric Adapter).
- AWS achieved 'NVIDIA Exemplar Cloud' status for GB300 performance benchmarks.
- cuVS makes vector indexing 10x faster and 4x cheaper than CPU-only builds.
Decoder
- Inference: The process of running a trained machine learning model to make predictions or generate content.
- RAG (Retrieval-Augmented Generation): An AI architecture that connects an LLM to external data sources to ground its responses.
- cuVS: A library by NVIDIA designed to accelerate vector similarity search and clustering.
Original Article
Building AI systems at scale is demanding, requiring low-latency inference, fast vector search, strong GPU price-performance and infrastructure that can grow without multiplying operational complexity.
NVIDIA’s latest work with Amazon Web Services (AWS) addresses each of those constraints. Across Amazon OpenSearch and Amazon EC2, NVIDIA AI infrastructure is giving enterprises more practical paths to deploy AI at production scale.
EC2 G7 instances powered by NVIDIA RTX PRO 4500 Blackwell Server Edition GPUs expand the compute layer for AI, graphics, video and data analytics workloads, while the NVIDIA cuVS library accelerates the retrieval layer by making GPU-powered vector indexing the default in OpenSearch Serverless. And with AWS achieving NVIDIA Exemplar Cloud status for NVIDIA GB300, customers can trust they’re receiving peak optimized performance for their training workloads.
NVIDIA RTX PRO 4500 Blackwell Server Edition Multi-Workload GPUs Power New Amazon EC2 G7 Instances
Amazon EC2 G7 instances bring NVIDIA RTX PRO 4500 Blackwell Server Edition GPUs to AWS for AI inference, graphics, spatial computing and GPU-accelerated data analytics — delivering a new instance type engineered for production workloads that need performance without the operational overhead of a customer-managed GPU platform.
Compared with G6 instances, G7 delivers up to 4.6x AI inference performance, up to 2.1x graphics performance and significantly faster GPU-accelerated data analytics on Amazon EMR using the NVIDIA cuDF library for Apache Spark workloads.
With support for up to eight GPUs, 256GB of total GPU memory, 700 Gbps of EFA-enabled networking and up to 7.6TB of local NVMe SSD storage — across one-, two-, four- and eight- GPU configurations plus bare metal, coming soon — G7 instances let customers right-size infrastructure for their workloads instead of over-provisioning for them.
The platform’s versatility means AI teams get lower-latency inference. Media and entertainment teams get high-resolution video workflows and rendering. Simulation, computer-aided design, virtual desktop infrastructure, gaming and spatial computing teams get the same instance type for graphics-intensive applications. And data teams can apply the GPU memory, local storage and networking improvements to analytics pipelines and vector database workloads.
G7 instances are accessible through AWS Deep Learning Amazon Machine Images (AMIs), Amazon Deep Learning Containers, Amazon EMR, Amazon EKS, Amazon ECS and graphics AMIs — and coming soon to Amazon SageMaker AI.
NVIDIA cuVS Makes GPU-Accelerated Vector Search the Default in Amazon OpenSearch
The next generation of Amazon OpenSearch Serverless powers agentic AI and dynamic workloads with no infrastructure management required. It uses GPU-accelerated vector indexing, powered by NVIDIA cuVS, as the default compute choice for all vector collections.
For teams building retrieval-augmented generation, semantic search, recommendation systems and agentic AI applications, that shift matters. It turns GPU-powered vector search from a specialized optimization project into a standard AWS capability.
The customer impact is direct: vector indexing up to 10x faster at a quarter of the cost, compared with CPU-only builds — making billion-scale vector databases practical to build in under an hour.
By making NVIDIA cuVS the default in OpenSearch Serverless, AWS customers get a much faster path from raw data to production-ready AI retrieval infrastructure — with serverless scaling that reduces operational overhead when workloads are idle.
AWS Achieves NVIDIA Exemplar Cloud Status for GB300 Training Performance
AWS has achieved NVIDIA Exemplar Cloud status on NVIDIA GB300 for training workloads. This means AWS meets the rigorous performance thresholds that NVIDIA uses to benchmark AI workloads against its reference architecture.
This achievement is the result of deep co-engineering efforts between AWS and NVIDIA teams. Through the NVIDIA Exemplar Clouds initiative, developers and AI leaders can be confident they’re using consistent, high-performance cloud infrastructure for large-scale training, helping teams evaluate cloud providers with greater confidence, improve total cost of ownership and move AI projects from planning to production more efficiently.
Together, these advancements reinforce every layer of the AI infrastructure stack on AWS. The throughline is the same: production-grade AI infrastructure that performs at scale, without adding operational burden to the teams running it.
Run isolated sandboxes with full lifecycle control: AWS Lambda introduces MicroVMs
AWS Lambda MicroVMs provide stateful, isolated environments for untrusted code, enabling rapid, VM-level startup using Firecracker snapshots.
Summary
Deep Dive
- MicroVM Image: A pre-baked snapshot of a Docker-based execution environment stored in S3.
- Firecracker: A lightweight virtualization technology built by AWS that uses KVM to create micro-virtual machines.
- Suspend/Resume: The ability to pause a VM to memory/disk and resume it instantly, avoiding cold-start latency.
Decoder
- Firecracker: An open-source VMM (Virtual Machine Monitor) that uses KVM to launch and manage micro-VMs with minimal memory overhead.
Original Article
Run isolated sandboxes with full lifecycle control: AWS Lambda introduces MicroVMs
Today, we are announcing AWS Lambda MicroVMs, a new serverless compute primitive within AWS Lambda that lets you run code generated by users or AI in isolated, stateful execution environments. You get virtual machine level isolation, near-instant launch and resume, and direct control over environment lifecycle and state, all without managing infrastructure or building expertise in complex virtualization technologies. Lambda MicroVMs are powered by Firecracker, the same lightweight virtualization technology that has powered over 15 trillions of monthly Lambda function invocations.
Why customers need this
Over the past few years a new class of multi-tenant applications has emerged that all share the need to hand each end user their own dedicated execution environment in which to safely run code that the application developer did not write. AI coding assistants, interactive code environments, data analytics platforms, vulnerability scanners, and game servers that run user-supplied scripts all fit this pattern. Building that capability today means making a difficult choice. Virtual machines deliver strong isolation but take minutes to start. Containers launch in seconds, yet their shared-kernel architecture requires significant custom hardening to safely contain untrusted code. Functions as a service are optimized for event-driven, request-response workloads, but are not designed for long-running interactive sessions that need to retain environment state across user interactions. That leaves developers either accepting tradeoffs between performance and isolation, or investing significant engineering resources to build and operate custom virtualization infrastructure to achieve isolated execution while delivering low-latency experiences to end-users. This presents an effort that demands deep expertise and pulls engineering time away from the product they are actually trying to build.
Lambda MicroVMs is purpose-built for exactly this gap. Each MicroVM gives a single end user or session its own isolated environment that launches rapidly, retains memory and disk state for the length of the session, and pauses to a low idle cost when the user steps away. Because the same Firecracker technology already underpins AWS Lambda Functions, you inherit the operational maturity of a service that has been running this stack at scale.
Let’s try it out
To get started, I navigated to the AWS Lambda console, where Lambda MicroVMs now appears in the left-hand navigation menu. I first need to create a MicroVM Image.
I packaged a Flask web app and its Dockerfile into a zip file, uploaded it to an Amazon Simple Storage Service (Amazon S3) bucket.
My Flask API – app.py
import logging
from flask import Flask, jsonify
app = Flask(__name__)
logging.basicConfig(level=logging.INFO)
@app.route("/")
def hello():
app.logger.info("Received request to hello world endpoint")
return jsonify(message="Hello, World!")
if __name__ == "__main__":
app.run(host="0.0.0.0", port=5000)
My Dockerfile
FROM public.ecr.aws/lambda/microvms:al2023-minimal
RUN dnf install -y python3 python3-pip && dnf clean all
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app.py .
EXPOSE 5000
CMD ["gunicorn", "--bind", "0.0.0.0:5000", "app:app"]
I used the following command to create my MicroVM Image.
aws lambda-microvms create-microvm-image \
--code-artifact uri=<path/to/s3/artifact.zip> --name <VM_image_name> \
--base-image-arn arn:aws:lambda:us-east-1:aws:microvm-image:al2023-1 \
--build-role-arn <IAM role ARN>
You can also create the MicroVM Image in the AWS Console as in the image above. Once I ran the command, Lambda retrieved the zip, ran the Dockerfile, initialized the application, and took a Firecracker snapshot of the running disk and memory state. Build logs streamed in real time to Amazon CloudWatch under /aws/lambda/microvms/<image-name>, and when the image was ready it appeared in the console with its Amazon Resource Name (ARN) and version number.
aws lambda-microvms run-microvm \
--image-identifier arn:aws:lambda:<region>:<acct>:microvm-image:my-image \
--execution-role-arn arn:aws:iam::<acct>:role/MicroVMExecutionRole \
--idle-policy '{"maxIdleDurationSeconds":900,"suspendedDurationSeconds":300,"autoResumeEnabled":true}'
Launching can also be done via the AWS Console or the CLI. I passed the image ARN and an idle policy configured to auto-suspend after 15 minutes of inactivity and auto-resume on the next incoming request. No networking setup was required. Lambda assigned the MicroVM a unique ID, returned a dedicated endpoint URL, and started a new MicroVM with my Flask app already running, since it was resumed from a snapshot. My Flask app was already running the moment the launch completed. One API call to get a fully initialized, bootstrapped compute environment.
To send traffic, I generated a short-lived auth token with the CLI and attached it to a plain HTTPS request using the X-aws-proxy-auth header. The request landed on my Flask app immediately. I then let the MicroVM sit idle past the suspend threshold, at which point the MicroVM was suspended, with its memory and disk state snapshotted and stored. I then sent another request, and it resumed with the application state fully intact. From the client side, the pause never happened.
How it works
Under the covers, Lambda MicroVMs delivers three capabilities that, until today, no single AWS compute service offered together. The first is virtual machine level isolation, which comes from Firecracker. Each session runs in its own dedicated MicroVM with no shared kernel and no shared resources between users, so untrusted code supplied by one user is contained to their execution environment, without access to other environments or the underlying system. The second is rapid launch and resume. The model is image-then-launch: you create a MicroVM Image by supplying a Dockerfile and code packaged as a zip artifact in Amazon S3, and Lambda runs your Dockerfile, initializes your application, and takes a Firecracker snapshot of the running environment’s memory and disk state. Every subsequent MicroVM launched from that image resumes from the pre-initialized snapshot rather than booting cold, which means launches and idle resumes both achieve near-instant startup latency. Even a multi-gigabyte interactive session comes back online quickly enough to feel responsive to the end user. The third is stateful execution. A running MicroVM retains memory, disk, and running processes across the user’s session. During idle periods, a MicroVM can be suspended – with memory and disk state intact – and resumed when traffic arrives. Installed packages, loaded models, and working filesets are readily available when the user resumes their session. MicroVMs support up to 8 hours of total runtime and can be suspended automatically after a configurable idle window, which makes it straightforward to build products as varied as software vulnerability scans that complete in minutes, data analytics applications that run for hours, and interactive coding sessions with extended idle periods. As Lambda MicroVMs are started from pre-initialized snapshots, applications generating unique content, establishing network connections, or loading ephemeral data during initialization may need to integrate with service-provided hooks for compatibility.
Lambda MicroVMs is a new resource within AWS Lambda, with a distinct API surface. Lambda Functions remain the right choice for event-driven, request-response workloads, and Lambda MicroVMs is purpose-built for multi-tenant applications that need to hand each end user or session their own isolated environment to execute user- or AI-generated code. The two complement each other. An application using Lambda Functions for its event-driven backbone can call into Lambda MicroVMs for the steps that need to run untrusted code in isolation. You bring the application, and the service delivers the execution environment.
Now available
AWS Lambda MicroVMs is available today in the US East (N. Virginia, Ohio), US West (Oregon), Europe (Ireland) and Asia Pacific (Tokyo) Regions, on the ARM64 architecture, with up to 16 vCPUs, 32 GB of memory, and 32 GB of disk per MicroVM. Idle MicroVMs can be suspended explicitly through an API call or automatically through a lifecycle policy, which reduces the running cost while preserving full state for fast resume. Pricing details can be found on the AWS Lambda pricing page.
To get started, visit the AWS Lambda console, or learn more on the Lambda MicroVMs product page. For documentation, see the Lambda MicroVMs Developer Guide.
How we found a bug in the hyper HTTP library
A silent data truncation bug in hyper was traced to a race condition that emerged only after a performance-boosting architecture change.
Summary
Deep Dive
- Symptom: Large image responses were intermittently truncated with 200 OK status codes.
- Cause: hyper's dispatch loop discarded the
poll_flushreturn value, proceeding toshutdownbefore the buffer was empty. - Trigger: A move to local Unix sockets made the intermediary faster, causing the socket buffer to fill up more frequently compared to the previous network-based intermediary.
- Debugging: Required
straceto observe syscalls; the bug disappeared when overhead was added to the process. - Resolution: Modified
poll_shutdownto force apoll_flushcheck before closing the stream. - Upstream: Fixed via PR #4018 in the hyper project.
Decoder
- Race condition: A concurrency flaw where the outcome depends on the non-deterministic timing of events—in this case, the flush and shutdown sequence.
- strace: A diagnostic tool for Linux that intercepts and records system calls made by a process and the signals it receives.
- Backpressure: A mechanism where a system limits the rate of data intake to prevent overloading the receiver.
Original Article
How we found a bug in the hyper HTTP library
The Images service, built in Rust on Workers, runs on every machine in Cloudflare’s edge network. To handle client connections, we use hyper, an open-source HTTP library for Rust.
Last year, we introduced the Images binding to enable custom, programmatic workflows for processing remote images in Workers. At the end of 2025, we rearchitected the binding to provide a more direct, local connection between the Workers runtime and the Images service.
Shortly after rollout, we received reports that transformation requests from the binding were failing — but only intermittently and only for larger images. Even stranger, the responses for these requests returned a 200 status without any errors logged. The image data was simply cut short: A response that should have been two megabytes might arrive with a few hundred kilobytes instead.
We spent six weeks chasing a nearly invisible bug — a race condition that occurred only under specific conditions — in the hyper library that impacted how the Images binding returned processed image data back to the client. In the end, it took four lines of code to fix it.
Hops, handoffs, and hyper
When developers build on Cloudflare, they compose full-stack applications from a set of platform services that are accessible to Workers through bindings. Bindings provide direct APIs to resources on the Developer Platform like compute, storage, AI inference, and media processing.
The Images binding decouples image optimization from delivery; you can transcode, composite, or manipulate images without needing to return the output as an HTTP response. It also lets you apply optimization parameters in any order, rather than following the fixed sequence imposed by the URL interface. Here, a worker can pass image data directly to the Images API, chain operations together, and get the processed result back as a stream:
const result = await env.IMAGES
.input(image)
.transform({ width: 800, rotate: 90 })
.output({ format: "image/avif" });
return result.response();
The binding communicates with Images through a socket connection managed by the Workers runtime. A socket connection is a communication channel between two processes. Each end of the socket has buffers that are managed by the operating system’s kernel; these buffers are temporary holding areas where data sits after one side writes it but before the other side reads it.
Hyper manages the connection on the Images service’s side, reading incoming requests from the socket and writing responses back to it.
When a request uses the Images binding, the Images service reads the input, performs the requested optimization operations, and encodes the result. It then passes the entire encoded image to hyper as a single in-memory block.
Hyper writes this response data into its own internal buffer. At this point, hyper considers the encoding work as complete, since it has all the bytes that it needs to send. The next step is to flush its internal buffer to the socket’s outbound buffer, moving the data from the Images service to the intermediary on the other end.
If the reader on the other end is fast, then hyper can flush everything in one pass — the outbound buffer will have room because the reader is consuming data as quickly as it arrives. Once all data is sent, hyper issues a shutdown on the socket, signaling that the connection is finished and no more data will be written. But if the reader is slower (even by a few milliseconds), then the outbound buffer fills up, and hyper needs to wait until there’s room to continue writing.
Taking the local
All incoming traffic on Cloudflare's network passes through FL, an internal intermediary service that runs security and performance features and routes requests to the appropriate backend. When we first launched the binding, image data flowed from the Workers runtime, through FL, to the Images service.
This path was a natural fit for our initial release and follows the same architecture as our URL interface. Over time, though, this coupling with FL became a constraint: Every change to the binding had to follow FL’s release cycle.
In December 2025, the Images team replaced FL with a new intermediary service, an internal worker binding that runs on the same machine. In the original architecture, data moved through FL over network sockets; this path carried the overhead of FL’s full processing pipeline, such as DNS lookups and routing.
The internal binding replaced these with Unix sockets to directly connect the services on the same machine, bypassing FL and the overhead of the network stack. This made the request path to Images faster and gave the team independent control over binding releases.
200 OK (not OK)
The first sign of trouble came from a customer with a non-standard setup: two layers of image processing, where one pipeline was nested inside another.
First, their worker used the Images binding to composite multiple large source images from R2 — a JPEG background plus PNG overlay layers — into a single combined JPEG. Second, they further compressed, transcoded, and resized the result through the URL interface.
The inner pipeline (transformation binding) handled compositing. The outer pipeline (transformation URL) handled delivery optimizations like scaling and format conversion. This layered approach meant that when the inner pipeline silently returned a truncated response, the only visible error appeared one level up:
error reading a body from connection: end of file before message length reached
The outer pipeline received HTTP 200 from the inner one, with a Content-Length header that promised several megabytes. The actual body was only a fraction of that: In one request, only ~200 KB arrived out of an expected 3.3 MB. The error surfaced in the outer pipeline, but the truncation could have originated in the binding, the intermediary service, the Images service, or somewhere in between.
Debugging in the dark
From here, we worked inward through the request path, testing each layer to isolate where the truncation was happening. Some of these efforts hit dead ends; others left breadcrumbs that narrowed the search:
- Building a reproduction. We built a worker that mimicked the customer’s nested setup, then stripped away layers until we could trigger the bug with the binding alone. A small script let us fire requests in batches. In one early run, 19 out of 25 requests failed. The amount of data that did arrive — roughly 200 KB — was suspiciously close to the size of the socket buffer in production. This confirmed that the problem wasn’t tied to the customer’s configuration and gave us a reliable way to trigger the bug on demand.
- Investigating timeouts. Early on, we suspected the truncation might be related to timeout behavior (i.e., the connection was being closed after a time limit). This theory didn’t hold, as the truncation wasn’t correlated with request duration.
- Updating hyper version. When the bug was first reported, we were running 0.14.x, while the latest hyper version was around 1.8.x. We tested across hyper versions 0.14, 1.7, and 1.8, just in case the most obvious answer was the correct (and easiest) one. But the bug appeared in each version, which meant that there wasn’t an upstream fix.
- Reproducing locally. We ran local integration tests on macOS and a Debian VM. Even under considerable load, our local requests never triggered any failure. Making direct curl requests to the binding socket and replaying captured requests always seemed to work. The bug only appeared on the full production path when there was real concurrency and a real Workers runtime client on the other end of the socket. This led us to suspect the runtime itself.
- Ruling out the Workers runtime. We examined the HTTP client that the Workers runtime uses to communicate with Images through the binding socket. None of the traces from either side of the connection showed any syscalls that indicated an unexpected close or early termination. We observed that the client behaved correctly and multiple other services used the same client without issues.
- Distributed tracing. By inspecting request traces end-to-end, we confirmed that the truncated body was already present before it reached the outer transformation layer in the customer’s setup. That narrowed the problem to the inner pipeline — the binding path through the Images service.
- Instrumenting the intermediary service. We added instrumentation to the intermediary service to measure body sizes before forwarding the response data. The bodies were already truncated by the time they left the Images service, so the intermediary was ruled out.
- Deeper tracing within the Images service. At the service level, the request was processed, the image was properly encoded, and the response was sent with HTTP
200.
A kernel of truth
To see what the system was actually doing, we attached strace to the Images service. strace records the syscalls that a process makes to the kernel, which could show us exactly which bytes were written, when a shutdown was called, and whether the client sent any termination signal.
Using a reproduction worker, we triggered the bug and compared the syscall output between successful and failing requests.
In a successful request, the response is written in chunks as the socket buffer allows, with shutdown called only after all the data is sent. When we reproduced the bug, a failing request looked like:
sendto(42, "HTTP/1.1 200 OK\r\nContent-Length: 14991808\r\n...", ...) = 219264
shutdown(42, SHUT_WR) = 0
Here, there is only one write — just enough for the headers and a sliver of the body — before the shutdown is immediately called. Out of a 14.9 MB response, only about 219 KB was sent. The remaining ~14.8 MB of image data never left hyper’s internal buffer, nor was there any termination signal from the client between the write and the shutdown. Instead, the Images service prematurely shut down the connection on its own, genuinely believing it was finished.
Inside the dispatch loop
Hyper's HTTP/1 connection lifecycle is driven by a state machine in a file called dispatch.rs. It runs a loop that reads requests, writes responses, flushes the write buffer to the socket, and decides when to shut down. In simplified form:
fn poll_loop(&mut self, cx: &mut Context<'_>) -> Poll<Result<(), Error>> {
loop {
let _ = self.poll_read(cx)?;
let _ = self.poll_write(cx)?;
let _ = self.poll_flush(cx)?;
if !self.conn.wants_read_again() {
return Poll::Ready(Ok(()));
}
}
}
More precisely, the let _ before poll_flush is where the bug lives. In Rust, let _ = expr discards the expression's result, including Poll::Pending, the signal that the flush isn’t done yet.
Don’t forget to flush
After weeks of investigation, the fix itself was conceptually simple. Hyper needed to check whether the flush was actually done before moving on. We traced through hyper's connection lifecycle and found a more targeted approach. Rather than changing how the dispatch loop behaves, we applied the fix at the point where shutdown is actually called. Before shutting down the socket, hyper should first flush any remaining data in its buffer:
pub(crate) fn poll_shutdown(
&mut self,
cx: &mut Context<'_>,
) -> Poll<io::Result<()>> {
ready!(self.poll_flush(cx)?);
Pin::new(&mut self.io).poll_shutdown(cx)
}
This leaves the dispatch loop unchanged. It adds a flush only at the exact point where data loss would otherwise occur — the moment before shutdown.
What stayed with us
None of the tools at the application level surfaced any errors, crashes, or log entries that provided useful clues. Application-level observability can have a blind spot for bugs that live below its awareness.
The failure occurred intermittently, scaled with response size, couldn’t be reproduced with simple tools like curl, and disappeared when we observed the system more closely. These signals pointed to a timing-dependent bug in the connection layer, not in the application logic.
We merged our fix and the deterministic test into hyperium/hyper via PR #4018. It will be available in a future hyper release, ensuring that any service using hyper’s HTTP/1 implementation won’t lose response data to the same race condition.
In the meantime, we’re running an internal fork with the patch applied. This fix stabilized the binding’s architecture, creating a reliable foundation to expand its functionality.
The Coming Loop
Software development is shifting toward 'agent loops' that move at extreme speeds, but this trend risks turning engineers into mere message-passing observers.
Summary
Deep Dive
- Agent loops exist at two levels: the internal model tool-use loop and the external 'harness' loop that manages the workflow.
- Automated code generation often creates 'defensive' code that lacks strong structural invariants, leading to unmaintainable complexity.
- Loops are highly effective for transient tasks like proof-of-concepts, benchmarking, and security scanning.
- The competitive pressure of 'agent-driven development' makes it nearly impossible for teams to opt out of these workflows.
- There is a looming need for better visualization and human-in-the-loop interfaces to restore legibility to complex AI-generated codebases.
Decoder
- Agent loop: An architectural pattern where an AI agent iterates on a task, evaluating its own progress against a defined goal and modifying its approach until completion.
- Harness: The external orchestration layer that manages sessions, queues, and logic for AI agents, determining when a task is officially 'done'.
Original Article
The Coming Loop
I don’t prompt Claude anymore. I have loops running that prompt Claude and figuring out what to do. My job is to write loops.
— Boris Cherny
Over the last months I have watched more and more people build something on top of coding agents that feels meaningfully different from just using a coding agent. Some of this happens on top of Pi which is cool to see for sure! The pattern is the same everywhere though: work is put into a queue of sorts, a machine picks it up, attempts it, stops, and then some harness decides whether that was actually the end.
If not, the harness continues the same session, injects another message, starts a fresh session with modified context, or sends the task to another machine. The task stays alive beyond the point where the model by itself would normally have said: “I am done.”
I think about that type of loop more than I want to admit.
There is already an agent loop inside every coding agent. The model calls a tool, incorporates the result, calls another tool, reads a file, edits a file, runs tests, and eventually produces some answer. That loop is one we have been quite familiar with for a long time. The other loop is the harness level loop: the loop outside the agent loop. That loop is also not new. We have been doing versions of this since early Claude Code days, but that loop is becoming ever more present in agentic engineering and in recent weeks it has started to dominate the Twitter discourse.
I Am Not Good At This Yet
My current status is that I have not had much success with this way of working for code I deeply care about which turns out to be quite a lot of code.
Part of that is taste and part of it is control. I attempt to set a high bar for what I want code to look like, and I want to understand the code I ship. Under pressure, or in a discussion with another human, I want to be able to explain what the system does without first having to ask a clanker to explain it to me. Now there is obviously a question if this desire to understand the code is one that I will still have a few years from now. For now I have not moved past the point of comprehension being important to me.
Given this desire, there is something I lack with my experience of code written without me paying attention, particularly from loops. Present-day models tend to produce code that is too defensive, too complex, too local in its reasoning. They avoid strong invariants. They add fallbacks instead of making bad states impossible. They duplicate code, invent bad abstractions, and paper over unclear design with more machinery. Worse though: I so far see very little progress of this improving. If anything, on that front it feels to me that we might even be making steps in the wrong direction. At least for my taste, present-day hands-off harnesses like Claude Code with ultracode produce worse code than what we were producing last autumn. That’s because Claude Code, with Fable for instance will be working uninterrupted on a problem for thirty minutes or more, when previously the process would have been much more human in the loop.
Furthermore it’s well understood that models tend to observe some local failure and add a local defense. Karpathy mentioned how they are “mortally terrified of exceptions”. In systems with important invariants, especially persisted data formats or core infrastructure, the right fix is not “handle every malformed case.” The right fix is to make the malformed case unrepresentable or impossible to write in the first place. Yet even with a lot of manual steering, that type of code does not come out of LLMs naturally, and even if the code comes out naturally like that, they will still attempt to handle now impossible errors.
When you take that behavior and you put it behind loops, you tend to amplify it. If each iteration adds another small defense, the system slowly becomes less understandable while appearing more robust. The more hands-off you are, the more that happens. It also teaches really bad practices when tools like this are given to juniors without clear guidance. Because if you ask them, why they are doing all that, they will convincingly argue their case.
Where Loops Work
At the same time, it would be dishonest to pretend the loop pattern does not work because it already works astonishingly well in some domains.
Porting code one of them. There are already impressive examples of large automatic porting efforts, including the reported work around moving parts of Bun from Zig to Rust. I have used it with success myself to port MiniJinja to Go. Performance explorations are another case where this works beautifully. A machine can try experiments, benchmark them, discard failures, and keep searching. Security scanning fits naturally too and so does almost any type of research: asking a system to explore a complex problem space and report back without necessarily committing lasting code. One thing that many of these have in common is that they either do not generate new code, but transform code that already exists, or they produce code that intentionally does not have a long shelf life. They either produce proof of concepts or ideas, surface findings or are more akin to mechnical transformation.
I believe that loops that produce artifacts without necessity of longevity or that create some form of clearly verifiable mechnical translation matters more than the general ability of a harness to mechanically measure a goal. Many successful applications of loops use another LLM as a judge or as an orchestrator. The mechnical translation case can be verified with a binary test case, but it can also be judged by an LLM instead!
Claude Code, for instance, is increasingly good at creating entire experimental workflows that it will then execute. Sure, the code it produces is slop, but that’s more the fault of the model than the harness not being a good judge on if a step in the workflow resulted in a net improvement or completion.
The harness just needs some signal that lets it continue. It does not have to be objective or binary, it just has to be useful enough to drive another iteration.
I absolutely love loops already that take the boring parts out of my day to experiment and measure and to give me ideas.
Software As Organism
On the other hand using that same looping methodology to write lasting code does not yet sit well with me. The metaphor I like to reach for is one of moving from software as a deterministic machine to software as an organism.
I became a software engineer in an enviornment that encouraged me to understand the machine. There was always a layer you could peel off to deepen your understanding. Machines that did not exhibit deterministic observable behavior were maybe accepted, but generally seen as not exactly optimal. Software architecture-wise, I saw it as desireable to push further towards more determinism rather than less. Likewise the ability to understand the code has been an undeniable goal. In practice not always possible we still took pride in writing code so that it became possible even for new engineers to navigate complex code bases through clever architecture. On well designed systems there were always engineers that knew where the invariantes lived, which parts were load-bearing and which changes were safe. Ideally all of that was also well documented. Where that understanding was lacking, it was generally regarded as something to improve upon.
Obviously that ideal has always been strained. Many software systems, especially very successful ones had periods where engineers on the team were able to keep them clean. Large software systems are not infrequently too big, too dynamic and too dependent on external services to fit into anyone’s head. Even without LLMs we already diagnose distributed systems somewhat like doctors in that we observe symptoms, form hypotheses, “order more tests”, try some remedies, and observe again.
Yet with LLMs we’re pushing much further in that direction and much quicker. We use them to write the code and we also use them for diagnosis and remedy. There are plenty of engineers that already live in a world in which the first step after the occurrence of a production issue is followed by having a clanker read logs, propose root causes and proactively put up a patch. The resulting patch is then often picked up by another machine that reviews, sometimes even landing it on main without any human supervision.
Obviously that is powerful and I cannot deny that it sounds appealing. But giving in to that idea, particularly with less and less human oversight means accepting that we may no longer understand the whole system in the same way. We treat it, we monitor it, we stabilize it, but we do not necessarily comprehend it.
I have no doubts that for some software, that is okay. Not every line of code deserves human authorship and worse code might have been written in the past.
But do I want all software to be authored this way?
You Cannot Quite Opt Out
What’s very uncomfortable is that opting out of this fully machine-driven future may not be an option.
Security is the clearest example today. Even if you do not use loops to build your software, other people will use loops against your software. Attackers will run machines continuously and even if it’s not attackers, then security researchers will and some of that automated work will throw up dust but also find real issues. And both the signal and the noise will come your way at a volume that makes it almost impossible to deal with unless you yourself throw a machine at the problem.
Daniel Stenberg’s post about curl’s summer of bliss is a good example of the pressure maintainers are already under. As far as I know, AI does not play a tremendous role in the core development of curl today. Yet despite all of this, maintainers are overwhelmed by reports, most of which are now AI-generated ones.
If attackers and reporters loop, defenders will eventually need to loop too to keep up. Maybe not to write patches directly, maybe just to triage and reproduce and pressure will increase.
The same is true competitively as some teams will out-build others through raw speed. Some projects will suddenly move faster because a tiny group figures out how to orchestrate machines effectively. Some startups will do with five people what used to require fifty. Some people might literally put a machine against your product in a loop and ask it to “make it like the other one.” And if their users are happy, does it really matter?
Not all software will be equally affected. Some domains will punish sloppiness and demand trust and responsibility, but a lot of software lives in a world where raw speed, quick experimentation, and vast coverage matter enormously.
Building New Dependencies
The scariest part to me is that we become dependent on these new machines in new ways. Software has always depended on tools. I remember the time when I had to pay for compilers. These new tools are a flashback to times where creating software came with real costs. But now it’s no longer a one-time payment, it’s a constant dependency. Not just a dependency on a filled wallet, but also a cognitive dependency.
If a codebase is produced by loops, reviewed by loops, patched by loops, and kept alive by loops, what happens when you no longer have access to the same class of systems? What happens when some trade restrictions take away access to the most powerful models? What if just the cost becomes unbearable? What if you and your team just lose the last remaining ability to understand the code without using the machine?
We may create codebases that are not merely hard to maintain by humans, but that assume machine participation as part of their maintenance model. This is already happening! It’s not happening everywhere, and it might not even be happening in ways that are seen as problematic, but we see more and more of it. People more and more merge code they cannot fully explain. People lose their ability to create issue reports or discuss things in chat, without augmenting or rephrasing their messages with the context provided by a clanker. Too many people increasingly rely on a machine to summarize or contextualize it. More and more do I encounter people who converse with me through the indirection of an LLM.
Again, maybe that is not even going to be wrong, but it’s a massive change to how we did things.
Future Harnesses
I have little doubt that this is where things are going but going there will require us to do something about our tooling everywhere, and not just in the coding agents.
Just orchestrating more loops won’t be enough. Better visualizations of changes or orchestration or agents will not restore our understanding. Either we need to find clever ways to jolt the human back into the loop and make the changes of the loops legible long term, or we need to find better ways to compose these ever more complex systems.
This is also where my thinking about the role of Pi is changing. Pi has been cautious, and I think that caution is good. I do not want a future where every interaction turns into an uncontrolled swarm of machines making changes I cannot follow. I would not want Pi to become an unmaintainable mess in an effort to win the race towards software that writes itself and I would not want Pi to promote this type of engineering either. At the same time Pi is a harness and harnesses are at the center of people running these new types of experiments.
Task queues for coding tasks, orchestration of agents, subagents, durable sessions will matter more and more. Even those of us who have their reservations and are not blindly embracing loops will have to start doing those experiments. We need to, because we need to understand how to make this future bounded and survivable.
Controlling Loops
As you can read from this post, I’m very uneasy about this future. Not cause of fear, but because of caution given experiences with this technology so far.
Adopting the idea of harness loops means that the harness decides when work is finished. In the agent loop, the model eventually says “done” and I review. Even before that, I usually steer along the way. I am involved and I enjoy learning along the way. In the harness operated loop I’m not sure what my role even is. Even the “done” signal loses all meanings and just becomes communicated to yet another machine that judges. My role is reduced to that of a messenger.
Today I do not like much of the code that I see from systems built that way and neither do I enjoy interacting with too much of software built with AI assistence. Looping is powerful but it removes responsibility more and more, and it at least today very much encourages us to give in to the machine.
And yet I have no doubts that this looping future is going to be our future despite the fact that I presently resent it. I already see astonishingly small teams building at impossible speed and I see codebases turning more and more into obscure and confusing organisms that can only be diagnosed by more machines. Those codebases are simultaniously useful and messy.
So I guess I’m coming to terms with that the question is not whether we will loop because clearly we will. Maybe the question is that in a future of loops, how do we don’t abdicate judgment, how we can retain rules of good engineering, how we can ensure that responsible human can continue to supervise, how we need to re-think how we architect code to retain sanity along the way.
The Problem is Prompt Debt
Relying on natural language prompts to specify system behavior creates 'prompt debt' that destroys code maintainability and locks developers into specific models.
Summary
Deep Dive
- 'Prompt debt' accumulates when brittle natural language fixes are layered on top of each other to steer probabilistic models.
- Repeated, contradictory instructions create regression risks that make the system difficult to iterate on.
- Different models possess different 'weights,' meaning prompts are rarely portable without significant performance degradation.
- Brittle prompts prevent organizations from upgrading to newer, more efficient models due to breaking changes in output behavior.
- Mature engineering requires shifting from 'coaxing' models to building hard, measurable constraints around them.
Decoder
- Prompt debt: The accumulation of complex, repeated, and contradictory instructions in system prompts that become increasingly difficult to maintain, test, and version over time.
- DSPy: A framework for programmatically optimizing LLM prompts using algorithmic search rather than manual natural language engineering.
- Fighting the weights: The process of attempting to steer a model to behave in ways contrary to its core training through persistent, repetitive prompt engineering.
Original Article
The Problem is Prompt Debt
You can’t be model agnostic if you’re hand-tuning prompts
Thanks to natural language interfaces, AI applications can be prototyped quickly. You write what you want in English, hand it to a frontier model, and a working prototype appears in an afternoon. This is extraordinarily powerful and for one-off tasks, optimal. But as a way to build reliable systems, the natural language prompt is a trap.
The plain-English prompt that makes prototypes effortless turns out to be a poor way to specify how a system should behave, and the bill arrives slowly, disguised as ordinary progress, until the application can barely move. The problem is not any single prompt. It is that natural language was never meant to be a specification language for engineering, and treating it as one quietly caps what you can build.
The Prompt Debt Trap
The first symptom of prompt debt is slowing iteration. As users flag errors and spot edge cases, additional guidance is added to the instructions, nudging the model into line. If unwanted behaviors persist, instructions are repeated, with increasing severity. Pretty soon, the prompt isn’t straightforward and quick fixes regress previous instructions. Errors can no longer be handled with one-line “hot fixes” and your development cycle slows to a crawl.
Next, prompt debt incapacitates your team. Your brittle prompt full of edge cases and all-caps threats is barely legible to you, and it’s downright impenetrable to your colleagues. Many teams mitigate this issue by breaking prompts into complicated templates assembled at run-time, each isolated to specific concerns. But these prompt segments evolve, too, growing into a thicket of conditions.
Finally, prompt debt ties you to a single model. Your hot fixes work on GPT-4o, but fail in entirely new ways when you point your inference call at GPT-5.4-mini. So you stay with 4o, hope the increasingly frequent deprecation emails from your inference provider are empty threats, and forgo the possibility of potentially cheaper, faster, better models.
Any one of these issues is a nuisance, but together they are the difference between a glorified prototype and a product that can grow with you, your customers, and your business. Your shiny new AI features are frozen, can only be improved through a full rebuild, and are locked to an aging model.
Why Prompt Debt Happens
Natural language interfaces are wonderful. They’re the right mechanism for one-off tasks and broad conversational threads. We get into trouble when we rely on natural language to define durable system behavior.
The imprecision of natural language paired with probabilistic language models means different words expressing the same intent, can yield different outputs. And it’s not only word choice that matters. Seemingly unrelated statements, in the same prompt, can affect results. Spurious statements influence the inference pass in ways we can’t predict. Which is why prompts become more brittle as you add fixes. An additional instruction to quell a stubborn error could affect how the model interprets a separate instruction that worked yesterday.
Repeating instructions propels us towards prompt debt, but it’s necessary when the behavior we want is at odds with a model’s training. This is fighting the weights, and once you recognize it you see it in system prompts everywhere.
None of these examples occurred in isolation. Multiple repeated rules are woven throughout the system prompts we examine. Stubborn errors grow our prompts quickly, with each increasing the brittleness, the risk of regression with every edit.
And worse: these fixes are tailored to a single model’s behavior. Models are not cleanly versioned software. They have different weights that produce different behaviors, in unpredictable and undocumented ways. A prompt that works beautifully with GPT-4o may fail with GPT-5.5.
Prompt debt locks an application to a single model. Our inability to easily swap models isn’t the result of frontier labs coming up with a clever moat. No, it’s the result of evolving a lossy, natural language specification against a probabilistic model.
Preventing Prompt Debt
Thankfully, we don’t have to theorize about how to mitigate prompt debt; one field has already shown the way. Over the last couple years they’ve been evolving best practices that let the model write more of the code, while delivering maintainable, modular software.
The first principle is to specify your system’s behavior with measurements, not prose. When the model’s output is probabilistic and language is imprecise, we build hard edges to constrain them: evaluations, metrics, and typed specifications. These are legible, shared artifacts colleagues can read and contribute to, enabling the collaboration that brittle prompts prevented.
The best engineers now spend more of their bandwidth on tests than ever, as they are no longer a safety net but the thing that lets the model cook.
The second principle is to stop writing the prompt by hand. Once we have metrics that can score candidates, the prompt is no longer something to craft but something for which to search. And the surface area of potential words, phrases, and structures that natural language allows is too vast to spend human hours on. This is terrain LLMs were built to explore, and there are already systems (like DSPy and GEPA) that manage this work for you, holding prompts accountable to your designs.
Once prompts are generated and your program’s behavior is defined by measurements, you are no longer bound to a particular model. Evaluating a new model takes hours, not weeks. When a faster, cheaper model arrives you can try it. When a deprecation email arrives, you can secure options in a day.
Every mature engineering discipline eventually stops doing by hand the very thing it once prided itself on doing by hand. Assembly gave way to compilers, hand-tuned queries gave way to planners, and manual memory management gave way (mostly) to machines that do it better. Prompt-writing is no different.
Coaxing the model with exactly the right words is a real skill, and for one-off tasks it’s often optimal. But to build reliable, improvable, and portable systems we should not be hand-tuning prompts.
Vulnerability Reports Are Not Special Anymore
The era of the 'special' vulnerability report is over now that LLMs can generate security findings at scale.
Summary
Deep Dive
- Vulnerability reporting was historically a 'special' favor; this dynamic is now obsolete.
- The current security challenge is not finding vulnerabilities, but triaging high volumes of machine-generated reports.
- Attackers utilize the same LLM-based analysis tools as defenders.
- Coordinated disclosure is becoming less effective due to the sheer volume of incoming reports.
- Maintainers should focus on rapid remediation and building automated verification processes.
- Highly trusted researchers may still deserve special handling, but the average reporter's signal-to-noise ratio has dropped significantly.
Decoder
- Full Disclosure: The practice of publishing vulnerability details publicly to pressure vendors, historically controversial but essential in moving the industry toward transparency.
- Coordinated Disclosure: The process of reporting a security flaw to a vendor privately, allowing them to fix it before public announcement.
- Triage: The process of determining the urgency and severity of incoming bug reports to prioritize developer attention.
Original Article
Vulnerability Reports Are Not Special Anymore
A requirement for staying sane while working in public as an open source maintainer is realizing that every issue, PR, and piece of feedback is a present, not an obligation. You can accept it, ignore it, and use it partially or not at all.
Except…
For years, as lead of the Go Security team at the time, I’ve told new team members that it doesn’t apply to vulnerability reports. No, vulnerability reports are special. Security researchers are doing us a favor by reporting things confidentially instead of doing full disclosure, so we owe them something, which is not true of regular issues opened on the issue tracker.
Different projects have different policies, but the general expectations are responsiveness and attribution. We’re supposed to acknowledge reports quickly, investigate them, keep the reporter posted, and eventually credit them with the discovery.
Why? Well, because the reporter is providing us a service, not asking us to provide one (such as a bug fix or a feature implementation). In exchange for responsiveness and attribution, they are offering precious insight and the confidentiality we need to ship a fix before attackers ship an exploit.
Ultimately, it all stems from our responsibility to our users. The security researchers are not special, the insight and confidentiality are, and we need them to keep our users safe. Ignoring a security report communicates you don’t care about users’ security, and it’s rightly a reason for shame.
Except…
It’s 2026 and none of the premises are true anymore.
LLMs are as good as almost any security researcher, and anyone can run them. The maintainers can run them. The attackers can run them.
The insight is not scarce and precious anymore. The bottleneck now is not finding potential issues but assessing which ones are real. Unless there’s already a trust relationship, external researchers can’t meaningfully contribute to that triage process, and picking through an LLM’s output or through a security@ inbox has approximately the same signal-to-noise ratio.
Confidentiality, embargoes, and coordination also don’t matter nearly as much as they used to. The attackers don’t need to read the full disclosure post to learn about the vulnerability: they can ask their own LLM and, in fact, they also probably have the same triage bottleneck as the defenders do.
The years of vulnerability reports being special might be over, as weird and uncomfortable as that feels. Triage, rapid remediation, and—as ever—prevention are the job now. And we should all figure out how to run LLM analysis in CI, I suppose.
A couple comments
This post rapidly generated some interesting discussion, which gives me the opportunity to add some nuance.
On Bluesky, Avery Pennarun points out things will change again.
I’m not sure I agree. There’s been a step change in ability to find vulns, but the only stable outcome (once we get there) is fewer vulns getting released. When that happens there will be a new higher bar and finding them will be hard again. Unclear we should optimize for the short term dynamics.
The current dynamic will persist at least for as long as the models keep getting better. I honestly have no idea how the profession will look after that, so this whole post is more of a current observation than a long-term prediction.
On Lobsters, Frederik Braun calls out how there are still some vulnerability reports that are special.
Special vulnerability reports should be treated as special and it is on the defender to work on better verification and published threat models such that people can meet (and verify) a new, higher bar for what constitutes a great report.
I agree, whether officially or unofficially there will need to be a process for special reports: the extremely high severity ones, the ones from highly trusted sources. Maybe the next task of security teams is getting good at classifying reports rapidly into special and not special buckets.
On Hacker News, William Woodruff confirms most reports are real, and not special anymore.
I agree with this. One of the consequences of the “vulnpocalpyse” is that it’s become even harder to sift through the noise: I triage well over a dozen reports a week, many of which are “real” in the sense that they reflect a genuine defect but otherwise have an unclear impact on a typical user. This has always been true of the median vulnerability report, but the volume means that I now lean much more heavily away from coordinated disclosure.
One flipside to this is that, because many of these bugs are “shallow” to LLMs, it’s actually easier than ever to moderate the worst participants in your vulnerability program – if someone sends you slop, you can just ban them and wait for the next, better orchestrated LLM to send you a better report for the same vulnerability.
Imagine being able to freely ban researchers just one year ago!
Still on Hacker News, Juho Forsén, one of the most prolific reporters of Go security issues, wrote a long interesting comment that makes the argument that instead we should lean harder into trust relationships with individual researchers. It’d certainly be worth it with Juho, in retrospect, but it’s unclear if it would pay off often enough, in the same way that training new contributors who might leave the project in a month or two is not always worth it.
Production-ready Vision, Everywhere (Website)
M87 Labs released Moondream, a series of compact vision language models paired with the 'Photon' inference engine for low-latency edge deployment.
Summary
Deep Dive
- Model Variants: Includes Moondream 2 (2B parameters) and Moondream 3 (9B MoE), both commercially friendly.
- Inference Engine: 'Photon' is a proprietary engine optimized for low-latency production, supporting Mac, Windows, and Jetson edge hardware.
- Fine-tuning: 'Lens' provides an API for fine-tuning on minimal datasets (as few as 20 images) without requiring ML infrastructure.
- Performance: Benchmarks on H100 GPUs show Moondream 3 outperforming Qwen 3.5 4B by 2x in latency for single-turn detection tasks.
- Deployment: Supports cloud-hosted, local, and air-gapped environments with identical APIs.
Decoder
- VLM: Vision Language Model, a neural network capable of interpreting and reasoning over image data alongside text.
- MoE: Mixture-of-Experts, a model architecture where only a subset of parameters is active per inference, increasing efficiency without sacrificing reasoning capability.
- vLLM: A popular high-throughput, memory-efficient library for LLM inference.
- Air-gapped: Systems isolated from the public internet for security or compliance reasons.
Original Article
Production-ready vision, everywhere.
Production VLMs need more than just accuracy. They need to be fast enough for real-time decisions, and run anywhere you deploy. That's what Moondream is built for.
Try the open models. You might already be done.
Moondream might already nail your use case out of the box. The open models are commercially friendly and can run anywhere. Use our playground to try it out or download it and run it yourself.
No credit card required. $5 in credits added monthly.
# Caption an image in four lines
import moondream as md
from PIL import Image
model = md.vl(model="moondream-2b")
image = Image.open("shelf.jpg")
print(model.caption(image).caption)
# → 'A warehouse shelf with six cardboard cartons…'
Moondream 3 Preview (9B MoE)
Sparse mixture-of-experts architecture with frontier-level visual reasoning, segmentation, and long-context queries.
Moondream 2 (2B Dense)
The production workhorse: compact, proven, commercially friendly, and easy to deploy across GPUs, CPUs, and edge devices.
Moondream 2 0.5B (distillation target)
A small fine-tuning base for constrained hardware where every megabyte matters.
Need more? Lens gets you to production-grade accuracy
Your data is specific, so the model has to be. Lens is a fine-tuning platform with a simple API. No dataset uploads, no infrastructure, no ML team required.
A simple hosted API — no hardware to rent or manage. Supports SFT and RL. Vibe-code your fine-tune script in minutes.
Your fine-tuned model is instantly ready to run on Moondream Cloud or locally with Photon. No cumbersome download or install step.
Our team handles the labeling protocol, loss design, and evaluation. You keep the weights, the training code, and the data. Unlike ML consulting, you walk away self-sufficient.
Our reinforcement-learning fine-tune API can dramatically improve accuracy with as few as 20 labeled images — not thousands.
Fast, efficient, runs everwhere you need it.
Once your model is accurate, performance and cost become the next wall. Photon is the inference engine we built to run Moondream in production. Moondream Cloud and partner clouds give you a hosted path if you want one.
Under 500 ms is the difference between a useful answer and a late one. Photon runs Moondream in roughly half the time vLLM does on the same hardware.
A VLM running across a fleet of cameras at the wrong efficiency costs thousands a day. Moondream is the lowest-cost VLM we have measured across the inference providers we tested.
Your deployment story will change. Start in the cloud, move to the edge, or run air-gapped. You pick the hardware. The model and APIs stay the same.
Measured on the ChartQA test split with prefix caching enabled. Latency is the P50 of a single direct-answer query call; throughput is sustained requests per second at batch 64.
import moondream as md
from PIL import Image
# Initialize with local GPU inference
model = md.vl(api_key="YOUR_API_KEY", local=True)
# Load an image
image = Image.open("path/to/image.jpg")
# Generate a caption
caption = model.caption(image)["caption"]
print("Caption:", caption)
Launch is just the start
One vendor for the full stack. Models drift. Engineers leave. New use cases appear. With stitched-together vendors, nobody owns the outage. With Moondream, we do.
Competitor stack- Model vendor (weights only)
- Fine-tuning vendor (your data goes elsewhere)
- Inference provider (different SLA)
- Your on-call engineer (owns everything)
- Model, weights, and roadmap
- Lens fine-tuning and evals
- Photon and Moondream Cloud
- One team on call, 24/7 on enterprise plans
Four products that work together. Use one. Use all of them.
- Open Models: The foundation. Free for commercial use. 2B, 1B, and 0.5B checkpoints on Hugging Face.
- Lens: Fine-tuning with a simple API. Self-serve or white-glove. You keep the weights.
- Photon: Inference engine. Hand-tuned kernels. Mac, Windows, CUDA — Jetson to B200.
- Moondream Cloud: Hosted inference. OpenAI-compatible API. Pay per image, no commitment.
- Support: Plans for teams running Moondream in production. One team owns the full stack.
Try the open model. Or talk to us about production.
The model is free, open, and the fastest way to see if Moondream fits. If you already know it does, we can skip ahead and talk about fine-tuning, inference, and a support plan.
Mistral OCR 4: SOTA OCR for Document Intelligence
Mistral released OCR 4, a document intelligence tool that adds bounding boxes and confidence scores to standard text extraction in a single container.
Summary
Deep Dive
- Features bounding boxes, block classification, and inline confidence scores.
- Supports 170 languages across 10 groups.
- Deployable in a single container for data residency requirements.
- Integrated with Mistral's Search Toolkit.
- Outperforms competitors in human preference tests (72% win rate) and OlmOCRBench (85.20 score).
- Pricing: $4/1k pages (API), $2/1k pages (Batch), $5/1k pages (Document AI).
Decoder
- RAG (Retrieval-Augmented Generation): A technique that connects LLMs to external, private data sources to improve the accuracy and relevance of generated responses.
- Bounding Box: A set of coordinates defining the rectangular region in an image where a specific element, such as text or a table, is located.
Original Article
Introducing OCR 4
Today, we're releasing Mistral OCR 4, featuring bounding boxes, block classification, and inline confidence scores alongside extracted text. The model supports 170 languages across 10 language groups, runs in a single container for fully self-hosted deployments, and serves as an ingestion component for enterprise search, RAG, and domain-specific retrieval pipelines. OCR 4 is a small, focused model, and this post covers what's new, how it performs on public and internal benchmarks, the known limitations of those benchmarks, and guidance on when to use the model API versus Document AI.
Highlights
- Breakthrough performance. Independent annotators prefer OCR 4 over every leading OCR and document-AI system tested, with win rates averaging 72%, alongside the top overall score on OlmOCRBench (85.20). See Benchmarks below for methodology and known scoring limitations.
- Segmentation, not just text. Alongside the extracted text, OCR 4 returns bounding boxes, typed-block classification (titles, tables, equations, signatures, and more), and inline confidence scores. Bounding boxes, our most-requested capability, localize text for in-context highlighting and reliable data pipelines. At the same time, block types and confidence scores drive source-grounded citations, redactions, and human-in-the-loop verification.
- Integrated with Mistral Search Toolkit (public preview). OCR 4 is an ingestion component of Search Toolkit, Mistral's open-source, composable search framework, announced at the AI Now Summit. Its structured output supplies citation-ready inputs to the toolkit's ingestion, retrieval, and evaluation workflow for RAG and enterprise search.
- Multilingual coverage. Support for 170 languages across 10 language groups, with measurable gains on specialized and low-resource languages where several competing systems degrade.
- Run on your own infrastructure. OCR 4 is compact enough to deploy on a single container, keeping document data in your environment for residency, sovereignty, and compliance, while supporting cost-efficient, high-throughput batch processing. Self-managed deployment is available to enterprise customers.
Overview
Mistral OCR 4 extracts and structures content from a wide range of documents. Where previous generations focused on converting a page into clean text and tables, OCR 4 returns a structured representation of the document. Each block is localized with a bounding box, classified by type, and inline confidence scores are generated per-page and per-word. Downstream systems, therefore, have access not only to what the document says but also to where each element sits, what role it plays, and how confident the model is in each region.
This structure supports several downstream workloads:
- Semantic chunking for RAG: clean, classified blocks become better retrieval units.
- Structural primitives for agents: agents move from reading documents to acting on them (form filling, invoice processing, compliance checks).
- Structured content for connectors: consistent, typed output for ingestion and indexing pipelines.
OCR 4 accepts common enterprise formats, including PDF, DOC, PPT, and OpenDocument, and supports 170 languages across 10 language groups, including specialized and low-resource languages that many systems handle poorly. As a compact model deployable in a single container, it is suited to both cost-sensitive and high-volume deployments. It can run fully self-hosted, allowing organizations with data-sovereignty requirements to keep document data within their own infrastructure.
Developers integrate the model via API, and teams can use Document AI in Mistral Studio for an application-level, no-code path to the same engine. Mistral OCR 4 through the API is priced at $4 per 1,000 pages, with a 50% Batch-API discount, reducing the cost to $2 per 1,000 pages. Document AI is priced at $5 per 1,000 pages.
Benchmarks
“We benchmarked Mistral OCR 4 against the leading agentic document parsers across a chart and figure dense financial QA dataset and reached equivalent accuracy at roughly 8x lower cost and 17x lower latency. For production use cases at scale, that delta compounds fast." - Aidan Donohue, AI Engineer, Rogo
To evaluate OCR 4, we compared it against leading AI-native OCR models, frontier general-purpose models, enterprise document services, and our own Mistral OCR 3.
Human Preference Evaluations
Automated benchmarks carry the scoring artifacts described above, so we complemented them with a head-to-head human evaluation on documents chosen to reflect real usage. We assembled 600+ documents across 12+ languages, sourced from third-party vendors to represent real industry use cases, and asked independent annotators to blindly rank each competitor's output against OCR 4's, document by document.
Overall Performance
“Mistral OCR is roughly 4x faster per page than our incumbent provider, an impressive result for the high-volume docketing workflows where speed is critical to managing our customers' IP timelines.” - Ivan Mihailov, AI engineer, Anaqua
In addition to placing first in our human preferences, OCR 4 achieves the top overall score amongst the models we tested on the public OlmOCRBench (85.20) and leads our internal Crawl Multilingual evaluation (.98), ahead of both AI-native and enterprise solutions.
On OmniDocBench, OCR 4 achieves a score of 93.07. We report this figure with a caveat: both OlmOCRBench and OmniDocBench have known limitations in how they score certain outputs, and a single aggregate number can both understate and overstate real-world performance.
Recommended use cases
OCR 4 supports both high-volume pipelines and interactive document workflows, including:
- Document parsing and extraction: complex, multilingual documents.
- Retrieval-Augmented Generation (RAG): structured, classified, citation-ready content for semantic chunking and source-grounded answers. With Search Toolkit, OCR 4 output can be fed directly into retrieval pipelines.
- Agentic workflows: providing agents with the structural primitives to complete tasks such as form filling, invoice processing, and compliance checks, especially in legal, financial services, and healthcare.
- Structured data pipelines using confidence scores to enable efficient use of human verifiers: form/invoice extraction, redactions, and compliance-driven processes.
- Enterprise search and knowledge bases: OCR as a data-source component for custom ingestion and entity extraction.
OCR 4 API: Understanding Your Options
Mistral's OCR 4 is available through a single API endpoint. Every request runs the same underlying OCR model and always returns extracted content, bounding boxes, block types, confidence scores, and markdown-structured text. What varies is how much you layer on top.
Use OCR 4 in pure extraction mode when you want to:
- Embed fast, accurate document extraction directly into your application, agent, or data pipeline.
- Work directly with the raw response, bounding boxes, block types, and confidence scores to drive custom downstream logic.
- Run high-volume or batch ingestion with full control over throughput and cost via the Batch API.
- Self-host for strict data-privacy, sovereignty, or compliance requirements.
Activate Document AI capabilities (same endpoint, additional parameters) when you want to:
- Return structured JSON in a schema you define — pass a JSON schema alongside your document, and the OCR output is fed to
mistral-small-2603to generate content shaped to your spec. - Annotate detected images with structured JSON by passing an image annotation schema, triggering an additional vision-language model call per image.
- Use a custom prompt alongside a JSON schema to guide how the extracted content of the full document is interpreted or summarized.
- Enable business users, solutions teams, or pilots to produce structured results without writing downstream parsing logic.
Now available
“The availability of Mistral Document AI with OCR 4 in Microsoft Foundry marks an important milestone in our partnership. Together, we’re enabling customers to bring advanced, structured document understanding directly into their AI workflows, combining Mistral’s innovation with Microsoft’s enterprise platform to deliver scalable, trusted solutions for real-world business needs.” -Kimmi Grewal, VP, AI Ecosystem Partnerships, Microsoft
Both Mistral OCRv4 and Document AI (powered by OCRv4) are available via API through Mistral Studio, Amazon SageMaker, Microsoft Foundry, and coming soon Snowflake Parse Document. For organizations with stringent data-privacy requirements, OCR 4 also offers a self-hosting option so sensitive information stays within your own infrastructure. To explore self-deployment, let us know.
Get started
- Try OCR 4. The new Getting Started with OCR 4 Cookbook walks through a first extraction, working with bounding boxes, and block classification.
- OCR 4 webinar. We'll cover what's new in OCR 4 with demos and Q&A on July 7th at 6:00 PM CET. Register for the OCR4 in Production webinar.
- Contact Sales for more information.
Claude Tag
Anthropic launched Claude Tag, a Slack-integrated agent system that allows teams to delegate tasks and maintain shared context across channels.
Summary
Decoder
- Asynchronous: A communication style where participants do not need to respond in real-time, allowing the AI to complete long-running tasks while users focus on other work.
Original Article
Introducing Claude Tag
Claude Tag is a new way for teams to work with Claude.
We’re starting on Slack, which Claude can join as a team member. Grant Claude access to selected channels, and connect it to whichever tools, data—and even codebases—you choose. Then, anyone in the channel can tag @Claude in, and delegate tasks to it while they focus on other work. Claude builds context by remembering relevant information from the channels it’s in, and can plan out tasks to complete in the future.
We see Claude Tag as the beginning of an evolution of Claude Code: it makes the model even more proactive, and it works better with a full team. Tagging @Claude is now one of the main ways we get things done at Anthropic. Today, 65% of our product team’s code is created by our internal version of Claude Tag. The same pattern is now spreading well beyond engineering—we’re tagging Claude to chase down product metrics and data, work through support tickets, or even help find the root cause of tricky bugs.
We’re launching Claude Tag on Slack, since it’s a natural home for collaborative work between teams and AI, and where much of Anthropic’s day-to-day work already happens. It’s available today in beta for Claude Enterprise and Team customers. Our goal is to expand where it’s available more widely, so that teams can tag @Claude in the many other places they work.
Working with @Claude
If you’ve worked with Claude Code or Cowork before, Claude Tag will feel familiar. Tag @Claude with a request in simple terms and it’ll break its task down into stages and then work through them in turn, using the tools it has access to. Once it’s done, it’ll respond in a Slack thread with what it’s created.
But tagging Claude comes with a few new advantages:
@Claude is multiplayer. Within a given Slack channel, there’s one Claude that interacts with everyone. This means that anyone can see what it’s working on, and can pick up the conversation from where the last person left off. This makes tagging Claude very different from working within a single chat or for a single task—it’s much more like interacting collaboratively with a teammate.
@Claude learns over time. As Claude follows along with its channel, it builds more context about the work. This means that users don’t need to explain things to it from scratch over and over again. And Claude can even automatically learn from other Slack channels and data sources, if it’s granted permission. (It doesn’t report from private channels.) This gives it the tacit knowledge necessary for it to provide the best possible work.
@Claude takes initiative. If “ambient” behavior is enabled, Claude will proactively keep you updated about whatever it thinks you might need to know. It’ll flag relevant information from across the channels it’s in and the tools it’s connected to, and follow up on threads or tasks that have gone quiet without being resolved.
@Claude works asynchronously. Set Claude a task, and you can focus on your other priorities while it works. It can also schedule tasks for itself, pursuing a project autonomously over hours or days. We’ve found this particularly helpful at Anthropic: we now spend much more of our time delegating tasks to many Claudes in parallel.
You can also send Claude direct messages: it’ll respond privately, using the personal tools and connectors you’ve set up.
Getting started
We’ve designed Claude Tag with teams and organizations in mind: @Claude’s access to sensitive data and task-specific tools can be very tightly controlled.
To get up and running, system administrators specify which tools and information the model should have access to, in which channels. Think of it as creating separate Claude identities for different uses: everything, including its memories, will stay scoped to the channels defined by the administrators. For example, a model set up for sales work won’t pass on memories to one set up for engineering; nor will it give engineers access to any sales data or tools. More information about provisioning access is available here.
Once permissions are set, everyone can begin tagging right away. Administrators can set limits for token spend (both for the organization and for individual channels), and can view a log of everything that @Claude has done, along with who requested each task.
If you’re a Claude Enterprise or Team customer, you have access to Claude Tag in beta starting today. To get started, visit here and follow these four steps:
- Pair Claude Tag with your Slack workspace
- Give Claude access to your tools
- Set a limit on your organization’s monthly spend
- Test Claude in a private channel to confirm it works.
Claude Tag replaces the existing Claude in Slack app. To migrate, administrators can opt in within 30 days. We’re issuing an introductory launch credit to eligible Enterprise and Team organizations so that the whole company can try it out.
Claude Tag works with Opus 4.8. You can read our docs and product page.
Insights on Indirect Prompt Injection
Gray Swan's Zico Kolter and Matt Fredrikson discuss the inevitability of indirect prompt injection and why AI security requires specialized red teaming.
Summary
Deep Dive
- AI systems introduce new classes of failure (e.g., indirect prompt injection) that differ from traditional cybersecurity threats.
- Larger models do not automatically gain robustness against adversarial attacks.
- Automated red teaming (via tools like Shade) is increasingly outperforming human teams.
- The 'Lethal Trifecta' for AI exploits: untrusted input, access to private data, and exfiltration capability.
- Mechanistic interpretability research can be automated and scaled using AI agents.
Decoder
- Indirect Prompt Injection: A vulnerability where an AI agent reads untrusted external data (like a website or email) that contains instructions to hijack the agent's behavior.
- Red Teaming: The practice of simulating adversarial attacks against a system to identify and remediate security vulnerabilities.
- Mechanistic Interpretability (Mech Interp): A field of AI research focused on reverse-engineering the internal neural network components to understand how and why models arrive at specific conclusions.
Original Article
Full article content is not available for inline reading.
Graphsignal (GitHub Repo)
Graphsignal offers a low-overhead profiling platform for production AI inference, tracking bottlenecks across hardware and models without recording sensitive prompt data.
Summary
Decoder
- Sidecar process: A secondary, auxiliary container or process that runs alongside the main application to handle background tasks like logging or monitoring without impacting the primary application's performance.
Original Article
Graphsignal: Inference Profiler
Graphsignal is a production-scale inference profiling platform that helps engineers optimize AI performance across models, engines, GPUs, and other accelerators. It provides essential visibility across the inference stack, including:
- Continuous, high-resolution profiling timelines exposing operation durations and resource utilization across inference workloads.
- LLM generation tracing with per-step timing, token throughput, and latency breakdowns for major inference frameworks.
- System-level metrics for inference engines and hardware (CPU, GPU, accelerators).
- Error monitoring for device-level failures and inference errors.
- Inference telemetry for AI agents to identify bottlenecks and drive targeted improvements across the inference stack.
Learn more at graphsignal.com.
Install
UV_TOOL_BIN_DIR=/usr/local/bin uv tool install 'graphsignal[cu12]' # CUDA 12.x
# or
UV_TOOL_BIN_DIR=/usr/local/bin uv tool install 'graphsignal[cu13]' # CUDA 13.x
Alternative: install into your workload environment
If you prefer a single environment, or you use the graphsignal.watch() Python API (which requires graphsignal importable by your application), install it directly into your workload's environment instead:
pip install 'graphsignal[cu12]' # CUDA 12.x
# or
pip install 'graphsignal[cu13]' # CUDA 13.x
Profile
Wrap your launch command with graphsignal-run:
export GRAPHSIGNAL_API_KEY=<my-api-key>
graphsignal-run vllm serve <model> --port 8001
Environment variables read by the profiler:
| Variable | Purpose |
|---|---|
GRAPHSIGNAL_API_KEY (required) |
Your account API key. |
GRAPHSIGNAL_TAG_<KEY>=<value> |
Arbitrary tag attached to all signals (e.g. GRAPHSIGNAL_TAG_DEPLOYMENT=us-prod). |
Sign up for a free account at graphsignal.com; you'll find the API key in Settings / API Keys.
See the Profiler CLI reference for the full set of options.
Applications that bootstrap themselves can call graphsignal.watch() from Python instead — see the Profiler API reference.
See integration documentation for libraries and inference engines:
- PyTorch
- vLLM
- SGLang
Optimize
Log in to Graphsignal to monitor and analyze your application.
Optimize with AI
Install the Graphsignal skill to let your AI coding agent (Claude Code, Codex, or Gemini) fetch and analyze signal context directly from your agent. See AI Optimization for setup instructions.
Overhead
The profiler has minimal impact on production performance. CUDA kernel activity is collected via CUPTI with low-overhead APIs, and analysis and upload happen in the sidecar process.
Security and Privacy
The profiler only establishes outbound connections to api.graphsignal.com to send data; inbound connections or commands are not possible.
Content and sensitive information, such as prompts and completions, are not recorded.
Troubleshooting
If something doesn't look right, report it to our support team via your account.
In case of connection issues, please make sure outgoing connections to https://api.graphsignal.com are allowed.
Krea 2 Technical Report
Krea 2 avoids the 'default aesthetic' trap by using a custom, multi-stage training pipeline and strict pretraining data curation to prioritize creative exploration.
Summary
Deep Dive
- Krea 2 uses a diffusion transformer (DiT) with gated sigmoid attention and grouped-query attention.
- It abandons synthetic training data to prevent bias propagation and quality ceilings.
- Data curation relies on hierarchical k-means clustering via FAISS and VLM-based filtering.
- The team implemented a 'krablet' PostgreSQL-based warehousing system for massive metadata handling.
- Multi-stage resolution scaling (256px to 1024px) was used to stabilize core capabilities.
- Training uses a multi-reward RL approach to discourage reward hacking and improve structural correctness.
- Fault tolerance was handled via aggressive checkpointing to the Weka file system.
Decoder
- DiT (Diffusion Transformer): A generative model architecture that uses transformer blocks to process diffusion-based latent spaces.
- Muon: A recent optimizer designed to scale training efficiency by performing lower-rank updates on weight matrices.
- Reward hacking: When a model learns to exploit the scoring mechanism in reinforcement learning to achieve a high score without performing the intended task correctly.
Original Article
Full article content is not available for inline reading.
Unlimited OCR Works (GitHub Repo)
Unlimited OCR pushes long-context document parsing by combining a constant KV cache design with DeepSeek-OCR to handle dozens of pages in one pass.
Summary
Decoder
- KV Cache (Key-Value Cache): A mechanism in transformer models that stores previous key and value pairs to avoid redundant calculations during the generation of subsequent tokens.
Original Article
Unlimited OCR Works
Welcome the Era of One-shot Long-horizon Parsing.
Release
- [2026/06/24] 🤝 Thanks to AK for creating a demo for us. It is now available at Hugging Face Spaces.
- [2026/06/23] 📄 Our paper is now available on arXiv.
- [2026/06/23] 🤝 Thanks to the ModelScope community for their support. Our model is now available at ModelScope.
- [2026/06/22] 🚀 We present Unlimited-OCR, aiming to push Deepseek-OCR one step further.
Inference
Transformers
Inference using Huggingface transformers on NVIDIA GPUs. Requirements tested on python 3.12.3 + CUDA12.9:
torch==2.10.0
torchvision==0.25.0
transformers==4.57.1
Pillow==12.1.1
matplotlib==3.10.8
einops==0.8.2
addict==2.4.0
easydict==1.13
pymupdf==1.27.2.2
psutil==7.2.2
import os
import torch
from transformers import AutoModel, AutoTokenizer
model_name = 'baidu/Unlimited-OCR'
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModel.from_pretrained(
model_name,
trust_remote_code=True,
use_safetensors=True,
torch_dtype=torch.bfloat16,
)
model = model.eval().cuda()
# ── Single image supports two configs: gundam or base ──
# gundam: base_size=1024, image_size=640, crop_mode=True
# base: base_size=1024, image_size=1024, crop_mode=False
model.infer(
tokenizer,
prompt='<image>document parsing.',
image_file='your_image.jpg',
output_path='your/output/dir',
base_size=1024, image_size=640, crop_mode=True,
max_length=32768,
no_repeat_ngram_size=35, ngram_window=128,
save_results=True,
)
# ── Multi page / PDF only uses base (image_size=1024) ──
model.infer_multi(
tokenizer,
prompt='<image>Multi page parsing.',
image_files=['page1.png', 'page2.png', 'page3.png'],
output_path='your/output/dir',
image_size=1024,
max_length=32768,
no_repeat_ngram_size=35, ngram_window=1024,
save_results=True,
)
# ── PDF (convert pages to images, then multi-page parsing) ──
import tempfile, fitz # PyMuPDF
def pdf_to_images(pdf_path, dpi=300):
doc = fitz.open(pdf_path)
tmp_dir = tempfile.mkdtemp(prefix='pdf_ocr_')
mat = fitz.Matrix(dpi / 72, dpi / 72)
paths = []
for i, page in enumerate(doc):
out = os.path.join(tmp_dir, f'page_{i+1:04d}.png')
page.get_pixmap(matrix=mat).save(out)
paths.append(out)
doc.close()
return paths
model.infer_multi(
tokenizer,
prompt='<image>Multi page parsing.',
image_files=pdf_to_images('your_doc.pdf', dpi=300),
output_path='your/output/dir',
image_size=1024,
max_length=32768,
no_repeat_ngram_size=35, ngram_window=1024,
save_results=True,
)
SGLang
Set up the environment (uv-managed virtualenv). Install the local SGLang wheel first, then pin kernels==0.9.0 and install PyMuPDF for PDF-to-image conversion:
uv venv --python 3.12
source .venv/bin/activate
uv pip install wheel/sglang-0.0.0.dev11416+g92e8bb79e-py3-none-any.whl
uv pip install kernels==0.11.7
uv pip install pymupdf==1.27.2.2
Start the SGLang server:
python -m sglang.launch_server \
--model baidu/Unlimited-OCR \
--served-model-name Unlimited-OCR \
--attention-backend fa3 \
--page-size 1 \
--mem-fraction-static 0.8 \
--context-length 32768 \
--enable-custom-logit-processor \
--disable-overlap-schedule \
--skip-server-warmup \
--host 0.0.0.0 \
--port 10000
Send streaming requests to the OpenAI-compatible API:
import base64
import json
import os
import tempfile
import fitz
import requests
from sglang.srt.sampling.custom_logit_processor import DeepseekOCRNoRepeatNGramLogitProcessor
server_url = "http://127.0.0.1:10000"
session = requests.Session()
session.trust_env = False
def pdf_to_images(pdf_path, dpi=300):
doc = fitz.open(pdf_path)
tmp_dir = tempfile.mkdtemp(prefix="pdf_ocr_")
mat = fitz.Matrix(dpi / 72, dpi / 72)
image_paths = []
for i, page in enumerate(doc):
image_path = os.path.join(tmp_dir, f"page_{i + 1:04d}.png")
page.get_pixmap(matrix=mat).save(image_path)
image_paths.append(image_path)
doc.close()
return image_paths
def encode_image(image_path):
ext = os.path.splitext(image_path)[1].lower()
mime = "image/jpeg" if ext in (".jpg", ".jpeg") else f"image/{ext.lstrip('.')}"
with open(image_path, "rb") as f:
data = base64.b64encode(f.read()).decode("utf-8")
return {"type": "image_url", "image_url": {"url": f"data:{mime};base64,{data}"}}
def build_content(prompt, image_paths):
return [{"type": "text", "text": prompt}] + [encode_image(path) for path in image_paths]
def generate(prompt, image_paths, image_mode, ngram_window):
payload = {
"model": "Unlimited-OCR",
"messages": [{"role": "user", "content": build_content(prompt, image_paths)}],
"temperature": 0,
"skip_special_tokens": False,
"images_config": {"image_mode": image_mode},
"custom_logit_processor": DeepseekOCRNoRepeatNGramLogitProcessor.to_str(),
"custom_params": {
"ngram_size": 35,
"window_size": ngram_window,
},
"stream": True,
}
response = session.post(
f"{server_url}/v1/chat/completions",
headers={"Content-Type": "application/json"},
data=json.dumps(payload),
timeout=1200,
stream=True,
)
response.raise_for_status()
chunks = []
for line in response.iter_lines(chunk_size=1, decode_unicode=True):
if not line or not line.startswith("data: "):
continue
data = line[len("data: "):]
if data == "[DONE]":
break
event = json.loads(data)
delta = event["choices"][0].get("delta", {}).get("content", "")
if delta:
print(delta, end="", flush=True)
chunks.append(delta)
print()
return "".join(chunks)
# Single image supports two configs: gundam or base. Example below uses gundam.
generate("document parsing.", ["your_image.jpg"], image_mode="gundam", ngram_window=128)
# Multi image (base only)
generate("Multi page parsing.", ["page1.png", "page2.png"], image_mode="base", ngram_window=1024)
# PDF (base only)
generate("Multi page parsing.", pdf_to_images("your_doc.pdf", dpi=300), image_mode="base", ngram_window=1024)
For batch inference, infer.py starts the SGLang server automatically and sends concurrent requests for an image directory or PDF:
# Image directory
python infer.py \
--image_dir ./examples/images \
--output_dir ./outputs \
--concurrency 8 \
--image_mode gundam
# PDF pages
python infer.py \
--pdf ./examples/document.pdf \
--output_dir ./outputs \
--concurrency 8 \
--image_mode gundam
Useful options:
--model_dir baidu/Unlimited-OCR # Local path or Hugging Face model ID
--gpu 0 # CUDA_VISIBLE_DEVICES value
--server_log ./log/sglang_server.log
Acknowledgement
We would like to thank Deepseek-OCR, Deepseek-OCR-2, PaddleOCR for their valuable models and ideas.
Citation
@misc{yin2026unlimitedocrworks,
title={Unlimited OCR Works},
author={Youyang Yin and Huanhuan Liu and YY and Qunyi Xie and Chaorun Liu and Shiqi Yang and Shaohua Wang and Zhanlong Liu and Hao Zou and Jinyue Chen and Shu Wei and Jingjing Wu and Mingxin Huang and Zhen Wu and Guibin Wang and Tengyu Du and Lei Jia},
year={2026},
eprint={2606.23050},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2606.23050},
}OpenAI prepares bidirectional voice mode for rollout on ChatGPT
OpenAI is rolling out 'Bidi 1', a new bidirectional voice model for ChatGPT that allows for simultaneous speaking and listening.
Summary
Deep Dive
- The Bidi 1 model allows the assistant to 'barge in' or be interrupted without losing context.
- Improved conversational thread management addresses previous issues with long-term memory in voice mode.
- It introduces tighter copyright filters compared to the previous advanced voice mode.
- Future updates are expected to include real-time translation features and potential integration with the Codex API.
Decoder
- Bidirectional voice: The ability of an AI model to process audio input and generate output at the same time, mimicking natural human conversation.
Original Article
OpenAI looks set to hand ChatGPT's voice mode its biggest upgrade in months, with a next-generation audio model surfacing as Bidi 1, shorthand for the bidirectional design that lets the assistant speak, hear, and listen at once. References to it began appearing in the ChatGPT web interface ahead of a possible release this week, and it has already begun reaching a subset of users in the app.
In our early testing, the gap from today's advanced voice mode is plain. Bidi 1 sits in the model selector under settings, beside the standard and advanced options, and turns the voice bubble yellow once picked. It offers small, natural acknowledgments — an "okay" or a brief nod — when you pause or slow down, without cutting across you. It also switches tasks on the fly: ask it to count to ten, interrupt to reverse the count, and it adjusts immediately.
More usefully, it holds the thread of a whole conversation rather than dropping earlier context, the weak point that has long dogged the current voice stack, and it no longer jumps in during longer pauses.
Creative behavior carries over from the first advanced voice rollout, singing and beatboxing included, though copyright handling is tighter; it declines popular songs outright while still attempting an original piece in a chosen artist's style.
The move reads as OpenAI closing the distance between its capable text models and an older voice layer, treating conversation as a core route into ChatGPT. The company has not formally announced it. A gradual, opt-in release across web and mobile looks likely, with the European Economic Area possibly waiting longer (not confirmed). Codex appears set for its own voice upgrade in the weeks after this launch, separate from it, and API access may follow later still (timeline is not confirmed).
Fluree DB (GitHub Repo)
Fluree DB integrates vector, text, and geo search directly into its graph database to provide long-term, persistent memory for AI coding agents.
Summary
Deep Dive
- Features a git-like branching and merging workflow for data.
- Supports time-travel queries using commit IDs, timestamps, or transaction numbers.
- Integrated BM25 full-text search and HNSW vector search within the query engine.
- Includes native support for RDF, SPARQL, and JSON-LD.
- Implements triple-level access control within the ledger.
- Provides an MCP server to expose data to Claude Desktop and Cursor.
- Uses the Business Source License 1.1.
Decoder
- RDF: Resource Description Framework, a standard model for data interchange on the web.
- SPARQL: A query language specifically for databases that can store and retrieve data via RDF.
- Triple: The basic unit of RDF data, consisting of a subject, predicate, and object.
- HNSW: Hierarchical Navigable Small World, a graph-based algorithm for fast approximate nearest neighbor search in vector databases.
- MCP: Model Context Protocol, a standard for connecting AI assistants to external data and systems.
Original Article
Fluree DB - A graph database for data that matters.
Temporal, verifiable, standards-compliant, git-like branching and merging, and optimized for AI agents. Integrated vector, text and geo search, and fine-grained access control with no external dependencies.
RDF 1.1 / 1.2, Open Cypher Preview, SPARQL and JSON-LD query (includes history query and other Fluree feature extensions).
Billions of graph facts on commodity hardware. Over 2M facts/second bulk import. Benchmark leader, 10.4x faster than next database. On the full 21.5-billion-triple Wikidata dump, all 850/850 WGPB graph-pattern queries complete with a 43 ms geometric mean.
Fluree Memory — is part of the Fluree DB CLI. Persistent, searchable memory for AI coding assistants. Give Claude Code, Cursor, and other AI tools long-term project memory: facts, decisions, and preferences persist across sessions in a Fluree ledger you control — scoped per-repo or per-user, shareable via git.
Install
Cloud / Serverless — Run in a dedicated serverless stack at no cost (usage limited), spin up dedicated servers on demand as needed. Interact seamlessly with local fluree CLI (install instructions below).
Docker — pre-configured HTTP server, ready to accept queries on port 8090. Best for trying out the API or running Fluree as a service.
docker run -p 8090:8090 fluree/server:latest
Homebrew, shell installer, or Windows PowerShell — installs the fluree binary that bundles both the CLI and the embedded server (fluree server run).
# Homebrew (macOS / Linux)
brew install fluree/tap/fluree
# Shell installer (macOS / Linux)
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/fluree/db/releases/latest/download/fluree-db-cli-installer.sh | sh
# Windows (PowerShell)
irm https://github.com/fluree/db/releases/latest/download/fluree-db-cli-installer.ps1 | iex
Pre-built binaries and the changelog for every release are on the GitHub Releases page.
Zero to graph in 60 seconds
fluree init
fluree create movies
fluree insert '
@prefix schema: <http://schema.org/> .
@prefix ex: <http://example.org/> .
ex:blade-runner a schema:Movie ;
schema:name "Blade Runner" ;
schema:dateCreated "1982-06-25"^^<http://www.w3.org/2001/XMLSchema#date> ;
schema:director ex:ridley-scott .
ex:ridley-scott a schema:Person ;
schema:name "Ridley Scott" .
ex:alien a schema:Movie ;
schema:name "Alien" ;
schema:dateCreated "1979-05-25"^^<http://www.w3.org/2001/XMLSchema#date> ;
schema:director ex:ridley-scott .
'
fluree query --format table 'SELECT ?title ?date WHERE {
?movie a <http://schema.org/Movie> ;
<http://schema.org/name> ?title ;
<http://schema.org/dateCreated> ?date .
} ORDER BY ?date'
┌──────────────┬────────────┐
│ title │ date │
├──────────────┼────────────┤
│ Alien │ 1979-05-25 │
│ Blade Runner │ 1982-06-25 │
└──────────────┴────────────┘
That's a SPARQL query. The same query in JSON-LD:
fluree query --jsonld '{
"@context": { "schema": "http://schema.org/" },
"select": ["?title", "?date"],
"where": [
{ "@id": "?movie", "@type": "schema:Movie",
"schema:name": "?title", "schema:dateCreated": "?date" }
],
"orderBy": "?date"
}'
Both languages access the same engine — same features, same performance.
Now update the data and query the past:
# Give every Ridley Scott movie a genre
fluree update '
PREFIX schema: <http://schema.org/>
PREFIX ex: <http://example.org/>
INSERT { ?movie schema:genre "sci-fi" }
WHERE { ?movie schema:director ex:ridley-scott }
'
# What did the data look like before that update?
fluree query --at 1 'SELECT ?title ?genre WHERE {
?movie a <http://schema.org/Movie> ;
<http://schema.org/name> ?title .
OPTIONAL { ?movie <http://schema.org/genre> ?genre }
}'
# → Blade Runner (no genre), Alien (no genre)
# And now?
fluree query 'SELECT ?title ?genre WHERE {
?movie a <http://schema.org/Movie> ;
<http://schema.org/name> ?title .
OPTIONAL { ?movie <http://schema.org/genre> ?genre }
}'
# → Blade Runner "sci-fi", Alien "sci-fi"
Every change is preserved. Query any point in history by transaction number, ISO timestamp, or commit ID.
What makes Fluree different
Time travel
Every transaction is immutable. Query data as it existed at any point in time — by transaction number, ISO-8601 timestamp, or content-addressed commit ID. No special tables, no slowly-changing dimensions. It's built into the storage model.
fluree query --at 2024-06-15T00:00:00Z 'SELECT * WHERE { ?s ?p ?o }'
Property graphs & edge annotations
Attach properties to a relationship, not just a node — a role and since date on a worksFor edge, a source and confidence on a claim. Fluree implements the RDF 1.2 / SPARQL 1.2 annotation syntax, so you get labeled-property-graph edges, parallel relationships between the same two nodes, and RDF-star statement-level provenance on a single surface. Plain triple queries are left untouched — annotations only change cardinality when you ask for them.
# Attach metadata to the edge itself, not to Alice or Acme
INSERT DATA {
ex:alice ex:worksFor ex:acme {| ex:role "Engineer" ; ex:since 2024 |} .
}
# Match the edge and its metadata together
SELECT ?role ?since WHERE {
ex:alice ex:worksFor ex:acme {| ex:role ?role ; ex:since ?since |} .
}
Integrated search
BM25 full-text search and HNSW vector similarity are built into the query engine — not bolted-on external services. Search results participate in joins, filters, and aggregations like any other graph pattern.
{
"@context": { "ex": "http://example.org/" },
"from": "mydb:main",
"where": [
{ "@id": "?doc", "ex:title": "?title" },
["bind", "?score", "(fulltext ?title \"knowledge graph\")"]
],
"select": ["?doc", "?title", "?score"],
"orderBy": [["desc", "?score"]],
"limit": 10
}
Git-like data management
Branch, rebase, merge, push, pull — the same workflow developers already use for code, applied to data. Fork a dataset to experiment without affecting production. Merge when ready. Rebase to catch up with upstream changes. Every branch has its own independent commit history.
fluree branch create experiment
fluree use mydb:experiment
# ... make changes safely ...
fluree branch rebase experiment # catch up with main
fluree branch merge experiment # fast-forward merge into main
fluree branch drop experiment # clean up
Triple-level access control
Policies are data in the ledger, enforced at query and transaction time. Users see only what they're authorized to see — not rows, not tables, individual facts. No application-layer filtering required.
Reasoning and inference
RDFS subclass/subproperty reasoning, OWL 2 RL forward-chaining, and user-defined Datalog rules. The database infers facts you didn't explicitly store.
Standards-first
Full SPARQL 1.1 with zero compliance failures against the W3C test suite. Native JSON-LD for idiomatic JSON APIs. Both query languages access the same engine with the same capabilities — time travel, policies, graph sources, and all.
Also worth knowing
- SHACL validation — declarative shape constraints enforced at transaction time, with violations reported per-target, per-property.
- OWL ontology imports — pull external vocabularies into a ledger via
f:schemaSource+owl:imports, materialized at commit time. - Apache Iceberg / R2RML — query Parquet warehouses and relational stores as first-class graph sources alongside native Fluree data.
Use it your way
CLI — Explore data, script pipelines, manage ledgers from the terminal.
fluree query -f report.rq --format csv > output.csv
HTTP Server — Run fluree server for a production API with OIDC auth, content negotiation, and OpenTelemetry.
fluree server run
curl -X POST http://localhost:8090/v1/fluree/query?ledger=mydb:main \
-H "Content-Type: application/sparql-query" \
-d 'SELECT ?s ?p ?o WHERE { ?s ?p ?o } LIMIT 10'
Rust library — Embed Fluree directly in your application. No server process needed.
let fluree = FlureeBuilder::memory().build_memory();
fluree.create_ledger("mydb").await?;
let result = fluree.graph("mydb:main")
.query()
.sparql("SELECT ?s WHERE { ?s a <http://schema.org/Person> }")
.execute()
.await?;
MCP server — Expose Fluree to AI assistants over the Model Context Protocol.
fluree mcp serve # stdio transport for Claude Desktop, Cursor, etc.
Capabilities
| Feature | Description |
|---|---|
| Query languages | SPARQL 1.1, JSON-LD Query |
| Data formats | JSON-LD, Turtle, TriG, N-Triples, N-Quads |
| Edge annotations | Property-graph edges & statement-level metadata (RDF 1.2 / SPARQL 1.2) |
| Time travel | Transaction number, ISO timestamp, commit ID |
| Full-text search | Integrated BM25 with Block-Max WAND |
| Vector search | Embedded HNSW or remote service |
| Reasoning | RDFS, OWL 2 QL, OWL 2 RL, Datalog rules |
| Access control | Triple-level policy enforcement |
| Geospatial | GeoSPARQL, S2 cell indexing |
| Verifiability | JWS-signed transactions, Verifiable Credentials |
| Data sources | Apache Iceberg, R2RML relational mappings |
| Storage backends | Memory, file, AWS S3 + DynamoDB, IPFS |
| Replication | Clone, push, pull between instances |
| Branching | Fork ledgers, independent commit histories |
| Observability | OpenTelemetry tracing, structured logging |
| Validation | SHACL shape constraints |
Documentation
For documentation and more information, visit labs.flur.ee/docs.
For AI agents
The published docs are available as LLM-readable text following the llms.txt convention:
- Curated index: https://fluree.github.io/db/llms.txt
- Full corpus: https://fluree.github.io/db/llms-full.txt
License
Licensed under the Business Source License 1.1, with a Change Date to Apache License 2.0 as specified in that file.
A New Era of Software Quality Starts Today
Momentic has updated its QA testing platform with autonomous agentic features designed to learn product behavior and maintain tests as code changes.
Summary
Deep Dive
- Automates test updates by reading PR diffs.
- Uses a shared knowledge base to align AI agents with specific product logic.
- Automatically classifies test failures to reduce alert fatigue.
- Transitions test suites to a readable, intent-based format.
Decoder
- Flaky test: A test that produces inconsistent results (passing or failing) despite no changes to the underlying code.
- PR (Pull Request): A method of submitting contributions to a software project, commonly used to review code changes before merging.
Original Article
Today we're announcing a new Momentic - a major platform update, a new brand, and a way for every developer to experience the future of testing for themselves.
Why we rebuilt the platform
Jeff and I founded Momentic around a problem we lived with every day. As former engineers at Robinhood, Qualtrics, WeWork, and Retool, we watched the widening gap between how fast teams could write code and how confidently they could ship it. AI coding agents have only made this gap impossible to ignore.
More code is shipping faster than ever, and that means more bugs and incidents reaching production. According to Faros AI's 2026 AI Engineering Report, monthly incidents are up nearly 58% since AI adoption accelerated. And a May 2026 CloudBees study found that 81% of enterprise technology leaders say they've seen a direct increase in production issues tied to AI-generated code.
This isn't a prediction, it's already happening. The code is shipping, but that was never the real bottleneck. QA, which was already painful, is now falling more behind.
What's new in Momentic
Over the past few weeks, we've been working closely with AI-native engineering teams whose applications collectively serve hundreds of millions of users, spanning productivity, media streaming, consumer applications, and professional services. They've been invaluable partners in shaping how agentic testing really should work alongside rapidly evolving developer toolchains.
One of our customers said it best:
“The best testers I've ever worked with know the product better than anybody else. They know the nooks and crannies, what's supposed to happen and what isn't. They know where things are brittle. That's what's missing from agentic code in general.”
That insight became the foundation for everything we built next.
The teams we worked with weren't asking how to write better test scripts. They were asking for an always-on, autonomous system that actually understands their product, knows how it's supposed to behave, and gets smarter over time. Humans just have to review a report, not manually test.
So here’s what Momentic customers now have access to:
Introducing memory and knowledge base
The best QA tester you’ve ever worked with didn’t just catch bugs. They knew your product's terminology, the flows that were brittle, the edge cases nobody had documented. That knowledge lived in their head, and when a new hire onboarded, they simply ‘got it’ too.
We built a way for that knowledge to live inside the platform instead, with our new Knowledge Base. Teams define how their product is supposed to behave, what counts as a bug versus an intentional change, what terminology means in their specific context. Every agent, whether it's writing new tests, triaging failures, or proposing fixes, runs on that shared understanding. The more your team puts in, the smarter it gets.
Every team has a different bar for what "quality" means. Some want pixel-perfect accuracy; others just need the user to reach the destination. Now you can customize the platform accordingly.
Coverage that grows with your product
Writing tests is usually the last thing engineers want to do. And with AI accelerating the number of commits daily, the gap between what's shipping and what's been verified keeps growing.
Explore Agent closes that loop automatically. Every time a PR lands, Momentic reads the diff, identifies what changed, and proposes new or updated tests, already scoped to the flows that matter, already consistent with your existing suite. It notices the new features, the renamed components, the edge cases that weren't there last sprint. Over time it gets better at this, because it's learning your product as it goes.
Failures that mean something
Flaky tests are one of the worst problems to have in your test suite. Not because they're annoying, because they train engineers to ignore failures. And that's exactly when real bugs slip through.
Every failure in Momentic now gets triaged through our Failure Classification Agent and analyzes its root cause. Is it a real bug, intentional application change, test setup issue, or transient error? If it's an intentional change that triggered a test failure, Momentic opens a pull request to fix the test itself. If it's a real bug, the team gets a high-signal alert with full context on what broke and why. The result is a test suite that compounds trust over time.
The spec is the test
For too long, test scripts have been artifacts that only the engineer who wrote them can understand.
The new Momentic test format is intent-based and readable by both humans and AI agents. Engineers just have to describe what they want to test in plain English. Plus, AI agents can parse, build, and modify tests more effectively - making the entire development loop faster and more autonomous.
A new brand
Our platform evolution meant Momentic needed a new expression to match, one built around the same two qualities we engineered into the product itself: ease of use and guardrails that can evolve autonomously with your product.
The grid structure you'll see throughout our new site is intentional. Structure and reliability are still the foundation. But there's motion in it now, an energy that reflects what it feels like when your agent catches a bug before production, when tests update themselves, and the agent unblocked itself mid-execution and kept going. That's the energy the new brand is built to match.
Quality for everyone
Quality used to be a function of how big your QA team was. Every developer, on every team, regardless of size, deserves quality. And it should be easy, enjoyable, and built into the foundation of how software gets made.
Today, Momentic is open to every software engineering team. It is free to try yourself. Our philosophy hasn't changed since day one: we don't want anything standing between teams and shipping.
Try it out yourself:
npx @momentic/wizard@latestGitHub Connections now available for self-hosted Octopus Server customers
Octopus Deploy now supports OIDC-based GitHub Connections for self-hosted instances, eliminating the need for long-lived, user-scoped Personal Access Tokens.
Summary
Decoder
- OIDC (OpenID Connect): An authentication layer on top of OAuth 2.0 that allows clients to verify identity based on an authentication performed by an authorization server.
- PAT (Personal Access Token): A static string used as a password to authenticate with GitHub APIs, often associated with a specific user account.
Original Article
As of release 2026.2, GitHub Connections (via the Octopus Deploy App for GitHub) are available to self-hosted Octopus Server customers.
In 2024, we launched the Octopus Deploy App in the GitHub Marketplace. It has been widely adopted, unsurprisingly, as it is the most seamless way to connect Octopus Deploy and GitHub. At launch, however, it did not support self-hosted Octopus Server customers.
This means until now, if you self-host your Octopus instance and you integrate with GitHub, you’ve been doing it with Personal Access Tokens (PATs). PATs work, but they’ve never been ideal for a critical integration like this, suffering from a number of problems:
- Someone leaves the team and their PAT goes with them
- A token quietly expires at 2am and a deployment falls over
- PATs are user-scoped, long-lived, and often have more permissions than they strictly need
We’ve received plenty of feedback requesting support for the Octopus Deploy App for GitHub from our self-hosted customers, and we’re happy to say that as of release 2026.2 it’s available.
What’s shipping
GitHub Connections are now available for self-hosted Octopus Deploy. You can install the Octopus Deploy App for GitHub, connect it to your self-hosted instance, and use it anywhere the product currently requires GitHub credentials.
Under the hood, GitHub Connections use OpenID Connect (OIDC) to exchange a signed token from your Octopus instance for a short-lived GitHub token, scoped to the repositories you’ve granted the app access to. There are no long-lived secrets to store, rotate, or leak. The permissions are managed in GitHub against the app installation, not against a user account. When someone leaves the team, nothing breaks.
What about Octopus instances behind a firewall?
This is the part that took the most thought.
OIDC requires GitHub to fetch the public signing keys from your Octopus instance to verify the tokens it issues. That works fine on Cloud, and it works fine on a self-hosted instance that’s reachable from the public internet. It doesn’t work if your instance lives behind a corporate firewall, on a private network, or otherwise isn’t reachable from github.com, and this is true for many self-hosted installs.
So we’ve added a second option. On the Signing Keys settings page, you’ll find a new Externally Hosted option. You generate your signing keys in Octopus as usual, then publish the public keys to any web host you like, such as an S3 bucket or Azure blob storage, whatever fits your infrastructure, and tell Octopus the OIDC issuer URL where GitHub can fetch them. Octopus signs tokens with that issuer URL, GitHub validates them against the published keys, and your private instance never has to take an inbound connection from the internet.
The signing keys docs walk through the options.
Getting started
GitHub Connections are available to all self-hosted customers from 2026.2. See the GitHub Connections docs for the details of how to configure the integration between Octopus Deploy and GitHub, without Personal Access Tokens.
Note: GitHub Connections work with repositories hosted on github.com. They don’t yet support GitHub Enterprise Server (self-hosted GitHub). If your repositories live on GitHub Enterprise Server, you’ll need to keep using Personal Access Tokens for now.
If you use GitHub Enterprise Server, and this is important to you, please register your interest on the roadmap card.
Happy deployments!
Amazon ECS introduces new high-resolution metrics for faster service auto scaling
Amazon ECS now supports 20-second high-resolution metrics for auto scaling, accelerating scale-out times by 76% to handle rapid traffic spikes.
Summary
Original Article
Amazon ECS introduces new high-resolution metrics for faster service auto scaling
Amazon Elastic Container Service (Amazon ECS) service auto scaling automatically adjusts task counts to meet workload demand with comprehensive scaling policies, including predictive scaling for recurring traffic patterns, scheduled scaling for planned events, and target tracking to scale dynamically on real-time metrics.
You can choose proactive scaling by using predictive scaling (automatic) and scheduled scaling (customer-defined), or reactive scaling by using target tracking with just a target to scale on. Amazon ECS service auto scaling adjusts the number of tasks in an ECS service based on Amazon CloudWatch metrics, such as average CPU/Memory usage, request count per target, a custom metric such as queue depth, or demand surges by using advanced machine learning (ML) algorithms.
With today’s launch, Amazon ECS service auto scaling now detects and responds to load changes faster with support for high resolution (20-second) metrics and metric publishing optimizations. In AWS benchmarking tests, time to trigger scale-out improved from 363 seconds to 86 seconds (76% faster, 4.2x), and total time to scale and provision new tasks improved from 386 seconds to 109 seconds (72% faster, 3.5x)
This launch delivers three key benefits for your applications:
- Improved performance and reliability: Faster scaling means, your application responds faster to demand surges, reducing latencies or failures for end users during demand surges.
- Right-size without compromise: Depending on the workload, you can reduce baseline task counts because scale-out now happens fast enough to handle traffic spikes without preemptive capacity padding. This directly reduces compute costs while maintaining application performance and availability.
- Simpler scaling configuration: Target tracking with high-resolution metrics delivers the aggressive scaling behavior that previously required custom scaling configurations, such as usage of step-scaling policies. One configuration change replaces custom engineering work.
How it works
To use ECS faster service auto scaling, first enable high-resolution metrics for your ECS service, and then configure a target tracking scaling policy which uses high-resolution metrics. ECS faster service autoscaling works across all compute options on ECS: AWS Fargate, ECS Managed Instances, and Amazon Elastic Compute Cloud (Amazon EC2). You can enable these metrics when you create or update your ECS service in the Amazon ECS console, or using AWS SDKs and tools, and AWS CloudFormation.
When you create a service in the console, add 20-seconds resolution metrics in the Monitoring configuration section. These metrics incur additional CloudWatch costs while the standard resolution (60-seconds) is free.
In the Service auto scaling section, check Use service auto scaling and choose Target Tracking for the scaling policy type to use real-time data to scale the number of tasks that your service runs based on demand.
Then, choose a Scaling policy type for the target tracking. You can select ECSServiceAverageCPUUtilizationHighResolution or ECSServiceAverageMemoryUtilizationHighResolution as new metrics.
That’s it – your ECS service will use high resolution metrics for auto scaling.
To update an existing ECS service to use faster auto scaling, you first need to configure high resolution metrics via Update Service. Once deployment completes, your service will generate high-resolution metrics. You can then go to the Service and auto scaling tab from your service details to update scaling policy to use higher resolution metrics.
That’s all you need. Your ECS service now evaluates scaling decisions at 20-second intervals.
You can also use the AWS Command Line Interface (AWS CLI) to enable new metrics in your ECS service through Application Auto Scaling. To learn more, visit the faster auto scaling documentation.
Now available
Faster service autoscaling with high-resolution metrics for Amazon ECS is available today. The feature itself has no additional cost, but high-resolution CloudWatch metrics introduce a new pricing dimension. For details, see the CloudWatch pricing page.
Give it a try today and send feedback to AWS re:Post for ECS or through your usual AWS Support contacts.
Analyzing Claude Code usage with CloudWatch and OpenTelemetry
CloudWatch OTLP now supports direct bearer token ingestion, allowing developers to monitor AI coding agent usage without complex IAM setup on local machines.
Summary
Deep Dive
- Bearer Token Authentication: Uses a static API key to authenticate metrics ingestion without needing temporary AWS security credentials.
- OTLP (OpenTelemetry Protocol): A vendor-agnostic format for transmitting telemetry data.
- PromQL: A query language for Prometheus, now natively supported in CloudWatch for metric analysis.
Decoder
- OTLP: A standard specification for how to encode and transport telemetry data (metrics, logs, traces) between services.
Original Article
Analyzing Claude Code usage with CloudWatch and OpenTelemetry
If your engineering organization uses AI coding agents like Claude Code, usage is likely growing faster than your ability to track it. Token consumption, cost per team, and developer productivity are questions that existing dashboards don’t answer, because the telemetry never made it to your observability backend.
With Amazon CloudWatch OpenTelemetry Protocol (OTLP) in General Availability, metrics ingestion is now possible with bearer token authentication. This means, tools that emit OTLP can typically ship metrics directly to CloudWatch with a single authorization header. No collectors, no sidecars, no IAM credential wiring on developer machines. Connect the signals in minutes and get per-developer cost attribution, team-level usage analytics, and operational alerting, all queryable with Prometheus Query Language (PromQL).
This post walks through the end-to-end setup for Claude Code. Although the focus here is Claude Code, comprehensive guidance for OpenAI Codex and GitHub Copilot is available on the AWS Observability best practices. This post focuses on bearer token authentication for its simplicity on developer machines. For organizations that require SSO (Single Sign-On) or OIDC (OpenID Connect) for developer authentication, see the guidance repository for federated identity patterns and token refresh helpers.
Bearer token authentication
Bearer tokens (or CloudWatch metrics API key) allow tools running outside AWS (like Claude Code on developer laptops) to send metrics to CloudWatch without requiring the AWS SDK or IAM credential chains. Each token is tied to an AWS IAM user scoped exclusively to the CloudWatchAPIKeyAccess managed policy.
Important: Bearer tokens are long-term credentials. This post uses bearer tokens because AI coding agents run on developer laptops outside of AWS. SigV4 authentication would require either a central collector or running a collector process on every developer machine. Both approaches add operational complexity. Bearer tokens eliminate that infrastructure requirement entirely. For workloads running inside AWS where SigV4 with short-term credentials is feasible, prefer that approach for a stronger security posture. The CloudWatch OTLP endpoint requires HTTPS; requests over plain HTTP are rejected. For more information, see CloudWatch OTLP Metrics Bearer Token Auth.
Granularity strategy
Organizations control how granular their telemetry attribution is based on how they provision bearer tokens. At the finest level, each developer gets their own IAM user and bearer token, so attribution is inherent to the token itself. At a coarser level, a single token can be shared across a team or an entire organization, with identity attribution handled through client-side resource attributes instead.
All three approaches produce identical dashboards and PromQL queries because attribution is driven by resource attributes, not the token itself. Start with a single shared token to validate the pipeline, then split to per-team or per-developer tokens as your security posture demands. Per-developer tokens are recommended when compliance requires credentials traceable to a named individual or when clean offboarding (revoking a single IAM user) is a hard requirement.
Prerequisites
- An AWS account with permissions to create CloudWatch resources
- AWS CLI v2 installed and configured
- Latest Claude Code version
- A CloudWatch metrics API key (generated below)
Create a bearer token via the console
In the CloudWatch console, navigate to Settings under the Setup section.
- Scroll to API Keys.
- Choose Create.
- Select an API key expiration.
CloudWatch creates the associated IAM user on your behalf with the CloudWatchAPIKeyAccess policy attached.
Create a bearer token via the CLI
Alternatively, create a token with the CLI using the following commands:
# Create an IAM user for CloudWatch metrics ingestion
aws iam create-user --user-name cloudwatch-metrics-api-key-user
# Attach the CloudWatchAPIKeyAccess managed policy
aws iam attach-user-policy \
--user-name cloudwatch-metrics-api-key-user \
--policy-arn arn:aws:iam::aws:policy/CloudWatchAPIKeyAccess
# Create a service-specific credential for CloudWatch metrics ingestion
aws iam create-service-specific-credential \
--user-name cloudwatch-metrics-api-key-user \
--service-name cloudwatch.amazonaws.com
The response includes the ServiceCredentialSecret field, which is the bearer token value. Store it securely in AWS Secrets Manager or your organization’s vault solution. A reminder that you should never commit the token to version control. For automated key rotation, use AWS Secrets Manager rotation with a Lambda function.
Client-side configuration
With the bearer token set up, you can now configure Claude Code to export metrics. This approach uses client-side resource attributes set by each developer (or distributed via profile management).
# Retrieve token from Secrets Manager
BEARER_TOKEN=$(aws secretsmanager get-secret-value \
--secret-id cloudwatch-otlp-bearer-token \
--query SecretString \
--output text)
export CLAUDE_CODE_ENABLE_TELEMETRY=1
export OTEL_METRICS_EXPORTER=otlp
export OTEL_EXPORTER_OTLP_PROTOCOL=http/json
export OTEL_EXPORTER_OTLP_ENDPOINT="https://monitoring.<AWS_REGION>.amazonaws.com"
export OTEL_RESOURCE_ATTRIBUTES="user.id=$(whoami),user.email=${USER_EMAIL},team.id=${TEAM:-engineering},cost_center=${COST_CENTER:-default},department=${DEPARTMENT:-engineering},environment=${ENV:-dev}"
# Security Note: Environment variables can be exposed through process listings and shell history.
export OTEL_EXPORTER_OTLP_HEADERS="Authorization=Bearer ${BEARER_TOKEN}"
# control export frequency (2s, for testing)
export OTEL_METRIC_EXPORT_INTERVAL=2000
Replace <AWS_REGION> with your target Region (for example, us-east-1, eu-west-1).
The OTEL_RESOURCE_ATTRIBUTES line attaches identity dimensions to every metric. Developers set these attributes directly. They become the PromQL labels for filtering and grouping in dashboards and alerts. Use whatever attributes your organization needs. The key requirement is consistency across your fleet so aggregations work.
| Attribute | PromQL label | Purpose | Example |
|---|---|---|---|
user.id |
@resource.user.id |
Per-developer attribution | jdoe |
user.email |
@resource.user.email |
Per-developer attribution (email) | jdoe@acme.com |
team.id |
@resource.team.id |
Team-level aggregation | platform-eng |
cost_center |
@resource.cost_center |
Finance/chargeback grouping | CC-4200 |
department |
@resource.department |
Org-level rollup | engineering |
environment |
@resource.environment |
Distinguish dev/staging/prod usage | production |
Verify metrics are flowing
After setting the environment variables, run a short Claude Code session:
# Start a brief session
claude -p "Let's conquer the world" --max-turns 1
To verify metrics are flowing, open CloudWatch Query Studio and enter claude_. You will see a few metrics such as claude_code.token.usage which tracks the number of tokens used.
Sample usage dashboard
Regardless of the granularity strategy described previously, a CloudWatch dashboard is available that uses PromQL to query Claude Code telemetry data. As long as your resource attributes follow the semantic conventions described in the client-side configuration section, all values appear in the dashboard automatically.
Deploy the pre-built dashboard:
# Download the dashboard definition
curl -o claude-code-dashboard.json https://raw.githubusercontent.com/aws-observability/aws-observability-accelerator/main/artifacts/cloudwatch-dashboards/claude-code/claude-code.json
aws cloudwatch put-dashboard \
--dashboard-name claude-code-usage \
--dashboard-body file://claude-code-dashboard.json \
--output off \
--region <AWS_REGION>
Alerting
Every panel in the dashboard is backed by a PromQL query. To create an alarm from any panel:
- Open the panel you are interested in (for example, token usage by user).
- To open the underlying query, choose View in Query Studio.
- Choose Create alarm directly from Query Studio.
- Adjust the query or thresholds as needed.
Cost estimate
For a 200-developer organization where each developer runs ~20 sessions per day, each emitting ~7 metric data points with resource attributes, and developers are active ~22 days/month:
A typical OTLP data point with 10-15 attributes is 300-600 bytes. Using 450 bytes as a midpoint:
200 developers × 20 sessions/day × 7 metrics × 450 bytes = 12.6 MB/day
12.6 MB/day × 22 days = ~277 MB/month ≈ 0.27 GB/month
At $0.50/GB ingestion, that is ~$0.14/month for the base case. Even with high-cardinality metrics (100x volume), total ingestion stays under $14/month. PromQL queries in the CloudWatch console are free.
The total cost for 200 developers, in this example, would be under $15/month.
Cleanup
Important: CloudWatch retains metrics up to 15 months at no charge. Delete alarms and IAM resources to avoid ongoing costs and security exposure.
Follow these commands to remove all the resources you have created:
# Delete Dashboard
aws cloudwatch delete-dashboards --dashboard-names claude-code-usage --region <AWS_REGION>
# Delete CloudWatch Alarms
aws cloudwatch delete-alarms --alarm-names <alarm-name> --region <AWS_REGION>
# Delete service-specific credential
aws iam delete-service-specific-credential --user-name cloudwatch-metrics-api-key-user --service-specific-credential-id <credential-id>
# Detach policy from IAM user
aws iam detach-user-policy --user-name cloudwatch-metrics-api-key-user --policy-arn arn:aws:iam::aws:policy/CloudWatchAPIKeyAccess
# Delete IAM user
aws iam delete-user --user-name cloudwatch-metrics-api-key-user
# If using Secrets Manager, delete the secret
aws secretsmanager delete-secret --secret-id <secret-name> --region <AWS_REGION>
Conclusion
In this post, you configured Claude Code to export OpenTelemetry metrics to Amazon CloudWatch using bearer token authentication, deployed a PromQL-powered dashboard for cost and usage visibility, and set up alerting for spend anomalies and adoption regression.
Build and Deploy a Remote MCP Server to GKE in 30 Minutes
You can now deploy remote Model Context Protocol (MCP) servers on GKE Autopilot to centralize AI tool access and improve service scalability.
Summary
Decoder
- MCP (Model Context Protocol): An open-standard protocol introduced by Anthropic that allows LLMs to interact with external tools and data sources consistently.
- Streamable HTTP: An MCP transport layer allowing server/client communication over standard HTTP connections.
Original Article
Build and Deploy a Remote MCP Server to GKE in 30 Minutes
Integrating context from tools and data sources into LLMs can be challenging, which impacts the ease of development for AI agents. To address this challenge, Anthropic introduced the Model Context Protocol (MCP), which standardizes how applications provide context to these models. Developers often want to build an MCP server for their APIs to make them available to fellow developers, allowing them to use it as context in their own applications. Google Kubernetes Engine (GKE) provides a scalable, reliable, and secure environment to deploy these remote MCP servers.
This guide shows the straightforward process of setting up a secure remote MCP server on GKE.
MCP transports
The Model Context Protocol follows a client-server architecture. It initially only supported running the server locally using the stdio transport. The protocol has since evolved and now supports remote access transports, specifically Streamable HTTP.
With Streamable HTTP, the server operates as an independent process that can handle multiple client connections. This transport uses HTTP POST and GET requests. The server must provide a single HTTP endpoint path that supports both POST and GET methods, such as https://example.com/mcp. You can learn more about the different transports in the official documentation.
Benefits of running an MCP server on GKE
Running an MCP server remotely on GKE provides several architecture benefits:
- Scalability: GKE Autopilot is built to handle highly variable traffic. Since MCP Servers are stateless, GKE can scale horizontally to handle spikes in demand efficiently.
- Centralized access: Teams can share access to a centralized MCP server, allowing developers to connect from local machines, Agents or pipelines instead of running redundant local servers. Updates to the central server immediately benefit everyone.
- Enhanced security: The Kubernetes Gateway API combined with SSL certificates provides an easy way to force secure, encrypted traffic. This allows only secure connections to the MCP server, preventing unauthorized access.
Prerequisites
Before starting, ensure the following tools are installed:
- python 3.10 or higher
- uv (for package and project management, see the installation documentation)
- Google Cloud SDK (
gcloud) kubectlcommand-line tool
Installation
Prepare environment variables
export PROJECT_ID=$(gcloud config get-value project)
export REGION=us-central1
Create a folder, mcp-on-gke, to store the code for the server and deployment.
mkdir mcp-on-gke && cd mcp-on-gke
Now configure the Google Cloud credentials and set the active project.
gcloud auth login
gcloud config set project $PROJECT_ID
Initiate the GKE Autopilot cluster creation in the background.
gcloud container clusters create-auto mcp-cluster \
--region $REGION \
--release-channel rapid \
--async
Use uv to create a project, which will generate a pyproject.toml file.
uv init
Math MCP server
Large language models are excellent at non-deterministic tasks, such as generating text, summarizing ideas, and reasoning about concepts. However, they can be unreliable for deterministic tasks like math operations. To solve this, developers can create tools that provide valuable context. Using FastMCP, a framework for building MCP servers in Python, it is possible to create a simple math server with two tools: add and subtract.
First, add FastMCP as a dependency.
uv add fastmcp
uv add asyncio
Copy the following code into server.py to create the server.
from fastmcp import FastMCP
from starlette.requests import Request
from starlette.responses import PlainTextResponse
import asyncio
import logging
logger = logging.getLogger(__name__)
logging.basicConfig(format="[%(levelname)s]: %(message)s", level=logging.INFO)
mcp_port=3000
# Initialize the FastMCP server
server = FastMCP(
"Math Server",
)
@server.tool()
def add(a: int, b: int) -> int:
"""Add two numbers together."""
return a + b
@server.tool()
def subtract(a: int, b: int) -> int:
"""Subtract the second number from the first."""
return a - b
@server.custom_route("/healthz", methods=["GET"])
async def health_check(request: Request) -> PlainTextResponse:
"""Simple health check endpoint that returns a 200 OK response"""
return PlainTextResponse("OK")
if __name__ == "__main__":
logger.info(f" MCP server started on port {mcp_port}")
asyncio.run(
server.run_async(
transport="streamable-http",
host="0.0.0.0",
port=mcp_port
)
)
Testing the MCP server locally
Create the test_mcp_server.py script to connect to test the MCP Server.
from fastmcp import Client, FastMCP
import asyncio
import logging
# Connect to the remote MCP server
client = Client("https://localhost:3000/mcp")
async def test_remote_server():
async with client:
# Basic server interaction
await client.ping()
# List available operations
tools = await client.list_tools()
print(f"Available tools: {tools} \n")
# Execute add operation
result = await client.call_tool("add", {"a": 5, "b": 3})
print(f"Result of addition: {result} \n")
# Execute subtract operation
result = await client.call_tool("subtract", {"a": 5, "b": 3})
print(f"Result of subtraction: {result} \n")
if __name__ == "__main__":
asyncio.run(test_remote_server())
Run the MCP server locally to test the connection:
uv run server.py
Building the container image
First, prepare the Dockerfile:
FROM python:3.10-slim
COPY --from=ghcr.io/astral-sh/uv:0.4.15 /uv /bin/uv
WORKDIR /app
COPY pyproject.toml .
COPY server.py .
RUN uv sync
CMD ["uv", "run", "server.py"]
Set up Artifact Registry
gcloud artifacts repositories create mcp-repo --repository-format=docker --location=$REGION
Build and push the image in parallel
gcloud builds submit --tag $REGION-docker.pkg.dev/$PROJECT_ID/mcp-repo/math-mcp-server:latest
Deploying to GKE with Gateway API and SSL
Create a deployment.yaml file to define the Kubernetes Deployment and Service.
apiVersion: apps/v1
kind: Deployment
metadata:
name: mcp-server
spec:
replicas: 2
selector:
matchLabels:
app: mcp-server
template:
metadata:
labels:
app: mcp-server
spec:
containers:
- name: mcp-server
image: $REGION-docker.pkg.dev/$PROJECT_ID/mcp-repo/math-mcp-server:latest
ports:
- containerPort: 3000
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m"
livenessProbe:
httpGet:
path: /healthz
port: 3000
initialDelaySeconds: 15
periodSeconds: 20
readinessProbe:
httpGet:
path: /healthz
port: 3000
initialDelaySeconds: 5
periodSeconds: 10
---
apiVersion: v1
kind: Service
metadata:
name: mcp-service
spec:
selector:
app: mcp-server
ports:
- port: 80
targetPort: 3000
Apply this configuration to the cluster:
kubectl apply -f deployment.yaml
To secure the connection, use a Google-managed SSL certificate and attach it to a Gateway API resource. First, reserve a static IP address for your load balancer:
gcloud compute addresses create mcp-server-ip --global
export MCP_SERVER_IP=$(gcloud compute addresses describe mcp-server-ip --global --format="value(address)")
Create a Google-Managed Certificate.
gcloud compute ssl-certificates create mcp-cert --domains mcp.yourdomain.com --global
Create a gateway.yaml file to provision the load balancer and configure Transport Layer Security (TLS) termination.
apiVersion: gateway.networking.k8s.io/v1beta1
kind: Gateway
metadata:
name: mcp-gateway
spec:
gatewayClassName: gke-l7-global-external-managed
listeners:
- name: https
protocol: HTTPS
port: 443
tls:
mode: Terminate
options:
networking.gke.io/pre-shared-certs: mcp-cert
addresses:
- type: NamedAddress
value: mcp-server-ip
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: mcp-route
spec:
parentRefs:
- name: mcp-gateway
hostnames:
- "mcp.yourdomain.com"
rules:
- matches:
- path:
type: PathPrefix
value: /mcp
backendRefs:
- name: mcp-service
port: 80
---
apiVersion: networking.gke.io/v1
kind: GCPBackendPolicy
metadata:
name: mcp-backend-policy
spec:
default:
sessionAffinity:
type: CLIENT_IP
targetRef:
group: ""
kind: Service
name: mcp-service
---
apiVersion: networking.gke.io/v1
kind: HealthCheckPolicy
metadata:
name: mcp-health
namespace: default
spec:
default:
checkIntervalSec: 15
timeoutSec: 5
healthyThreshold: 1
unhealthyThreshold: 2
logConfig:
enabled: false
config:
type: HTTP
httpHealthCheck:
port: 3000
requestPath: /healthz
targetRef:
group: ""
kind: Service
name: mcp-service
Deploying this configuration creates the infrastructure required to route external traffic securely to the MCP server.
kubectl apply -f gateway.yamlHow Netflix Simplified Batch Compute with Kueue
Netflix migrated its batch compute jobs to Kueue, enabling advanced resource sharing and preemption-based scheduling on its Titus container platform.
Summary
Decoder
- Kueue: An open-source Kubernetes-native job queueing system that manages resource quotas and scheduling for batch workloads.
- Preemption: The ability of a scheduler to stop or pause a lower-priority task to free up resources for a higher-priority task.
Original Article
Netflix migrated millions of batch jobs from its homegrown Compute Managed Batch (CMB) system to Kueue, an open-source Kubernetes-native job queueing system, resulting in significantly improved resource utilization across its container platform Titus. The migration, which allowed operators to switch tenants with the click of a button, brought features like preemption-based fair sharing that let teams borrow idle reserved capacity from other tenants while ensuring high-priority workloads can still preempt lower-priority ones when needed.
Firecrawl (GitHub Repo)
Firecrawl now features an autonomous 'Agent' endpoint that navigates and extracts structured web data without needing predefined URLs.
Summary
Deep Dive
- Agent Endpoint: Enables autonomous web browsing and data retrieval based on natural language prompts.
- LLM Integration: Designed for direct integration with AI clients via MCP.
- Output formats: Native support for Markdown, structured JSON, and screenshots.
- Infrastructure: Handles proxy rotation, rate limiting, and JS-heavy rendering internally.
- Licensing: Available as a hosted service or self-hosted under AGPL-3.0.
- Benchmarks: Features a P95 latency of 3.4s across large-scale crawls.
- Tooling: Provides a CLI for direct interaction and SDKs for Python and Node.js.
Decoder
- MCP (Model Context Protocol): An open standard that allows AI assistants to connect to local and remote data sources, enabling them to execute actions or retrieve context securely.
- AGPL-3.0: A copyleft open-source license that requires anyone who modifies and runs the software over a network to make their source code available.
Original Article
🔥 Firecrawl
The API to search, scrape, and interact with the web at scale. 🔥 The web context API to find sources, extract content, and turn it into clean Markdown or structured data your agents can ship with. Open source and available as a hosted service.
Why Firecrawl?
- Industry-leading reliability: Covers 96% of the web, including JS-heavy pages — no proxy headaches, just clean data (see benchmarks)
- Blazingly fast: P95 latency of 3.4s across millions of pages, built for real-time agents and dynamic apps
- LLM-ready output: Clean markdown, structured JSON, screenshots, and more — spend fewer tokens, build better AI apps
- We handle the hard stuff: Rotating proxies, orchestration, rate limits, JS-blocked content, and more — zero configuration
- Agent ready: Connect Firecrawl to any AI agent or MCP client with a single command
- Media parsing: Parse and extract content from web-hosted PDFs, DOCX, and more
- Actions: Click, scroll, write, wait, and press before extracting content
- Open source: Developed transparently and collaboratively — join our community
Feature Overview
Core Endpoints
| Feature | Description |
|---|---|
| Search | Search the web and get full page content from results |
| Scrape | Convert any URL to markdown, HTML, screenshots, or structured JSON |
| Interact | Scrape a page, then interact with it using AI prompts or code |
More
| Feature | Description |
|---|---|
| Agent | Automated data gathering, just describe what you need |
| Crawl | Scrape all URLs of a website with a single request |
| Map | Discover all URLs on a website instantly |
| Batch Scrape | Scrape thousands of URLs asynchronously |
Quick Start
Sign up at firecrawl.dev to get your API key. Try the playground to test it out.
Search
Search the web and get full content from results.
from firecrawl import Firecrawl
app = Firecrawl(api_key="fc-YOUR_API_KEY")
search_result = app.search("firecrawl", limit=5)
Scrape
Get LLM-ready data from any website — markdown, JSON, screenshots, and more.
from firecrawl import Firecrawl
app = Firecrawl(api_key="fc-YOUR_API_KEY")
result = app.scrape('firecrawl.dev')
Interact
Scrape a page, then interact with it using AI prompts or code.
from firecrawl import Firecrawl
app = Firecrawl(api_key="fc-YOUR_API_KEY")
result = app.scrape("https://amazon.com")
scrape_id = result.metadata.scrape_id
app.interact(scrape_id, prompt="Search for 'mechanical keyboard'")
app.interact(scrape_id, prompt="Click the first result")
Power Your Agent
Connect Firecrawl to any AI agent or MCP client in minutes.
Skill
Give your agent easy access to real-time web data with one command.
npx -y firecrawl-cli@latest init --all --browser
MCP
Connect any MCP-compatible client to the web in seconds.
{
"mcpServers": {
"firecrawl-mcp": {
"command": "npx",
"args": ["-y", "firecrawl-mcp"],
"env": {
"FIRECRAWL_API_KEY": "fc-YOUR_API_KEY"
}
}
}
}
More Endpoints
Agent
The easiest way to get data from the web. Describe what you need, and our AI agent searches, navigates, and retrieves it. No URLs required.
Crawl
Crawl an entire website and get content from all pages.
Map
Discover all URLs on a website instantly.
Batch Scrape
Scrape multiple URLs at once.
SDKs
Our SDKs provide a convenient way to use all Firecrawl features and automatically handle polling for async operations (Python, Node.js, Java, Elixir, Rust).
Integrations
Agents & AI Tools: Firecrawl Skill, CLI Skills, Workflows, MCP.
Platforms: Lovable, Zapier, n8n.
Open Source vs Cloud
Firecrawl is open source under the AGPL-3.0 license. The cloud version at firecrawl.dev includes additional features.
License
This project is primarily licensed under the GNU Affero General Public License v3.0 (AGPL-3.0). The SDKs and some UI components are licensed under the MIT License.
It is the sole responsibility of end users to respect websites' policies when scraping. Users are advised to adhere to applicable privacy policies and terms of use. By default, Firecrawl respects robots.txt directives. By using Firecrawl, you agree to comply with these conditions.
Airllm (GitHub Repo)
AirLLM now supports running 405B parameter models on just 8GB of VRAM by layer-wise model decomposition.
Summary
Deep Dive
- Layer-wise Inference: Decomposes LLMs so only one layer at a time resides in memory.
- Memory Efficiency: Runs 70B models on 4GB VRAM and 405B models on 8GB VRAM.
- Quantization: Implements 4-bit and 8-bit block-wise quantization for 3x faster inference.
- Compatibility: Now supports CPU inference and Apple Silicon/MacOS environments.
- Model Support: Compatible with Llama 3.1, Qwen2.5, Mistral, and ChatGLM architectures.
- Disk usage: Requires significant local storage in the Hugging Face cache directory to store the decomposed shards.
Decoder
- Block-wise quantization: A technique that reduces the precision of weights in blocks rather than the entire model, balancing computational speed and accuracy.
- VRAM: Video Random Access Memory; dedicated memory on a graphics card used to store model parameters during inference.
Original Article
AirLLM optimizes inference memory usage, allowing 70B large language models to run inference on a single 4GB GPU card without quantization, distillation and pruning. And you can run 405B Llama3.1 on 8GB vram now.
Updates
[2024/08/20] v2.11.0: Support Qwen2.5
[2024/08/18] v2.10.1 Support CPU inference. Support non sharded models. Thanks @NavodPeiris for the great work!
[2024/07/30] Support Llama3.1 405B. Support 8bit/4bit quantization.
[2024/04/20] AirLLM supports Llama3 natively already. Run Llama3 70B on 4GB single GPU.
[2023/12/25] v2.8.2: Support MacOS running 70B large language models.
[2023/12/20] v2.7: Support AirLLMMixtral.
[2023/12/20] v2.6: Added AutoModel, automatically detect model type, no need to provide model class to initialize model.
[2023/12/18] v2.5: added prefetching to overlap the model loading and compute. 10% speed improvement.
[2023/12/03] added support of ChatGLM, QWen, Baichuan, Mistral, InternLM!
[2023/12/02] added support for safetensors. Now support all top 10 models in open llm leaderboard.
[2023/12/01] airllm 2.0. Support compressions: 3x run time speed up!
[2023/11/20] airllm Initial version!
Quickstart
1. Install package
First, install the airllm pip package.
pip install airllm
2. Inference
Then, initialize AirLLMLlama2, pass in the huggingface repo ID of the model being used, or the local path, and inference can be performed similar to a regular transformer model.
(You can also specify the path to save the splitted layered model through layer_shards_saving_path when init AirLLMLlama2.
from airllm import AutoModel
MAX_LENGTH = 128
# could use hugging face model repo id:
model = AutoModel.from_pretrained("garage-bAInd/Platypus2-70B-instruct")
# or use model's local path...
#model = AutoModel.from_pretrained("/home/ubuntu/.cache/huggingface/hub/models--garage-bAInd--Platypus2-70B-instruct/snapshots/b585e74bcaae02e52665d9ac6d23f4d0dbc81a0f")
input_text = [
'What is the capital of United States?',
#'I like',
]
input_tokens = model.tokenizer(input_text,
return_tensors="pt",
return_attention_mask=False,
truncation=True,
max_length=MAX_LENGTH,
padding=False)
generation_output = model.generate(
input_tokens['input_ids'].cuda(),
max_new_tokens=20,
use_cache=True,
return_dict_in_generate=True)
output = model.tokenizer.decode(generation_output.sequences[0])
print(output)
Note: During inference, the original model will first be decomposed and saved layer-wise. Please ensure there is sufficient disk space in the huggingface cache directory.
Model Compression - 3x Inference Speed Up!
We just added model compression based on block-wise quantization-based model compression. Which can further speed up the inference speed for up to 3x , with almost ignorable accuracy loss!
How to enable model compression speed up:
- Step 1. make sure you have bitsandbytes installed by
pip install -U bitsandbytes - Step 2. make sure airllm verion later than 2.0.0:
pip install -U airllm - Step 3. when initialize the model, passing the argument compression ('4bit' or '8bit'):
model = AutoModel.from_pretrained("garage-bAInd/Platypus2-70B-instruct",
compression='4bit' # specify '8bit' for 8-bit block-wise quantization
)
What are the differences between model compression and quantization?
Quantization normally needs to quantize both weights and activations to really speed things up. Which makes it harder to maintain accuracy and avoid the impact of outliers in all kinds of inputs.
While in our case the bottleneck is mainly at the disk loading, we only need to make the model loading size smaller. So, we get to only quantize the weights' part, which is easier to ensure the accuracy.
Configurations
When initialize the model, we support the following configurations:
- compression: supported options: 4bit, 8bit for 4-bit or 8-bit block-wise quantization, or by default None for no compression
- profiling_mode: supported options: True to output time consumptions or by default False
- layer_shards_saving_path: optionally another path to save the splitted model
- hf_token: huggingface token can be provided here if downloading gated models like: meta-llama/Llama-2-7b-hf
- prefetching: prefetching to overlap the model loading and compute. By default, turned on. For now, only AirLLMLlama2 supports this.
- delete_original: if you don't have too much disk space, you can set delete_original to true to delete the original downloaded hugging face model, only keep the transformed one to save half of the disk space.
MacOS
Just install airllm and run the code the same as on linux.
- make sure you installed mlx and torch
- only Apple silicon is supported
FAQ
1. MetadataIncompleteBuffer
safetensors_rust.SafetensorError: Error while deserializing header: MetadataIncompleteBuffer
If you run into this error, most possible cause is you run out of disk space. The process of splitting model is very disk-consuming. You may need to extend your disk space, clear huggingface .cache and rerun.
2. ValueError: max() arg is an empty sequence
Most likely you are loading QWen or ChatGLM model with Llama2 class. Try the following:
For QWen model:
from airllm import AutoModel #<----- instead of AirLLMLlama2
AutoModel.from_pretrained(...)
For ChatGLM model:
from airllm import AutoModel #<----- instead of AirLLMLlama2
AutoModel.from_pretrained(...)
3. 401 Client Error....Repo model ... is gated.
Some models are gated models, needs huggingface api token. You can provide hf_token:
model = AutoModel.from_pretrained("meta-llama/Llama-2-7b-hf", #hf_token='HF_API_TOKEN')
4. ValueError: Asking to pad but the tokenizer does not have a padding token.
Some model's tokenizer doesn't have padding token, so you can set a padding token or simply turn the padding config off:
input_tokens = model.tokenizer(input_text,
return_tensors="pt",
return_attention_mask=False,
truncation=True,
max_length=MAX_LENGTH,
padding=False #<----------- turn off padding
)
Citing AirLLM
@software{airllm2023,
author = {Gavin Li},
title = {AirLLM: scaling large language models on low-end commodity computers},
url = {https://github.com/lyogavin/airllm/},
version = {0.0},
year = {2023},
}GitOps in Practice: How to Design a Scalable CI/CD Pipeline with GitLab and GKE
Scalable GitOps on GitLab and GKE requires pull-based reconciliation and strict separation between application code and environment configurations.
Summary
Deep Dive
- Reconciliation: Prefer pull-based models (Flux/GitLab agent) over push-based
kubectl applyfor improved security. - Repository Structure: Use separate repos for code and environment manifests; treat environments as directory overlays rather than branches.
- Authentication: Use Workload Identity Federation instead of static service account JSON keys.
- Secrets: Manage secrets externally using Google Secret Manager and the External Secrets Operator.
- Promotion: Gate environment promotion via merge request approvals instead of pipeline conditionals.
- Observability: Monitor DORA metrics alongside Merge Request rates to gauge pipeline health.
Decoder
- GitOps: A paradigm where the entire infrastructure state is stored in Git, and an automated agent reconciles the real cluster state to match the Git state.
- DORA metrics: Deployment frequency, lead time for changes, change failure rate, and time to restore service—the industry standard for measuring DevOps performance.
- Workload Identity Federation: A secure authentication method that allows Kubernetes pods to access cloud resources using short-lived tokens instead of long-lived secrets.
Original Article
A scalable CI/CD pipeline on GitLab and Google Kubernetes Engine starts with one decision: do you treat the pipeline as a delivery system you design, or as a YAML file you copy from a tutorial? Most teams default to the second. They wire up a .gitlab-ci.yml, push to a cluster, and call it GitOps. It runs — until environments multiply, secrets sprawl, and a single bad merge reaches production.
The gap between “it works on my branch” and “it deploys safely across five environments” is design intent. The architectural choices that separate the two include how state actually syncs to GKE, how you structure branches and promotion, how you keep credentials out of your cluster, and where GitLab Duo earns its place.
What Does GitOps With GitLab and GKE Actually Mean?
GitOps means Git is the single source of truth for your cluster’s desired state, and a controller reconciles the live cluster to match it. The distinction that trips teams up is how that reconciliation happens.
A push-based pipeline runs kubectl apply from a GitLab runner — the pipeline reaches into the cluster. A pull-based model flips it: an agent inside the cluster watches Git and pulls changes. GitLab recommends Flux for GitOps, paired with the GitLab agent for Kubernetes, so the cluster never has to be exposed outside your firewall.
That architecture choice has real consequences. With the pull model, the GitLab agent detects a push to a repository it watches and triggers Flux to reconcile the cluster — no inbound cluster access, no long-lived kubeconfig sitting in a CI variable. For multi-cluster GKE setups, that’s the difference between a defensible security posture and a pile of cluster credentials in your pipeline settings.
So the first design question isn’t “which YAML?” It’s “push or pull?” For anything beyond a single dev cluster, we default to pull-based reconciliation.
How Should You Structure Branches for Environment Promotion?
Branch strategy is where pipeline design quietly decides your release cadence. The instinct is to mirror environments with long-lived branches — dev, staging, prod — and merge between them. It feels orderly. It also creates drift, merge conflicts, and “what’s actually in staging?” confusion.
A cleaner pattern separates application code from deployment state. Application repositories use trunk-based development: short-lived feature branches, frequent merges to main, small reviewable changes. Environment configuration lives in a separate manifests repository, where each environment is a directory or overlay, not a branch.
Why this matters for scale: when environments are folders, promoting a release is a commit that bumps an image tag in the staging/ overlay, then the prod/ overlay. The history is linear and auditable. When environments are branches, promotion is a merge — and merges carry whatever else accumulated on the source branch. Small, frequent changes are also what the research rewards: the 2024 DORA report found elite performers deploy on demand with change failure rates around 5%, which is far easier to hit when each change is small enough to reason about.
How Does Code Get Promoted Across Environments?
Promotion logic is the heart of a multi-environment pipeline, and GitLab gives you the primitives to make it explicit rather than implicit.
Use GitLab environments to model dev, staging, and production as first-class objects, then put protected environments in front of the sensitive ones. Protected environments restrict who can deploy and let you require approvals before a job touches production. A typical flow: a merge to main auto-deploys to dev, a manual when: manual job promotes to staging, and production requires a protected-environment approval from a release owner.
In a pull-based GitOps model, “deploy” really means “commit the new image tag to the manifests repo.” The CI pipeline builds and scans the image, then opens or updates a merge request against the environment overlay. Flux does the actual apply. This keeps your promotion gates in code review — where they’re visible — instead of buried in pipeline conditionals nobody reads.
The payoff is traceability. Every production change is a reviewed merge request with an approver, a timestamp, and a diff. When something breaks at 2 a.m., you’re reading a Git history, not reverse-engineering a runner log.
How Do You Manage Secrets in GKE Without Storing Service Account Keys?
This is where default config does the most damage. The path of least resistance — exporting a Google Cloud service account key as JSON and mounting it as a Kubernetes Secret — is exactly the pattern Google explicitly discourages, because long-lived keys leak, end up in repos, and rarely get rotated.
The design-intent answer is Workload Identity Federation for GKE. It lets a Kubernetes service account authenticate to Google Cloud APIs using short-lived, automatically rotated tokens tied to the pod’s identity — no key file anywhere in the cluster. You bind a Kubernetes service account to IAM permissions directly, and access is governed by IAM policy instead of in-cluster RBAC and static secrets.
For application secrets — database passwords, API tokens — pair Workload Identity with the External Secrets Operator reading from Google Secret Manager. The secret lives in Secret Manager, the operator pulls it using the federated identity, and your Git repo contains a reference, never the value. Nothing sensitive sits in your manifests or your pipeline variables.
We’ll be blunt: if your GitLab CI variables hold a GOOGLE_APPLICATION_CREDENTIALS JSON blob, that’s the first thing Cloudfresh removes in a pipeline review. It’s the single most common avoidable risk we see on GKE.
Where Does GitLab Duo Fit Into Pipeline Design?
GitLab Duo’s value here isn’t writing your whole pipeline — it’s cutting the two costs that eat platform engineers alive: writing correct CI config and diagnosing broken jobs.
For authoring, Duo offers CI/CD component generation, helping you draft pipeline configuration that follows valid syntax instead of trial-and-error YAML. For failures, GitLab Duo Root Cause Analysis analyzes a failed job’s logs, identifies the likely cause, and proposes a fix — instead of you scrolling through a dense log to find that an alpine image was missing the Go runtime.
Used well, Duo shifts debugging left. When a misconfigured job fails in dev, you get an explanation and a suggested fix in the merge request before that change ever moves toward staging. That’s the practical version of “catch it before production” — not prediction magic, but faster, clearer feedback at the earliest gate.
How Do You Know the Pipeline Is Actually Working?
A scalable pipeline is one you can measure, and the four DORA metrics are the standard: deployment frequency and lead time for changes measure throughput; change failure rate and time to restore measure stability. Read them together, not individually — high deploy frequency with a climbing failure rate means your gates are too loose.
GitLab surfaces these alongside its own Merge Request Rate, which rewards exactly the small-batch habit a good branch strategy enables. If your lead time is creeping up, the cause is usually structural: a promotion gate that’s manual when it should be automated, or a test stage that’s serial when it could run in parallel. Measurement turns “the pipeline feels slow” into a specific job you can fix.
Designing for Intent, Not Defaults
Most teams use a fraction of what GitLab and GKE can do together because they accept the defaults: push-based deploys, branch-per-environment, keys mounted as secrets, no measurement. Each shortcut is fine at small scale and expensive at large scale.
Designing with intent means choosing pull-based reconciliation, separating code from deployment state, gating promotion in code review, replacing keys with Workload Identity, and watching DORA metrics to know it’s working. None of these are exotic — they’re decisions, made deliberately, early.
Anthropic Wants Claude to Be Your New Slack Coworker
Anthropic is launching Claude Tag, a new feature that allows its AI to actively participate in Slack channels on behalf of users.
Summary
Original Article
Claude Tag is a new feature that lets users create a chatbot that interacts on their behalf on Slack. It can be used to monitor Slack activity, send alerts about important posts, drop comments in conversations, and fix issues with code. Claude Tag will replace the existing Claude Slack app. It is being rolled out to Anthropic's enterprise and team subscription users.
With Starfall, SpaceX eyes an edge in global cargo delivery from orbit
SpaceX has launched a secretive saucer-shaped reentry pod named 'Starfall' designed for point-to-point cargo delivery from orbit.
Summary
Decoder
- Point-to-point delivery: A logistics model where cargo is moved directly between two locations via suborbital or orbital trajectories to significantly reduce transit times compared to traditional shipping.
Original Article
A SpaceX Falcon 9 rocket lifted off on Tuesday to test a new reentry vehicle designed to deliver cargo anywhere in the world from low-Earth orbit.
The company developed the new saucer-shaped reentry pod, called Starfall, under a veil of secrecy. Its purpose is to support the “transport and delivery of goods through space,” according to an environmental assessment published by the Federal Aviation Administration last month.
The first demonstration of the Starfall vehicle began at 6:53 am EDT (10:53 UTC) with liftoff aboard a Falcon 9 rocket from Cape Canaveral Space Force Station, Florida. At least one Starfall reentry pod rode to orbit on the Falcon 9, perhaps alongside another undisclosed payload. After circling the planet two times, the Falcon 9’s upper stage was expected to release Starfall for atmospheric reentry, targeting a parachute-assisted splashdown in the Pacific Ocean around 800 miles west of California.
That is according to airspace and maritime warning notices telling pilots and sailors to steer clear of the Starfall splashdown zone.
SpaceX’s official webpage for the mission includes a timeline of key events during the launch, but it lacks additional information about the payload or the exact sequence of events for the Falcon 9’s upper stage.
“Today’s mission includes a demo of a new vehicle that will enable affordable, routine access to the microgravity environment for scientific research and in-space manufacturing,” SpaceX posted on X. “After demonstrating controlled flight, the spacecraft will splash down in the Pacific Ocean.”
From here to anywhere
Most of what we know about Starfall comes from the FAA’s environmental assessment. In that document, the FAA writes that Starfall will “enable point-to-point delivery of critical cargo through space on rapid timelines” and provide access to space for commercial in-space manufacturing, a nascent market that, so far, is largely geared toward pharmaceuticals. Starfall could bring those materials back to Earth for commercial use. A private company named Varda Space Industries is already working in this area.
The FAA’s environmental review approved SpaceX’s proposal for two Starfall reentry demonstrations. It did not specify if these demos would happen on one or two missions. SpaceX intends to recover the vehicle, including parachutes and heat shields, “to the maximum extent practicable,” the FAA said.
The Starfall vehicle is cylindrical in shape, with a diameter of 10.2 feet (3.1 meters) and a height of 2.5 feet (0.75 meters). Starfall weighs approximately 4,600 pounds (2.1 metric tons) with a capacity for about 2,200 pounds (1 metric ton) of payload, for a total weight of 6,800 pounds (3.1 metric tons). Designed exclusively for cargo, Starfall is smaller than SpaceX’s human-rated Crew Dragon spacecraft, which ferries astronauts to and from the International Space Station.
The first Starfall Demo mission will spend a few hours in low-Earth orbit, but the vehicle could also fly on shorter suborbital trajectories after launching on either Falcon 9 or the much larger Starship rocket. This version of Starfall is not capable of de-orbiting itself but instead relies upon its launch vehicle to guide it back into the atmosphere. After separating from its rocket carrier, the disc-shaped vehicle uses compressed nitrogen gas to orient its heat shield for reentry.
So who might use something like Starfall? The US military is one obvious answer. The Pentagon is already working with SpaceX on a concept named Rocket Cargo or Point-to-Point Delivery, which would use Starship to deliver massive loads of equipment and supplies to far-flung locations in less than an hour. Starship is an enormous vehicle, nearly 20 stories tall and 30 feet wide, that must land at prepared sites. Starfall could prove to be a more versatile option for lighter deliveries.
The military has also signed agreements with Blue Origin, Rocket Lab, and Anduril for studies and development of technologies for global cargo delivery from space. Notwithstanding Starship, which is still undergoing experimental flight tests, SpaceX may have an early advantage with the Starfall delivery vehicle.
War by Other Means
The rise of robotic warfare is shifting political power away from citizens and toward the private firms that supply and control military technology.
Summary
Deep Dive
- The shift from human-centric to machine-centric warfare alters the political cost of conflict.
- Military procurement is moving toward an 'outsourced' model, relying on private expertise to manage complex software like Project Maven.
- Companies gain structural leverage over state decisions by controlling the maintenance and evolution of essential military systems.
- Nationalizing defense technology companies is increasingly impractical due to the need to preserve innovative, market-driven incentives.
- States that fail to adapt their social contract to this new reality risk being beholden to foreign or corporate capital.
Decoder
- Forward-deployed engineer: Software engineering model popularized by Palantir where private sector engineers work directly within client organizations to customize and maintain complex integrated systems.
- Dual-use technology: Innovations developed for commercial or civil applications that also have significant military utility.
Original Article
War by Other Means
Robotic warfare is shifting the source of state power away from citizens to firms. The transition will produce a new social contract.
In December 2024, the Ukrainian National Guard’s 13th Khartiia Brigade carried out a combined ground and air assault near the town of Lyptsi. Air assets coordinated with dispersed ground forces bounding from cover to cover. Dozens of attackers overran Russian positions, cleared mines in the vicinity, and laid down a defensive perimeter around the captured territory. But not a single Ukrainian was present on the battlefield.
To achieve this, soldiers piloted a mix of unmanned ground vehicles mounted with machine-gun turrets, aerial drones bristling with grenades and assault rifles, and more conventional kamikaze drones. This experiment allowed the Ukrainians to trade valuable human advantages like spatial awareness and tactical flexibility for the assurance that their only casualties would be robotic. For a country suffering severe manpower shortages, such a tradeoff was welcome amid a gruesome war of attrition fought against an adversary with more than three times its population.
The growing use of drones is both a cause and a consequence of an ongoing transformation of military power. As robotic soldiers mature on the battlefield, defense planners are beginning to rethink the tradeoffs they pose in combat in favor of the labor-saving benefits they provide. One vision of “America First” foreign policy argues that downsizing American personnel abroad could be amply made up for by increasing the use of unmanned munitions, driving costs down and bringing soldiers home without weakening deterrence.
Combined with recent and forthcoming improvements to industrial automation, all signs point toward governments relying on a much smaller pool of human capital for labor and war. Accordingly, this means that a fundamental source of political power has begun to shift from the people toward the firms that make those machines. This is leading to an erosion of the popular constraints that have historically disciplined the use of force, and an ever-greater insulation of the state from revolutions and popular uprisings.
For some countries like Ukraine, this reliance is born of immediate necessity. Yet even for powerful countries like the United States, the growing reliance on commercial industry in areas like robotics and AI raises fundamental questions about the basis of sovereignty and the future of the liberal-democratic social contract.
The Dawn of Robotic Warfare
Although humans remain predominant on the battlefield, many of the advantages they held have shifted in favor of robots. For instance, while humans struggle to quickly communicate detailed, accurate information in a chaotic environment, data transmission between robots is immediate and precise. Target assessment by ground units can be relayed to air support almost instantaneously, and automated command centers can process more information than any human operator. Machines do all of this without becoming sick, hungry, tired, confused, distracted, or disobedient.
A gap between human and machine capabilities remains, however. Most systems require persistent wireless links, and electronic jamming has become widespread on the battlefield. Both Ukraine and Russia now use lightweight fiber-optic tethers to maintain physical connections, limiting range and mobility. Full autonomy has only just begun to impact military strategy, but its use is increasing—both sides of the war in Ukraine have fielded fully autonomous drones, which are confirmed to have taken the lives of human combatants. As the technical constraints are overcome, the military “tooth” may increasingly become machine-run, with humans receding into a “tail” involving maintenance, refinement, and training.
Slowly but surely, the absence of humans on the battlefield is becoming normalized, creating structural preconditions for an eventual handover to full autonomy. On its face, this trend seems humane—fewer soldiers, fewer deaths—but it also risks changing the political costs of war, which becomes more palatable, even routine. “Forever wars” emerge from such a disjunction between military effort and political consequence.
While human soldiers remain more versatile than their robotic counterparts, the tradeoff between the two depends on the marginal value a state places on military effectiveness relative to the value it places on its capital and labor reserves. Ukraine, for example, faces an exceptionally challenging demographic situation as outmigration, aging, and the casualties of war bleed the country dry. In this situation, even inferior robotic soldiers will be a worthwhile investment if it means preserving a limited supply of workers and conscripts. Conversely, a comparatively labor-rich state like Russia may choose to forgo fielding more advanced robotic soldiers. While Russia faces well-documented long-term demographic decline, its willingness to mass-mobilize military reserves creates a functional labor surplus relative to its capital-constrained access to advanced microelectronics and high-tech components.
The speed at which militaries will field robotic soldiers exemplifies the broader capital-labor tradeoffs states make when choosing a grand strategy consistent with domestic necessity. As machines play a larger role in combat, partnerships with the private-sector organizations that design, program, maintain, and operate them continually grow in importance.
Outsourcing War
Ukraine has heavily relied on these types of partnerships, and a diverse mix of foreign and domestic companies is now integral to its war effort. Take Palantir, a company known primarily for its software and data analytics. Its CEO, Alex Karp, visited President Volodymyr Zelensky soon after the Russian invasion in 2022. Since then, Palantir has provided valuable strategic resources to the Ukrainian military. During the 2023 counteroffensive, for instance, Palantir’s software helped Ukrainian forces track Russian positions in real time and target units using machine-vision algorithms.
Palantir is but one of many companies taking part in what some have described as Ukraine’s “war laboratory,” providing the physical capital and technical expertise otherwise unavailable to the Ukrainian government. Reforms to Ukraine’s military acquisition system have increasingly encouraged this public-private model, in turn sparking domestic entrepreneurial energy and a new generation of defense-tech startups.
These dynamics point toward a transformed political landscape where military power is drawn less from public institutions and more from private enterprise. By equipping states like Ukraine with the capabilities they need to survive in a competitive international system, firms like Palantir compete with government institutions in the provision of public goods and core military functions. This reliance comes with built-in leverage. The strenuous efforts of Ukraine and Russia to maintain access to Starlink illustrate just how much sway the foreign policy of CEOs like Elon Musk can hold over modern military conflicts.
Great powers like the United States are just as beholden to these relationships. At the end of the Cold War, the United States faced significant domestic pressure to downsize its military and reduce defense expenditures. This resulted in a reduction of the amount spent on the military relative to GDP, but at the cost of contracting out many military functions to the private sector. During the War on Terror, defense expenditures rose, but many core roles—the design of weapon systems, aircraft maintenance, intelligence functions—remained outsourced, especially as the expertise they required increasingly sat in the private sector. This has led to scenes such as SpaceX and the Pentagon haggling over the “subscription tier” used to control unmanned munitions in the recent Iran war. The end result was each drone becoming twice as expensive in the middle of hostilities, pointing to how the labor-saving aspects of robotic munitions are undercut if the wider ecosystem that produces them drives up the price.
Private firms, unconstrained by public pay scales and hiring bureaucracy, can outcompete governments for the highly skilled labor that complex military technologies demand. This imbalance is reinforced by the increasing importance of dual-use technology, which gives the private sector flexible access to both military and non-military markets. In areas like AI, the military has become deeply dependent on commercial firms for access to the best models, data infrastructure, and talent. It’s no wonder that arguably the most discussed topic in the U.S. defense scene today revolves around finding ways for the military to successfully integrate the private sector into its development and procurement processes.
This raises several dilemmas. Take software development, for instance. Once embedded in military workflows, planning software like Project Maven—originally built with Google, then picked up by Palantir and augmented by Anthropic to be used for the war in Iran—can be difficult to customize or replace without ongoing access to its developers, hence the “forward-deployed engineer” model popularized by Palantir. However, the vast amount of labor and intellectual property that goes into this software has led to corporate calls for negotiations with the government centered on the long-term ownership of important algorithms and data. This may entail companies retaining some degree of ownership rights.
Fortunes are often at stake in these matters. Maven began as a $70 million contract with $9 million earmarked for Google, but it grew into a program worth $1.5 billion, with nearly $800 million of it fulfilled by Palantir. Considering how deeply software is integrated with operational or strategic planning, not just weapon targeting systems, the political calculus inherent to such planning is partly determined by actors who do not represent the state. If the state continues to delegate development and maintenance to the private sector for efficiency, the transaction costs of reasserting control will only continue to escalate. As these technical processes become more complex and dynamic, the tacit knowledge necessary to even audit these systems becomes further entrenched within the firm, deepening the asymmetry of the relationship between state and corporate capabilities.
Managing this relationship remains a constant balancing act. Tech companies can walk away from contracts when faced with employee backlash, shifting corporate priorities, or changing leadership. Google’s decision to exit Project Maven in 2018 after employee protests was an early demonstration of how fragile these partnerships can be. As the disagreement between Anthropic and the Department of War over AI’s use in autonomous weaponry turns from sanction into legal dispute, it demonstrates just how indispensable the capabilities provided by private firms have become to military strategy. That the U.S. military continued to use Anthropic’s Claude during its strikes on Iran shouldn’t come as a surprise.
One might argue that the state ultimately retains the trump card of nationalization. If a firm becomes truly vital to state survival, the government possesses the legal and physical power to seize it. Yet in the context of the technology race inherent to modern warfare, this option is illusory. The strategic value of modern defense firms lies neither in the equipment they produce nor the intellectual property they hold; it is in their unique organizational capacity for rapid innovation. Nationalizing these industries risks destroying the market-driven incentive structures that generate their comparative advantage in the first place—if the U.S. government nationalized Anthropic tomorrow, how many of its staff would opt to stay rather than leave for another frontier lab?
Consequently, states today find themselves in a bind: to maintain the military effectiveness required for survival, they must allow the sources of their coercive capacity to remain in private hands, thereby structurally ceding political leverage to private capital only tangentially aligned with the state’s goals. Governments reluctant to go down this pathway will functionally be trading economic and military performance for political virtue.
Sovereign Power
To understand how the social contract is being rewritten by these developments, we must understand its foundations. From a Hobbesian view, the legitimacy of political authority rests on the collective agreement among individuals to authorize a sovereign to monopolize their coercive capacity. By conceding some individual rights and subsuming themselves within the state, the citizens of that state hold anarchy at bay.
This bargain, however, entails a recognition that the sovereign’s power ultimately rests on its ability to wield its citizens’ coercive potential. Were the sovereign to abrogate its duties, the logic of the bargain breaks, and the citizens might repossess their coercive capacity. Draft-dodging, desertion, and mutiny are all manifestations of this—each has been seen in the Ukraine war—but they are solved problems when robots predominate. Today, the creators of war machines can take actions that signal their disagreements with the state, such as in the case of Anthropic and Iran, or Musk denying Ukraine access to Starlink in Crimean waters, but for now there is a lack of inter-elite coordination on such matters.
The social contract has always depended on an alignment between the sovereign and the source of coercive capacity exercised by actual people. Frequently, those people have been the few as opposed to the many, especially during periods where the dynamics of labor, capital, and organizational capacity have favored smaller groups of wealthy military elites, such as those of medieval Europe or Japan.
Of course, human labor-power alone is insufficient to coerce others. As military technology became more complex and intertwined with the civilian economy, governments needed to rely on specialized groups dedicated to the logistics or production of military power. Over time, the rising costs of this relationship induced states to experiment with novel financial instruments to fund both the production and employment of military force—wars create debt, and debt creates finance. This synergistic rise of state, industrial, and military capacity peaked just as the nation-state and the mass army rose to dominance, resulting in the rise of total wars, which entailed the mass-mobilization of the state’s labor and capital reserves.
With the advent of the nuclear revolution, computerization, and precision targeting, however, the shrinking marginal value of labor reversed the advantage of mass armies. As the drive for military effectiveness increasingly privileged high-tech capital, the most powerful states saw the need to maintain large-scale investments in industrial and R&D sectors to stay competitive.
If this trend toward substituting machines for human labor on the battlefield continues, the source of coercive capacity and the vector of political upheaval will correspondingly continue to shift away from labor and toward capital. Once the sovereign can rule the citizenry without needing to co-opt their coercive capacity, relying instead on corporate persons, the old social contract breaks.
In turn, the competitive pressures of the international system may end up rewarding those states that organize around a new social contract—one more representative of the source of the sovereign’s coercive power. Today, private firms are increasingly becoming the source of this power, with fractious public-private partnerships providing early evidence of this fact. Countries like Ukraine are especially vulnerable to this new social contract, and it will take the continued development of domestic high-skilled labor and industry to prevent them from being beholden to the interests of foreign capital.
Political uncertainty will become ubiquitous in this new world where human labor is no longer the principal source of coercive capacity, and to avoid the worst possible futures, we must be clear-eyed in recognizing the structural problems that lie ahead. But although it might be possible to completely substitute human labor with machines in factories and militaries, no such substitution can exist in the political world so long as politics remains a fundamentally human endeavor.
A new social contract looms in the distance, and we ought to grapple with that fact sooner rather than later if the people are to have a place in it. “War,” as Clausewitz so famously pointed out, is and will forever remain the “continuation of politics by other means.” It would do us good to remember that no matter how much war becomes dependent on capital and technology, politics is, and continues to remain, our domain.
NSA Lost Access to Powerful AI Model Amid Anthropic Dispute
The NSA lost access to Anthropic's 'Mythos' model after new export controls halted their classified testing phase.
Summary
Decoder
- Mythos: An AI model developed by Anthropic capable of identifying vulnerabilities in classified network infrastructure.
Original Article
NSA cybersecurity analysts were testing versions of Anthropic's tools when export controls were imposed on the company. The tests found that Mythos was able to identify cybersecurity flaws within the NSA's classified network quickly. Analysts were impressed with Mythos' capabilities in controlled test settings. There is an effort to push forward a classified contract between Anthropic and the NSA that would allow the agency to use the company's AI technology for a variety of purposes, but that contract has not been finalized.
Please keep code descriptions simple
Maintainable code reviews require atomic commits and descriptions that explain why a change was made, not just what was changed.
Summary
Deep Dive
- Prioritize 'why' over 'what' in commit messages and documentation.
- Keep commits atomic to allow for easier, granular reviews.
- Use
git amendto clean up local history before submitting. - Rebase and squash commits to maintain a clean project history.
- Avoid relying on AI to write documentation for changes you do not personally understand.
Decoder
- Atomic commit: A commit that performs one specific task, ensuring that the entire set of changes is either applied or reverted together.
- Squash: The process of merging multiple sequential commits into a single commit to simplify repository history.
Original Article
Please keep code descriptions simple
Just something I experience more and more these days.
When it comes to reviewing code, the descriptions, commits and such can be massive blast of information: Full of extraneous details depicting what was changed. The main point is why was something changed. And often in only one huge commit with massive diffs.
I'm sorry but my poor ADHD brain can't take this very well. I don't want to read a novel. Usually blurbs of text are fine: Extraneous detail I can ask about if I need to know.
So this is my plea, from accessibility-ish(?) standpoint, to keep commit messages, merge request descriptions and code comments clear, to the point, need-to-know basis. Do not explain what, but why. Usually the code itself is enough to tell rest of the story. If not, I will ask questions. That's why it's a review.
It's easy to think that having huge description with all and everything is the way to go, but it will just make it slower for people like me to review it. I can barely concentrate already..
Then commits should always be atomic, especially during merge review. Use git amend to make small changes. Before merging, rebase and clean up, or squash. But try your best keep commits atomic: changes that can stand alone.
(Note that this is not aimed at any specific individual, I just finally had brains to write this post since I was reminded of the topic.)
If you use LLM tools, please still write comments, descriptions, commit messages etc. yourself. It helps you to understand whats going on, and it's more accessible for me to review. (Or better yet, try to avoid these tools if you can. I don't think anyone actually needs them. You're good enough without, I promise!)
edit: Seems people are upset about me mentioning accessiblity there. I do not know what is the best way to describe it. But you can just ignore this blog post if it annoys you.
Higgsfield Launches Enterprise Marketing Agents Built on NVIDIA
AI startup Higgsfield launched Supercomputer 2.0, an autonomous marketing agent framework built on NVIDIA’s toolkit, claiming 78% Fortune 500 adoption.
Summary
Deep Dive
- Higgsfield claims 390 Fortune 500 companies use their platform, though these figures are self-reported and lack third-party audits.
- The system uses proprietary Soul models running on NVIDIA Blackwell architecture.
- Features include policy guardrails and permissioning controls for corporate compliance.
- Auditability for enterprise teams is listed as a roadmap feature rather than a current capability.
- The company produced a 95-minute feature film, Hell Grind, in 14 days for under $500,000 using the toolset.
Decoder
- Agentic AI: Systems that can independently execute sequences of tasks toward a goal rather than simply responding to static prompts.
- NVIDIA Blackwell: NVIDIA's GPU architecture designed specifically for large-scale generative AI workloads and high-performance computing.
- Nemotron: A family of foundation models developed by NVIDIA for specialized tasks like agentic orchestration.
Original Article
TL;DR
Higgsfield launched Supercomputer 2.0, an enterprise marketing automation agent built on NVIDIA’s Agent Toolkit. The company claims 78 per cent of Fortune 500 companies use its platform, though the figure has not been independently audited.
Higgsfield, the AI video startup valued at $1.3 billion, on Wednesday launched what it calls the first enterprise-ready autonomous agent framework for marketing automation. Supercomputer 2.0, built on NVIDIA’s Agent Toolkit and powered by Nemotron models, adds safety controls and granular permissioning to a platform the company says is already used to create campaigns for 390 Fortune 500 companies.
That adoption figure, which Higgsfield has not independently verified through a third-party audit, would make it one of the most widely deployed AI creative tools in corporate marketing. The company says its net revenue has nearly quadrupled in the first five months of 2026, driven by 30 per cent month-over-month growth.
What Supercomputer 2.0 does
The system orchestrates more than 35 image, audio, and video models, including Higgsfield’s proprietary Soul models built on NVIDIA Blackwell architecture, alongside leading large language models. NVIDIA’s Nemotron models power specialised subagents that handle tasks running continuously inside every campaign.
Supercomputer 2.0 ships with more than 20 production pipelines covering TV commercials, product reels, Amazon listing generators, and AI podcasts. The agent is designed to manage the full marketing lifecycle, from ideation and creative production to posting and autonomous optimisation, in a single interface.
Three capabilities are designed to make it deployable inside enterprise organisations: policy guardrails that screen every action for data leaks, permissioning controls that define what the agent may do, and an auditability layer for compliance teams. The auditability feature remains on the roadmap rather than shipping at launch, a distinction worth noting for enterprises evaluating the product today.
The growth claims
Higgsfield says 12,000 businesses across six continents use the platform, with commercial advertising accounting for 70 per cent of activity. Its Marketing Studio reportedly attracted 68,000 marketers in its first 30 days.
The company raised $80 million in a Series A extension led by Accel in January 2026, following a $50 million Series A in September 2025, bringing total funding to approximately $138 million. Higgsfield was founded in 2023 by Alex Mashrabov, the former head of generative AI at Snap who previously co-founded AI Factory, the company behind Snapchat Lenses.
“We went from product concept to running video ads in 15 minutes, a process that used to take weeks,” said Sean Frank, chief executive of Ridge, a men’s accessories brand. WPP, the world’s largest advertising network, said it is “excited to explore” how it could build solutions with Higgsfield, though the language stops short of a formal partnership.
The Cannes calling card
Higgsfield’s creative ambitions extend beyond marketing clips. In May, a 15-person team used the Supercomputer to produce Hell Grind, a 95-minute AI-generated action fantasy film, in 14 days for reportedly less than $500,000, a fraction of the roughly $50 million a comparable traditional production would cost.
The film premiered in Cannes during the festival period but was not part of the official programme. Variety reviewed it and described the visuals as strikingly realistic, though the production drew criticism from filmmakers concerned about AI’s impact on the industry.
The timing works in Higgsfield’s favour. OpenAI shut down Sora in April after the consumer app collapsed below half a million users and burned roughly $1 million a day in compute, and at least one other AI film at Cannes was left without its model when the service went dark.
A crowded agent race
Higgsfield is not building in a vacuum. Zyg, founded by the IronSource team, raised $60 million for agentic AI that automates e-commerce advertising, while Gradial raised $65 million for its own agentic enterprise marketing suite.
NeoCognition raised $40 million on the thesis that self-learning agents need to be fundamentally different from the current generation of capable-but-inconsistent generalists. Meta’s Advantage+ suite already handles creative generation and targeting for 8 million advertisers.
McKinsey estimates that agentic AI could support up to two-thirds of current marketing activities and accelerate campaign creation by up to 15 times, but its research also found that fewer than 10 per cent of chief marketing officers have deployed end-to-end workflows that generate measurable value. The gap between what the technology promises and what enterprises have captured remains wide, and Higgsfield’s pitch is that its enterprise safety layer is the missing piece.
Who Sets the Quality Bar?
Design teams are failing to define quality bars for AI projects, leading to inconsistent outputs that look professional but miss user intent.
Summary
Original Article
Most AI products lack a defined quality bar — a problem the Designer Fund's 2026 AI in Design report exposes through data on inconsistent output, reduced team collaboration, and unclear ownership. Speed and AI-generated visual polish mask deeper failures: products may look good while missing user intent, communicating uncertainty poorly, or behaving incorrectly in critical moments. Design teams have the methods to fix this by specifying quality criteria before building starts, but only if they're involved early enough to ask the right questions.
After 350 Projects, Design Studio UI UX Says Most Digital Products Fail for the Same Three Reasons
After analyzing 350 projects over a decade, design firm UI/UX identified that digital product abandonment is primarily driven by bloated onboarding, poor information hierarchy, and inconsistent branding.
Summary
Deep Dive
- Overwhelming onboarding: Users drop off when required to perform too many setup steps before accessing core value.
- Poor visual hierarchy: Dashboards fail when they present all data equally rather than emphasizing the most actionable metrics.
- Trust-eroding inconsistencies: Misaligned photography, pricing discrepancies, or buried shipping info cause users to doubt site legitimacy.
- Context-blind design: Treating two different user personas as one experience often degrades usability for both groups.
- Research over intuition: Successful redesigns prioritize usability testing over personal preference to catch design debt early.
Decoder
- Visual hierarchy: The arrangement of elements on a page in a way that implies importance, guiding the user's eye to the most significant information first.
- Design debt: The accumulation of UI/UX inconsistencies and complexity resulting from rapid feature additions without periodic design reviews or refactoring.
Original Article
After 350 Projects, Design Studio UI UX Says Most Digital Products Fail for the Same Three Reasons
A decade of redesign work across healthcare, SaaS and ecommerce keeps turning up the same problems, the kind that hide in plain sight until a user gives up.
Most of what lands on Design Studio UI UX's desk is not a blank page. It is a product that is already live, already has real users, and is already losing them. Ten years and more than 350 projects in, the team has noticed something. The reason rarely has anything to do with taste. It is almost always one of a few problems, and they show up again and again no matter what the product does.
Smart Moving, a SaaS platform built for the moving industry, had grown fast and its onboarding screen had not kept up. New users were opening the app, hitting a wall of setup steps, and leaving before they had done anything useful. Nobody had added a bad feature. The flow had just grown one screen at a time until it stopped making sense to the person actually using it.
"People do not abandon a product because it looks dated," said Sneh Sagar, Co-Founder of Design Studio UI UX. "They abandon it because the first five minutes asked too much before giving anything back. We run into this constantly. A team adds one more step to onboarding, then another, and a year later nobody remembers why a new user has to click through six screens before they can do anything at all."
Then there is the opposite problem, too much information with no order to it. That was the situation at Pretaa, a behavioral analytics platform used by addiction treatment centers in the US. Clinicians needed to read patient data fast, sometimes mid conversation, and the screen in front of them buried the one number that mattered under a dozen that did not. The fix was not adding anything. It was deciding what came first.
Prabhash Choudhary, Co-Founder and CEO, put it this way. "A dashboard that shows everything at once is not thorough. It's unfinished, somebody still has to decide what the user needs to see in the first three seconds, and skipping that decision does not make the product simpler. It just hands the problem to someone else."
Ecommerce has its own version. Halmari Tea, one of India's most awarded single estate tea brands, came up against this during its website redesign. People already trusted the tea. What they were not sure about was whether the site would get their order right, because small things, slightly different pricing on two pages, photography that did not quite match, shipping details buried where nobody looks, kept chipping away at that trust before checkout even started. None of it looked serious on its own. Together it was enough to make someone close the tab.
GreenPal - a lawn care marketplace in the US ran into something else entirely. Homeowners booking a service and crews fulfilling it were using the exact same app for two completely different jobs, and a layout built mostly with one group in mind was quietly making life harder for the other. The fix was not a redesign in the usual sense. It meant treating the same product as two different experiences that happened to share the same data underneath.
None of this sounds complicated once it is spelled out, which is probably why it is so easy to miss while the product is actually being built. A feature gets added under a deadline. A new screen gets bolted onto an old flow because there is no time to rethink the whole thing. Design Studio UI UX has spent a decade building its process specifically to catch these moments early, through research and usability testing, instead of finding out the hard way after users have already left.
The industries change - Healthcare, SaaS, ecommerce, enterprise software. The lesson has not. Most products do not fail because nobody cared about the design. They fail because nobody had the time, or the distance, to actually notice what using the thing felt like.
ByteDance's New AI Video Model Can Make 30-Second Clips From a Single Prompt
ByteDance's Seedance 2.5 can generate 30-second, 4K video clips from a single prompt and allows up to 50 reference files for fine-grained control.
Summary
Decoder
- 4K: A high-definition display resolution of approximately 4,000 pixels wide, commonly used in professional media production.
Original Article
The Chinese tech company ByteDance revealed Seedance 2.5, the latest version of its artificial intelligence video generator model, during a conference in Beijing on Tuesday, according to a report from The Information.
AI video generation has come a long way since its debut, with each new version becoming increasingly better at producing realistic imagery. We're far from the first time we saw Will Smith eating spaghetti, which was horrifyingly bad.
Now, we need watermarking for these AI-generated videos to help identify deepfakes and other synthetic or false content.
The latest version of the video model allows users to provide up to 50 reference pieces, whether they're images, videos or audio files -- up from 12 in its predecessor, Seedance 2.0. Increasing the number of references will give you greater control over the video creation process. The model can generate 30-second, 4K videos with a single prompt.
ByteDance has consistently released some of the most impressive AI video generation models, rivaling those of OpenAI's now-dead Sora and Google's Veo 3. ByteDance, which previously held a majority stake in TikTok, is said to release the new model in China next month, according to the report. A release time window for other countries was not mentioned.
The introduction of the new model may turn some heads, and not in the best way. Seedance 2.0's US rollout was delayed earlier this year under pressure from Hollywood to stop using copyrighted works that appeared to be used for training the model. If the latest model is significantly better than its predecessor, it could see a similar backlash if it can't address legal and copyright issues.
ByteDance didn't immediately respond to a request for comment.
Build real agentic apps using CUGA: two dozen working examples on a lightweight harness
IBM Research released CUGA, an open-source agent harness designed to manage planning and state for building complex, error-correcting agentic applications.
Summary
Decoder
- Harness: A testing or management framework that provides a consistent environment for executing and monitoring software components.
Original Article
CUGA, IBM's open-source agent harness, simplifies developing agentic apps by managing the complexities of planning, execution, and state management, allowing developers to focus on tool selection and prompts. CUGA's efficient system maintains state and corrects errors, outperforming others in benchmarks like AppWorld. Its unique features include configurable reasoning modes and integrated policy systems, enabling quick deployment from development to production while maintaining governance and flexibility.
How Businesses Are Building Specialized AI They Can Trust
NVIDIA's new Agent Toolkit provides an open-source foundation for businesses to build, customize, and securely deploy domain-specific AI agents.
Summary
Decoder
- Digital Twin: A virtual representation of a physical system or environment, often used in industrial operations to train agents in a safe, simulated context.
Original Article
Companies are asking how to build specialized AI that fits with the way their workflows actually run.
The first wave of enterprise AI was about access. Companies experimented with new frontier and open models, ran pilots and explored how AI can help.
Now, specialized agents — systems of models that can reason, use tools and take action even for the most complex workflows — put more useful AI within reach of the people who already know the work best.
Agents are already helping life sciences researchers accelerate medicine discovery, security teams investigate vulnerabilities with more context and operations teams seamlessly coordinate supply chains.
To tap into these specialized agents, businesses are using a foundation they can adapt and own: one built on models they can customize, tools that connect to systems they already use and infrastructure that lets agents operate safely at scale.
NVIDIA Agent Toolkit — comprising models, tools, skills and a secure runtime — provides an open, modular foundation for building safer, faster, lower-cost digital AI coworkers that enterprises and developers can customize, specialize, control and trust.
The Building Blocks for Specialized AI Coworkers
Enterprises and developers building secure, specialized AI agents require:
- Models, which provide the reasoning foundation.
- Tools and skills, which connect agents to the actions and domain expertise needed to get work done.
- Runtime support, which helps agents execute workflows.
NVIDIA Agent Toolkit includes all three:
- NVIDIA Nemotron open models give teams flexibility to customize, evaluate and deploy agents for their own needs.
- NVIDIA NemoClaw blueprints provide patterns for safer agent behavior, delivering accurate results at lower costs, with tools and skills connecting agents to concrete actions.
- The NVIDIA OpenShell runtime helps agents operate safely inside the systems where work gets done.
NVIDIA technologies accelerate all the pieces needed to turn a powerful frontier model into a fully functional digital coworker. The toolkit’s users can work with third-party agent harnesses — or agent orchestration frameworks — of their choice, including Hermes Agents and OpenClaw.
This unlocks enterprise AI momentum with control. And that matters because the most valuable agents across industries will be specialized.
Agents Take Shape Across Industries
The specialized AI foundation is already at work.
In life sciences, agents can help researchers call domain models for protein design, virtual screening, genomics analysis and biomarker discovery. The new NVIDIA BioNeMo Toolkit enables work that previously took months to be completed in days.
In healthcare, agents support clinical documentation, clinical decision support and care coordination. Plus, physical agents in robotics systems trained in digital twins of hospitals can scale surgical assistance and hospital automation to meet care demands.
In software, cybersecurity, industrial operations and customer workflows, agents can connect to the tools and data teams already use, helping people move faster through complex workflows.
For example, Cadence and Synopsys are building autonomous agents for chip design and engineering workflows. CrowdStrike is running specialized security agents that triage alerts with 98.5% accuracy. Palantir, SAP, ServiceNow, Siemens and Dassault Systèmes are embedding agent capabilities into the enterprise platforms where critical decisions get made.
It all points to the same larger shift: Agents become more useful when they can combine models, tools, skills, runtime and infrastructure in ways companies can adapt to their own workflows. NVIDIA Agent Toolkit provides an open, modular foundation that enables this combination.
Learn more about NVIDIA Agent Toolkit and NVIDIA BioNeMo Agent Toolkit.
Prompt Injection as Role Confusion
Researchers propose that prompt injection in LLMs is fundamentally a failure of role perception, treating instruction and content as an undifferentiated stream of tokens.
Summary
Deep Dive
- LLMs fail to distinguish between system instructions, user prompts, and external content.
- The model processes everything as a single stream of tokens.
- Role confusion occurs when user input mimics high-authority roles or system states.
- Defense mechanisms currently act as a reactive 'whack-a-mole' instead of addressing architectural root causes.
- True security requires the model to develop a sense of agency and distinct role boundaries.
Decoder
- Role confusion: An AI security flaw where a model fails to differentiate between trusted system instructions and untrusted user input.
- Token soup: The flat, undifferentiated sequence of data that LLMs receive as input, masking the origin or hierarchy of the information.
Original Article
Paper
Cite this for the formal ICML paper.
@inproceedings{ye2026promptinjectionroleconfusion,
title = {Prompt Injection as Role Confusion},
author = {Ye, Charles and Cui, Jasmine and Hadfield-Menell, Dylan},
booktitle = {International Conference on Machine Learning (ICML)},
year = {2026},
url = {https://arxiv.org/abs/2603.12277}
}US Presses Meta to Agree to AI Reviews as Security Concerns Rise
The US Commerce Department is pressuring Meta to submit its large AI models for federal security reviews, making it the final outlier among major AI developers.
Summary
Original Article
The Trump administration is pressing Meta to submit its AI models for voluntary review. Meta is the only major AI developer in the US that has not reached an agreement to voluntarily share its models with the federal government for review. The review involved evaluating models' abilities and vulnerabilities. Meta's policy team has been negotiating with the Commerce Department about how to proceed, but it is unclear whether they will be able to reach an agreement.
Introducing Engram: Scaling Compute on Your Context
Engram is developing AI models that prioritize continuous learning from a user's private data to reduce the need for redundant context processing.
Summary
Original Article
Introducing Engram: Scaling compute on your context
We’re Engram. We’re building AI that learns from you and deeply understands your work. Today’s AI models don’t understand what you do. Not really. Everything models know comes from their training –...
Patch the Planet: a Daybreak initiative to support open source maintainers
OpenAI and Trail of Bits are launching 'Patch the Planet' to proactively identify and fix vulnerabilities in critical open-source software.
Summary
Deep Dive
- Partnership: Collaboration between OpenAI and Trail of Bits to secure open-source infrastructure.
- Methodology: AI-assisted vulnerability discovery combined with expert manual review and patch development.
- Workflow: Findings are validated and coordinated via disclosure before reaching maintainers to minimize noise.
- Scope: Nine projects initially included, such as cURL, Python, and the Go programming language.
- Impact: Hundreds of security issues identified and dozens of patches successfully merged.
Original Article
OpenAI launched Patch the Planet, a Daybreak initiative partnering with Trail of Bits to not only find vulnerabilities in critical open-source software but actually help maintainers fix them using AI-assisted security research and expert human review. The program has already identified hundreds of security issues and merged dozens of patches across nine initial projects, including cURL, Python, and the Go project, with Trail of Bits dedicating its entire security research organization to validate findings, develop patches, and coordinate disclosure before anything reaches overwhelmed maintainers.
Retirement of Azure DevOps issuer in Workload identity federation service connections
Microsoft is retiring the Azure DevOps issuer for workload identity federation on July 1, 2027, in favor of the standard Microsoft Entra issuer.
Summary
Deep Dive
- Deprecation: The Azure DevOps issuer (vstoken.dev.azure.com) will reach end-of-life on July 1, 2027.
- Migration: Users must update existing service connections via the UI.
- Standardization: Shifts all federation to the Microsoft Entra issuer (login.microsoftonline.com).
- Exclusions: Multi-tenant apps and non-public cloud connections are temporarily exempt.
- Timeline: Warnings appear July 2026; retirement July 2027.
Decoder
- Workload Identity Federation (WIF): A mechanism that allows CI/CD systems to access cloud resources without storing long-lived service account secrets, using OIDC tokens instead.
- OIDC (OpenID Connect): An identity layer on top of the OAuth 2.0 protocol, used for secure authentication.
Original Article
We are announcing the deprecation of the Azure DevOps issuer in workload identity federation (WIF) service connections, with planned retirement on July 1, 2027. The Azure DevOps issuer uses the https://vstoken.dev.azure.com prefix in federated credentials. This change is part of Microsoft’s broader initiative to standardize on the Microsoft Entra issuer across Azure services that implement workload identity federation.
Important This deprecation only applies to service connections in Azure public cloud that use single-tenant Microsoft Entra applications or managed identities. Service connections targeting non-public clouds (for example, Azure Government, Azure China, or Azure Stack) and service connections that use multi-tenant applications (
signInAudience: AzureADMultipleOrgs) are explicitly excluded from today’s deprecation announcement. The Azure DevOps issuer will continue to be supported for these scenarios until they’re supported by the Microsoft Entra issuer.
Background: Workload Identity Federation in Azure DevOps
More than two years ago, we introduced workload identity federation support for Azure DevOps, enabling secretless authentication between Azure Pipelines and Azure resources with managed identities or app registrations. This was a significant security improvement over the use of app registrations with secrets.
Workload identity federation has proven to be an invaluable feature for our customers, with strong adoption across organizations seeking to eliminate long-lived credentials from their CI/CD pipelines.
Microsoft is standardizing on the Microsoft Entra issuer (https://login.microsoftonline.com/) for workload identity federation across services. Instead of using an OIDC token minted by Azure DevOps in the OpenID Connect (OIDC) flow underpinning workload identity federation, the flow uses an OIDC token minted by Microsoft Entra. The Microsoft Entra issuer has been used for new service connections since last year, and today more than 50% of all workload identity federation service connections use the Microsoft Entra issuer.
Timeline
Since November 2025, new workload identity federation service connections created in Azure DevOps have been using the Microsoft Entra issuer. Dates of upcoming changes are:
- July 1, 2026: The Azure DevOps issuer (
https://vstoken.dev.azure.com) is deprecated. New service connections created in Azure DevOps will continue to use the Microsoft Entra issuer by default. - July 2026 – June 2027: Existing service connections using the Azure DevOps issuer will show a warning in pipeline runs and the service connection configuration UI.
- July 1, 2027: The Azure DevOps issuer will reach end of life and will no longer be supported.
What You Need to Do
Service connections that use the Azure DevOps issuer (https://vstoken.dev.azure.com) are listed at the top of the service connection list, with a warning indicating they need action.
To convert a service connection to the Microsoft Entra issuer, select the Update button.
If you don’t have access to the underlying identity and can’t create a federated credential for the Microsoft Entra issuer, select the Create federated credential in link to create the federated credential and populate it with the issuer and subject manually.
FAQ
Q: Will my existing pipelines break immediately?
A: No. Service connections that use the Azure DevOps issuer will continue to work until retirement on July 1, 2027. We recommend planning your conversion before then.
Q: How is the Microsoft Entra issuer different?
A: The issuer is an implementation detail that’s hidden during regular use. Pipeline tasks work the same and don’t require changes. The Microsoft Entra issuer provides additional benefits by using Microsoft Entra-minted tokens and an immutable federation subject, so the federated credential is guaranteed to be used by the service connection it was created for.
Q: Can I use the Microsoft Entra issuer today?
A: Yes. All new service connections created today use the Microsoft Entra issuer by default. You can also convert existing connections by following the steps above.
Q: Is there any downtime during the conversion, and how long does it take?
A: During the conversion, the existing Azure DevOps issuer federated credential will continue to be used by pipelines that reference the service connection. After we verify that the new Microsoft Entra issuer federated credential works during conversion, pipeline jobs will start using the Microsoft Entra issuer.
Q: What if I have questions about the conversion?
A: Please review Workload identity federation conversion and Workload identity federation troubleshooting. You can also reach out to Azure DevOps Support or visit the Azure DevOps Developer Community.
Q: I create service connections in automation and need to be able to know the federated credential subject before creating it.
A: See Use scripts to automate workload identity service connections.
Q: My Azure DevOps organization has an organization-wide exception to use multi-tenant apps to prevent the error AADSTS70052: The identity must be a managed identity or a single tenant app. What will happen?
A: We’re working on an experience that will provide a service connection-specific exception for multi-tenant apps instead. Until then, you will see no difference in experience.
Mark Zuckerberg Directed Meta to Create a Prediction Markets App
Mark Zuckerberg is reportedly directing a team at Meta to develop a prediction markets app codenamed 'Arena'.
Summary
Original Article
Mark Zuckerberg recently dispatched a small team at Meta to create a smartphone app similar to Polymarket and Kalshi. The app uses a points system rather than real money, but Meta has not ruled out the eventual use of real money betting. Internally referred to as 'Arena', the app will function independently from Meta's other apps. The effort is part of a broader purchase at the company to create new types of apps based on emerging social behavior online.
The truth about being a manager
Engineering management is a lonely, high-stakes role that requires balancing team advocacy with professional distance and organizational pragmatism.
Summary
Original Article
I vividly remember most of my own journey into engineering management over the years. Throughout my career, I’ve also mentored several people taking the step from engineer to engineering manager. It’s a challenging and often lonely journey, and here are a few truths most new managers realize the hard way. (If you aren’t a manager yourself, you might just learn a bit about your manager’s situation.)
The things no one tells you about being a manager:
You’ll bring work home with you more often than not. The difficult conversation you’re dreading, the “stupid” business decision you will need to explain to your team, the office politics you need to handle, or the team member struggling silently with a personal crises. You will need to learn how to manage stress and rumination – and you’ll need to learn it fast.
You’re not “part of the team” anymore. You are the manager. Even if you try to be “one of the gang,” you’re not, the dynamic has shifted. The team might not want you at their lunch every day (you’re welcome, but not always). They will have private conversations when you’re not there. They will talk about you behind your back, and that’s okay. You will feel the distance slowly but surely. The shift might start subtly, but after the first salary negotiation or the first tough decision, you will definitely feel it. You will probably miss being part of a team a lot.
You need to be careful with every word. You can’t joke like before, because a casual remark might be interpreted as a directive or cause unnecessary worry. You can’t just blurt out an idea, or the team will wonder if it’s a decision and they need to change direction. Many managers aren’t clear on when it’s just their personal opinion, an idea, or an order, so most people learn to listen carefully and overthink things. You must learn to be quieter. Let others take the space and help them dare to speak up with their ideas.
You will encounter business decisions you think are terrible but you still have to sell to your team. You cannot vent your frustration to the people you lead. You have to be professional and diplomatic even when you disagree. You have to be the calm in the storm. Hopefully, you learn this before you make too many mistakes. Because once you lose their trust, it’s hard to get it back.
You’ll probably feel very lonely. If you’re lucky, you’ll have a peer group of other managers. Nurture those relationships! You will need them. Make sure you have at least one peer you can talk to openly, discuss difficult decisions with, laugh with, or vent to when needed.
You will carry knowledge you cannot share. Re-orgs, performance issues, budget cuts, upcoming business decisions, or a team member’s personal crisis will weigh heavy on you. You must learn to handle questions diplomatically, keep your emotions in check, and encourage your team even when you disagree with the direction.
You need to network and understand the business. You need to build solid relationships with other colleagues around your team, the product manager, designer, business analyst or other people that affects your teams. You also need to get to know people in other teams and other departments to do a good job. You can’t just focus on the tech. Get to know people in other departments such as sales, marketing, IT and customer success. You need to know the company’s KPIs, understand the business, the data, and the company strategy by heart to really support your teams.
You will often feel a lack of progress. As an engineer, your day ended with clear output: a feature shipped, a refactor completed, a design finalized. As a manager, you often finish the day unsure of what you actually accomplished. Most work take weeks, not days.
It’s way too easy to just get stuck in meetings and busywork, don’t let that happen to you! Figure out where you can have the most impact. Set up clear goals, track them and celebrate the wins. Be the person who drives progress, not just the person who manages the calendar.
You will miss being an engineer. Often. You will miss the ability to hunker down, solve a hard technical problem and ship something, without thinking about office politics or rhetorics. You will miss “just doing things”.
You will not get the training you need. Most companies have no or very little manager training. You’re supposed to just magically know how to give difficult feedback, present decisions confidently, navigate labor laws, or handle group dynamics. You are on your own. If you’re lucky, you’ll get a great manager who can support you and show you how to have great 1:1s, valuable development talks and give you clear feedback. The more senior you become, the more you will have to take full responsibility for your own development and feedback.
- Read up on lots of things by yourself. Set clear goals for yourself and what you need to learn.
- Read up on the basics: parental leave, sick leave, vacation policies and laws. Most unions have lots of information and also free trainings for managers.
- Double-check with HR until you’re confident. Build a good relationship with your HR friend if you can.
- Find a mentor, a senior manager you admire, and ask for guidance. There are lots of great books and material out there. It’s up to you!
You need to learn about feedback, fast. Most managers are terrified of giving clear feedback because it’s uncomfortable and scary. This means you will likely inherit teams with dysfunctional behaviors that you must fix. Don’t wait. The whole team notices when someone gets away with bad (or just lazy) behaviour.
This also means that you probably won’t get the feedback you need either. Ask your peers and directs you trust to give you honest, critical feedback. Practice receiving it gracefully. Be a role model in both recieving and giving feedback. If you want your team to be open, you must be the first to admit when you’re wrong. And don’t forget to give feedback upward to your own manager.
You will make mistakes and you won’t be liked by everyone. You will make mistakes. Hopefully, they aren’t catastrophic, and hopefully, you learn from them. If you’re “lucky,” people will point them out. You will also not be liked by everyone. Sorry, but it’s true. This can be very uncomfortable, especially if you’re a people pleaser. To be a good manager, you must make unpopular decisions, have tough conversations and sometimes let people go. You will be the messenger for bad news. You can try to be liked by everyone, but you’ll fail and lose their respect. Or, you can try to be a great manager, make the hard calls, give clear feedback, and earn respect – even if some people dislike you. I vote for the latter.
You need to be the adult in the room. At all times. Even when you’re exhausted and don’t want to be. You need to be the calm in the storm, a role model who absorbs the anxiety and defuses immature behavior. You must help your team navigate their conflicts, and be the expert when you find yourself tangled in one as well.
As Tanya Reilly describes it in “The Staff Engineer’s Path”: “How you behave is how others will behave. You’ll be the voice of reason, the ‘adult in the room’. There will be times when you’ll think ‘This is a problem and someone should say something’ … and realize with a sinking feeling that that someone is you’.”
You need to learn how to sell. You need to sell decisions and business opportunities to your team to get them excited. You need to sell your team’s features to other departments or the whole company. You must advocate for and brag about your team. And you need to advocate fiercely for your direct reports regarding salaries and promotions.
You must learn how to manage up. In a perfect world, everyone would have a great manager, and the more senior a manager is, the better they would be at it. This is not a perfect world. You need to learn how to bring bad news to senior leadership and navigate office politics. You might find yourself doing some of your manager’s work and handle their frustration as well. Be clear on your goals and accomplishments, support your manager, and handle disagreements diplomatically.
As Tanya Reilly writes in the excellent book “The Staff Engineer’s Path”: “Managing up includes understanding your boss’s priorities, giving them the information they need, and solving the problems that are in their way – in other words, helping them be successful. Their success gives them social capital that they can spend to help you.” This is definitely true for managers as well.
You will feel powerless and frustrated. Decisions will be made without your input. Budgets will be cut. Markets will shift. You will be frustrated by other managers’ poor choices and the slow pace of progress. You must learn to manage your own emotions so you can be the calm anchor for your team when they vent their frustration at you.
You will get a view of the whole picture. By talking to other managers, teams, and departments, you will spot dependencies and misunderstandings before they explode into crises. You can solve friction points before they become blockers. You will understand the context your team needs to succeed, and see the systematic issues that needs to be resolved. Protect this perspective. It is too easy to get bogged down in meetings and busywork and lose this bird’s-eye view. Your mission isn’t just managing your team, it’s connecting the dots that no one else can see.
Being a manager can be fun and fulfilling. Best case, you become the manager you always wished you had. You will feel immense pride when your team succeeds and ship something that impacts the business or users. Nothing beats watching someone you coached flourish and become amazing at their job. It’s a great feeling to see someone you hired and believed in later crush it in their new role. It can feel great to collaborate with other managers and improve things for your teams. You will learn more about yourself and others than you ever thought possible. It’s hard work, it’s lonely, but if you do it right, it will hopefully feel worth it.
Meta announces new smart glasses starting at $299, as Zuckerberg keeps pushing wearables
Meta is aggressively lowering the barrier to entry for AI wearables with new $299 smart glasses, aiming to own the hardware platform of the AI era.
Summary
Decoder
- Wearables: Computing devices that are worn on the body, such as smart glasses or watches, meant to provide integrated access to digital services.
Original Article
- Meta on Tuesday announced a new set of smart glasses called Meta Glasses with new designs and a starting price of $299.
- That's at least $80 less than the price tag for the company's entry-level second-generation Meta Ray-Ban glasses.
- Meta is aggressively marketing its smart glasses to consumers as it seeks to own a hardware platform in the AI era.
Meta on Tuesday announced a new set of $299 smart glasses, at least $80 less than the price tag for the company's entry-level second-generation Meta Ray-Ban glasses, as CEO Mark Zuckerberg continues his push into wearables.
The Meta Glasses come with new designs and are built in partnership with Ray-Ban parent EssilorLuxottica, but they don't come with Ray-Ban or Oakley branding.
Meta is aggressively marketing its smart glasses to consumers as eyewear competition heats up and consumers find more value in augmented reality devices. Though the smart glasses market is still small, Meta and EssilorLuxottica dominate it, with estimated market share of more than 80% and millions of units sold since they first launched in 2021.
Meta Glasses lack a screen, but they include a camera and personal speakers. Users can speak to Meta's AI to translate or understand what they see around them, or take photos and videos of their surroundings.
Meta executives have said they see the lightweight smart glasses as a step toward a more advanced device that includes screens in the lenses with computing capabilities. Meta last year announced glasses called Ray-Ban Display glasses, which cost $799 and include a built-in display.
Zuckerberg has found more success in smart glasses than in virtual reality headsets, which were key to the company changing its name from Facebook to Meta in 2021. VR has continued to be a niche market, largely for gamers, but Zuckerberg is focused on owning a hardware platform for the artificial intelligence era.
Meanwhile, competition in the smart glasses market is picking up. Google said last month that it's building new computerized eyewear in partnership with Warby Parker that will use its Gemini AI model. Last week, Snap announced Specs, a pair of $2,195 smart glasses that CEO Evan Spiegel positioned as the successor to the smartphone.
Meta Glasses come in three new designs, the company said. Meta also introduced a new charging stand for the glasses.
Less is more, more or less
AI makes it dangerously easy to build more, but the most important skill for a developer today is knowing what not to build.
Summary
Deep Dive
- AI tools reduce the cost of creation, creating a bias toward complexity.
- Simple interfaces increase 'processing fluency,' making software feel more intuitive.
- Excessive animations, even if technically smooth, can create cognitive load if they occur in high-frequency interactions.
- Code quality is no longer measured by lines written, but by the efficiency and intent of the solution.
- Maintain codebase standards to ensure both humans and agents follow architectural principles.
- Reject the 'more is better' mindset; true sophistication lies in omission.
Decoder
- Processing Fluency: The ease with which the human brain can process information; higher fluency leads to feelings of familiarity and trust.
- Cognitive Load: The total amount of mental effort being used in the working memory during interaction with an interface.
Original Article
Less is more, more or less
Today, with AI, it's very easy to fall into the trap of producing more just because you can. Every idea, every new feature, every animation you've always wanted to build is just a couple of prompts away. It’s amazing. It feels like having a superpower.
Things that previously would've taken hours, days or weeks now take minutes. However, the longer I use these tools, the more conscious I become of how I use them and I keep wondering if leaning into quantity is really the best way to build.
Quality over quantity
Everyone knows the age-old saying of quality over quantity but sometimes it's difficult to understand exactly what it means in practice.
In the age of AI, more people can make more things, much faster. Quantity still matters and it always will, but more things being made doesn’t mean better things are being made.
I spend a lot of time thinking about what quality means in software.
When you go from using a good product to a great one, you can feel the difference.
If you're a domain expert, you can probably point to a lot of things that make the difference. Even then, it might be difficult to point your finger at all of them.
Usually it's not a single thing but instead a collection of smaller decisions and details that add up to a great experience.
The products that stand out and last in the AI era are ones built with intent and extraordinary care.
Removing is harder than adding
Crucial parts of building a great product are simplicity and clarity.
Humans generally prefer simple and predictable things because, in a way, the brain is an energy-saving machine. Simplicity reduces unnecessary cognitive load, makes things easier to process and can make an experience feel less overwhelming.
There’s a concept in psychology called processing fluency. The easier something is to process, the more familiar, pleasant or credible it can feel.
“Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.”
AI makes adding things easier than it ever was. It's very tempting and it's much easier than removing something.
When you remove something, you have to be intentional about it and think all the implications through. Adding is different. With agents it's easy to close your eyes, add things and hope for the best without thinking about it.
I love this quote by Jony Ive, because it describes exactly what simplicity is. It's more than removing clutter. It's having an understanding so deep that you make things make sense and you only keep what's essential in order to do that.
“True simplicity is derived from so much more than just the absence of clutter and ornamentation. It's about bringing order to complexity.”
Simplicity comes from understanding
You could let an agent run non-stop and produce millions of lines of code but there is no guarantee that the result will be good.
It applies to most things, for example animations. In a weird way it actually became easier to animate things than not to.
The animated variant in the example below is nice but does it make sense? Not really.
Switch tabs to compare the animated and non-animated variants.
Animating something and animating something well are two very different things.
In the example below, one context menu animates both when it opens and when it closes while the other only animates when it closes.
One animates the background-color change of the items on hover, the other doesn't.
Why is that? Isn't more animation always better?
Not really. On paper it might sound like the more animated an interface is the better, but in practice that's rarely the case.
Let's assume the context menu is the same context menu that appears when you right-click on macOS. It might seem like a good idea to animate both the entrance and exit. Why wouldn't it be?
When you understand that users will use this action, hundreds, sometimes thousands of times per day it stops looking like such a good idea.
Assuming you open the menu 200 times per day and the duration of the animation is 300ms, that’s about a minute per day or more than 6 hours per year spent watching the animation play out. It gets in the way and becomes annoying.
This example might be obvious. Of course you wouldn't animate something that is used as much as a context menu but that's exactly the point.
Once you understand what you’re solving and how people use it, not animating it becomes the obvious decision.
Agents can’t do this just yet. They're amazing at executing. However, they don’t fully possess understanding and judgement, and ultimately those are the things that make products feel great.
Making agents understand
A lot of the same ideas apply to engineering for example. Now that everyone can write a lot of code, the days when the quality of an engineer’s output was determined by the amount of code they produced are long gone.
At Interfere for example, we celebrate pull requests that do what they're supposed to with as little code as possible.
In the same way that judgement and understanding are becoming more valuable in design, they’re becoming more valuable in engineering too.
The ability to review code, distinguish good code from bad code and think critically is becoming more important than the ability to write code, but also more scarce.
If you lack knowledge and understanding and you skip ahead straight to building, it becomes much harder to judge whether the agent’s output is good or bad and therefore steer it to something good.
Our quality bar for everything is very high and each engineer at Interfere is incredible at something else.
To maintain the quality bar and to share the knowledge and principles that we want both agents and humans to follow, we created our own /codebase-standards skill.
We pair it with a /interfere-review command that reviews code against them. By doing this, we try to encode our understanding and judgement in a way both teammates and agents can use.
Principles of working with agents
I put together a short list of things that I generally follow when working with agents.
- Don't outsource your thinking to the agent
- Be critical, don't assume anything the agent writes is correct by default
- Try to make sure you can explain what each line of code added by an agent does, at least broadly
- Think about if everything you're adding makes the final outcome better
- Your agents are an extension of you. The better you are at something, the better they are too
- Give agents as much context as possible. Use skills, commands, MCPs and be opinionated about how you want them to do things
- If you don't understand something, use AI to explain it. It's one of, if not the most powerful learning tool there is
Conclusion
AI makes it easier than ever to add more. More features, more code, more animations, more everything. It's incredibly powerful, but it also makes it easier to build things that perhaps don’t need to be built.
The question to answer these days seems to be what you should build and how you should build it instead of whether you can build it.
The more powerful the tools we use become, the more our understanding, judgement and taste matter.
Understanding the product, the user and the problem. Being opinionated. Having a vision. Calibrating your taste. Leaning on your judgement.
Those things are still on you.
Simplicity doesn’t happen by accident. It comes from understanding deep enough that you know what to remove, what to leave alone and what not to build at all.
“Simplicity is the ultimate sophistication.”
The next time you’re adding an element, an animation or a feature, think critically about why you’re adding it and whether adding it makes the final outcome better.
In the age of AI, knowing what not to build might be the most important skill of all.
More often than not, less is more, more or less.
Getty Images Accused AI of Wholesale Theft. It's Now an Official ChatGPT Image Partner
Getty Images has ended its public opposition to AI, signing a licensing agreement to integrate its library into OpenAI’s ChatGPT.
Summary
Original Article
Getty Images, once a fierce critic of how AI companies use creative content without permission, has struck a licensing deal with OpenAI to integrate its professionally licensed photos and visual assets into ChatGPT's search and discovery experiences. The partnership signals a broader industry shift, with rights holders and AI companies increasingly turning to formal licensing agreements rather than legal confrontations. For Getty, it's a new distribution channel and a vindication of its licensed-content stance; for OpenAI, it means access to one of the largest professional image libraries in the world.
Can John Ternus bring the fun back to Apple design?
Incoming Apple CEO John Ternus reportedly plans a significant design refresh, seeking to restore the personality and visual boldness of Apple's earlier products.
Summary
Original Article
Regular readers of my articles on Creative Bloq (hello, both!) will know that I've been rather down on Apple design over the last few years. While the company's noughties design was all about fun, the rise of the iPhone has given birth to a homogeny of black and silver glass slabs.
But there've been a few signs that times might be a-changing of late. After a string of duds, Apple's ad department seems to have rediscovered its mojo, and the success of the colourful MacBook Neo has cast a light on a general appetite for more personality-filled designs. And now, new reports suggest forthcoming Apple CEO John Ternus thinks a "major design shake-up" is needed. You and me both, John.
According to seasoned Apple leaker Mark Gurman, Ternus is set to re-establish the influence of Apple's design team, and influence that has apparently waned in recent years. Which sounds like we could be heading for an era of bold and fun design – the kind of design Apple used to be famous for.
Which is music to my ears. It's not just nostalgia that's had me reaching for my iPod in recent months; I genuinely find Apple's designs of the 2000s more delightful. Nowadays each successive device looks identical to the one it supposedly replaced. But back then, every generation of iPod presented a marked design departure from its predecessor.
And there was even a time, believe it or not, when each subsequent iteration of the iPhone looked distinctly different from the last. It wasn't until the iPhone 3GS that Apple rehashed an existing design – and even that was followed by the beautiful iPhone 4, which changed the look entirely. Nowadays, in a world of incremental and iterative updates, the idea of yearly new designs feels like a novelty.
But maybe, just maybe, that's about to change. I remember when using an Apple computer felt like a statement. In a world of black and grey Microsoft stuff, it stood for fun. Maybe it can again.
Why Your Creative Eye is the Most Valuable Thing in the Studio Right Now
Adobe is positioning its AI Studio as a 'grunt work' eliminator to emphasize that human taste is now the primary differentiator in creative work.
Summary
Deep Dive
- Ethical Sourcing: Adobe pays contributors for content used in Firefly model training, attempting to distinguish itself from competitors using scraped data.
- Workflow Integration: AI Studio targets the 'boring stuff' like resizing and background removal, specifically aiming to return time to designers for higher-level creative choices.
- Creative Value: The article argues that as production cost drops, the role of the creative professional shifts from creator to 'curator of taste.'
- Licensing: All starting assets are commercially safe, allowing agencies to use them without copyright concerns.
Decoder
- Firefly: Adobe's family of generative AI models designed for commercial safety and integrated directly into Creative Cloud apps.
Original Article
Why your creative eye is the most valuable thing in the studio right now
Now that anyone can make content in seconds, taste has become the most valuable asset. That means the tools worth using are the ones that take the grunt work without making you compromise on craft or how things are built. Adobe Stock's AI Studio does just that.
Something has flipped in the last 18 months. Since 2009, I've supported the creative industries through my platform, and during that time, I've always known that making the thing was the hardest part. You could have a brilliant idea over coffee, but turning it into a finished image, video, or usable layout took hours, skill and money. The bottleneck was production.
All that has changed. Content can now be made faster and cheaper than at any point in living memory, and the volume expected of creative teams has climbed to match. If you work in a small studio or an in-house team of three doing the work of 10, you already know this. The brief isn't "make something good" anymore. It's "make 40 good things, in five formats, by tomorrow".
Before I go any further, let me be straight about where I stand. I know a lot of you are wary of AI right now – some of you are super angry about it – and you have every right to be. Creative Boom has always been in your corner, and that's exactly why I want to talk about this honestly, rather than cheerlead.
Because the honest response to this isn't to celebrate. Plenty of people will try to sell you the "make more, faster" dream, and most working creatives are right to be wary of it, because "more" was never the point. I think the interesting shift is actually encouraging news. When anyone can produce content quickly, the thing that becomes scarce, and therefore more valuable, is knowing what's worth making at all. That's down to your taste and judgment. It's something Aporva Baxi pointed out in our recent podcast. It's about being able to look at 10 versions and say, with reasons, why the seventh shot is the one. That, to me, has always been the heart of any creative's job.
When making things gets easy, deciding what's good becomes the whole job.
This is where decent AI tools earn their place. The question I'd ask of any tool, then, isn't "how much can it churn out?" Instead, it's "does it give me back the hours that were lost in grunt work, so I can spend more time being creative?"
It's the test I've been putting Adobe Stock's newly launched AI Studio through. For instance, clearing an unwanted object from the background of an otherwise perfect shot. Nudging a colour to match the client's brand palette. Perhaps sizing the same content for 10 different placements. None of that is a creative act; it's the boring stuff around it. And handing that dull drudge to a machine, so I can focus on the choices that need taste, that's a very different proposition from being told to simply produce more, more, more.
I also appreciate where it starts. AI Studio sits atop a collection of nearly a billion images, videos, illustrations, vectors and music tracks. The starting point isn't a blank box and a text prompt; it's real content – shot, drawn, and composed by actual photographers, videographers, artists, and illustrators – that you can refine and play with until you're happy. You can also keep track of your edits with the new in-line editing feature. You're not summoning something from nothing. You're exercising your own judgment on assets that already have a great foundation.
That foundation really matters too – not just to you, but to other creators. The content you start with in AI Studio is commercially safe, and when you license it, the photographers, videographers and other artists who created it are paid as well. In plain terms, you can put it in front of a client without lying awake at night wondering whose work it was built from.
From there, you can choose the edits you feel most comfortable using – whether that's Firefly or other third-party models. The choice is yours. For a lot of us, that's the difference between a tool we'll actually adopt and one we'd avoid like the plague.
"Starting from high-quality, licensed content changes the role of AI. Instead of generating something from scratch, you can apply your own creativity to a strong foundation to make it your own," explains Matt Smith, VP of strategy, design and emerging products at Adobe.
Adobe Stock has also, on multiple occasions, provided bonus payments to contributors whose content was considered for training its Firefly models – it was among the first platforms to do so, and to use licensed content rather than scraping the open web. It isn't a perfect arrangement, but it's further down the road in paying and crediting the people underneath the tools than most alternatives.
And that's really the point. None of this makes the tool the hero. The hero is the person deciding what's worth making, what to cut, and what "good" looks like from one brief to another.
The tools that earn a place in your process are the ones that take the work that was never really yours to do and leave you with the part that is most fulfilling. This matters more than ever because, as content gets cheaper and faster to make, your taste goes from being a nice-to-have to the most valuable thing you can bring to the table.
"Now that making content is easier than ever, what matters is the creator's judgment. The creators who stand out now aren't the ones making the most; it's the ones who know what is actually worth making. And as technology evolves, creators should capture more value from their creativity, not less," adds Smith.
So no, I don't think the machines are going to replace us or our judgment. If anything, they'll make taste the only thing left to compete on. And for anyone who got into this industry because they care whether the work is actually good, that might be the most encouraging thing we've heard in ages.
Dead players
The shift toward lower-risk, impersonal innovation in tech risks replacing the erratic genius of founders who pursue impossible moonshots.
Summary
Original Article
Founders used to aim for the stars. iPhones, social networks, large language models, and cryptocurrency are the sort of non-linear increments that were either created by accident or in the process of trying to do the impossible or unprecedented. The non-linear option is now more impersonal and lower-risk than ever. This means people who have no personal reason to take risks and pursue a moonshot are more likely to try anyway.
Apple and Disney had conversations about merging, says Bob Iger
Bob Iger confirmed that Apple and Disney explored a potential merger, though the talks dissolved due to a lack of interest from Apple.
Summary
Original Article
The talks never went anywhere because Apple didn't show that much interest.
Amazon's new Fire TV interface is finally rolling out to more devices
Amazon is pushing a major Fire TV interface update to centralize content discovery and reduce app-hopping.
Summary
Decoder
- Alexa+: An updated, likely more capable iteration of Amazon's voice assistant focused on deeper OS integration and proactive content discovery.
Original Article
Amazon has started rolling out a redesigned Fire TV interface to current Fire TV devices and smart TVs, featuring faster performance, a cleaner layout, and content-focused sections (Movies, TV Shows, Sports, Live TV, and News) that help users find something to watch without hopping between apps. The update also deeply integrates Alexa+, reflecting Amazon's strategy to make Fire TV itself—not individual streaming apps—the primary destination for content discovery, following a trend already seen on platforms like Google TV, Roku, webOS, and Tizen.
The Organizational Cost of Low Taste
As AI makes content generation essentially free, organizational "taste" is becoming the most critical filter for preventing complex, low-quality bloat.
Summary
Original Article
Organizations fail not from poor strategy but from weak "taste" — a shared standard that filters bad options before they demand justification. Without it, decision paralysis sets in, politics replaces clarity, and complexity accumulates through unchecked addition. As AI makes creation nearly free, judgment becomes the scarcest and most decisive organizational resource.
Key Soft Skills to Succeed as a UX Designer
A 2022 survey confirms that UX hiring managers prioritize soft skills like communication and problem-solving over technical tool proficiency.
Summary
Decoder
- Heuristics: Mental shortcuts or established rules-of-thumb used to evaluate design usability and efficiency.
Original Article
Soft skills like communication, problem-solving, collaboration, and storytelling are essential for UX designers, not optional extras. A 2022 survey by the Interaction Design Foundation found that 73% of hiring managers prioritized communication and problem-solving over tool proficiency, chosen by only 13%. These intangible skills bind together UX activities and are largely transferable across professions.
CAD Copilot for Hardware Teams (Website)
Adam is a CAD copilot designed to assist hardware teams working in CADAM, Onshape, and Autodesk Fusion.
Summary
Decoder
- CAD: Computer-Aided Design, software used by engineers and architects to create precision drawings or 3D models.
Original Article
Adam is a CAD copilot for hardware teams working across CADAM, Onshape, and Autodesk Fusion.
AI Game Maker (Website)
Aippy is a platform for building, sharing, and exploring interactive AI-generated content.
Summary
Original Article
Aippy is a vibrant space to build, share, and explore interactive creations with AI.
A non-linear career path isn't a red flag
Creative industry veterans argue that non-linear 'squiggly' career paths are an asset, provided the transitions show intentional growth rather than inconsistent performance.
Summary
Original Article
In the creative industry, a non-linear or “zig-zag” career can be a strength rather than a weakness, as it exposes people to a wider range of experiences, skills, perspectives, and professional networks. The key is to make intentional career moves that support personal growth and to build a reputation for delivering results and maintaining strong professional relationships. Frequent job changes only become a concern when they suggest a pattern of short tenures without meaningful impact. Rather than focusing on a traditional ladder of promotions, creative professionals can benefit from embracing experimentation, following their interests, and clearly articulating how each career move contributed to their development and expertise.
Victoria Beckham and Artist Phoebe Collings-James Speak About Their New Artistic Collaboration
Victoria Beckham Beauty is collaborating with sculptor Phoebe Collings-James on ceramic art that interprets the texture and movement of its new Blush Stylus product.
Summary
Original Article
Victoria Beckham Beauty commissioned sculptor and ceramic artist Phoebe Collings-James to create a one-of-a-kind artwork inspired by the brand's new Blush Stylus product.