Imagine every pixel on your screen, streamed live directly from a model (3 minute read)

Tech aifrontendinfrastructure Read original

Flipbook streams entire user interfaces pixel-by-pixel from an AI video model at 1080p 24fps, replacing HTML and layout engines with purely generative visuals.

What: Flipbook is a prototype that generates user interfaces entirely from an optimized AI video model, streaming live to browsers via websockets connected to serverless GPUs, with no HTML, CSS, or traditional layout code involved.

Why it matters: This represents a fundamentally different UI paradigm where models generate pixel-level output instead of structured markup, enabling interfaces that fluidly reshape themselves and make any region interactive without predefined buttons or components.

Takeaway: Try the early prototype at flipbook.page to experience model-generated interfaces, though the team notes it's slow and many demos are sped up.

Deep dive

The team optimized LTX Studio's video model heavily enough to stream live 1080p video at 24fps directly to browsers, a significant technical achievement for real-time generative rendering
Uses websockets to connect directly to Modal Labs serverless GPU infrastructure, enabling on-demand compute without managing servers
Since there's no rigid layout engine, illustrations automatically reshape themselves to fit any window size, moving beyond fixed responsive breakpoints
Any region of the generated image can become interactive, not just predefined clickable elements like traditional buttons or links
Currently optimized for visual explanations due to model limitations, but the team expects the use case range to expand as models become more accurate and stateful
The approach could eventually extend to traditionally structured UIs like coding environments, though this remains speculative
Represents an emerging "model-native" interface paradigm where AI generates every pixel rather than rendering structured markup
The demos shown in the announcement are sped up and edited, indicating current performance still has room for improvement

Decoder

Layout engine: Software that calculates where elements appear on screen (like browser rendering engines for HTML/CSS)
Serverless GPU: On-demand access to graphics processors for compute-intensive tasks without managing server infrastructure
Websockets: Protocol enabling real-time two-way communication between browser and server
Stateful models: AI models that maintain context and memory across interactions, rather than treating each request independently

Original article

Imagine every pixel on your screen, streamed live directly from a model. No HTML, no layout engine, no code. Just exactly what you want to see.

@eddiejiao_obj, @drewocarr and I built a prototype to see how this could actually work, and set out to make it real. We're calling it Flipbook.

Because there's no strict layout engine, illustrations reshape themselves to fit your window. And any region of the image can become interactive, not just the parts someone decided to make a button.

To bring the imagery to life, we heavily optimized @LTXStudio's video model. Enough to stream live 1080p video at 24fps directly to your screen, connecting directly via websockets to @modal_labs serverless GPU infra.

Today, Flipbook is limited, so we designed it around visual explanations. As the models get more accurate and more stateful, the set of things worth doing this way will expand. Even ones you'd assume need structured UIs like coding.

All of this is live! It's early and slow. Many of the demos above are sped up/edited, but we can't wait to see what you think. Try it yourself at flipbook.page