Devoured - May 01, 2026
SMG: The Case for Disaggregating CPU from GPU in LLM Serving (16 minute read)

SMG: The Case for Disaggregating CPU from GPU in LLM Serving (16 minute read)

AI Read original

Shepherd Model Gateway (SMG) is a Rust-based LLM serving layer that eliminates Python's Global Interpreter Lock bottleneck by moving all CPU workloads off the GPU inference path, achieving up to 3.5x throughput improvements in production.

What: SMG is an open-source model-routing gateway that disaggregates CPU-bound operations (tokenization, tool orchestration, multimodal processing, reasoning parsing) from GPU inference by running them in a pure Rust layer that communicates with inference engines via gRPC, supporting SGLang, vLLM, TensorRT-LLM, and cloud providers.
Why it matters: At large scale with fast GPUs like H100s, Python's single-threaded GIL creates a CPU bottleneck where tokenization overhead causes expensive GPUs to idle waiting for input. The project proves that moving these workloads to Rust eliminates this constraint, with benchmarks showing the advantage grows with concurrency and context length—exactly when it matters most in production. The architecture also allows gateway and engine layers to evolve independently.
Takeaway: If you're running production LLM serving with vLLM or SGLang, you can try SMG with `pip install smg --upgrade` to get gRPC-based serving with cache-aware routing and tool orchestration.
Deep dive
  • SMG was created to solve a production problem at scale: Python's GIL creates a single-threaded ceiling on tokenization and detokenization that becomes the bottleneck when GPUs are fast enough, causing hundreds of thousands of dollars in GPU hardware to sit idle
  • The core architectural bet is disaggregation: GPUs should only do tensor math, while everything else (tokenization, tool orchestration, multimodal preprocessing, reasoning parsing, chat history) belongs in a dedicated Rust serving layer with zero GIL contention
  • The team rebuilt the entire serving pipeline around a native Rust gRPC data plane where the gateway sends preprocessed tokens to engines and receives generated tokens back, with all other processing happening gateway-side
  • SMG rewrote major components of Hugging Face's Python image processors from scratch in Rust to enable vision preprocessing with zero Python overhead, supporting eight vision model families (Llama-4 Vision, Qwen-VL, etc.)—claimed as an industry first
  • The gateway implements a two-level tokenizer cache (L0 exact-match for repeated prompts, L1 prefix-aware at special-token boundaries) and includes fifteen model-specific parsers for extracting reasoning blocks and function calls from streaming tokens
  • MCP tool orchestration runs entirely in the gateway with a Universal Built-in Tools feature that turns any MCP server into native capabilities like FileSearch and WebSearch, letting you deploy Llama or Qwen with GPT-4-style tools
  • WASM middleware provides sandboxed extensibility for custom authentication, PII redaction, cost tracking, and compliance logging without forking the codebase—another claimed industry first
  • Benchmarks on H100s using NVIDIA GenAI-Perf across 8 models, 2 runtimes, and 1,082 comparison points show gRPC delivers ~8% more throughput at high concurrency (256), growing to 12.2% with long contexts (7,800 tokens)
  • The most dramatic result: Llama-3.3-70B-FP8 with 7,800-token inputs achieved 3.5x higher output throughput (1,150 tok/s vs 327 tok/s) because HTTP/JSON serialization became the dominant bottleneck while gRPC uses compact binary encoding
  • The project includes eight intelligent routing policies including cache-aware routing rewritten from the ground up (10-12x faster, 99% memory reduction) that uses event-driven KV cache state streaming to reduce TTFT p99 by 28% in production
  • SMG supports five native agentic APIs (OpenAI Chat/Responses, Anthropic Messages, Gemini Interactions, Realtime WebSocket) as first-class implementations, not translation layers, and is the only open-source gateway supporting OpenAI's Responses API
  • Production adoption includes Google Cloud Platform, Oracle Cloud Infrastructure, Alibaba Cloud, and TogetherAI, with the gRPC protocol adopted upstream in vLLM (PR #36169) and NVIDIA TensorRT-LLM (five merged PRs)
  • The architecture is designed to compose with other infrastructure layers like NVIDIA Dynamo and llm-d rather than replace them, operating at the serving/protocol boundary while those projects handle engine optimization and cluster orchestration
  • The project shipped thirteen releases in six months and is fully modularized into standalone crates (smg-auth, smg-mesh, smg-mcp, smg-wasm, llm-tokenizer, llm-multimodal) with cross-platform support (Linux, Windows, macOS, x86, ARM) from a single Python wheel
Decoder
  • GIL (Global Interpreter Lock): Python's mechanism that allows only one thread to execute Python bytecode at a time, creating a single-threaded bottleneck for CPU-bound operations even on multi-core systems
  • gRPC: A high-performance RPC framework using HTTP/2 and Protocol Buffers for compact binary serialization, contrasting with text-based HTTP/JSON
  • Prefill-decode disaggregation: An architecture that separates the initial prompt processing phase (prefill) from the token generation phase (decode) across different GPU pools for optimization
  • MCP (Model Context Protocol): A protocol for tool orchestration in LLM systems, allowing models to invoke external tools and services
  • WASM (WebAssembly): A binary instruction format that enables sandboxed execution of code, used here for safe extensibility plugins
  • TTFT (Time To First Token): The latency from receiving a request to generating the first output token, a key performance metric for interactive LLM applications
  • SWIM protocol: Scalable Weakly-consistent Infection-style process group Membership protocol, used for distributed cluster membership and failure detection
  • CRDT (Conflict-free Replicated Data Type): Data structures that can be replicated across nodes and merged without conflicts, enabling eventually-consistent distributed state
Original article

Shepherd Model Gateway (SMG) is a high-performance model-routing gateway for large-scale LLM deployments. It centralizes worker lifecycle management, balances traffic across HTTP/gRPC/OpenAI-compatible backends, and provides enterprise-ready control over history storage, MCP tooling, and privacy-sensitive workflows. SMG has full OpenAI and Anthropic API compatibility across SGLang, vLLM, TRT-LLM, OpenAI, Gemini, and more. This post discusses the underlying architecture behind the gateway.