Devoured - April 30, 2026
oLLM (GitHub Repo)

oLLM (GitHub Repo)

Data Read original

oLLM lets developers run massive language models with 100k+ token contexts on consumer GPUs by offloading weights and cache to SSD instead of keeping everything in expensive GPU memory.

What: oLLM is an open-source Python library built on PyTorch and Hugging Face Transformers that enables running large-context LLM workloads on modest hardware—for example, running a 160GB model with 50k context on an $200 GPU with only 8GB VRAM by streaming model layers and KV cache from SSD on demand without quantization.
Why it matters: This makes privacy-preserving local analysis of long documents, medical records, contracts, or logs accessible without cloud APIs or expensive hardware, using full-precision models instead of degraded quantized versions.
Takeaway: Install with `pip install --no-build-isolation ollm` and try running models like Llama-3.1-8B on 100k context with just 6-7GB VRAM—check the GitHub repo for examples including multimodal support.
Deep dive
  • oLLM achieves dramatic VRAM reduction by loading model layer weights from SSD directly to GPU one at a time rather than holding all weights in memory simultaneously
  • The library offloads KV cache (attention state that grows with context length) to SSD and loads it back to GPU on demand, avoiding the massive memory costs of long contexts
  • Example benchmarks: qwen3-next-80B (160GB model) with 50k context uses only 7.5GB GPU memory instead of 190GB, with 180GB on SSD
  • Llama-3.1-8B with 100k context runs in 6.6GB VRAM instead of 71GB by offloading 69GB to disk
  • The implementation uses FlashAttention-2 with online softmax to avoid materializing the full attention matrix, which would be huge for long contexts
  • MLP layers are chunked to handle large intermediate activations without memory spikes
  • No quantization is used—models run at full fp16/bf16 precision, avoiding quality degradation from compression
  • Recent updates added multimodal support including voxtral-small-24B for audio+text and gemma3-12B for image+text processing
  • AutoInference feature enables running any Llama3 or gemma3 model with PEFT adapter support for fine-tuned models
  • Performance varies by model: qwen3-next-80B achieves 1 token per 2 seconds, making it viable for offline batch processing
  • The library works across NVIDIA, AMD, and Apple Silicon GPUs, with optional kvikio and flash-attn dependencies for NVIDIA performance boosts
  • Target use cases include analyzing contracts, medical histories, compliance reports, large log files, and historical customer support chats entirely locally
  • Optional CPU offloading of some layers can provide additional speed improvements by balancing between GPU, CPU, and disk
  • Built on standard PyTorch and Hugging Face infrastructure, making it compatible with the existing ecosystem of models and tools
Decoder
  • KV cache: Key-Value cache that stores attention layer states to avoid recomputing them; grows linearly with context length and becomes a major memory bottleneck for long contexts
  • VRAM: Video RAM on the GPU, the fast memory where model computations happen; much more expensive per GB than regular RAM or SSD storage
  • Quantization: Reducing model precision from 16-bit to 8-bit or 4-bit numbers to save memory, usually with some quality loss
  • FlashAttention: Optimized attention algorithm that computes attention scores in chunks without materializing the full attention matrix, dramatically reducing memory usage
  • MLP: Multi-Layer Perceptron, the feedforward neural network layers in transformers that can create large intermediate activations
  • PEFT: Parameter-Efficient Fine-Tuning, methods like LoRA that fine-tune models by adding small adapter layers instead of updating all weights
  • Offloading: Moving data from fast but limited GPU memory to slower but larger storage (CPU RAM or SSD) and loading it back only when needed
Original article

LLM Inference for Large-Context Offline Workloads

oLLM is a lightweight Python library for large-context LLM inference, built on top of Huggingface Transformers and PyTorch. It enables running models like gpt-oss-20B, qwen3-next-80B or Llama-3.1-8B-Instruct on 100k context using ~$200 consumer GPU with 8GB VRAM. No quantization is used—only fp16/bf16 precision.

Latest updates (1.0.3) 🔥

  • AutoInference with any Llama3 / gemma3 model + PEFT adapter support
  • kvikio and flash-attn are optional now, meaning no hardware restrictions beyond HF transformers
  • Multimodal voxtral-small-24B (audio+text) added. [sample with audio]
  • Multimodal gemma3-12B (image+text) added. [sample with image]
  • qwen3-next-80B (160GB model) added with ⚡️1tok/2s throughput (our fastest model so far)
  • gpt-oss-20B flash-attention-like implementation added to reduce VRAM usage
  • gpt-oss-20B chunked MLP added to reduce VRAM usage

8GB Nvidia 3060 Ti Inference memory usage:

Model Weights Context length KV cache Baseline VRAM (no offload) oLLM GPU VRAM oLLM Disk (SSD)
qwen3-next-80B 160 GB (bf16) 50k 20 GB ~190 GB ~7.5 GB 180 GB
gpt-oss-20B 13 GB (packed bf16) 10k 1.4 GB ~40 GB ~7.3GB 15 GB
gemma3-12B 25 GB (bf16) 50k 18.5 GB ~45 GB ~6.7 GB 43 GB
llama3-1B-chat 2 GB (bf16) 100k 12.6 GB ~16 GB ~5 GB 15 GB
llama3-3B-chat 7 GB (bf16) 100k 34.1 GB ~42 GB ~5.3 GB 42 GB
llama3-8B-chat 16 GB (bf16) 100k 52.4 GB ~71 GB ~6.6 GB 69 GB

By "Baseline" we mean typical inference without any offloading

How do we achieve this:

  • Loading layer weights from SSD directly to GPU one by one
  • Offloading KV cache to SSD and loading back directly to GPU, no quantization or PagedAttention
  • Offloading layer weights to CPU if needed
  • FlashAttention-2 with online softmax. Full attention matrix is never materialized.
  • Chunked MLP. Intermediate upper projection layers may get large, so we chunk MLP as well

Typical use cases include:

  • Analyze contracts, regulations, and compliance reports in one pass
  • Summarize or extract insights from massive patient histories or medical literature
  • Process very large log files or threat reports locally
  • Analyze historical chats to extract the most common issues/questions users have

Supported GPUs: NVIDIA (with additional performance benefits from kvikio and flash-attn), AMD, and Apple Silicon (MacBook).

Getting Started

It is recommended to create venv or conda environment first

python3 -m venv ollm_env
source ollm_env/bin/activate

Install oLLM with pip install --no-build-isolation ollm or from source:

git clone https://github.com/Mega4alik/ollm.git
cd ollm
pip install --no-build-isolation -e .

# for Nvidia GPUs with cuda (optional): 
pip install kvikio-cu{cuda_version} Ex, kvikio-cu12 #speeds up the inference

💡 Note: voxtral-small-24B requires additional pip dependencies to be installed as pip install "mistral-common[audio]" and pip install librosa

Check out the Troubleshooting in case of any installation issues

Example

Code snippet sample

from ollm import Inference, file_get_contents, TextStreamer
o = Inference("llama3-1B-chat", device="cuda:0", logging=True) #llama3-1B/3B/8B-chat, gpt-oss-20B, qwen3-next-80B
o.ini_model(models_dir="./models/", force_download=False)
o.offload_layers_to_cpu(layers_num=2) #(optional) offload some layers to CPU for speed boost
past_key_values = o.DiskCache(cache_dir="./kv_cache/") #set None if context is small
text_streamer = TextStreamer(o.tokenizer, skip_prompt=True, skip_special_tokens=False)

messages = [{"role":"system", "content":"You are helpful AI assistant"}, {"role":"user", "content":"List planets"}]
input_ids = o.tokenizer.apply_chat_template(messages, reasoning_effort="minimal", tokenize=True, add_generation_prompt=True, return_tensors="pt").to(o.device)
outputs = o.model.generate(input_ids=input_ids,  past_key_values=past_key_values, max_new_tokens=500, streamer=text_streamer).cpu()
answer = o.tokenizer.decode(outputs[0][input_ids.shape[-1]:], skip_special_tokens=False)
print(answer)

or run sample python script as PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True python example.py

# with AutoInference, you can run any LLama3/gemma3 model with PEFT adapter support
# pip install peft 
from ollm import AutoInference
o = AutoInference("./models/gemma3-12B", # any llama3 or gemma3 model
  adapter_dir="./myadapter/checkpoint-20", # PEFT adapter checkpoint if available
  device="cuda:0", multimodality=False, logging=True)
...

More samples

Roadmap

For visibility of what's coming next (subject to change)

  • Qwen3-Next quantized version
  • Qwen3-VL or alternative vision model
  • Qwen3-Next MultiTokenPrediction in R&D

Contact us

If there's a model you'd like to see supported, feel free to suggest it in the discussion — I'll do my best to make it happen.