Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model (2 minute read)
Qwen's new 27B parameter model reportedly matches flagship-level coding performance in a package 14x smaller than their previous best model.
Deep dive
- Qwen3.6-27B achieves what the team claims is flagship-level coding performance while being 807GB → 55.6GB in full size compared to their previous best model
- The previous generation Qwen3.5-397B-A17B was a mixture-of-experts architecture with 397B total parameters and 17B active, while the new model is a 27B dense architecture
- Simon Willison tested a 16.8GB quantized version (Q4_K_M) locally using llama-server and found it delivered impressive results on SVG generation tasks
- The quantized model generated complex SVG images (pelican on bicycle) at 25.57 tokens/s for generation, producing 4,444 tokens in under 3 minutes on consumer hardware
- Testing methodology used specific llama-server configuration with reasoning mode enabled and thinking preservation to maximize coding performance
- The model downloaded to local cache (~17GB) on first run, making it practical for offline use once cached
- Performance remained strong even with aggressive quantization, suggesting the model's capabilities are robust across different compression levels
- SVG generation quality was described as "outstanding" for a 16.8GB local model, demonstrating practical coding assistance capabilities
- The breakthrough suggests that architectural improvements and training advances are enabling smaller dense models to compete with much larger sparse models
Decoder
- Dense model: A model where all parameters are used for every inference, unlike mixture-of-experts which activates only a subset
- MoE (Mixture of Experts): Architecture with many parameter groups where only some are active per request, allowing larger total size with lower active memory
- Quantized: Compressed model using lower precision numbers (like 4-bit instead of 16-bit) to reduce size while preserving most capability
- GGUF: A file format for storing quantized models optimized for efficient inference on CPUs and GPUs
- Agentic coding: AI systems that can autonomously plan, execute, and iterate on programming tasks rather than just generating code snippets
- Q4_K_M: A specific quantization scheme using 4-bit precision with medium-size optimization for balancing quality and size
Original article
Qwen3.6-27B: Flagship-Level Coding in a 27B Dense Model
Big claims from Qwen about their latest open weight model:
Qwen3.6-27B delivers flagship-level agentic coding performance, surpassing the previous-generation open-source flagship Qwen3.5-397B-A17B (397B total / 17B active MoE) across all major coding benchmarks.
On Hugging Face Qwen3.5-397B-A17B is 807GB, this new Qwen3.6-27B is 55.6GB.
I tried it out with the 16.8GB Unsloth Qwen3.6-27B-GGUF:Q4_K_M quantized version and llama-server using this recipe by benob on Hacker News, after first installing llama-server using brew install llama.cpp:
llama-server \
-hf unsloth/Qwen3.6-27B-GGUF:Q4_K_M \
--no-mmproj \
--fit on \
-np 1 \
-c 65536 \
--cache-ram 4096 -ctxcp 2 \
--jinja \
--temp 0.6 \
--top-p 0.95 \
--top-k 20 \
--min-p 0.0 \
--presence-penalty 0.0 \
--repeat-penalty 1.0 \
--reasoning on \
--chat-template-kwargs '{"preserve_thinking": true}'
On first run that saved the ~17GB model to ~/.cache/huggingface/hub/models--unsloth--Qwen3.6-27B-GGUF.
Here's the transcript for "Generate an SVG of a pelican riding a bicycle". This is an outstanding result for a 16.8GB local model:

Performance numbers reported by llama-server:
- Reading: 20 tokens, 0.4s, 54.32 tokens/s
- Generation: 4,444 tokens, 2min 53s, 25.57 tokens/s
For good measure, here's Generate an SVG of a NORTH VIRGINIA OPOSSUM ON AN E-SCOOTER (run previously with GLM-5.1):

That one took 6,575 tokens, 4min 25s, 24.74 t/s.