Qwen3.5-Omni Technical Report (32 minute read)

Qwen Team releases Qwen3.5-Omni, a massive multimodal model scaling to hundreds of billions of parameters that processes text, audio, and video with 256k context length and beats Gemini 3.1 Pro on key audio benchmarks.

What: Qwen3.5-Omni is a multimodal AI model using a Hybrid Attention Mixture-of-Experts architecture, trained on over 100 million hours of audio-visual content. It handles 10+ hours of audio, 400 seconds of 720P video, and supports speech generation across 10 languages.

Why it matters: The model demonstrates significant advances in multimodal AI with novel capabilities like Audio-Visual Vibe Coding (generating code directly from audio-visual instructions) and ARIA, a mechanism that solves long-standing streaming speech synthesis quality issues by dynamically aligning text and speech units.

Takeaway: Developers building multimodal AI applications can review the technical report to understand the architecture and benchmark performance of this Gemini competitor.

Deep dive

Achieves state-of-the-art results across 215 audio and audio-visual benchmarks, surpassing Gemini 3.1 Pro in key audio tasks and matching it in comprehensive audio-visual understanding
Scales to hundreds of billions of parameters with 256k context length, enabling processing of over 10 hours of audio or 400 seconds of 720P video at 1 FPS
Uses Hybrid Attention Mixture-of-Experts framework for both Thinker (understanding/reasoning) and Talker (speech generation) components to enable efficient long-sequence inference
Introduces ARIA to address streaming speech synthesis instability caused by encoding efficiency discrepancies between text and speech tokenizers, improving prosody and naturalness with minimal latency impact
Trained on massive heterogeneous datasets including text-vision pairs and over 100 million hours of audio-visual content
Supports multilingual understanding and speech generation across 10 languages with human-like emotional nuance in output
Demonstrates superior audio-visual grounding capabilities with script-level structured captions, precise temporal synchronization, and automated scene segmentation
Exhibits emergent Audio-Visual Vibe Coding capability, directly generating code from audio-visual instructions without intermediate text representation
Represents significant evolution over predecessor Qwen-Omni models in scale, capability, and performance
Model family includes Qwen3.5-Omni-plus variant that achieves the top benchmark results

Decoder

MoE (Mixture-of-Experts): Architecture using multiple specialized sub-models where only a subset activates for each input, improving efficiency at scale
ARIA: Dynamic alignment mechanism introduced in this work to synchronize text and speech units for better conversational speech stability and prosody
Audio-Visual Vibe Coding: Emergent capability where the model generates code directly from audio-visual instructions without text intermediary
Thinker and Talker: Architectural components where Thinker handles understanding/reasoning and Talker handles speech generation
256k context length: Can process 256,000 tokens (roughly 192,000 words or 10+ hours of audio) in a single inference
SOTA: State-of-the-art, meaning best current performance on benchmark tasks
Omni-modality: Ability to process and understand multiple input modalities (text, audio, video) simultaneously

Original article

In this work, we present Qwen3.5-Omni, the latest advancement in the Qwen-Omni model family. Representing a significant evolution over its predecessor, Qwen3.5-Omni scales to hundreds of billions of parameters and supports a 256k context length. By leveraging a massive dataset comprising heterogeneous text-vision pairs and over 100 million hours of audio-visual content, the model demonstrates robust omni-modality capabilities. Qwen3.5-Omni-plus achieves SOTA results across 215 audio and audio-visual understanding, reasoning, and interaction subtasks and benchmarks, surpassing Gemini-3.1 Pro in key audio tasks and matching it in comprehensive audio-visual understanding. Architecturally, Qwen3.5-Omni employs a Hybrid Attention Mixture-of-Experts (MoE) framework for both Thinker and Talker, enabling efficient long-sequence inference. The model facilitates sophisticated interaction, supporting over 10 hours of audio understanding and 400 seconds of 720P video (at 1 FPS). To address the inherent instability and unnaturalness in streaming speech synthesis, often caused by encoding efficiency discrepancies between text and speech tokenizers, we introduce ARIA. ARIA dynamically aligns text and speech units, significantly enhancing the stability and prosody of conversational speech with minimal latency impact. Furthermore, Qwen3.5-Omni expands linguistic boundaries, supporting multilingual understanding and speech generation across 10 languages with human-like emotional nuance. Finally, Qwen3.5-Omni exhibits superior audio-visual grounding capabilities, generating script-level structured captions with precise temporal synchronization and automated scene segmentation. Remarkably, we observed the emergence of a new capability in omnimodal models: directly performing coding based on audio-visual instructions, which we call Audio-Visual Vibe Coding.