Speculative Decoding for RL Training (18 minute read)
Researchers achieved up to 1.8x faster reinforcement learning training for large language models by applying speculative decoding to rollout generation without changing model outputs.
Deep dive
- The paper addresses autoregressive rollout generation as the primary bottleneck in RL post-training for frontier language models
- Speculative decoding is implemented as a "lossless" acceleration method that preserves the target model's exact output distribution, unlike off-policy execution or lower-precision alternatives
- The implementation supports both synchronous and asynchronous RL pipelines in NeMo-RL with vLLM backend
- Multiple speculation mechanisms work with this approach: pretrained MTP heads, small external draft models, and techniques like Eagle3
- In synchronous RL workloads at 8B parameter scale, the system achieved 1.8x rollout throughput improvement on reasoning tasks
- High-fidelity performance simulations project up to 2.5x end-to-end training speedup when combining speculative decoding with asynchronous RL at 235B scale
- The approach enables deployment of state-of-the-art speculative decoding techniques that were traditionally only applied after the RL training phase
- The system integration demonstrates that speculative decoding benefits are realizable across different speculation mechanisms during active training
- This work provides a practical deployment path for production RL training systems facing rollout generation bottlenecks
Decoder
- Speculative decoding: A technique where a faster draft model generates candidate tokens that a larger target model verifies in parallel, speeding up inference while maintaining exact output quality
- RL rollouts: The process of generating sequences from a language model during reinforcement learning training, which are then scored and used to update the model
- RL post-training: Fine-tuning pre-trained language models using reinforcement learning methods (like RLHF) to improve alignment, reasoning, or other capabilities
- MTP heads: Multi-Token Prediction heads that predict multiple future tokens simultaneously, used as one form of draft mechanism for speculation
- Eagle3: A specific speculative decoding technique, part of the Eagle family of methods for accelerating language model generation
Original article
Accelerating RL Post-Training Rollouts via System-Integrated Speculative Decoding
Abstract
RL post-training of frontier language models is increasingly bottlenecked by autoregressive rollout generation, making rollout acceleration a central systems challenge. Many existing efficiency methods improve throughput by changing the rollout or optimization regime, for example, through off-policy execution, replay, or lower-precision generation. We study speculative decoding as a lossless acceleration primitive for RL rollouts that preserves the target model's output distribution. We implement speculative decoding in NeMo-RL with a vLLM backend, supporting both synchronous and asynchronous pipelines and enabling speculation during RL rollouts. This benefit is realizable across speculation mechanisms, such as pretrained MTP heads, small external draft models or even techniques such as Eagle3, which are traditionally applied after RL phase. This yields a deployment path for state-of-the-art speculative decoding inside RL training. In a reasoning post-training workload at 8B scale under synchronous RL, speculative decoding improves rollout throughput by 1.8x. Using a high-fidelity performance simulator, we project that combining speculative decoding with asynchronous RL yields up to 2.5x end-to-end training speedup at 235B scale.