Expert Upcycling (GitHub Repo)
Amazon researchers open-sourced a method to expand Mixture-of-Experts language models during training by duplicating experts, cutting training costs by 32% while maintaining performance.
Deep dive
- Demonstrated on a 7B→13B parameter expansion (1B active) with 32→64 experts pre-trained on 380B tokens, matching fixed-size baseline quality (56.4 vs 56.7 avg accuracy across 11 benchmarks, 1.263 vs 1.267 validation loss)
- Reduces training cost by ~32% of GPU hours (27,888 vs 41,328 hours) when training from scratch, or ~67% when starting from an existing checkpoint
- Uses gradient-based importance scores to determine which experts to duplicate more frequently—high-utility experts receive more copies
- Router weights are extended with small bias perturbations to seed routing diversity among duplicate experts
- Stochastic gradient diversity and loss-free load balancing during continued pre-training break symmetry and drive specialization
- Top-K routing remains fixed throughout so per-token inference cost is unchanged
- Generalizes to full MoE architectures with 256→512 experts and TopK=8, achieving 93-95% gap closure across scales from 154M to 1B parameters
- Released under CC-BY-NC-4.0 license (academic/research use only) and integrates with NeMo/Megatron-LM via runtime monkey-patching with no fork required
- Supports multiple duplication strategies including utility-based selection (gradient norm, saliency, Fisher information), exact copy, copy with noise, and SVD perturbation
- Includes 98 tests covering all methods, strategies, and integration scenarios
Decoder
- MoE (Mixture-of-Experts): Neural network architecture with multiple specialized sub-networks (experts) where a router selects which experts process each input
- Top-K routing: Only the K highest-scoring experts are activated for each token, keeping inference cost fixed regardless of total expert count
- Active parameters: The subset of model parameters actually used during inference, versus total parameters available in the model
- Continued pre-training (CPT): Resuming training on a modified model architecture to specialize duplicated components
- All-to-all communication: Distributed training pattern where data must be exchanged between all compute nodes, expensive at scale
- Gradient-based importance scores: Metrics like gradient norm or Fisher information that estimate how valuable each expert is for the task
- Load balancing: Ensuring experts receive roughly equal amounts of training data to prevent some from being underutilized
Original article
Expert Upcycling
Capacity expansion for Mixture-of-Experts models during continued pre-training.
Dwivedi et al., "Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts" (preprint).
Scaling laws show that MoE quality improves predictably with total expert count at fixed active computation, but training large MoEs from scratch is expensive — memory, gradients, and all-to-all communication all scale with total parameters. Expert upcycling sidesteps this by starting training with a smaller E-expert model and expanding to mE experts mid-training via the upcycling operator:
- Expert replication — each expert is duplicated (high-utility experts receive more copies via gradient-based importance scores).
- Router extension — router weights are copied to new slots with small bias perturbations to seed routing diversity.
- Continued pre-training (CPT) — stochastic gradient diversity and loss-free load balancing break symmetry among duplicates, driving specialization.
Top-K routing is held fixed throughout, so per-token inference cost is unchanged.
Figure 1: Overview of the expert upcycling procedure.
Key results on a 7B→13B total parameter (1B active) interleaved MoE, pre-trained on 380B tokens:
- The upcycled model (32→64 experts) matches the fixed-size 64-expert baseline across 11 downstream benchmarks (56.4 vs. 56.7 avg accuracy) and validation loss (1.263 vs. 1.267).
- Training cost is reduced by ~32% of GPU hours (27,888 vs. 41,328 hours). When a pre-trained checkpoint already exists (e.g., from a prior training run or a public release), the pre-training cost is already paid and only the CPT phase is needed, bringing savings to ~67%.
- Results generalize to full MoE architectures (256→512 experts, TopK=8) with 93–95% gap closure across scales from 154M to 1B total parameters.
Figure 2: GPU hours, validation loss, and downstream accuracy for the 7B→13B upcycled model vs. baselines.
Installation
Recommended: NeMo 2.x container
Start from the official NeMo container — PyTorch, Megatron-LM, Transformer Engine, NeMo, Lightning, and omegaconf are all pre-installed.
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 \
-v /path/to/expert-upcycling:/workspace/expert-upcycling \
-it nvcr.io/nvidia/nemo:24.09 bash
# Inside the container:
cd /workspace/expert-upcycling
pip install -e .
pip install dacite
Do not use
pip install -e ".[nemo]"inside the container — it would conflict with the container's pre-installed NeMo.
From scratch (no NeMo container)
Install dependencies manually, then install the package with the relevant extras:
# Core only (torch + numpy):
pip install -e .
pip install dacite
# With Megatron-LM integration:
pip install -e ".[megatron]"
# Full NeMo entrypoint (installs NeMo, Lightning, omegaconf):
pip install -e ".[nemo]"
Quick Start
Option A: NeMo entrypoint (recommended)
Edit configs/upcycle.yaml to set your model dimensions, then run from the repo root:
# Single GPU
cd /workspace/expert-upcycling
python -m expert_upcycling.entrypoint \
--config-path=configs --config-name=upcycle \
resume.restore_config.path=/path/to/base/checkpoint
# Multi-GPU (e.g. 8 GPUs with tensor parallelism)
torchrun --nproc_per_node=8 -m expert_upcycling.entrypoint \
--config-path=configs --config-name=upcycle \
resume.restore_config.path=/path/to/base/checkpoint \
strategy.tensor_model_parallel_size=8
The callback fires on the first optimizer step, doubles the expert count, saves the upcycled checkpoint, and exits. The output path defaults to <input_checkpoint>-upcycled.
Option B: Patch existing training script
import expert_upcycling
expert_upcycling.apply_patches()
# Now TEGroupedMLP has .upcycle_experts() and TopKRouter has .upcycle_router()
# Call them during training at the desired transition point.
# Note: model is typically wrapped — unwrap to reach the decoder:
inner = model
for attr in ("module", "module"):
if hasattr(inner, attr):
inner = getattr(inner, attr)
for i, layer in enumerate(inner.decoder.layers):
if hasattr(layer.mlp, 'experts'):
selected = layer.mlp.experts.upcycle_experts(optimizer, i, expert_cfg)
if hasattr(layer.mlp, 'router'):
layer.mlp.router.upcycle_router(router_cfg, selected)
Option C: Use the model-level API
from expert_upcycling import perform_expert_upcycling
perform_expert_upcycling(
model, optimizer,
expert_cfg={"usefulness_metric": "gradient_norm", "selection_strategy": "greedy"},
router_cfg={"method": "bias_only", "bias_noise_scale": 0.01},
)
Upcycling Strategies
Expert duplication
| Strategy | Description |
|---|---|
| Utility-based (recommended) | Duplicate high-importance experts using gradient-based scores (weight norm, saliency, gradient squared, approx Fisher) |
copy |
Exact duplication (baseline) |
copy_noise |
Duplication + Gaussian noise |
drop_upcycle |
Re-initialize a fraction of columns |
svd_perturb |
SVD decomposition + perturbation |
| + 6 more | See expert_upcycling.config.UpcycleMethod |
Router expansion
| Strategy | Description |
|---|---|
bias_only (recommended) |
Keep weights identical, add noise to bias |
copy |
Exact duplication |
copy_noise |
Duplication + noise |
| + 7 more | See expert_upcycling.config.RouterUpcycleMethod |
Architecture
This package treats Megatron-LM and NeMo as third-party dependencies — no fork required. Upcycling methods are injected at runtime via monkey-patching:
expert-upcycling/ # pip install -e .
├── expert_upcycling/
│ ├── config.py # All enums + dataclasses (no deps)
│ ├── expert_upcycler.py # Heuristic strategies (torch only)
│ ├── expert_selector.py # Utility-based selection (torch + numpy)
│ ├── router_upcycler.py # Router strategies (torch only)
│ ├── optimizer_utils.py # Optimizer state handling (torch only)
│ ├── patch.py # Monkey-patches onto Megatron-LM classes
│ ├── upcycle_model.py # Model traversal
│ └── entrypoint.py # NeMo launch script
├── configs/
│ └── upcycle.yaml # Example config
└── scripts/
└── run_upcycle.sh # Example launch script
Running Tests
# CPU tests (no GPU, no Megatron install required)
python tests/test_comprehensive.py # 91 tests: all methods, all strategies
pytest tests/test_integration.py -v # 7 end-to-end integration tests
# GPU test (requires NeMo container + GPU)
python tests/test_entrypoint_gpu.py # real TEGroupedMLP + TopKRouter, 32->64 experts
Citation
@article{dwivedi2025expertupcycling,
title={Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts},
author={Dwivedi, Chaitanya and Gupta, Himanshu and Varshney, Neeraj and Jayarao, Pratik and Yin, Bing and Chilimbi, Trishul and Huang, Binxuan},
year={2026}
}
License
CC-BY-NC-4.0
This code is being released solely for academic and scientific reproducibility purposes, in support of the methods and findings described in the associated publication. Pull requests are not being accepted in order to maintain the code exactly as it was used in the paper.