Modular Post-Training (14 minute read)

AllenAI's BAR training method lets you add or upgrade specific capabilities in language models without expensive full retraining or losing existing skills.

What: BAR (Branch-Adapt-Route) is a modular post-training recipe that trains independent domain experts (math, code, tool use, safety) separately through their own complete pipelines, then composes them into a unified model using a mixture-of-experts architecture where a router selects which experts to activate for each input.

Why it matters: Traditional post-training forces a choice between expensive full retraining when adding capabilities or suffering catastrophic forgetting when training further on new data. BAR enables linear cost scaling for updates—you only retrain the affected expert and lightweight router—versus the quadratic cost of monolithic retraining where each domain update requires reprocessing all domains.

Takeaway: The team released the training recipe, technical report, and model checkpoints for teams looking to implement modular post-training on their own models.

Deep dive

BAR addresses a fundamental problem in language model development: updating models after post-training typically requires either expensive full retraining or causes catastrophic forgetting of existing capabilities
The approach evolved from FlexOlmo, which worked for pretraining by freezing shared layers and only training domain-specific FFN experts, but this recipe failed for post-training because behavioral shifts require updating attention layers, embeddings, and language modeling heads
Stage 1 uses progressive unfreezing: mid-training freezes all shared layers (since knowledge lives in FFNs), SFT unfreezes embeddings and LM head (critical for new tokens), and RLVR unfreezes all parameters including attention to handle distributional shifts
Each expert is structured as a two-expert MoE with one frozen "anchor" expert preserving base model FFN weights and one trainable expert, and trains on a mix of domain-specific plus general SFT data to prevent degradation of general capabilities
Stage 2 merges experts by simply averaging shared parameters that diverged across expert runs, which surprisingly introduces little to no measurable performance loss despite independent modifications during training
Stage 3 trains the router on just 5% of stratified SFT data with all experts and shared weights frozen, making this final stage fast and cheap
On 19 benchmarks across 7 categories, BAR outperformed all baselines except full retraining from mid-training, beating post-training-only retraining 49.1 vs 47.8 overall with large gains in math (+7.8) and code (+4.7)
Modular training's key structural advantage: late-stage RL on one domain can't degrade safety capabilities learned during earlier SFT stages in other domains because each pipeline is isolated
Dense model merging after mid-training catastrophically fails (6.5 overall score) because mid-training causes enough divergence that naive weight averaging produces a nearly non-functional model
Demonstrated modular upgrades work in practice: replacing a code expert with one trained on better data improved code by +16.5 points while other domains stayed unchanged, and adding RL to an existing math expert improved math by +13 points with minimal impact elsewhere
The approach enables linear cost scaling versus monolithic retraining's quadratic scaling, critical for teams where different groups work on different capabilities on different timelines
Training domain experts on only domain-specific data without general SFT data severely degrades general capabilities like instruction following despite strong in-domain performance
Activating 4 of 5 experts at inference achieves nearly identical performance to using all 5, suggesting opportunities for more efficient routing strategies

Decoder

MoE (Mixture-of-Experts): An architecture where multiple specialized neural network modules (experts) process inputs, with a router deciding which experts to activate for each input
FFN (Feed-Forward Network): The layers in transformers that primarily store factual knowledge, as opposed to attention layers that handle relationships between tokens
Post-training: Training stages after initial pretraining that teach models to follow instructions, reason, use tools, and behave safely
SFT (Supervised Fine-Tuning): Training stage using labeled examples to teach specific behaviors like instruction following or function calling
RLVR (Reinforcement Learning with Verified Rewards): RL training using verifiable correctness signals (like code execution or math verification) rather than human preference
Mid-training: Intermediate training stage between pretraining and SFT, typically for domain knowledge acquisition
FlexOlmo: AllenAI's earlier work on modular MoE-based pretraining that inspired BAR
Catastrophic forgetting: When training on new tasks causes a model to lose performance on previously learned tasks
BFCL (Berkeley Function Calling Leaderboard): Benchmark for evaluating how well models can call functions and use tools
Dense model: Traditional neural network where all parameters are active for every input, versus sparse models like MoE where only subsets activate

Original article

Train separately, merge together: Modular post-training with mixture-of-experts

After pretraining, language models go through a series of mid- and post-training stages to become practically useful—learning to follow instructions, reason through problems, reliably call tools, and so on. But updating or extending a model following these stages is often challenging. The most reliable option, retraining from scratch with new capabilities included from the start, is expensive and requires full access to the original training setup. Training further on new data is cheaper, but it can cause the model to lose capabilities it already had. And because post-training typically involves multiple stages – each with its own data and objectives – adding new skills means rerunning or adjusting each stage to accommodate them without breaking what came before.

We present BAR (Branch-Adapt-Route), a recipe for modular post-training that sidesteps these issues. Rather than training a single model on all data at once, BAR trains independent domain experts – each through its own complete training pipeline – and composes them into a unified model via a mixture-of-experts (MoE) architecture. Each expert can be developed, upgraded, or replaced without touching the others.

We're releasing the recipe, a technical report, and the checkpoints used to validate the approach.

Background and motivation

Our earlier work on FlexOlmo showed that modular MoE-based training works well for pretraining: you can branch from a shared base, train domain-specific feed-forward network (FFN) experts while freezing all shared layers, and merge them back. But we found that this recipe doesn't transfer to post-training. The reason is intuitive in hindsight—pretraining primarily updates knowledge representations, which live largely in FFN layers. Post-training, on the other hand, introduces behavioral shifts such as new output formats, reasoning patterns, and safety constraints that require changes to shared parameters like attention layers, embeddings, and the language modeling head.

For example, when we tried the FlexOlmo approach directly during reinforcement learning with verified rewards (RLVR), the reward curve was completely flat; the model simply could not learn with all shared parameters frozen. This motivated us to develop a new recipe specifically for post-training.

How BAR works

BAR has three stages:

Stage 1: Independent expert training. Each domain expert is instantiated as a two-expert MoE: one frozen "anchor" expert that preserves the base model's FFN weights, and one trainable expert. Experts go through whichever training stages their domain requires. In our experiments, math and code go through mid-training, supervised fine-tuning (SFT), and RLVR; tool use and safety use SFT only.

The key technical contribution is a progressive unfreezing schedule for shared parameters across stages:

Mid-training: All shared layers frozen (same as pretraining, since knowledge acquisition is well-captured by FFN updates alone).
SFT: Embedding layer and language modeling head unfrozen. This is necessary for domains that introduce new special tokens (e.g., function-calling formats for tool use). Without this, on the Berkeley Function Calling Leaderboard (BFCL) – the tool use benchmark we used for tool-calling performance evaluation – our tool use expert scored 20.3. With unfreezing, it reached 46.4.
RLVR: All shared parameters unfrozen, including attention. RL induces distributional shifts that extend beyond what expert FFNs can accommodate.

Each expert also trains on a mixture of domain-specific and general SFT data. We found this is critical: domain-only SFT produces strong in-domain performance but severely degrades general capabilities like instruction following and knowledge.

Stage 2: Expert merging. After training, we merge all experts into a single MoE model. Shared parameters that diverged across expert runs (because they were unfrozen during SFT or RLVR) are simply averaged. We find this averaging introduces little to no measurable performance loss on domain-specific evaluations compared to any individual expert.

Stage 3: Router training. Finally, we train the router inside of the MoE with all other experts and shared weights frozen. We found that a stratified 5% sample of the SFT data is sufficient for effective routing, making this stage fast and cheap.

Strong performance across evals

Our models are all at least at the 7B scale, training experts for math, code, tool use, and safety on top of a fully post-trained Olmo 2 base model. (We use Olmo 2 because our FlexOlmo architecture was built around it, and because it provides a useful testbed for exploring how newer datasets and post-training improvements can strengthen a model beyond its original release configuration.) We compare against six baselines across 19 benchmarks, spanning 7 evaluation categories. All scores reported below are category-level averages (out of 100, the higher the better). For per-benchmark breakdowns, please refer to our technical report.

A few things stand out:

On average, BAR outperforms all baselines that don't require rerunning mid-training from scratch. BAR beats retraining with post-training only overall (49.1 vs. 47.8), with particularly large gains in math (+7.8) and code (+4.7). We attribute this to a structural advantage of modular training: in a monolithic pipeline, late-stage RL on math and code can degrade safety capabilities learned during earlier SFT stages. Modular training avoids this entirely because each domain's pipeline is isolated.

Dense model merging after mid-training fails catastrophically. Mid-training causes models to diverge enough that naive weight averaging produces a nearly non-functional model—one that scores 6.5 overall on our benchmarks. Even without mid-training, merging trails BAR by a wide margin (36.9 vs 49.1 overall).

BTX, a technique that trains each expert as a fully independent dense model, underperforms BAR (46.7 vs. 49.1 overall) despite using the same per-domain data and training stages. Training without shared parameters leads to greater divergence, making composition via routing more difficult.

Full retraining with mid-training remains the performance ceiling (50.5), but requires full access to the original pretraining checkpoint and reprocessing everything from scratch— impractical for most open-weight models, and expensive even with full access.

Modular upgrades

One of the most tangibly useful properties of BAR is that experts can be upgraded independently. We demonstrate two types of upgrades:

Upgrading to newer data: Replacing a code expert with one trained on higher-quality data and RL improves code performance by +16.5 points in the combined model, while all other domains remain essentially unchanged.
Adding a training stage: Taking an existing math expert and adding RL on top of its SFT improves math by +13 points in the combined model, again with minimal impact on other domains.

In both cases, only the affected expert and the lightweight router need retraining. In a monolithic pipeline, either of these upgrades would require retraining the full model across all domains. This gives BAR linear cost scaling for domain updates, compared to the effectively quadratic cost of monolithic retraining (each domain update requires reprocessing all domains).

Performance scaling with incremental expert additions

What we learned

A few practical takeaways:

Post-training needs more flexibility than pretraining. The FlexOlmo recipe of freezing all shared layers works for pretraining but breaks during post-training. Progressive unfreezing is essential, especially unfreezing attention during RL and embeddings/LM head for domains with new tokens.
Domain-only SFT isn't enough. Training an expert on only its own domain data improves in-domain performance but destroys general capabilities. Mixing with general SFT data is critical.
Weight averaging after unfreezing works surprisingly well. Despite each expert independently modifying shared parameters during SFT and RLVR, simply averaging the diverged parameters introduces little to no measurable degradation.
Not every expert needs to be active. Activating 4 of 5 experts at inference time achieves nearly identical performance to using all 5, suggesting room for more efficient routing strategies.

Looking ahead

In practice, large-scale model development is already modular: different teams work on different capabilities, new datasets appear on different timelines, and the cost of rerunning an entire pipeline for a single domain improvement is hard to justify. BAR offers a recipe that aligns the training process with this reality.

Full retraining still sets the performance ceiling. But for teams iterating on individual capabilities, BAR provides a way to upgrade parts of a model independently, compose independently trained experts without degradation, and avoid the catastrophic forgetting that comes from running all domains through a single training sequence. One natural next step is starting from a natively sparse architecture rather than upcycling a dense model, which could improve both the efficiency and scalability of the modular approach.