Browse all previously published AI Tutorials here.
Table of Contents
Mixture-of-Experts MoE Architectures 2024-2025 Literature Review
Core Principles and Architecture of MoE
Key Advancements in MoE Architectures 2024-2025
Performance Improvements Efficiency Techniques and Scaling Approaches
Applications of MoE in Large-Scale AI Models LLMs and Beyond
Comparative Analysis of Recent Open-Source MoE Models
Conclusion and Outlook
Core Principles and Architecture of MoE
Mixture-of-Experts (MoE) is a neural network architecture that increases model capacity by incorporating multiple expert sub-networks and activating only a subset for each input. In modern LLMs, MoE layers are typically inserted in place of the feed-forward network (FFN) within Transformer blocks ([2407.06204] A Survey on Mixture of Experts). Each MoE layer contains several experts (often independent FFNs) and a router (gating function) that selects top-k experts for each input token based on a gating score ([2407.06204] A Survey on Mixture of Experts). Only those k experts’ outputs are computed and combined (usually via a weighted sum), enabling sparse activation of parameters. This sparsity allows the model’s total parameter count to scale up significantly with minimal increase in computation per token ([2406.18219] A Closer Look into Mixture-of-Experts in Large Language Models) ([2407.06204] A Survey on Mixture of Experts). For example, activating 2 experts out of 8 in a layer means a token utilizes only a fraction of the model’s weights at a time, achieving a better performance-versus-compute trade-off than dense models of equivalent size ([2406.18219] A Closer Look into Mixture-of-Experts in Large Language Models).
To train MoE models effectively, a routing strategy and load-balancing mechanism are crucial. Early MoE works introduced a noisy gating approach and an auxiliary loss to encourage the router to use all experts evenly, preventing any single expert from overload ([2407.06204] A Survey on Mixture of Experts). The MoE layer is usually placed after self-attention because FFN sub-layers dominate the model’s computation at scale ([2407.06204] A Survey on Mixture of Experts). By routing tokens to different experts, MoEs can specialize parts of the model to subsets of data. Increasing the number of experts or the number of chosen experts (k) generally improves model expressiveness ([2407.06204] A Survey on Mixture of Experts). In essence, the core principle of MoE is to scale model capacity (parameters) linearly while scaling computation sub-linearly, by conditioning on the input which experts to compute ([2407.06204] A Survey on Mixture of Experts). This architecture has gained significant attention as it enables trillion-parameter models without a proportional increase in FLOPs per inference.
Key Advancements in MoE Architectures (2024–2025)
Adaptive and Improved Routing: Recent research has proposed novel routing mechanisms to enhance how experts are selected. A notable idea is using an LLM itself as the router. Liu & Lo (2025) introduce LLMoE, which replaces the traditional learned gating network with a pretrained LLM that reads input context (e.g. stock prices + news) and chooses experts accordingly ([2501.09636] LLM-Based Routing in Mixture of Experts: A Novel Framework for Trading). This approach infuses world knowledge into routing decisions, leading to more interpretable and context-aware expert selection and improved task performance ([2501.09636] LLM-Based Routing in Mixture of Experts: A Novel Framework for Trading). Other works analyze routing behavior to inform better designs. Lo et al. (2024) found that MoE routers tend to pick experts with larger output norms, and that expert diversity increases in higher layers ([2406.18219] A Closer Look into Mixture-of-Experts in Large Language Models). These insights suggest new routing strategies (e.g. norm-aware gating or varying expert allocation per layer) to boost utilization and modularity in MoE models. Another advancement is eliminating the auxiliary load-balancing loss: DeepSeek-V3 (2024) pioneers an auxiliary-loss-free strategy for load balancing, showing that careful architecture and initialization can maintain expert balance without extra losses (Paper page - DeepSeek-V3 Technical Report).
Hybrid and Specialized Architectures: Researchers have begun blending MoE with other architectural innovations. Jamba (2024) is a hybrid Transformer-Mamba MoE architecture that interleaves standard Transformer layers with a memory-augmented module (“Mamba” layers), inserting MoE in some layers to expand capacity ([2403.19887] Jamba: A Hybrid Transformer-Mamba Language Model). This design yielded a powerful LLM that fits on a single 80GB GPU yet achieved state-of-the-art results on both standard benchmarks and extremely long context tasks (up to 256K tokens) ([2403.19887] Jamba: A Hybrid Transformer-Mamba Language Model). The success of Jamba demonstrates that MoE can be combined with long-context architectures to handle massive sequences efficiently. Another example is OLMoE-1B-7B (2024), a 7B-parameter model that uses only 1B active parameters per token via MoE. The OLMoE team reported high specialization of experts emerging in their model – each expert focuses on distinct aspects of the data – which they quantify with new routing property metrics (OLMoE: Open Mixture-of-Experts Language Models | OpenReview). This confirms that MoE can partition knowledge among experts effectively at scale. We also see MoE applied at different network locations: while most MoE LLMs use experts for FFN layers, some works experiment with expert mixtures in attention layers or other components (though FFN experts remain dominant) ([2407.06204] A Survey on Mixture of Experts). Overall, 2024–2025 research has expanded MoE beyond a one-size-fits-all Transformer, exploring adaptive routers and hybrid layer designs to push the frontier of model capacity and specialization.
Performance Improvements, Efficiency Techniques, and Scaling Approaches
A primary appeal of MoE is the efficiency of scaling to huge model sizes, and recent studies focus on improving training/inference efficiency and stability for MoE LLMs. One line of work looks at efficient training strategies. Pan et al. (2024) propose DS-MoE (Dense Training, Sparse Inference), a hybrid framework that trains all experts densely (i.e. activates all experts during training) but uses sparse activation at inference ([2404.05567] Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models). By treating the model as dense in training, DS-MoE avoids the memory overhead of maintaining extra unused experts during backpropagation and achieves better parameter efficiency. Their 6B-parameter DS-MoE model matched the performance of a dense model of similar size while activating only ~30–40% of parameters at runtime ([2404.05567] Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models). In fact, DS-MoE-6B runs up to 1.86× faster than a dense 7B model (Mistral-7B) and ~1.5–1.7× faster than prior MoEs of comparable size ([2404.05567] Dense Training, Sparse Inference: Rethinking Training of Mixture-of-Experts Language Models), demonstrating significant speedups from this training approach.
Another efficiency innovation is post-training MoE conversion. Pei et al. (2025) introduce CMoE (Carved MoE), which carves out an MoE from a pretrained dense model without full re-training ([2502.04416] CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference). CMoE identifies groups of neurons in a dense LLM that fire infrequently (high sparsity) and assigns them to separate experts. It then inserts a lightweight router and performs brief fine-tuning. Remarkably, CMoE can transform a 7B dense model into a well-performing MoE in minutes, recovering performance with under an hour of fine-tuning ([2502.04416] CMoE: Fast Carving of Mixture-of-Experts for Efficient LLM Inference). This offers a practical path to obtain the efficiency of MoE after training a dense model, saving the cost of training from scratch.
Scaling to Extreme Model Sizes: MoE has enabled some of the largest models to date, and new techniques ensure such scaling remains feasible. DeepSeek-V3 (2024) is a milestone, with 671B total parameters (37B active per token) using a multi-level MoE design (Paper page - DeepSeek-V3 Technical Report). To train such a massive model, the authors employed Multi-Head Latent Attention (MLA) and other optimizations validated in an earlier version . They also introduced a multi-token prediction objective (predicting multiple tokens per step) to improve training efficiency . Impressively, DeepSeek-V3 was pretrained on 14.8 trillion tokens and required 2.788 million GPU hours on H800 hardware, yet maintained stable training (no loss spikes or restarts) . Its successful training run underscores advances in MoE training stability at scale. On the other end of the scale, OLMoE’s strategy was to leverage an enormous training dataset (5 trillion tokens) to get the most out of a smaller MoE model (Paper page - OLMoE: Open Mixture-of-Experts Language Models). By doing so, OLMoE-1B-7B attained performance on par with much larger dense models while keeping active parameters low, illustrating an efficiency data-scaling trade-off: massive data can compensate for fewer active parameters. Another noteworthy technique is Branch-Train-MiX (BTX) by Sukhbaatar et al. (2024), which trains several expert LLMs in parallel on different domains (e.g. code, math, etc.) and then merges them into a single MoE model ([2403.07816] Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM). In BTX, each expert is first independently fine-tuned from a seed model (high-throughput parallel training), then their FFN weights are combined as MoE experts and a brief MoE fine-tuning is done to learn the routing ([2403.07816] Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM). This approach achieved a strong accuracy–efficiency trade-off, effectively “sharing the load” of training across multiple experts and then uniting them ([2403.07816] Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM). Techniques like DS-MoE, CMoE, and BTX reflect a broader 2024 trend: making MoE more resource-efficient – whether by clever training procedures or by leveraging existing models – to broaden its practicality.
Applications of MoE in Large-Scale AI Models (LLMs and Beyond)
The MoE paradigm has seen wide application in large-scale AI, most prominently in natural language tasks with LLMs. By sparsely activating experts, MoE LLMs can attain superior performance on a range of tasks without the prohibitive costs of dense super-models. For instance, MoE has been key to building competitive open-source LLMs that rival much larger closed models. DeepSeek-V3, with its 37B active parameters, achieves performance comparable to leading closed-source models on benchmarks like MMLU (e.g. ~88.5% accuracy) (The Open-Source Advantage in Large Language Models (LLMs)) . This indicates MoE can push open models into the state-of-the-art range. Similarly, Mixtral 8×7B (Mistral AI, 2024) leverages MoE to outperform dense models far above its inference budget: although each token uses only ~13B parameters, Mixtral surpassed Meta’s 70B dense LLaMa-2 on tasks such as math, code generation, and multilingual understanding ([2401.04088] Mixtral of Experts). Applications in multilingual and multi-task settings especially benefit from MoE, since different experts can capture different languages or skills. The Branch-Train-MiX approach explicitly built experts for distinct skill domains and combined them, yielding a single model proficient in all ([2403.07816] Branch-Train-MiX: Mixing Expert LLMs into a Mixture-of-Experts LLM). This modular training is useful for assembling specialists (e.g. a code expert, a reasoning expert) into one LLM.
Beyond general LLM benchmarks, MoE models have been applied to specialized domains and modalities. In the finance domain, researchers used MoE for stock trading models: recent work integrated textual news and price data, and even used an LLM-based router (LLMoE) to select experts, resulting in improved trading decisions ([2501.09636] LLM-Based Routing in Mixture of Experts: A Novel Framework for Trading). This showcases MoE’s flexibility in multimodal scenarios – different experts could handle numerical vs. textual analysis. MoEs have also been explored in embeddings and retrieval: Li & Zhou (2024) discovered that the router outputs (the gating weights) of an MoE LLM can serve directly as high-quality feature embeddings for sentences ([2410.10814] Your Mixture-of-Experts LLM Is Secretly an Embedding Model For Free). Without any fine-tuning, these MoE-based embeddings were robust to prompt wording and captured high-level semantics, improving performance on various embedding tasks ([2410.10814] Your Mixture-of-Experts LLM Is Secretly an Embedding Model For Free). This is an intriguing free byproduct of MoE architectures, hinting that MoE LLMs inherently learn rich representations in their routing mechanisms.
In summary, MoE has enabled large-scale models in NLP to achieve new heights in performance and specialization. It also opens up new capabilities: from on-device intelligent models (as OLMoE demonstrates) to domain-expert systems and multitask specialists. Outside of NLP, MoE techniques are being adopted in other fields (e.g. vision and multimodal models) to manage model size and complexity ([2407.06204] A Survey on Mixture of Experts) ([2407.06204] A Survey on Mixture of Experts), although LLMs remain the showcase for recent MoE advances. As the above examples illustrate, MoE’s ability to allocate capacity where needed makes it a powerful paradigm for building adaptable, large-scale AI systems.
Comparative Analysis of Recent Open-Source MoE Models
Several open-source projects in 2024–2025 have implemented MoE in large language models, each with different design choices and goals. Below is a comparison of key open MoE models and their characteristics:
Mixtral 8×7B (2024) – An MoE variant of Mistral-7B by Mistral AI, with 8 experts per layer (total ≈47B parameters, 13B active per token). Mixtral’s sparse 13B inference outperforms dense 70B models like LLaMa-2-70B and even matches GPT-3.5 on many benchmarks ([2401.04088] Mixtral of Experts). Notably, it excels at math and coding tasks, highlighting how MoE can inject scale for complex skills. It was released under Apache 2.0 with 32k context length support, making it a powerful community model that is both high-performing and efficient ([2401.04088] Mixtral of Experts).
OLMoE-1B-7B (2024) – A fully open MoE LLM announced by Allen Institute/Contextual AI, with 7B total parameters (1B active). Trained on an unprecedented 5 trillion tokens, OLMoE achieves best-in-class performance for its size, even beating larger models like LLaMa2-13B-Chat (Paper page - OLMoE: Open Mixture-of-Experts Language Models). Its focus is on edge and low-latency deployment: by keeping only 1B parameters active, it can run on common devices (e.g. smartphones) while matching or surpassing much larger models on tasks like MMLU (Introducing OLMoE - fully open source Mixture of Experts LLM) . OLMoE is also one of the most transparent releases, providing open weights, code, data, and training logs (OLMoE: Open Mixture-of-Experts Language Models | OpenReview).
DeepSeek-MoE V3 (2024) – One of the largest MoE LLMs to date, with 671B parameters (37B activated per forward pass) (Paper page - DeepSeek-V3 Technical Report). Developed by DeepSeek-AI, this model uses a two-level MoE (with multi-head routing) to manage the huge expert count and forgoes the usual auxiliary routing loss via a novel balancing method . DeepSeek-V3 was trained on 14.8T tokens and underwent supervised and reinforcement fine-tuning . The result is a model that outperforms all other open-source LLMs and is on par with top proprietary models in many evaluations . Its success demonstrates that MoE can scale to extreme model sizes effectively, given sufficient data and engineering. The project also released its checkpoints openly, contributing a new high-end baseline for the community.
Jamba (2024) – An open LLM from AI21 Labs with a hybrid MoE architecture. Jamba intermixes traditional Transformer layers with special memory-oriented Mamba layers, and inserts MoE experts in some of these layers ([2403.19887] Jamba: A Hybrid Transformer-Mamba Language Model). This design yields high throughput and low memory usage. Impressively, Jamba attains state-of-the-art performance on both standard language benchmarks and long-context (256K token) tasks, all while running on a single 80GB GPU ([2403.19887] Jamba: A Hybrid Transformer-Mamba Language Model). Its open release (with permissive license) provides researchers a unique testbed that combines MoE scaling with long-context processing. Jamba underlines that MoE is compatible with other innovations (like extended context) to address multiple challenges simultaneously.
Each of these models showcases different strengths of MoE: Mixtral and DeepSeek emphasize raw performance gains by increasing capacity; OLMoE and Jamba emphasize efficiency — whether in deployment size or context length — without sacrificing capability. All are fully open-source, indicating a trend in 2024–2025 to democratize MoE research. With open implementations available, practitioners can study and build on these advances (e.g. using released code from Mixtral ([2401.04088] Mixtral of Experts) or OLMoE (OLMoE: Open Mixture-of-Experts Language Models | OpenReview)). The comparative takeaway is that MoE is a versatile architecture: it can be tuned for maximum scale (DeepSeek), cost-effective performance (OLMoE, Mixtral), or specialized tasks (Jamba for long context, BTX for multi-domain, etc.), all within the open research ecosystem.
Conclusion and Outlook
Mixture-of-Experts architectures have resurged as a key avenue for scaling AI models in 2024 and 2025. The core idea of sparsely activated expert networks has been validated in large language models, yielding impressive performance-to-compute ratios. Recent research has not only solidified the fundamentals of MoE (through surveys and analysis ([2406.18219] A Closer Look into Mixture-of-Experts in Large Language Models) ([2407.06204] A Survey on Mixture of Experts)) but also pushed the boundaries with new routing methods, training schemes, and open deployments. As summarized, MoE enables models that are larger, more specialized, yet more efficient than their dense counterparts. Open-source MoE LLMs now match or exceed closed models on many fronts, highlighting the power of community-driven innovation (Paper page - OLMoE: Open Mixture-of-Experts Language Models).
Looking forward, challenges remain in MoE research: balancing experts without auxiliary losses, reducing communication overhead in distributed MoE training, and extending MoE to diverse modalities and tasks. However, the trajectory is clear – key techniques like expert merging, dense-to-sparse training, and hybrid architectures are making MoEs more accessible and robust. Given the successes of 2024’s open MoE models, we can expect continued refinement of expert models in 2025. In large-scale AI, MoE stands as a promising path to scale further, train faster, and deploy smarter. The literature of the past two years provides a strong foundation and toolkit for the next generation of MoE-based AI systems.