Mixture of Experts (MoE)
Definition
Mixture of Experts (MoE) is a neural network architecture that replaces dense feed-forward layers in transformers with a collection of N expert networks and a router (gating network). For each token, the router selects only k of the N experts (typically k=2 out of 8-64) to process that token. This means the total model parameters are large (all N experts), but only k/N of them are activated per token—making compute cost proportional to the much smaller k rather than N. Mixtral-8x7B has 8 experts but activates 2 per token, providing the quality of a ~47B model at the inference cost of a ~13B model. GPT-4 is widely believed to be an MoE model, explaining its combination of high quality and relatively fast inference.
Why It Matters
MoE enables scaling model capacity (total parameters) without proportionally scaling inference cost—the key to building larger, more capable models that remain economically deployable. For AI application builders, MoE models offer a practical quality-cost tradeoff: Mixtral-8x7B provides performance competitive with Llama-3-70B at approximately 3x lower inference compute. Understanding MoE helps explain why model benchmarks alone don't predict cost—a 47B MoE model may be faster and cheaper to inference than a 13B dense model despite having more total parameters. For 99helpers, MoE models are particularly relevant when evaluating self-hosted open-source model options.
How It Works
An MoE layer works as follows: input tokens pass through a router (a small linear layer + softmax) that scores each of N experts for each token. The top-k experts are selected; their outputs are computed and combined as a weighted sum (weighted by router probabilities). Load balancing is critical—without it, the router might route all tokens to 1-2 experts, collapsing the model to a dense model. Auxiliary losses during training penalize expert capacity imbalance. Expert specialization is emergent: different experts tend to specialize in different domains, syntactic structures, or languages without explicit specialization training. Serving MoE models requires all expert weights in memory, even though only a fraction are active per token.
Mixture of Experts — Route Each Token to 2 of 8 Expert Networks
Input token
"Solve 2x + 4 = 12"
Gating Network (Router)
Computes softmax over 8 experts → selects top-2
Expert 1
Math & reasoning
ACTIVEExpert 2
Code generation
Expert 3
Language & style
ACTIVEExpert 4
Factual recall
Expert 5
Instruction following
Expert 6
Summarization
Expert 7
Translation
Expert 8
Common sense
Weighted sum of Expert 1 + Expert 3 outputs
"x = 4"
Experts activated
2 / 8
per token (top-k routing)
Compute vs dense
~25%
only active experts run
Model: Mixtral 8×7B
46.7B
params, ~12B active
Real-World Example
A 99helpers team evaluates Mixtral-8x7B (Mistral's MoE model) for their chatbot. Total parameters: 47B. Active parameters per token: ~13B. On their benchmark, Mixtral scores 82% accuracy—slightly below Llama-3-70B at 85% but 3x faster and cheaper to self-host. They deploy Mixtral on 2 A100 80GB GPUs (handling the full 47B weight) at a throughput of 120 tokens/second. A dense 70B model would require 4 A100s at half the throughput. For their target of 500 concurrent users with real-time response requirements, Mixtral's MoE architecture delivers sufficient quality at half the infrastructure cost.
Common Mistakes
- ✕Comparing MoE and dense models by total parameter count—MoE models should be compared by active parameters per token for compute cost estimation.
- ✕Assuming all MoE models have the same expert routing strategy—different implementations (top-1 vs top-2 routing, different load balancing) produce different quality-efficiency tradeoffs.
- ✕Underestimating MoE serving infrastructure requirements—all expert weights must be loaded into GPU memory even though only a fraction are active, requiring more memory than a dense model of equivalent active-parameter count.
Related Terms
Large Language Model (LLM)
A large language model is a neural network trained on vast amounts of text that learns to predict and generate human-like text, enabling tasks like answering questions, writing, translation, and code generation.
Transformer
The transformer is the neural network architecture underlying all modern LLMs, using self-attention mechanisms to process entire input sequences in parallel and capture long-range dependencies between words.
Model Parameters
Model parameters are the learned numerical weights of an LLM—billions of floating-point values encoding the model's knowledge and capabilities—whose count (e.g., 7B, 70B, 405B) is the primary indicator of model capacity.
LLM Inference
LLM inference is the process of running a trained model to generate a response for a given input, encompassing the forward pass computation, token generation, and the infrastructure required to serve predictions at scale.
Open-Source LLM
An open-source LLM is a language model with publicly available weights that anyone can download, run locally, fine-tune, and deploy without per-query licensing fees, enabling private deployment and customization.
Ready to build your AI chatbot?
Put these concepts into practice with 99helpers — no code required.
Start free trial →