Large Language Models (LLMs)

Mixture of Experts (MoE)

Definition

Mixture of Experts (MoE) is a neural network architecture that replaces dense feed-forward layers in transformers with a collection of N expert networks and a router (gating network). For each token, the router selects only k of the N experts (typically k=2 out of 8-64) to process that token. This means the total model parameters are large (all N experts), but only k/N of them are activated per token—making compute cost proportional to the much smaller k rather than N. Mixtral-8x7B has 8 experts but activates 2 per token, providing the quality of a ~47B model at the inference cost of a ~13B model. GPT-4 is widely believed to be an MoE model, explaining its combination of high quality and relatively fast inference.

Why It Matters

MoE enables scaling model capacity (total parameters) without proportionally scaling inference cost—the key to building larger, more capable models that remain economically deployable. For AI application builders, MoE models offer a practical quality-cost tradeoff: Mixtral-8x7B provides performance competitive with Llama-3-70B at approximately 3x lower inference compute. Understanding MoE helps explain why model benchmarks alone don't predict cost—a 47B MoE model may be faster and cheaper to inference than a 13B dense model despite having more total parameters. For 99helpers, MoE models are particularly relevant when evaluating self-hosted open-source model options.

How It Works

An MoE layer works as follows: input tokens pass through a router (a small linear layer + softmax) that scores each of N experts for each token. The top-k experts are selected; their outputs are computed and combined as a weighted sum (weighted by router probabilities). Load balancing is critical—without it, the router might route all tokens to 1-2 experts, collapsing the model to a dense model. Auxiliary losses during training penalize expert capacity imbalance. Expert specialization is emergent: different experts tend to specialize in different domains, syntactic structures, or languages without explicit specialization training. Serving MoE models requires all expert weights in memory, even though only a fraction are active per token.

Mixture of Experts — Route Each Token to 2 of 8 Expert Networks

Input token

"Solve 2x + 4 = 12"

Gating Network (Router)

Computes softmax over 8 experts → selects top-2

Expert 1

Math & reasoning

ACTIVE

Expert 2

Code generation

Expert 3

Language & style

ACTIVE

Expert 4

Factual recall

Expert 5

Instruction following

Expert 6

Summarization

Expert 7

Translation

Expert 8

Common sense

Weighted sum of Expert 1 + Expert 3 outputs

"x = 4"

Experts activated

2 / 8

per token (top-k routing)

Compute vs dense

~25%

only active experts run

Model: Mixtral 8×7B

46.7B

params, ~12B active

Real-World Example

A 99helpers team evaluates Mixtral-8x7B (Mistral's MoE model) for their chatbot. Total parameters: 47B. Active parameters per token: ~13B. On their benchmark, Mixtral scores 82% accuracy—slightly below Llama-3-70B at 85% but 3x faster and cheaper to self-host. They deploy Mixtral on 2 A100 80GB GPUs (handling the full 47B weight) at a throughput of 120 tokens/second. A dense 70B model would require 4 A100s at half the throughput. For their target of 500 concurrent users with real-time response requirements, Mixtral's MoE architecture delivers sufficient quality at half the infrastructure cost.

Common Mistakes

✕Comparing MoE and dense models by total parameter count—MoE models should be compared by active parameters per token for compute cost estimation.
✕Assuming all MoE models have the same expert routing strategy—different implementations (top-1 vs top-2 routing, different load balancing) produce different quality-efficiency tradeoffs.
✕Underestimating MoE serving infrastructure requirements—all expert weights must be loaded into GPU memory even though only a fraction are active, requiring more memory than a dense model of equivalent active-parameter count.

Ready to build your AI chatbot?

Put these concepts into practice with 99helpers — no code required.

Start free trial →

Mixture of Experts (MoE)

Definition

Why It Matters

How It Works

Real-World Example

Common Mistakes

Related Terms

Large Language Model (LLM)

Transformer

Model Parameters

LLM Inference

Open-Source LLM

Ready to build your AI chatbot?