Mixture of experts, visually
April 10, 2026
Most people hear "mixture of experts" and picture something complicated. It isn't. It's one of the cleaner architectural ideas in modern LLMs — and once you see it, you'll spot it everywhere.
The problem with dense models
A standard transformer (GPT-style) routes every token through every parameter, every time. If your model has 70 billion parameters, each token activates all 70 billion. That works, but it's expensive — and it means you're paying for capacity you don't need on most inputs.
Think about what a language model actually does. A token like Paris in a geography question needs different knowledge than Paris in a fashion-week recap. A dense model handles both with the same weights. It works, but it's wasteful.
What MoE actually does
A mixture-of-experts model splits the feed-forward layer (the "storage" part of the transformer) into N separate expert networks. For each token, a small router network picks the top k experts — typically 2 out of maybe 8 or 64 — and routes only through those.
Token → Attention → Router → [Expert 3, Expert 7] → Output
↗ (64 experts total, only 2 active)
The result: the model has the capacity of a dense model 8–32× its size, but the compute cost of a much smaller one.
Why this matters in practice
GPT-4 and Mixtral are both believed to be MoE models. Mixtral 8x7B has ~47B total parameters but activates only ~13B per token — roughly the cost of a 13B dense model with the knowledge of a much larger one.
The tradeoff is memory: you have to load all the experts onto your hardware even if only 2 are active per forward pass. MoE is great for inference throughput; it's harder to run on a single consumer GPU.
The load-balancing problem
Here's where it gets interesting. If the router always picks the same two experts, the others never learn anything. Training MoE models requires an auxiliary load-balancing loss that penalizes uneven expert usage. Without it, you get expert collapse — the model degenerates toward a dense model.
This is an active research area. Google's Switch Transformer, Mistral's Mixtral, and DeepSeek's MoE paper all approach it slightly differently.
The one-line mental model
An MoE model is a dense model that learned to skip most of itself, selectively, based on what the token needs.
That's it. The router is the trick. Everything else follows.