LLMs · 8 min read

Mixture of experts, visually

April 10, 2026

Most people hear "mixture of experts" and picture something complicated. It isn't. It's one of the cleaner architectural ideas in modern LLMs — and once you see it, you'll spot it everywhere.

The problem with dense models

A standard transformer (GPT-style) routes every token through every parameter, every time. If your model has 70 billion parameters, each token activates all 70 billion. That works, but it's expensive — and it means you're paying for capacity you don't need on most inputs.

Think about what a language model actually does. A token like Paris in a geography question needs different knowledge than Paris in a fashion-week recap. A dense model handles both with the same weights. It works, but it's wasteful.

What MoE actually does

A mixture-of-experts model splits the feed-forward layer (the "storage" part of the transformer) into N separate expert networks. For each token, a small router network picks the top k experts — typically 2 out of maybe 8 or 64 — and routes only through those.

Token → Attention → Router → [Expert 3, Expert 7] → Output
                           ↗ (64 experts total, only 2 active)

The result: the model has the capacity of a dense model 8–32× its size, but the compute cost of a much smaller one.

Why this matters in practice

GPT-4 and Mixtral are both believed to be MoE models. Mixtral 8x7B has ~47B total parameters but activates only ~13B per token — roughly the cost of a 13B dense model with the knowledge of a much larger one.

The tradeoff is memory: you have to load all the experts onto your hardware even if only 2 are active per forward pass. MoE is great for inference throughput; it's harder to run on a single consumer GPU.

The load-balancing problem

Here's where it gets interesting. If the router always picks the same two experts, the others never learn anything. Training MoE models requires an auxiliary load-balancing loss that penalizes uneven expert usage. Without it, you get expert collapse — the model degenerates toward a dense model.

This is an active research area. Google's Switch Transformer, Mistral's Mixtral, and DeepSeek's MoE paper all approach it slightly differently.

The one-line mental model

An MoE model is a dense model that learned to skip most of itself, selectively, based on what the token needs.

That's it. The router is the trick. Everything else follows.