Mixture of experts, visually

Most people hear "mixture of experts" and picture something complicated. It isn't. It's one of the cleaner architectural ideas in modern LLMs — and once you see it, you'll spot it everywhere.

The problem with dense models

A standard transformer (GPT-style) routes every token through every parameter, every time. If your model has 70 billion parameters, each token activates all 70 billion. That works, but it's expensive — and it means you're paying for capacity you don't need on most inputs.

Think about what a language model actually does. A token like Paris in a geography question needs different knowledge than Paris in a fashion-week recap. A dense model handles both with the same weights. It works, but it's wasteful.

What MoE actually does

A mixture-of-experts model splits the feed-forward layer (the "storage" part of the transformer) into N separate expert networks. For each token, a small router network picks the top k experts — typically 2 out of maybe 8 or 64 — and routes only through those.

Token → Attention → Router → [Expert 3, Expert 7] → Output
                           ↗ (64 experts total, only 2 active)

The result: the model has the capacity of a dense model 8–32× its size, but the compute cost of a much smaller one.

Why this matters in practice

GPT-4 and Mixtral are both believed to be MoE models. Mixtral 8x7B has ~47B total parameters but activates only ~13B per token — roughly the cost of a 13B dense model with the knowledge of a much larger one.

The tradeoff is memory: you have to load all the experts onto your hardware even if only 2 are active per forward pass. MoE is great for inference throughput; it's harder to run on a single consumer GPU.

The load-balancing problem

Here's where it gets interesting. If the router always picks the same two experts, the others never learn anything. Training MoE models requires an auxiliary load-balancing loss that penalizes uneven expert usage. Without it, you get expert collapse — the model degenerates toward a dense model.

This is an active research area. Google's Switch Transformer, Mistral's Mixtral, and DeepSeek's MoE paper all approach it slightly differently.

The one-line mental model

An MoE model is a dense model that learned to skip most of itself, selectively, based on what the token needs.

That's it. The router is the trick. Everything else follows.