Interactive Workbook

What if your AI had
specialist doctors
instead of one GP?

Mixture of Experts explained from scratch. Play with it, break it, understand it. No ML background needed.

0 Total parameters in a MoE
0% Parameters active at once
0x Compute efficiency gain
scroll to explore
Chapter 01 The Problem

Regular AI uses
100% of its brain
for every single word

Imagine hiring 100 specialists but forcing all 100 of them to attend every single meeting even if only one of them is actually useful. That's how traditional AI models work.

๐Ÿฅ The Hospital Analogy
You walk into a hospital with a broken finger. In a regular AI model, every single doctor: the cardiologist, the neurologist, the oncologist, the psychiatrist: all examine your finger at the same time, every time. Expensive, slow, and mostly pointless. Mixture of Experts sends you straight to the hand specialist. Done.
Traditional Dense Model Wasteful
Compute used100%

Every neuron activates for every single token. Processing "the" uses the same compute as processing a complex equation.

Mixture of Experts Efficient
Compute used20%

Only the relevant experts activate. The right 2 out of 8 fire, and only those 2 do any work.

Chapter 02 Meet the Experts

Each expert specialises
in something different

Inside a MoE model, there are multiple "expert" neural networks. Each one develops its own specialty during training. Click each expert to see what it's good at.

๐Ÿ”ข
Math Expert
Handles equations, calculations, numerical reasoning, and logical proofs. Activates for anything quantitative.
Active
๐Ÿ’ป
Code Expert
Specialises in programming logic, syntax, debugging, and software architecture across languages.
Active
๐Ÿ—ฃ๏ธ
Language Expert
Masters grammar, tone, translation, literary devices, and nuances across human languages.
Active
๐Ÿ”ฌ
Science Expert
Covers physics, chemistry, biology, and scientific reasoning, from atoms to ecosystems.
Active
๐Ÿ“œ
History Expert
Knows dates, events, civilisations, cause-and-effect in human history, and cultural context.
Active
๐ŸŽจ
Creative Expert
Generates stories, poems, metaphors, and imaginative content. Thinks in narratives and emotion.
Active
๐Ÿ’ก
These specialties aren't programmed in. They emerge naturally during training. The model discovers on its own that it's more efficient to have some neurons focus on math while others focus on language. No human labels these experts.
Chapter 03 The Router

The gatekeeper that decides
who handles what

The Router is a small neural network that sits at the entrance of every MoE layer. Its only job: look at the incoming token and decide which 1-3 experts should handle it. Click a question below to watch the routing happen live.

Router Decision
Select a question above to see how the router decides which experts to activate.
๐ŸŽฏ How the router actually works
The router multiplies the incoming token by a learned weight matrix and applies a softmax function, producing a probability score for each expert. It picks the top-K experts (usually 2) with the highest scores. Think of it as the router asking: "Which experts are most confident they can handle this?" and picking the two most confident ones.
Chapter 04 Sparse Activation

The magic number:
only 2 out of N fire

In most MoE models, only 2 experts activate per token, regardless of how many experts exist in total. Drag the sliders and watch what happens to compute cost and model capacity.

Total experts 8
Active per token 2
Params per expert (B) 7B
Total Parameters
56B
Model's full knowledge base
Active per Token
14B
Actual compute used
Efficiency Ratio
4x
More knowledge, same compute
Activation Rate
25%
Of parameters used per token
TOKEN STREAM: watch the routing happen
โšก
This is why MoE is revolutionary: you get the knowledge of a 560B parameter model at the compute cost of a 70B model. The experts that don't activate don't cost anything they are just waiting. You scale knowledge without scaling compute.
Chapter 05 Real World

Models you've heard of
that use MoE right now

MoE isn't theoretical it is powering some of the most capable AI systems running today. Here's how they stack up.

DeepSeek V3
DeepSeek AI ยท Open Source ยท Dec 2024
Total experts256
Active per token8
Total parameters671B
Active parameters37B
Training cost~$5.6M
Kimi K2
Moonshot AI ยท Open Source ยท 2025
Total experts384
Active per token8
Total parameters1.04T
Active parameters32B
StrengthAgentic reasoning
Qwen3 235B
Alibaba ยท Open Source ยท 2025
Total experts128
Active per token8
Total parameters235B
Active parameters22B
Key featureThinking mode
Gemma 4 MoE
Google ยท Open Source ยท Apr 2025
Total experts128
Active per token8
Total parameters27B
Active parameters4B
Efficiency27B quality at 4B cost
๐Ÿš€
The numbers do not lie: over 60% of open-source AI model releases in 2025 use MoE. DeepSeek V3 carries 671B parameters of knowledge but spends only 37B worth of compute per token. Kimi K2 crossed 1 trillion total parameters while keeping just 32B active. The architecture has gone from academic curiosity to the default choice for every serious lab on the planet.
Chapter 06 Test Yourself

Did it actually land?

Three questions. No tricks. Just checking if the concepts stuck.