Interactive Workbook

What if your AI had
specialist doctors
instead of one GP?

Mixture of Experts explained from scratch. Play with it, break it, understand it. No ML background needed.

0 Total parameters in a MoE

0% Parameters active at once

0x Compute efficiency gain

scroll to explore

Chapter 01 The Problem

Regular AI uses
100% of its brain
for every single word

Imagine hiring 100 specialists but forcing all 100 of them to attend every single meeting even if only one of them is actually useful. That's how traditional AI models work.

🏥 The Hospital Analogy

You walk into a hospital with a broken finger. In a regular AI model, every single doctor: the cardiologist, the neurologist, the oncologist, the psychiatrist: all examine your finger at the same time, every time. Expensive, slow, and mostly pointless. Mixture of Experts sends you straight to the hand specialist. Done.

Traditional Dense Model Wasteful

Compute used100%

Every neuron activates for every single token. Processing "the" uses the same compute as processing a complex equation.

Mixture of Experts Efficient

Compute used20%

Only the relevant experts activate. The right 2 out of 8 fire, and only those 2 do any work.

Chapter 02 Meet the Experts

Each expert specialises
in something different

Inside a MoE model, there are multiple "expert" neural networks. Each one develops its own specialty during training. Click each expert to see what it's good at.

🔢

Math Expert

Handles equations, calculations, numerical reasoning, and logical proofs. Activates for anything quantitative.

Active

💻

Code Expert

Specialises in programming logic, syntax, debugging, and software architecture across languages.

Active

🗣️

Language Expert

Masters grammar, tone, translation, literary devices, and nuances across human languages.

Active

🔬

Science Expert

Covers physics, chemistry, biology, and scientific reasoning, from atoms to ecosystems.

Active

📜

History Expert

Knows dates, events, civilisations, cause-and-effect in human history, and cultural context.

Active

🎨

Creative Expert

Generates stories, poems, metaphors, and imaginative content. Thinks in narratives and emotion.

Active

💡

These specialties aren't programmed in. They emerge naturally during training. The model discovers on its own that it's more efficient to have some neurons focus on math while others focus on language. No human labels these experts.

Chapter 03 The Router

The gatekeeper that decides
who handles what

The Router is a small neural network that sits at the entrance of every MoE layer. Its only job: look at the incoming token and decide which 1-3 experts should handle it. Click a question below to watch the routing happen live.

Router Decision

Select a question above to see how the router decides which experts to activate.

🎯 How the router actually works

The router multiplies the incoming token by a learned weight matrix and applies a softmax function, producing a probability score for each expert. It picks the top-K experts (usually 2) with the highest scores. Think of it as the router asking: "Which experts are most confident they can handle this?" and picking the two most confident ones.

Chapter 04 Sparse Activation

The magic number:
only 2 out of N fire

In most MoE models, only 2 experts activate per token, regardless of how many experts exist in total. Drag the sliders and watch what happens to compute cost and model capacity.

Total experts 8

Active per token 2

Params per expert (B) 7B

Total Parameters

56B

Model's full knowledge base

Active per Token

14B

Actual compute used

Efficiency Ratio

More knowledge, same compute

Activation Rate

25%

Of parameters used per token

TOKEN STREAM: watch the routing happen

⚡

This is why MoE is revolutionary: you get the knowledge of a 560B parameter model at the compute cost of a 70B model. The experts that don't activate don't cost anything they are just waiting. You scale knowledge without scaling compute.

Chapter 05 Real World

Models you've heard of
that use MoE right now

MoE isn't theoretical it is powering some of the most capable AI systems running today. Here's how they stack up.

DeepSeek V3

DeepSeek AI · Open Source · Dec 2024

Total experts256

Active per token8

Total parameters671B

Active parameters37B

Training cost~$5.6M

Kimi K2

Moonshot AI · Open Source · 2025

Total experts384

Active per token8

Total parameters1.04T

Active parameters32B

StrengthAgentic reasoning

Qwen3 235B

Alibaba · Open Source · 2025

Total experts128

Active per token8

Total parameters235B

Active parameters22B

Key featureThinking mode

Gemma 4 MoE

Google · Open Source · Apr 2025

Total experts128

Active per token8

Total parameters27B

Active parameters4B

Efficiency27B quality at 4B cost

🚀

The numbers do not lie: over 60% of open-source AI model releases in 2025 use MoE. DeepSeek V3 carries 671B parameters of knowledge but spends only 37B worth of compute per token. Kimi K2 crossed 1 trillion total parameters while keeping just 32B active. The architecture has gone from academic curiosity to the default choice for every serious lab on the planet.

What if your AI hadspecialist doctorsinstead of one GP?

Regular AI uses100% of its brainfor every single word

Each expert specialisesin something different

The gatekeeper that decideswho handles what

The magic number:only 2 out of N fire

Models you've heard ofthat use MoE right now

Did it actually land?

What if your AI had
specialist doctors
instead of one GP?

Regular AI uses
100% of its brain
for every single word

Each expert specialises
in something different

The gatekeeper that decides
who handles what

The magic number:
only 2 out of N fire

Models you've heard of
that use MoE right now