Mixture of Experts explained from scratch. Play with it, break it, understand it. No ML background needed.
Imagine hiring 100 specialists but forcing all 100 of them to attend every single meeting even if only one of them is actually useful. That's how traditional AI models work.
Every neuron activates for every single token. Processing "the" uses the same compute as processing a complex equation.
Only the relevant experts activate. The right 2 out of 8 fire, and only those 2 do any work.
Inside a MoE model, there are multiple "expert" neural networks. Each one develops its own specialty during training. Click each expert to see what it's good at.
The Router is a small neural network that sits at the entrance of every MoE layer. Its only job: look at the incoming token and decide which 1-3 experts should handle it. Click a question below to watch the routing happen live.
In most MoE models, only 2 experts activate per token, regardless of how many experts exist in total. Drag the sliders and watch what happens to compute cost and model capacity.
MoE isn't theoretical it is powering some of the most capable AI systems running today. Here's how they stack up.
Three questions. No tricks. Just checking if the concepts stuck.