Mixture of Experts (MoE)

Mixture of Experts (MoE) is a family of architectures where a Transformer holds many parallel 'expert' sub-networks but routes each Token through only a few of them. The result is that the model can have an enormous total parameter count (47B in Mixtral, for instance), while the number of parameters actually used per token stays much smaller — high capacity at cheaper inference cost. The idea traces back to Shazeer et al.'s 2017 'Outrageously Large Neural Networks' paper but went mainstream in 2023-2024 thanks to Mixtral 8x7B from Mistral AI and persistent rumours about GPT-4. Training MoEs introduces practical challenges around routing imbalance and load distribution, which keeps it an active research area.