a few resource dump on mixture of experts for a project.
from what i understand so far, MoE has two main components: experts and gates. experts are specialized sub-networks, feedforward layers that processes specific portions of input data. gates utilize "gating functions" that are learned mechanisms that decide which experts should handle each input tokens. how? it computes a prob distribution over all experts (softmax) and select the top-k experts for each token based on the probs this selective routing allows MoE to scale # params without proportional increase in computational cost during inference ,as only a subset of experts are active (sparse MoE). the downsides? training can be complex due to the need for effective routing and load balancing among experts. without careful design, some experts may be over or underutilized.
understand
- Mixture of Experts Explained
- Mistral / Mixtral Explained: Sliding Window Attention, Sparse Mixture of Experts, Rolling Buffer
- A Visual Guide to Mixture of Experts (MoE)
papers
- LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training
- Open mixture-of-experts language models
code
- pytorch impl. of sparsely-gated MOE by lucid rains
- (ST-MoE)
- makeMoE: Implement a Sparse Mixture of Experts Language Model from Scratch
more