mixture of experts

a few resource dump on mixture of experts for a project.

from what i understand so far, MoE has two main components: experts and gates. experts are specialized sub-networks, feedforward layers that processes specific portions of input data. gates utilize "gating functions" that are learned mechanisms that decide which experts should handle each input tokens. how? it computes a prob distribution over all experts (softmax) and select the top-k experts for each token based on the probs this selective routing allows MoE to scale # params without proportional increase in computational cost during inference ,as only a subset of experts are active (sparse MoE). the downsides? training can be complex due to the need for effective routing and load balancing among experts. without careful design, some experts may be over or underutilized.

understand

papers

code

pytorch impl. of sparsely-gated MOE by lucid rains
- (ST-MoE)
makeMoE: Implement a Sparse Mixture of Experts Language Model from Scratch

awesome MOE

BENEDICT NEO 梁耀恩

mixture of experts