Mixture-of-Experts (MoE)

The following explanation of MoE models comes from the DeepSeekMoE paper.

A standard Transformer language model is constructed by stacking $L$ layers of standard Transformer blocks, where each block can be represented as follows:

u_{1 : T}^{l} = Self-Attn (h_{1 : T}^{l - 1}) + h_{1 : T}^{l - 1} h_{t}^{l} = FFN (u_{t}^{l}) + u_{t}^{l}

where $T$ denotes the sequence length, $Self-Attn (\cdot)$ denotes the self-attention module, $FFN (\cdot)$ denotes the Feed-Forward Network (FFN), $u_{1 : T}^{l} \in R^{T \times d}$ are the hidden states of all tokens after the $l$ -th attention module, and $h_{t}^{l} \in R^{d}$ is the output hidden state of the $t$ -th token after the $l$ -th Transformer block. For brevity, we omit the layer normalisation in the above formulations.

A typical practice to construct an MoE language model usually substitutes FFNs in a Transformer with MoE layers at specified intervals. An MoE layer is composed of multiple experts, where eah expert is structurally identical to a standard FFN. Then, each token will be assigned to one or two experts. If the $l$ -th FFN is substituted with an MoE layer, the computation for its output hidden state $h_{t}^{l}$ is expressed as:

h_{t}^{l} g_{i, t} s_{i, t} = i = 1 \sum N (g_{i, t} FFN_{i} (u_{t}^{l})) + u_{t}^{l} = {s_{i, t} 0 s_{i, t} \in Topk ({s_{j, t} ∣1 \leq j \leq N}, K) otherwise = Softmax_{i} (u_{t}^{l^{T}} e_{i}^{l})

where $N$ denotes the total number of experts, $FFN_{i} (\cdot)$ is the $i$ -th expert $FFN$ , $g_{i, t}$ denotes the gate value for the $i$ -th expert, $s_{i, t}$ denotes the token-to-expert affinity, $Topk (\cdot, K)$ denotes the set comprising $K$ highest affinity scores among those calculated for the $t$ -th token and all $N$ experts, and $e_{i}^{l}$ is the centroid of the $i$ -th expert in the $l$ -th layer. Note that $g_{i, t}$ is sparse, indicating that only $K$ out of $N$ gate values are non-zero. This sparsity property ensures computational efficiency within an MoE layer, i.e., each token will only be assigned to and computed in only $K$ experts. Also, in the above formulations, we omit the layer normalisation operation for brevity.

References

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

ML Notes

Explorer

Mixture-of-Experts (MoE)

Graph View

Backlinks