DeepSeekMoE

The DeepSeekMoE is a mixture-of-experts architecture that involves two principal strategies:

finely segmenting the experts into $m N$ ones and activating $m K$ from them, allowing for a more flexible combination of activated experts
isolating $K_{s}$ experts as shared ones, aiming at capturing common knowledge and mitigating redundancy in routed experts

Fine-Grained Expert Segmentation

The idea here is that many smaller experts is better than few large experts. The rationale is that large experts are:

be more likely to encounter a wide variety of knowledge
therefore more likely to learn to generalise to this wide knowledge

With many small experts this wide variety of knowledge has more potential to be decomposed into the experts, allowing each expert to retain a high level of expert specialisation.

DeepSeekMoE:

splits each expert $FFN$ into $m$ smaller experts by reducing the $FFN$ intermediate hidden dimension to $\frac{1}{m}$ times its original size
increases the total experts $m$ times to keep the same memory cost, as illustrated Figure 2(b) where $N$ is doubled
number of activated experts $m$ times to keep the same computational cost, as illustrated in Figure 2(b) where $K$ is doubled

The output of an MoE layer with fine-grained expert segmentation can be expressed as:

h_{t}^{l} g_{i, t} s_{i, t} = i = 1 \sum m N (g_{i, t} FFN_{i} (u_{t}^{l})) + u_{t}^{l} = {s_{i, t} 0 s_{i, t} \in Topk ({s_{j, t} ∣1 \leq j \leq m N}, m K) otherwise = Softmax_{i} (u_{t}^{l^{T}} e_{i}^{l})

where the total number of expert parameters is equal to $N$ times the number of parameters in a standard $FFN$ , and $m N$ denotes the total number of fine-grained experts. With the fine-grained expert segmentation strategy, the number of non-zero gates will also increase to $m K$ .

Shared Expert Isolation

Certain tokens may require each expert to have some common knowledge. In conventional routing, multiple experts might acquire this common knowledge, resulting in redundancy in expert parameters. This redundancy can be mitigated by using dedicated shared experts to capture and consolidate this common knowledge.

To achieve this, $K_{s}$ experts are isolated to serve as shared experts. Each token will deterministically be assigned to these shared experts. The number of experts activated by the router is decreased by $K_{s}$ in order to maintain a constant computational cost, as illustrated in Figure 2(c) where $K$ is decreased by the number of shared experts.

Therefore, an MoE layer in the complete DeepSeekMoE architecture is formulated as follows:

h_{t}^{l} g_{i, t} s_{i, t} = i = 1 \sum K_{s} FFN_{i} (u_{t}^{l}) + i = K_{s} + 1 \sum m N (g_{i, t} FFN_{i} (u_{t}^{l})) + u_{t}^{l} = {s_{i, t} 0 s_{i, t} \in Topk ({s_{l, t} ∣ K_{s} \leq j \leq m N}, m K - K_{s}) otherwise = Softmax_{i} (u_{t}^{l^{T}} e_{i}^{l})

Finally, in DeepSeekMoE the:

number of shared experts is $K_{s}$
total number of routed experts is $m N - K_{s}$
number of non-zero gates is $m K - K_{s}$

Load Balance Consideration

There are two main routing considerations when training an MoE model:

If the same experts are always chosen, other experts are prevented from learning. This is called routing collapse
If experts are distributed across multiple devices, load imbalance can exacerbate computation bottlenecks

In practice, a:

small expert-level balance factor is used to mitigate the risk of routing collapse
large device-level balance factor is used to promote balanced computation across devices

Expert-Level Balance Loss

L_{E x pB a l} f_{i} P_{i} = α_{1} i = 1 \sum N^{'} f_{i} P_{i} = \frac{N ^{'}}{K ^{'} T} t = 1 \sum T 1 (Token t selects Expert i) = \frac{1}{T} t = 1 \sum T s_{i, t}

where $α_{1}$ is a hyperparameter called expert-level balance factor, $N^{'}$ is equal to $m N - K_{s}$ and $K^{'}$ is equal to $m K - K_{s}$ for brevity. $1 (\cdot)$ denotes the indicator function.

Device-Level Balance Loss

If we partition all routed experts into $D$ groups ${E_{1},, E_{2}, \dots, E_{D}}$ , and deploy each group on a single device, then device-level balance loss is computed as follows:

L_{De v B a l} f_{i}^{'} P_{i}^{'} = α_{2} i = 1 \sum D f_{i}^{'} P_{i}^{'} = \frac{1}{∣ E _{i} ∣} j \in E_{i} \sum f_{j} = j \in E_{i} \sum P_{j}

where $α 2$ is a hyperparameter called device-level balance factor.

References

DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

ML Notes

Explorer

DeepSeekMoE

Fine-Grained Expert Segmentation

Shared Expert Isolation

Load Balance Consideration

Expert-Level Balance Loss

Device-Level Balance Loss

Graph View

Table of Contents

Backlinks