MoE Align: Quantifying Expert Similarity Across Transformer Layers
๐ง Overview
Modern Mixture-of-Experts (MoE) transformers activate a subset of expert functions per token, enabling large-scale models with efficient inference. These expert layers are usually assumed to be depth-specific and non-interchangeable. This project explores whether experts across layers perform functionally similar transformations, making cross-layer expert reuse feasible.
๐ Core Idea
We propose a method to quantify the functional similarity between experts in different transformer layers by introducing a lightweight adapter. This adapter aligns the input distribution of one expert to another, allowing us to evaluate output similarity using Mean Squared Error (MSE).
If such an adapter achieves low MSE, it implies functional similarity (up to input transformation) between the two experts.
๐งช Problem Formulation
Let:
- $( f_{\ell, e} : \mathbb{R}^d \rightarrow \mathbb{R}^d )$ be the transformation implemented by expert $( e )$ in layer $( \ell )$
- $( D_{\ell,e} )$ be the empirical input distribution of expert $( (\ell, e) )$
- $( A : \mathbb{R}^d \rightarrow \mathbb{R}^d )$ be an adapter function
We aim to find:
$[ f_{\ell_2, e_2}(A(x)) \approx f_{\ell_1, e_1}(x), \quad \forall x \sim D_{\ell_1, e_1} ]$
The adapter is defined as:
$[ A(x) = \text{LayerNorm}(Wx + b), \quad W \in \mathbb{R}^{d \times d}, b \in \mathbb{R}^d ]$
The training objective is:
$[ \mathcal{L}_{\text{MSE}} = \mathbb{E}_{x \sim D_{\ell_1,e_1}} \left[ | f_{\ell_2,e_2}(A(x)) - f_{\ell_1,e_1}(x) |_2^2 \right] ]$
โ๏ธ Experimental Setup
- Model: Qwen1.5-MoE-A2.7B (decoder-only MoE transformer)
- Tokenization: QwenTokenizer
- Hardware: 1x NVIDIA A100 GPU
- Framework: HuggingFace Transformers
- Data: Subset of English Wikipedia
- Precision: float16
- Adapter: Linear + LayerNorm trained to minimize MSE
๐ Results
๐ Cross-Layer Expert Alignment
We compare outputs of Layer 1 and Layer 2 experts using learned adapters. Example MSE alignment losses:
L1โL2 | E0 | E4 | E8 | E12 |
---|---|---|---|---|
E0 | 8.34 | 7.71 | 7.81 | 7.87 |
E4 | 0.017 | 0.010 | 0.018 | 0.014 |
E8 | 0.017 | 0.011 | 0.016 | 0.014 |
E12 | 0.015 | 0.009 | 0.014 | 0.011 |
๐ Observation: Expert 0 at Layer 1 shows consistently high MSE with all Layer 2 experts, indicating a functionally distinct behavior.
๐งฉ Additional Experiment: Layer Swapping
To test broader functional similarity, we swapped sparse-FFN blocks across layers in the Switch-Base-8
model and evaluated on MNLI:
Swapped Layers | Accuracy |
---|---|
1 <-> 3 | 0.74 |
1 <-> 5 | 0.728 |
1 <-> 7 | 0.654 |
1 <-> 9 | 0.378 |
1 <-> 11 | 0.406 |
๐ Observation: Functional similarity decays rapidly with layer distance.
Swapping entire MoE blocks (router + experts) between layers confirms that closer layers tend to be more functionally aligned. Deep swaps cause routing distributions to diverge significantly.
๐ฎ Future Work
- Extend adapter-based analysis across all layer pairs
- Measure perplexity degradation during expert swapping
- Leverage functional similarity for dynamic expert routing
- Apply results to sparsity and compression
- Evaluate generalization to out-of-distribution data
๐ฅ Authors
๐ Resources
- ๐ Project Report
- ๐ Slides
๐ References
- Switch Transformer: https://arxiv.org/abs/2101.03961
- Universal Transformers: https://arxiv.org/abs/1807.03819
- MoE-UT: https://arxiv.org/abs/2405.16039
- Expert Merging: https://arxiv.org/pdf/2310.01334
๐ก Conclusion
Our findings suggest that many MoE experts are functionally redundant across layers, and functional specialization is unevenly distributed. These insights pave the way for more efficient and interpretable MoE transformer architectures.