๐Ÿง  Overview

Modern Mixture-of-Experts (MoE) transformers activate a subset of expert functions per token, enabling large-scale models with efficient inference. These expert layers are usually assumed to be depth-specific and non-interchangeable. This project explores whether experts across layers perform functionally similar transformations, making cross-layer expert reuse feasible.

๐Ÿ” Core Idea

We propose a method to quantify the functional similarity between experts in different transformer layers by introducing a lightweight adapter. This adapter aligns the input distribution of one expert to another, allowing us to evaluate output similarity using Mean Squared Error (MSE).

If such an adapter achieves low MSE, it implies functional similarity (up to input transformation) between the two experts.

๐Ÿงช Problem Formulation

Let:

  • $( f_{\ell, e} : \mathbb{R}^d \rightarrow \mathbb{R}^d )$ be the transformation implemented by expert $( e )$ in layer $( \ell )$
  • $( D_{\ell,e} )$ be the empirical input distribution of expert $( (\ell, e) )$
  • $( A : \mathbb{R}^d \rightarrow \mathbb{R}^d )$ be an adapter function

We aim to find:

$[ f_{\ell_2, e_2}(A(x)) \approx f_{\ell_1, e_1}(x), \quad \forall x \sim D_{\ell_1, e_1} ]$

The adapter is defined as:

$[ A(x) = \text{LayerNorm}(Wx + b), \quad W \in \mathbb{R}^{d \times d}, b \in \mathbb{R}^d ]$

The training objective is:

$[ \mathcal{L}_{\text{MSE}} = \mathbb{E}_{x \sim D_{\ell_1,e_1}} \left[ | f_{\ell_2,e_2}(A(x)) - f_{\ell_1,e_1}(x) |_2^2 \right] ]$

โš™๏ธ Experimental Setup

  • Model: Qwen1.5-MoE-A2.7B (decoder-only MoE transformer)
  • Tokenization: QwenTokenizer
  • Hardware: 1x NVIDIA A100 GPU
  • Framework: HuggingFace Transformers
  • Data: Subset of English Wikipedia
  • Precision: float16
  • Adapter: Linear + LayerNorm trained to minimize MSE

๐Ÿ“Š Results

๐Ÿ” Cross-Layer Expert Alignment

activations (1)

We compare outputs of Layer 1 and Layer 2 experts using learned adapters. Example MSE alignment losses:

L1โ†’L2 E0 E4 E8 E12
E0 8.34 7.71 7.81 7.87
E4 0.017 0.010 0.018 0.014
E8 0.017 0.011 0.016 0.014
E12 0.015 0.009 0.014 0.011

๐Ÿ“Œ Observation: Expert 0 at Layer 1 shows consistently high MSE with all Layer 2 experts, indicating a functionally distinct behavior.

๐Ÿงฉ Additional Experiment: Layer Swapping

To test broader functional similarity, we swapped sparse-FFN blocks across layers in the Switch-Base-8 model and evaluated on MNLI:

Swapped Layers Accuracy
1 <-> 3 0.74
1 <-> 5 0.728
1 <-> 7 0.654
1 <-> 9 0.378
1 <-> 11 0.406

๐Ÿ“Œ Observation: Functional similarity decays rapidly with layer distance.

token_routing_heatmap

Swapping entire MoE blocks (router + experts) between layers confirms that closer layers tend to be more functionally aligned. Deep swaps cause routing distributions to diverge significantly.

๐Ÿ”ฎ Future Work

  • Extend adapter-based analysis across all layer pairs
  • Measure perplexity degradation during expert swapping
  • Leverage functional similarity for dynamic expert routing
  • Apply results to sparsity and compression
  • Evaluate generalization to out-of-distribution data

๐Ÿ‘ฅ Authors

๐Ÿ“Ž Resources

๐Ÿ“š References

  • Switch Transformer: https://arxiv.org/abs/2101.03961
  • Universal Transformers: https://arxiv.org/abs/1807.03819
  • MoE-UT: https://arxiv.org/abs/2405.16039
  • Expert Merging: https://arxiv.org/pdf/2310.01334

๐Ÿ’ก Conclusion

Our findings suggest that many MoE experts are functionally redundant across layers, and functional specialization is unevenly distributed. These insights pave the way for more efficient and interpretable MoE transformer architectures.

Updated: