EMO : modèle MoE performant avec seulement 12,5% d'experts

Q: Comment EMO réduit-il les coûts de calcul tout en maintenant la performance ?

EMO utilise une approche modulaire qui permet de sélectionner un sous-ensemble d'experts sans compromettre la performance globale, réduisant ainsi les ressources nécessaires.

Researchers at the Allen Institute for AI and UC Berkeley have built EMO, a mixture-of-experts model that develops modular structures during pre-training. The model can be stripped down to a small fraction of its experts with barely any drop in performance. Mixture-of-experts (MoE) architectures are now standard in language models like DeepSeek-V4 or Qwen3.5. They activate only a handful of experts per token, which lets them scale to hundreds of billions of parameters without blowing up compute costs. But the full model still has to sit in memory because different tokens within a task call on different experts. If you only want to do math or code, you can't just load a slice of the model and call it a day. According to the paper, that's because experts in standard MoEs tend to latch onto shallow language patterns. They respond to things like prepositions, punctuation, or articles instead of higher-level domains like math or code. That makes it impossible to carve out a useful subset. Document boundaries as a training signal EMO tackles this with a simple trick. Instead of sorting training data into fixed domains like math or biology ahead of time—the way projects like BTX or Ai2's own FlexOlmo do—the authors use document boundaries. Tokens within a document usually belong to the same domain. EMO forces all tokens in a document to pick their active experts from a shared pool. The model decides which experts belong in that pool by averaging its router preferences across all tokens in a document and keeping the most frequently selected ones. EMO trains modularity as a first-order goal. You can select an arbitrary subset of experts for a given domain without hurting the full model's performance. | Image: Allen Institute Two adjustments were needed to keep training stable. First, the authors stopped calculating load balancing, which aims to spread work evenly across experts, locally per training batch. Instead, they compute it globally across many documents. Otherwise, the two training goals would fight each other: one bundles tokens within a document, and the other spreads them across as many experts as possible. Second, the researchers randomly vary the size of the document pool during training instead of fixing it. This teaches the model to work with expert subgroups of different sizes at inference time. A quarter of the experts, one percent performance loss The team trained a MoE with 1 billion active and 14 billion total parameters with 128 experts, eight active per token, on 1 trillion tokens from the OLMoE pre-training corpus. As a full model, EMO matches an identically trained standard MoE. The authors say it beats OLMoE despite using five times more data. The researchers then started removing experts to see how far they could go. With just 25 percent of them left (32 out of 128), EMO loses about one percentage point of absolute performance averaged across several benchmarks. At 12.5 percent (16 experts), the drop is around three points. A standard MoE collapses in the same setup, losing 10 to 15 percentage points and, in some cases, falling below the level of a dense model with the same number of active parameters. On the math benchmark GSM8K, subsets with just 12.5 percent of the experts match full-model performance again after fine-tuning. On the base models, EMO stays close to full performance even with only 16 out of 128 active experts (12.5 percent), while the standard MoE drops sharply. | Image: Allen Institute On the math task GSM8K, EMO holds steady at full-model level (12.0) even with 16 experts (12.2), while the standard MoE drops to 4.9 with half the experts and falls below random guessing in the smallest setting. | Image: Allen Institute Finding the right experts doesn't take much data, the authors say. A single few-shot example is enough to pick a subgroup that performs comparably to one selected on a full validation dataset. EMO works with both simple router-based selection and the more specialized Easy-EP method. Experts learn topics, not punctuation To understand what EMO actually learned, the researchers analyzed how the model distributes tokens to experts internally. For each token, they recorded the probability with which the router sends it to each expert. These patterns create a kind of fingerprint per token. They then grouped tokens with similar fingerprints into clusters. In a standard MoE (left), each token independently picks its top-k experts, so nearly all experts get used over the course of a document. EMO (right) defines a shared pool per document that all tokens must route through. This enforces consistent expert usage and drives domain specialization. | Image: Allen Institute The difference is clear-cut. In a standard MoE, expert clusters correspond to shallow linguistic categories: prepositions, proper names, definite articles. EMO's clusters map to actual topics: health, medical and wellness, US politics and elections, film, music, TV & book reviews. Tokens from the.

Accueil

Outils

Annuaire

Apprendre