LLM Mixture of Experts Explained

Updated: Jan 31

Mixture of Experts is a technique in AI where a set of specialized models (experts) are collectively orchestrated by a gating mechanism to handle different parts of the input space, optimizing for performance and efficiency. It leverages the fact that an ensemble of weaker language models specializing in specific tasks can produce more accurate results, similar to traditional ML ensemble methods. However, it introduces a new concept of dynamic routing of the input in the process of generation. In this blog post, I will explain how OpenAI leveraged it to effectively combine eight different models under what is called GPT-4, and how Mixtral's architecture for this method was even more efficient.

Index

The Mixture of Experts: Explained
GPT4 is just 8 smaller expert models
Mistral 8x7B aka Mixtral explained
Technical Components of MoE
Advantages: Speed and Efficiency
Disadvantages: GPU needs & Hard to Train
The Evolution of MoE
Open Source MoEs and Exciting Directions
The Bottom Line: Why MoE Matters

The Mixture of Experts: Explained

Here's a surprising revelation: to build an LLM application, you will, of course, need an LLM. However, when you break down the functionalities of your LLM app, you'll find that, like many other applications, different components are designed to serve very distinct purposes. Some components may be tasked with retrieving relevant data from a database, others might be engineered to generate a "chat" experience, and some could be responsible for formatting or summarization. Similar to traditional machine learning, where combining different models in ensemble techniques like boosting and bagging, Mixture of Experts in LLMs leverages a set of different transformer models, that are trained differently and leverages ML to also weight them differently to generate a complex inference pipeline.

Room of Experts - each contributes to different topics

Mixture of Experts LLM (MoE)

In the context of LLMs, the concept of 'expertise' takes a unique form. Each model, or 'expert,' naturally develops a proficiency in different topics as it undergoes the training process. In this setup, the role of a 'coordinator,' which in a human context might be a person overseeing a team, is played by a Gating Network. This network has the crucial task of directing inputs to the appropriate models based on the topic at hand. Over time, the Gating Network improves its understanding of each model's strengths and fine-tunes its routing decisions accordingly.

It's important to clarify, however, that despite the use of the term 'expert,' these LLM models don't possess expertise in the way we typically think of human specialists in fields of science or arts. Their 'expertise' resides in a complex, high-dimensional embedding space. The alignment of this expertise with our conventional, human-centric understanding of subjects can vary. The notion of categorizing these models into different domains of expertise is more of a conceptual tool to help us understand and navigate their diverse capabilities within the AI framework.

What makes MoE unique?

In traditional models, all tasks are processed by a single, densely packed neural network, akin to a generalist handling every problem. However, for complex problems, it becomes hard to find a generalist model capable of handling everything which is why Mixture of Experts LLM is so valuable.

How GPT-4 implements Mixture of Experts 📄

On June 20th, George Hotz, the founder of self-driving startup Comma.ai, revealed that GPT-4 is not a single massive model, but rather a combination of 8 smaller models, each consisting of 220 billion parameters. This leak was later confirmed by Soumith Chintala, co-founder of PyTorch at Meta.

GPT4 -> 8 x 220B params = 1.7 Trillion params

For context, GPT-3.5 has around 175B parameters. However, just like we will cover in Mixtral, the calculation of the total number of parameters when using MoE is not so direct since only FFN (feed-forward network) layers are replicated between each expert, while the other layers can be shared by all. This may significantly decrease the total number of parameters of GPT-4. Regardless the total number should be somewhere between 1.2-1.7 Trillion parameters.

Why GP4 is becoming more dumb and lazy?

Short answer:

MoE models may be fewer and/or smaller
Continuous aggressive RLHF
Distillation or Quantization of MoE models

Recent reports of the degradation of GPT-4 quality of answers and increased laziness may be directly connected to the fact that it is a MoE. Since OpenAI has been so focused on getting the inference costs down, while also decreasing the price-per-token for the user, they may be using fewer experts or smaller experts to build GPT4.

Since each expert needs to be loaded into VRAM, thus occupying GPU even if only some layers are being used at each step, the hardware requirements are immense. That is why a small reduction in the expert's size or number can have a big impact on costs, although performance may be affected as well.

The reigning theory is that it is this cost reduction combined with more aggressive RLHF (Reinforcement Learning with Human-Feedback) that is causing the degradation of user experience and quality of answers. This focus on RLHF is mainly to make GPT4 more robust and useful for the company's products but it gets less interesting for the everyday ChatGPT user.

That is the problem with the lack of transparency on their side, we do not know what we are getting! We can only get some insights through some leaks that have happened and might happen.

Mistral 8x7B aka Mixtral explained 📄

Mixtral is outperforming many large models while being efficient in inference. It employs a routing layer that decides which expert or combination of experts to use for each task, optimizing resource usage. It only has 46.7B parameters, but only uses about 12.9B per token.

Despite its impressive capabilities, Mixtral faces challenges like any other MoE model, particularly in training and data management.

The Architecture of Mixtral

Mixtral is a sparse mixture-of-experts (SMoE) network. At its core, it's a decoder-only model, a design choice that differentiates it from models that include both encoder and decoder.

It is composed by 8 expert models using the Mistral-7B architecture. And the best part is that, like Mistral, Mixtral is is fully open-source with an Apache 2.0 license.

The Expert Mechanism

The magic of Mixtral lies in how it handles its feedforward block. Here's where the 'experts' come into play. Mixtral doesn't rely on a single set of parameters; instead, it picks from eight distinct groups of parameters. This selection is dynamic and context-dependent.

Token Routing: For every token in the input, a router network chooses two groups of experts. This dual selection allows for a nuanced and context-rich processing of information.
Additive Output Combination: The outputs from these chosen experts are then combined additively, ensuring a rich blend of specialized knowledge.

Parameters and Efficiency

One might assume that having multiple experts would exponentially increase the parameter count. However, Mixtral balances this with efficiency:

Total Parameters: Mixtral boasts a total of 46.7 billion parameters. But, the efficient use of these parameters is what sets it apart.
Parameters per Token: It uses only about 12.9 billion parameters per token. This ingenious approach means that Mixtral operates with the speed and cost of a 12.9 billion parameter model, despite its larger size.

Performance and Benchmarks

Mixtral's performance is a highlight in its story. It outperforms many existing large models, including the Llama2 70B and GPT-3.5, in various benchmarks. In hugging face's leaderboard it sits at one of the top spots, as seen here.

Mixtral is not just better in raw output quality but also in inference speed, which is about six times faster. In the paper that the authors present Mixtral they also provide some comparrison with other models as seen below.

Mixtral outperforms or matches Llama 2 70B and GPT-3.5 performance on most metrics.

Performance of Mixtral and different Llama models on a wide range of benchmarks. Mixtral outperforms or matches Llama 2 70B on all benchmarks. — Performance of Mixtral and different Llama models on a wide range of benchmarks. Mixtral outperforms or matches Llama 2 70B on all benchmarks

LMSys Leaderboard. (From Dec 22, 2023). Mixtral is currently the best open-weights model by a large margin.Mixture of Experts (MoE) Large Language Models (LLM) GPT4 Mixtral AI and Machine Learning Natural Language Processing (NLP) GPT4 Applications AI Model Optimization Machine Learning Models Deep Learning Neural Networks Artificial Intelligence Technologies GPT4 vs GPT-3.5 AI Content Generation Tech Innovations in AI AI Model Training Advanced AI Models AI Research and Development Semantic Search Optimization Keyword Research with AI — LMSys Leaderboard. (From Dec 22, 2023). Mixtral is currently the best open-weights model by a large margin.

For any open-source fan or user Mixtral is definitly encouraging by being able to compete with proprietary models as big as GPT-3.5 or Claude-2.1

By the way, if you have seen records of Mixtral_34Bx2_MoE_60B and other variants like the Mixtral_11Bx2_MoE_19B getting amazing results, just remember that despite having Mixtral in name, they are not Mistral-based. Rather they are Yi-based, so it's specialized only for English and Chinese language output.

Training and Implementation Challenges

Dataset Size and Composition: Details about the dataset used for pretraining Mixtral are not fully disclosed. This includes its size, composition, and preprocessing methods.
VRAM Requirements: To run Mixtral effectively, a significant amount of VRAM is needed. Estimates suggest a requirement of at least 30 GB of RAM, making high-end GPUs like the A100 or A6000 necessary.
Engineering Hurdles: Training a model like Mixtral on a single GPU, such as the A100, is feasible but requires a slew of engineering tricks. Techniques like 4-bit quantization and QLoRA are employed, particularly targeting the linear layers in the attention blocks. However, care must be taken not to target the MLP layers, as they don't interact well with PEFT due to their sparse nature.

Technical Components of MoE 📄

1. Sparse MoE Layers: Unlike the dense layers in conventional models, MoE uses sparse layers with a set number of experts (e.g., 8). These experts are neural networks in themselves, usually simpler but sometimes as complex as MoEs.

2. Gate Network or Router: This is the conductor of the orchestra, deciding which tasks or 'tokens' go to which expert. This routing is a crucial aspect of MoE's efficiency and effectiveness.

MoE Architecture - Gatting Network chooses what Expert to use Mixture of Experts (MoE) Large Language Models (LLM) GPT4 Mixtral AI and Machine Learning Natural Language Processing (NLP) GPT4 Applications AI Model Optimization Machine Learning Models Deep Learning Neural Networks Artificial Intelligence Technologies GPT4 vs GPT-3.5 AI Content Generation Tech Innovations in AI AI Model Training Advanced AI Models AI Research and Development Semantic Search Optimization Keyword Research with AI — MoE Architecture - Gatting Network chooses what Expert to use

Advantages of MoE: Speed & Efficiency

- Pretraining Speed: Due to having sparse layers they are pretrained much faster than dense models.

- Inference Speed: Despite their size, they offer faster inference, using only a fraction of their parameters at any given time.

- Lower Costs: When compared to a Dense model with the same total number of parameters MoE models are much cheaper to train and run inference on, due to the previous two points.

- Quality of Answers: By using experts for different topics it creates a better performing overall model. Capable of remembering more information and solving more niche scenarios.

Disadvantages of MoE: GPU needs & Hard to Train

- GPU VRAM Requirements: A catch is their high VRAM requirement, as all experts need to be loaded into memory even if only 1 or 2 are being used at a certain time.

- Fine-tuning Difficulties: Historically, MoEs struggled with fine-tuning, often leading to overfitting. Although now there have been many advancements on this and it is getting easier.

- Training and Inference Trade-offs: While offering faster inference, they require careful management of VRAM and computing resources.

The Evolution of MoE

The MoE concept isn't new. It dates back to 1991, with significant milestones along the way:

- 1991: 📄 Geoffrey Hinton proposed Mixtures of Local Experts for the first time

- 2014: 📄 MoE was first applied to deep learning.

- 2017: 📄 Geoffrey Hinton proposed using MoE on large-scale models.

- 2020: 📄 Google's GShard experimented with MoE in giant transformers.

- 2022: 📄 Google's Switch Transformers addressed some of MoE's training and fine-tuning issues.

Open Source MoEs and Exciting Directions

Today, there are open-source source projects for MoE training, like Megablocks, Fairseq, and OpenMoE. Additionally, models like Google's Switch Transformers and Meta's NLLB MoE are pushing the boundaries. The fact that Mixtral is open-source with an Apache 2.0 license is amazing and gives us hope that AI can continue to be democratized.

Exciting directions include 📄:

- Distilling MoEs: Compressing them into smaller, dense models.

- Quantization of MoEs: Reducing the memory footprint, as seen in QMoE projects.

- Model Merging Techniques: Exploring different ways to combine experts efficiently.

The Bottom Line: Why MoE Matters

As we advance, MoE models will likely become more prevalent, pushing the boundaries of what's possible in AI and LLMs. The story of MoE is still being written, and each development is a new chapter in this exciting tale of technological progress.

My hunch is that many open-source models will start appearing as MoE, for example, Llama3 may indeed be a Mixture of Experts which allows for much higher performance.