What are Quantized LLMs?

Updated: Mar 30

Model Quantization is a technique used to reduce the size of large neural networks, including large language models (LLMs) by modifying the precision of their weights. LLM Quantization is enabled thanks to empiric results showing that while some operations related to neural network training and inference must leverage high precision, in some cases it's possible to use significantly lower precision (float16 for example) reducing the overall size of the model, allowing it to be run using less powerful hardware with an acceptable reduction of its capabilities and accuracy. In this blog post I will go over the LLM Quantisation and cover following points:

Basics of quantization
Advantages and disadvantages of quantized models
Find and use already quantized models
Quantization techniques including sample code

Llamas of different sizes (created with Midjourney)

What is Quantization
Larger Quantized Model vs Smaller Non-Quantized Model
Accessing Pre-Quantized Models
Guide to Quantizing Models with AutoGPTQ
Noteworthy Techniques in Quantization: GPTQ, ExLLama, NF4, bitsandbytes, GGML, GGUF, llama.cpp
Final Thoughts

What is Quantization?

The term quantization refers to the process of mapping continuous infinite values to a smaller set of discrete finite values. In the context of LLMs, it refers to the process of converting the weights of the model from higher precision data types to lower-precision ones.

Precision of Neural Networks

LLMs are essentially neural networks, computational models that are represented in the memory of the GPU or RAM as Tensors - multidimensional arrays of numbers. To store them, you can use different types: Float64, Float16, or even integers. The data type you choose will impact the number of "digits" that need to be used in memory and, of course, the memory size. The size of the variable can be referred to as precision, indicating how many "digits" are used to represent it in memory.

Quantized LLM data types

Generally, using high precision in neural networks are associated with better accuracy and more stable training, you can read about it here. Using high precision is also more computationally expensive as it requires more hardware and more expensive hardware. Research mostly done by Google and Nvidia regarding the possibility to use lower precision for some neural network operations showed that lower precision can be leveraged for some training and inference operations.

Aside the research, both companies developed hardware and frameworks to support lower precision operations. For example, the Nvidia T4 accelerators are lower precision GPUs with Tensor Cores technology that is significantly more efficient than this of the K80. Google's TPUs introduced the concept of bfloat16, a special primitive data type optimized for neural networks. The fundamental idea behind lower precision is that neural networks don't always need to use ALL the range that 64bit floats in order to allow them to perform well.

Source: https://cloud.google.com/tpu/docs/bfloat16?hl=en

The bfloat16 numerical format | Google

As neural networks became increasingly large, the importance of leveraging lower precision had significant impact on the ability to use them. With LLMs this became even more crucial.

For reference, an A100 GPU by Nvidia has 80GB of memory in its most advanced version. In the table below you can see that a LLama2-70B model requires 138 GB of memory approximately, meaning that to host it, you will need multiple A100s. Distributing models over multiple GPUs means paying for more GPUs as well as overhead infrastructure. A quantized version on the other hand, requires around 40 GB of memory, therefore it can fit easily into one A100, reducing the cost of inference significantly. This example doesn't even mention the fact that within the single A100, using quantized models would result in faster execution of most of the individual computation operation.

Model	Original Size (FP16)	Quantized Size (INT4)
Llama2-7B	13.5 GB	3.9 GB
Llama2-13B	26.1 GB	7.3 GB
Llama2-70B	138 GB	40.7 GB

[Example of 4-bit quantization using llama.cpp, size may vary slightly depending on method]

How does quantization shrink models?

Quantization significantly decreases the model's size by reducing the number of bits required for each model weight. A typical scenario would be the reduction of the weights from FP16 (16-bit Floating-point) to INT4 (4-bit Integer). This allows for models to run on cheaper hardware and/or with higher speed. By reducing the precision of the weights, the overall quality of the LLM can also suffer some impact.

Studies show that this impact varies depending on the techniques used and that larger models suffer less from change in precision. Larger models (over ~70B) are able to maintain their capacities even when converted to 4-bit, with some techniques such as the NF4 suggesting no impact on their performance. Therefore, 4-bit appears to be the best compromise between performance and size/speed for these larger models, while 6 or 8-bit might be better for smaller models.

The Two Types of LLM Quantization

It's possible to divide the techniques of obtaining quantized models into two:

Post-Training Quantization (PTQ): converting the weights of an already trained model to a lower precision without any retraining. Though straightforward and easy to implement, PTQ might degrade the model's performance slightly due to the loss of precision in the value of the weights.
Quantization-Aware Training (QAT): Unlike PTQ, QAT integrates the weight conversion process during the training stage. This often results in superior model performance, but it's more computationally demanding. A highly used QAT technique is the QLoRA.

This post will only focus on PTQ strategies and the key distinctions between them.

Larger Quantized Model vs Smaller non-Quantized

Acknowledging that reducing the precision will reduce the accuracy of the model, should you prefer smaller full-precision model or a larger quantized model with a comparable inference cost? Although the ideal choice might vary due to diverse factors, recent research by Meta offers some insightful guidelines.

While we expect that reducing the precision would result in reduction of the accuracy, Meta researchers have demonstrated that in some cases, not only does the quantized model demonstrate superior performance, but it also allows of reduced latency and enhanced throughput. The same trend can be observed when comparing an 8-bit 13B model with a 16-bit 7B model. In essence, when comparing models with similar inference costs, the larger quantized models can outperform their smaller, non-quantized counterparts. This advantage becomes even more pronounced with larger networks, as they exhibit a smaller quality loss when quantized.

Where to find already Quantized models?

Fortunately it is possible to find many versions of models already quantized using GPTQ (some compatible with ExLLama), NF4 or GGML on the Hugging Face Hub. A quick glance would reveal that a substantial chunk of these models has been quantified by TheBloke, an influential and respected figure in the LLM community. This user has published several models with different types of quantization methods so one can choose to use the best fit for each particular use-case.

To easily experiment with these models open up a Google Colab and make sure you change your runtime to a GPU (a free one is available for use). Start by installing the transformers library maintained by Hugging Face and all necessary libraries. Since we will be using a model quantized using Auto-GPTQ the respective libraries will also be required:

!pip install transformers
!pip install accelerate

# Due to using GPTQ
!pip install optimum
!pip install auto-gptq

You might need to restart the runtime so that the installs are available. Then simply load the already quantized model, in this case we are loading a Llama-2-7B-Chat model previously quantized using Auto-GPTQ, as shown below:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "TheBloke/Llama-2-7b-Chat-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")

model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16, device_map="auto")

Quantizing any model

As highlighted earlier, a plethora of quantized models already reside on the Hugging Face Hub, eliminating the necessity to compress a model personally in many scenarios. However, in same cases you may want to use models which are not yet quantized or you may want to compress the model yourself. This can be achieved by using a dataset tailored to your specific domain.

To demonstrate how to easily quantize a model using AutoGPTQ along with the Transformers library, we employed a streamlined variant of the AutoGPTQ interface found in Optimum – Hugging Face's solution for refining training and inference:

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

model_id = "facebook/opt-125m"

tokenizer = AutoTokenizer.from_pretrained(model_id)

quantization_config = GPTQConfig(bits=4, dataset = "c4", tokenizer=tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", quantization_config=quantization_config)

Bear in mind, model compression can be time-consuming. For instance, a 175B model demands at least 4 GPU-hours, especially with expansive datasets like "c4". Notably, the number of bits in the quantization process or the dataset can be easily modified by in the parameters of GPTQConfig. Changing the dataset will impact how the quantization is done so, if possible, use a dataset which resembles data seen in inference in order to maximize performance.

If you do end up quantizing a model which we encourage you to push it to Hugging Face as a contribution to the community! Here is how it is done:

from huggingface_hub import notebook_login

notebook_login()

model.push_to_hub("opt-125m-gptq-4bit")
tokenizer.push_to_hub("opt-125m-gptq-4bit")

Results got worse?

Sometimes the quality of the answers might decrease when doing quantization on some smaller models or a more aggressive type of quantization. In those cases before giving up on the quantized model, a deep dive into prompt engineering might be the solution! Sometimes the model just needs to be nudged in the right direction to still maintain the accuracy and quality.

Example of reducing hallucinations through advanced prompt engineering

If you want to learn more about Prompt Engineering and Advanced Techniques on it through a Practical Guide with Examples I highly recommend looking at this blogpost.

Noteworthy Techniques in Quantization

Several state-of-the-art methods have emerged in the arena of model quantization. Let's delve into some prominent ones:

GPTQ: With some implementation options AutoGPTQ, ExLlama and GPTQ-for-LLaMa, this method focuses mainly on GPU execution.
NF4: Being implemented on the bitsandbytes library it works closely with the Hugging Face transformers library. It is primarily used by QLoRA methods and loads models in 4-bit precision for fine-tuning.
GGML: This C library works closely with the llama.cpp library. It features a unique binary format for LLMs, allowing for fast loading and ease of reading. Notably, its recent shift to the GGUF format ensures future extensibility and compatibility.

Many quantization libraries supports a number of different quantization strategies (e.g. 4-bit, 5-bit, and 8-bit quantization), each of which offers different trade-offs between efficiency and performance.

GPTQ: A Revolution in Model Quantization

The paper titled “GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers” introduced an exciting approach, the GPTQ. By merging the name of the GPT model family with post-training quantization (PTQ), GPTQ offers an innovative solution for quantization. The key benefits include:

Scalability: GPTQ has the capacity to compress large networks such as the GPT models with 175 billion parameters in about 4 GPU hours, cutting the bit width down to 3 or 4 bits per weight with very minimal degradation in accuracy. The paper states that as the size of the model increases, the difference in performance between FP16 and GPTQ decreases.
Performance: This technique makes it feasible to run inference on a 175 billion-parameter model using a single GPU.
Inference Speed: GPTQ models offer 3.25x speed-ups on high-end GPUs like NVIDIA A100 and a 4.5x speed increase on cost-effective ones like NVIDIA A6000, compared to FP16 models.

GPTQ can only quantize models into INT-based data types, being most commonly used to convert to 4INT. Models quantized using GPTQ 4-bit are compatible with ExLLama for GPU speed-up. Hugging Face's AutoGTPQ automatically uses ExLLama for acceleration by default.

ExLLama is a standalone implementation of Llama for use with 4-bit GPTQ weights, designed to be fast and memory-efficient on modern GPUs. It features much lower VRAM usage and much higher speeds due to not relying on non-optimized transformers code. It is only recommended for more recent GPU hardware. A new version of this library called ExLlamaV2 is now available, although it is still in initial stages of development.

The GPTQ quantization technique can be applied to many models to transform them into 3, 4 or 8-bit representions in a few simple steps. The most commonly used library for quantization related to GPTQ is AutoGPTQ due to its integration with the transformers library. An example of how to quantize and use already quantized models with AutoGPTQ was shown in the sections above.

NF4 (4-bit NormalFloat) and bitsandbytes

The NormalFloat (NF) data type is an enhancement of the Quantile Quantization technique. It has shown better results than both 4-bit Integers and 4-bit Floats.

NF4 can also be coupled with Double-Quantization (DQ) for higher compression while maintaining performance. DQ encompasses two quantization phases; initially, quantization constants are processed, which are then used as inputs for the subsequent quantization, yielding FP32 and FP8 values. This method avoids any performance drop, while saving an average of about 0.37 bits per parameter (approximately 3 GB for a 65B model).

What's remarkable is that the recent integration of bitsandbytes, which incorporates findings from the QLoRA paper (including NF4 and DQ), shows virtually no reduction in performance with 4-bit quantization for both inferring and training large language models (LLMs). As previously stated, details about the QLoRA fine-tuning method will not be addressed in this article.

NF4 and Double Quantization can be leveraged using the bitsandbytes library which is integrated inside the transformers library. Here is an example of how to easily load and quantize any Hugging Face model:

!pip install bitsandbytes

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

nf4_config = BitsAndBytesConfig(
   load_in_4bit=True,
   bnb_4bit_quant_type="nf4",
   bnb_4bit_use_double_quant=True,
   bnb_4bit_compute_dtype=torch.bfloat16
)

model_name = "PY007/TinyLlama-1.1B-step-50K-105b"

tokenizer_nf4 = AutoTokenizer.from_pretrained(model_name, quantization_config=nf4_config)

model_nf4 = AutoModelForCausalLM.from_pretrained(model_name, quantization_config=nf4_config)

Be aware that this process can take some time and large models might consume high amounts of RAM, although this exact code was able to run on a free Google Colab GPU.

Additionally, bitsandbytes also allows users to load a model and distribute its weights between the CPU and GPU. Weights allocated to the CPU remain in float32 and aren't converted to 8-bit. This is designed for users aiming to manage a large model by balancing between GPU and CPU. More information on this topic is available here.

GGML and llama.cpp: Allowing inference on CPU

In the constantly evolving ecosystem of machine learning, GGML has carved a niche for itself. GGML is a C library for machine learning (ML), where the "GG" refers to the initials of its originator (Georgi Gerganov).

The library's distinctive edge lay in its proprietary binary format which offered a method to distribute LLMs, setting it apart from other standard formats. A noteworthy progression is the transition from the GGML format to GGUF, which supports the use of non-llama models.The GGUF format was tailored to be extensible and future-proof, all while being much lighter on the RAM requirements for quantization.

It was meticulously crafted to work seamlessly with the llama.cpp library, ensuring that practitioners can harness the power of LLMs efficiently. The main goal of the llama.cpp library is allowing the use of LLaMA models using 4-bit integer quantization on a MacBook.

One of GGML's primary functions was to facilitate the loading of GGML models and executing them on a CPU. Although nowadays it also allows the offload certain layers onto a GPU. This enhancement not only accelerated inference speeds but also provided a workaround for LLMs that were too expansive for an average VRAM.

Recent reports also available on the llama.cpp repository suggest that this technique enabled a 180B Falcon model to operate inference on a Mac M2 Ultra. This significant milestone showcases the power of quantization in making large language models accessible on consumer-available hardware.

Quantizing a LLM from Hugging Face to GGUF for inference on a CPU can require some more lines of code, so here is a notebook of how to do it step-by-step. Alternatively, one can use this script from llama.cpp to easily quantize any Hugging Face model.

Final Thoughts

Quantization has revolutionized the way we perceive and utilize LLMs. By compacting colossal models like LLaMA-30B to fit everyday devices, it has essentially democratized access to artificial intelligence. This breakthrough ensures that users don't have to compromise performance for size, allowing for swift and efficient language processing even on consumer-grade hardware.

It's truly a testament to the ingenuity of the LLM community that solutions like these continue to emerge, bridging the gap between advanced AI and the everyday user. As technology continually evolves, it's exciting to imagine what the future holds.