This project provides a tool for benchmarking and comparing different quantized versions of models in terms of their output and energy consumption.
Quantization is a process that reduces the number of bits that represent a number. In the context of deep learning, quantization is a technique used to perform computation and storage in lower precision. This results in smaller model size and faster computation, often with minimal impact on the model’s accuracy.
The tool consists of several Python scripts that work together to load a model, quantize it to a specified bit count, and then generate responses from the model. The tool also measures the size of the model in memory, which can be used as a proxy for energy consumption.
-
mistral_quantize.py: This script contains the load_model_quantized function, which loads a pre-trained model and quantizes it to a specified bit count. The function uses the BitsAndBytesConfig class from the transformers library to configure the quantization settings.
-
Model.py: This script defines a Model class that handles interaction with the language model. The class has methods to generate model output based on user input (get_output) and to extract the latest response from the conversation history (get_latest_response).
-
main function: The main function in the mistral_quantize.py script prompts the user to enter the number of bits for quantization, loads the model and tokenizer, and then enters a loop where it continually prompts the user for input, generates a response from the model, and prints the response.
-
UI: ui.py sets up a UI using gradio to simultaneously prompt and compare the outputs and energy consumptions of different quantized models simultaneously in a web chatbot interface for each separate selected quantized model.
Follow the steps below to setup the repo based on your environment:
conda create --prefix "C:\\Users\\rs659\\Desktop\\quantization-workbench\\wincondaprojenv" python=3.9
conda activate "C:\\Users\\rs659\\Desktop\\quantization-workbench\\wincondaprojenv"
pip install -r requirements.txt
pip install torch==2.1.0 torchvision==0.16.0 torchaudio==2.1.0 --index-url https://download.pytorch.org/whl/cu121
pip install -e git+https://github.com/casper-hansen/AutoAWQ_kernels@83d1f4b326a9067d0f94f089ef1bb47cf5377134#egg=autoawq_kernels
git clone https://github.com/casper-hansen/AutoAWQ_kernels To launch the UI and start chattting with the different models simultaneously, simply run the UI script:
python -m ui
The tool works by first loading a pre-trained model using the load_model_quantized function. The user specifies the number of bits for quantization (4, 8, 16, or 32). If the model has been previously quantized and saved, it is loaded directly. Otherwise, the model is quantized according to the specified bit count using the BitsAndBytesConfig class, and then saved for future use.
The Model class is used to handle interaction with the language model. The get_output method appends the user input to the model context, tokenizes the input, generates a response from the model, and then appends the model output to the context. The get_latest_response method is used to extract the latest response from the conversation history.
The main function prompts the user to enter the number of bits for quantization, loads the model and tokenizer, and then enters a loop where it continually prompts the user for input, generates a response from the model, and prints the response. The size of the model in memory is printed after the model is loaded, which can be used as a measure of energy consumption.
from scripts.mistral_quantize import load_model_quantized, load_tokenizer
model_id = "mistralai/Mistral-7B-Instruct-v0.2"
bit_input = 8 # Specify the number of bits for quantization
# Load the quantized model and tokenizer
model, device = load_model_quantized(model_id, bit_input)
tokenizer = load_tokenizer(model_id)
# Create an instance of the Model class
model_instance = Model(model, tokenizer)
# Generate a response from the model
user_input = "Tell me about AI"
response = model_instance.get_output(user_input)
print("Model response:", response)This will load the specified model, quantize it to 8 bits, and then generate a response to the input “Tell me about AI”. The response is then printed to the console. The size of the model in memory is also printed after the model is loaded.
This tool provides a simple and effective way to benchmark and compare different quantized versions of models. By measuring the size of the model in memory and the quality of the model’s output, we can gain insights into the trade-offs between model size, computation speed, and model performance. This can be particularly useful in resource-constrained environments where model size and computation speed are critical. The tool provides a way to compare energy consumption of these models in any context/environment in which they are run.
Hugging Face Transformers
Quantization in PyTorch