-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Open
Labels
Description
Motivation.
- vLLM supports a large variety of quantization formats. This is hard to maintain and makes the codebase complex
- Many mature frameworks (
llm-compressor,modelopt,quark,torchao) have emerged which are general purpose implementations of various quantization schemes - we have limited usage of older formats per usage stats
Proposed Change.
- deprecate many of the legacy formats
Kept:
- compressed-tensors
- quark
- awq.py (to be deprecated later, too many models exist though --- autoawq no longer maintained)
- bitsandbytes.py
- fp8.py
- quark
- mxfp4.py
- modelopt.py
- gguf.py
- gptq.py (to be deprecated later, too many models exists though) --- autogptq no longer maintained)
- torchao.py
Proposed to be removed (per usage stats):
- auto_round
- awq_marlin (consolidate to awq.py)
- awq_triton (consolidate to awq.py)
- bitblas.py
- cpu_wna16.py
- deepseepfp8.py
- experts_int8.py
- fbgemm_fp8.py
- fp_quant.py
- gptq_bitblas.py
- gptq_marlin.py (consolidate to gptq.py)
- gptq_marlin_24.py
- hqq_marlin.py
- inc.py
- input_quant_fp8.py
- ipex_quant.py
- moe_wna16.py
- petit.py
- ptpc_fp8.py
- rtn.py
- tpu_int8.py
Feedback Period.
2 Weeks
CC List.
Any Other Things.
The goal is to clean up the codebase:
- reduce mental load
- reduce complexity of implementing features (e.g. FusedMoE refactor)
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
hmellor and zhuohan123