You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Purpose ##
* Support NVFP4A16 for `model_free_ptq`
```bash
llmcompressor.reindex_fused_weights \
unsloth/Kimi-K2-Thinking-BF16 \
Kimi-K2-Thinking-BF16-reindexed \
--num_workers=10
```
```python
model_free_ptq(
model_stub="Kimi-K2-Thinking-BF16-reindexed",
save_directory="Kimi-K2-Thinking-BF16-NVFP4A16",
scheme="NVFP4A16",
ignore=[
"re:.*gate$",
"lm_head",
"re:.*kv_a_proj_with_mqa$",
"re:.*q_a_proj$",
"model.embed_tokens",
],
max_workers=15,
device="cuda:0",
)
```
## Changes ##
* Restructure files
* Move `validate_scheme` to `validate.py`
* Move `find_safetensors_index_path`, `find_config_path`,
`find_safetensors_index_file` to `helpers.py`
* Move `process_file` to `process.py`
* Move `validate_scheme` to `validate.py`
* Break `calibrate_weights` into `calibrate_global_scale` and
`calibrate_scale_zp`
* Add extra utility functions
* `match_names_set_eager`
* `invert_mapping`
* Add microscale/fused module utility functions
* `is_microscale_scheme`
* `get_fused_names`
* Add `process_file_microscale_scheme` to separate the fp4 lifecycle
from the regular lifecycle (this script should be very trustworthy. By
separating the functions, an FP8 user does not have to trust anything
about FP4)
* Add `llm.compressor.reindex_fused_weights` script which reindexes a
model's weights so that fused modules are in the same files.
* Fix
[bug](https://github.com/vllm-project/llm-compressor/pull/1988/files#diff-8d11f284a49f6c4e559617aaf7750f3437a074cd526ee94dbefe86866f250a42R80-R82)
where safetensors index metadata was not being saved correctly
## Testing ##
* Add NVFP4A16 to `test_model_free_ptq_matches_oneshot`
* Regression tested large mistral model e2e with FP8_BLOCK
* Tested large mistral model e2e with NVFP4A16
## Mistral 3 ##
This branch was used to quantize Mistral 3
1. Quantize to W4A16
```python3
from llmcompressor import model_free_ptq
model_free_ptq(
"mistralai/Mistral-Large-3-675B-Instruct-2512",
"Mistral-Large-3-675B-Instruct-2512-FP8_BLOCK",
scheme="NVFP4_A16",
ignore=[
"tok_embeddings", # embeddings
"re:patch_merger.*", # patch merger
"re:vision_encoder.*", # vision tower
"re:vision_language_adapter.*", # vision adapter
"re:.*attention$", # sensitive to quantization
"re:.*gate$", # sensitive to quantization
"output", # lm head
],
max_workers=10, # 10 = 52Gb
device="cuda:0",
)
```
2. Update ignore list to use vLLM checkpoint format
```
[
"model.embed_tokens",
"re:patch_merger.*",
"re:vision_encoder.*",
"re:vision_language_adapter.*",
"lm_head",
"re:.*self_attn.*",
"re:.*gate$"
]
```
3. Add observers to vLLM model definition and run for 100 samples from
ultrachat
4. Save model checkpoint, making sure to reduce values from shards
For more information on how observers were added to vLLM, please reach
out to @kylesayrs
---------
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Co-authored-by: Dipika Sikka <dipikasikka1@gmail.com>
Copy file name to clipboardExpand all lines: examples/model_free_ptq/README.md
+50Lines changed: 50 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,3 +13,53 @@
13
13
In `kimi_k2_thinking_fp8_block.py`, we call `model_free_ptq` by providing a `scheme` and `ignore` list, similar to how we provide reicpes to `oneshot` calls. In the case of Kimi-K2 Thinking, we apply the `FP8_BLOCK` scheme and ignore layers that are incompatible with a block_size of 128 (specifically, `kv_a_proj_with_mqa` and `q_a_proj`).
14
14
15
15
In contrast to `oneshot`, we expect the model stub or pathway string to be directly passed in, as opposed to first being loaded through transformers. Once complete, the model is compressed using compressed-tensors and saved to `SAVE_DIR`.
16
+
17
+
To get started, simply call `model_free_ptq` with your desired model stub and save directory
18
+
```python
19
+
model_free_ptq(
20
+
model_stub="unsloth/Kimi-K2-Thinking-BF16",
21
+
save_directory="Kimi-K2-Thinking-FP8-BLOCK",
22
+
scheme="FP8_BLOCK",
23
+
ignore=[
24
+
"re:.*gate$",
25
+
"lm_head",
26
+
"re:.*kv_a_proj_with_mqa$",
27
+
"re:.*q_a_proj$",
28
+
"model.embed_tokens",
29
+
],
30
+
max_workers=15,
31
+
device="cuda:0",
32
+
)
33
+
34
+
```
35
+
36
+
37
+
# Quantizing models to NVFP4A16/ MXFP4A16
38
+
39
+
Using model_free_ptq to quantize models with microscale schemes (NVFP4/MXFP4) is the same as quantizing models with non-microscale schemes, except for one additional step. That extra step is that the safetensors in the model files must be reindexed to ensure that fused modules (qkv, gate_up) end up in the same safetensors files, which allows model_free_ptq to fuse global scales.
40
+
41
+
First, apply `llmcompressor.reindex_fused_weights` from the command line entrypoint
0 commit comments