You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: examples/model_free_ptq/README.md
+50Lines changed: 50 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,3 +13,53 @@
13
13
In `kimi_k2_thinking_fp8_block.py`, we call `model_free_ptq` by providing a `scheme` and `ignore` list, similar to how we provide reicpes to `oneshot` calls. In the case of Kimi-K2 Thinking, we apply the `FP8_BLOCK` scheme and ignore layers that are incompatible with a block_size of 128 (specifically, `kv_a_proj_with_mqa` and `q_a_proj`).
14
14
15
15
In contrast to `oneshot`, we expect the model stub or pathway string to be directly passed in, as opposed to first being loaded through transformers. Once complete, the model is compressed using compressed-tensors and saved to `SAVE_DIR`.
16
+
17
+
To get started, simply call `model_free_ptq` with your desired model stub and save directory
18
+
```python
19
+
model_free_ptq(
20
+
model_stub="unsloth/Kimi-K2-Thinking-BF16",
21
+
save_directory="Kimi-K2-Thinking-FP8-BLOCK",
22
+
scheme="FP8_BLOCK",
23
+
ignore=[
24
+
"re:.*gate$",
25
+
"lm_head",
26
+
"re:.*kv_a_proj_with_mqa$",
27
+
"re:.*q_a_proj$",
28
+
"model.embed_tokens",
29
+
],
30
+
max_workers=15,
31
+
device="cuda:0",
32
+
)
33
+
34
+
```
35
+
36
+
37
+
# Quantizing models to NVFP4A16/ MXFP4A16
38
+
39
+
Using `model_free_ptq` to quantizing models with microscale schemes (NVFP4/MXFP4) is the same as quantizing models using non-microscale schemes, except for one additional step. That extra step is that the safetensors in the model files must be reindexed in order to guarantee that fused modules (qkv, gate_up) end up in the same safetensors files, which assists `model_free_ptq` in fusing global scales.
40
+
41
+
First, apply `llmcompressor.reindex_fused_weights` from the command line entrypoint
0 commit comments