Installed flash attention but still getting OOM. Traceback:

Traceback (most recent call last): based config for targeting of modules                                                                            
  File "/workspace/qwen3/quant.py", line 18, in <module>░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░░| 0:00:55 / 0:44:00 [1/48] 2.1%
    model.quantize(calibration_dataset, batch_size=1)                                                                                               
  File "/venv/main/lib/python3.12/site-packages/gptqmodel/models/base.py", line 1013, in quantize
    result = module_looper.loop(                                                                                                                    
             ^^^^^^^^^^^^^^^^^^^                                          
  File "/venv/main/lib/python3.12/site-packages/gptqmodel/looper/module_looper.py", line 963, in loop
    return self._loop_impl(fail_safe=fail_safe, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)                                          
           ^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/gptqmodel/looper/module_looper.py", line 1204, in _loop_impl
    forward_outputs = self._run_forward_batches(
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/gptqmodel/looper/module_looper.py", line 406, in _run_forward_batches
    return self._run_forward_batches_single(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/gptqmodel/looper/module_looper.py", line 521, in _run_forward_batches_single
    module_output = module(*layer_input, **additional_inputs)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/transformers/modeling_layers.py", line 94, in __call__
    return super().__call__(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/transformers/models/qwen3_omni_moe/modeling_qwen3_omni_moe.py", line 1535, in forward
    hidden_states, _ = self.self_attn(
                       ^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/transformers/utils/deprecation.py", line 172, in wrapped_func
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/transformers/models/qwen3_omni_moe/modeling_qwen3_omni_moe.py", line 1462, in forward
    attn_output, attn_weights = attention_interface(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/transformers/integrations/sdpa_attention.py", line 96, in sdpa_attention_forward
    attn_output = torch.nn.functional.scaled_dot_product_attention(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 37.10 GiB. GPU 0 has a total capacity of 94.97 GiB of which 32.70 GiB is free. Including non-PyTorch memory, this process has 62.26 GiB memory in use. Of the allocated memory 58.69 GiB is allocated by PyTorch, and 2.78 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Package versions:

gptqmodel 1077c9545f019e29e3eac6313dbf6ed71b4530a9
torch                    2.9.0+cu128
torchvision              0.24.0
transformers             4.57.1   
triton                   3.5.0  
flash-attn               2.8.3 
nvidia-cuda-runtime-cu12 12.8.90

GPU:

NVIDIA RTX PRO 6000 Blackwell Workstation
Driver Version: 575.57.08
CUDA Version: 12.9

Quant script:

from datasets import load_dataset
from gptqmodel import GPTQModel, QuantizeConfig

model_id = "Qwen/Qwen3-Omni-30B-A3B-Instruct"
quant_path = "Qwen3-Omni-30B-A3B-Instruct-GPTQ-4bit"

calibration_dataset = load_dataset(
    "allenai/c4",
    data_files="en/c4-train.00001-of-01024.json.gz",
    split="train"
  ).select(range(1024))["text"]

quant_config = QuantizeConfig(bits=4, group_size=128)

model = GPTQModel.load(model_id, quant_config)

# increase `batch_size` to match gpu/vram specs to speed up quantization
model.quantize(calibration_dataset, batch_size=1)

model.save(quant_path)

Also when batch_size != 1 I get:

  File "/venv/main/lib/python3.12/site-packages/transformers/models/qwen3_omni_moe/modeling_qwen3_omni_moe.py", line 1462, in forward
    attn_output, attn_weights = attention_interface(
                                ^^^^^^^^^^^^^^^^^^^^
  File "/venv/main/lib/python3.12/site-packages/transformers/integrations/sdpa_attention.py", line 96, in sdpa_attention_forward
    attn_output = torch.nn.functional.scaled_dot_product_attention(
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: The expanded size of the tensor (17640) must match the existing size (8820) at non-singleton dimension 3.  Target sizes: [2, 32, 8820, 17640].  Tensor sizes: [2, 1, 8820, 8820]

[QUESTION] Qwen3 Omni VRAM memory leak #2081

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions