-
-
Notifications
You must be signed in to change notification settings - Fork 11.8k
Description
Your current environment
GPU: H800
Docker image: vllm/vllm-openai:v0.10.2-x86_64 with
vllm=0.10.2, transformers=4.56.1, torch=2.8.0+cu128
and
autoawq=0.2.9 (manually installed)
🐛 Describe the bug
Hello VLLM developers,
I am using your vllm/vllm-openai:v0.10.2-x86_64 Docker image, deployed on a Linux server with 6 H800 GPUs. The model I am trying to serve is: GLM-4.6-AWQ, the model's config.json file contains following content:
"quantization_config": {
"quant_method": "awq_marlin",
"bits": 4,
"group_size": 128,
"version": "gemm",
"zero_point": true,
"modules_to_not_convert": ["embed_tokens", "shared_experts", "shared_head", "lm_head"]
}
After entering the Docker container, I ran the following command::
vllm serve \
/data \
--served-model-name glm46 \
--enable-auto-tool-choice \
--tool-call-parser glm45 \
--reasoning-parser glm45 \
--swap-space 16 \
--max-num-seqs 32 \
--max-model-len 8192 \
--gpu-memory-utilization 0.9 \
--tensor-parallel-size 4 \
--trust-remote-code \
--host 0.0.0.0 \
--port 8000
The server starts successfully:
(Apologies for the photo format, as our computers are offline.)
However, the output text is garbled:
I also tried loading the model in code using:
model = LLM('/data', tensor_parallel_size=4)
but the output is still garbled:
Next, I tried loading the model without VLLM using Transformers and AutoAWQ:
model = AutoAWQForCausalLM.from_quantized(
model_path,
fuse_layers=True,
trust_remote_code=True,
safetensors=True,
device_map="auto",
)
but it fails with: glm4_moe awq quantization isn't supported yet.
I also tried using AutoModelForCausalLM.from_pretrained, which outputs:
For reference, my environment versions are:
transformers=4.56.1
vllm=0.10.2
torch=2.8.0+cu128
These are all from the Docker image, with only autoawq=0.2.9 installed manually.
Btw,I have also tried using --chat-template, adding --quantization parameters, loading with bfloat16, etc., but nothing works. I have confirmed that the model files are not corrupted.
Could you please advise on how to correctly serve this AWQ model using VLLM or Transformers?
Thank you very much!
s
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.