[Bug]: GLM-4.6-AWQ model outputs garbled text on vllm/vllm-openai:v0.10.2-x86_64

### Your current environment

GPU: H800

Docker image:     vllm/vllm-openai:v0.10.2-x86_64 with

vllm=0.10.2, transformers=4.56.1, torch=2.8.0+cu128

and 

autoawq=0.2.9 (manually installed)


### 🐛 Describe the bug

Hello VLLM developers,

I am using your vllm/vllm-openai:v0.10.2-x86_64 Docker image, deployed on a Linux server with 6 H800 GPUs. The model I am trying to serve is: [GLM-4.6-AWQ](https://modelscope.cn/models/tclf90/GLM-4.6-AWQ), the model's config.json file contains following content:
```
"quantization_config": {
    "quant_method": "awq_marlin",
    "bits": 4,
    "group_size": 128,
    "version": "gemm",
    "zero_point": true,
    "modules_to_not_convert": ["embed_tokens", "shared_experts", "shared_head", "lm_head"]
}


```

After entering the Docker container, I ran the following command:：
```
vllm serve \
    /data \
    --served-model-name glm46 \
    --enable-auto-tool-choice \
    --tool-call-parser glm45 \
    --reasoning-parser glm45 \
    --swap-space 16 \
    --max-num-seqs 32 \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9 \
    --tensor-parallel-size 4 \
    --trust-remote-code \
    --host 0.0.0.0 \
    --port 8000
```
The server starts successfully:

<img width="1499" height="654" alt="Image" src="https://github.com/user-attachments/assets/cac06772-c21a-4086-ae70-23c076e09049" />

(Apologies for the photo format, as our computers are offline.)

However, the output text is garbled:

<img width="1624" height="217" alt="Image" src="https://github.com/user-attachments/assets/b0709b99-7492-4702-8e7b-07cdf5bfdfe7" />

I also tried loading the model in code using:
```
model = LLM('/data', tensor_parallel_size=4)
```
but the output is still garbled:

<img width="1580" height="260" alt="Image" src="https://github.com/user-attachments/assets/7f23253e-c9f9-4105-9b4d-ccb9e1148690" />

Next, I tried loading the model without VLLM using Transformers and AutoAWQ:
```
model = AutoAWQForCausalLM.from_quantized(
    model_path,
    fuse_layers=True,                
    trust_remote_code=True,
    safetensors=True,               
    device_map="auto",
)
```
but it fails with: `glm4_moe awq quantization isn't supported yet.`

I also tried using `AutoModelForCausalLM.from_pretrained`, which outputs:

<img width="1660" height="158" alt="Image" src="https://github.com/user-attachments/assets/cfb6941b-87e4-4df1-b220-dd9b950a647f" />

For reference, my environment versions are:

transformers=4.56.1

vllm=0.10.2

torch=2.8.0+cu128

These are all from the Docker image, with only autoawq=0.2.9 installed manually.

Btw,I have also tried using --chat-template, adding --quantization parameters, loading with bfloat16, etc., but nothing works. I have confirmed that the model files are not corrupted.

Could you please advise on how to correctly serve this AWQ model using VLLM or Transformers?

Thank you very much!
s

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bug]: GLM-4.6-AWQ model outputs garbled text on vllm/vllm-openai:v0.10.2-x86_64 #30165

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Bug]: GLM-4.6-AWQ model outputs garbled text on vllm/vllm-openai:v0.10.2-x86_64 #30165

Description

Your current environment

🐛 Describe the bug

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions