Gradio Bug Report: CUDA GPU inference hangs when using queue (default behavior)

I have been working on this repo and it litearlly took my 12 hours to figure out why Gradio was not working while their demo working

https://github.com/facebookresearch/sam-3d-objects

Here my findings as report for Gradio 6.0.1



# Gradio Bug Report: CUDA GPU Inference Hangs When Using Queue

## Summary

When running PyTorch CUDA inference inside a Gradio event handler with `queue=True` (the default), GPU operations hang indefinitely at certain CUDA kernel calls. The same code works perfectly when run directly (outside Gradio) or when `queue=False` is explicitly set.

---

## Environment

| Component | Version/Details |
|-----------|-----------------|
| **OS** | Windows 10 (10.0.26200) |
| **GPU** | NVIDIA GeForce RTX 5090 (Blackwell architecture) |
| **CUDA Device Count** | 2 |
| **Gradio Version** | 6.x |
| **PyTorch** | With CUDA support |
| **Python** | 3.x (venv) |

---

## Steps to Reproduce

### 1. Create a Gradio app that runs PyTorch CUDA inference:

```python
import gradio as gr
import torch

# Load a PyTorch model to GPU at startup
model = load_my_model()  # Model is on CUDA

def process(input_data):
    # This hangs when queue=True (default)
    output = model(input_data)  # CUDA operations
    return output

with gr.Blocks() as demo:
    btn = gr.Button("Process")
    btn.click(fn=process, inputs=[...], outputs=[...])
    # Default: queue=True

demo.launch()
```

### 2. Click the button to trigger inference

### 3. Observe the hang

The function hangs at CUDA kernel execution with **no GPU activity** (0% GPU utilization in Task Manager/nvidia-smi).

---

## Expected Behavior

CUDA inference should execute on GPU and complete normally, same as when running the inference code directly in a Python script.

---

## Actual Behavior

- ✅ Model loads correctly to GPU (logs confirm `device: cuda`)
- ✅ CUDA is available in the callback thread (verified with `torch.cuda.is_available()`)
- ✅ `torch.cuda.current_device()` returns correct device (0)
- ✅ Basic CUDA operations work (e.g., `torch.randn(1000,1000,device='cuda')`)
- ❌ **Complex model inference hangs indefinitely** (specifically at condition embedder / DINO ViT forward pass)
- ❌ No GPU utilization visible - appears to be waiting/blocked, not running on CPU

---

## Diagnostic Output

```
[DEBUG] CUDA available: True
[DEBUG] CUDA device count: 2
[DEBUG] Current CUDA device: 0
[DEBUG] CUDA device name: NVIDIA GeForce RTX 5090

Running inference with seed=42
Progress: Computing depth & pointmap (5.0%)
Progress: Preprocessing inputs (15.0%)
Progress: Sampling sparse structure (35.0%)
2025-12-02 19:28:39.697 | INFO | Running condition embedder ...
<--- HANGS HERE INDEFINITELY --->
```

---

## Workaround / Fix

Setting `queue=False` on the event handler fixes the issue:

```python
btn.click(
    fn=process,
    inputs=[...],
    outputs=[...],
    queue=False,  # THIS FIXES IT
)
```

### Working Example

```python
import gradio as gr

def process(input_data):
    output = model(input_data)
    return output

with gr.Blocks() as demo:
    btn = gr.Button("Process")
    btn.click(
        fn=process,
        inputs=[...],
        outputs=[...],
        queue=False,  # Bypass queue threading - fixes CUDA hang
    )

demo.launch()
```

---

## Root Cause Analysis

The issue appears to be related to how Gradio's queue system handles CUDA operations:

### With `queue=True` (default)
Gradio runs the callback in a thread pool worker. Something in this threading model causes CUDA operations to block/deadlock.

### With `queue=False`
The callback runs synchronously (likely in the main thread or with different threading behavior), and CUDA works correctly.

### Possible Causes

1. **CUDA context not properly shared/accessible** in queue worker threads
2. **Thread-local state issues** with PyTorch CUDA
3. **Deadlock** between queue management and CUDA synchronization primitives
4. **Specific issue with newer GPU architectures** (Blackwell/RTX 50 series)

---

## Additional Context

| Scenario | Result |
|----------|--------|
| Same inference code run via `python demo.py` (no Gradio) | ✅ Works perfectly |
| Model loading at startup (before Gradio takes over) | ✅ Works fine |
| Simple CUDA tensor operations in callback | ✅ Works |
| Complex model forward passes (Vision Transformers) | ❌ **Hangs** |

### Key Observation

The **exact same inference code** works perfectly when run directly without Gradio. The only difference is the execution context (Gradio queue thread vs main script).

---

## Impact

This is a **critical issue** for any Gradio app that needs to run GPU inference, as `queue=True` is the default behavior.

### User Experience

- ❌ Complete application hang
- ❌ No error messages (silent failure)
- ❌ Difficult to diagnose (CUDA appears available, model appears loaded)
- ❌ Users may incorrectly assume their GPU/model is broken

---

## Suggested Investigation Areas

1. **Queue Worker Thread Spawning** - How queue worker threads are spawned and their relationship to CUDA context
2. **CUDA Device Setting** - Whether `torch.cuda.set_device()` needs to be called in worker threads
3. **CUDA Stream Synchronization** - Potential deadlock in CUDA stream synchronization when called from queue threads
4. **Windows-Specific Behavior** - Windows-specific threading behavior with CUDA
5. **New GPU Architecture** - Blackwell/RTX 50 series specific issues

---

## Temporary Workaround for Users

Until this issue is fixed, users running CUDA inference in Gradio should:

```python
# Option 1: Disable queue per-event
btn.click(fn=my_cuda_function, ..., queue=False)

# Option 2: If using .then() chains, disable on all events
btn.click(..., queue=False).then(..., queue=False)
```

**Note:** Disabling queue means requests won't be queued and the UI may be less responsive during long-running inference, but GPU inference will work correctly.

---

## Related Information

- **Gradio Version**: 6.x
- **PyTorch CUDA**: Confirmed working outside Gradio
- **GPU**: RTX 5090 (Blackwell) - may also affect other GPUs
- **Threading**: Issue is specifically with queue=True threading model


Full used code below

 ```
# Use the EXACT same import path as demo.py
import sys
sys.path.append("notebook")
from inference import Inference, load_image

import os
import tempfile
import numpy as np
from PIL import Image
import gradio as gr

# Global inference instance
inference = None


def load_model():
    global inference
    if inference is None:
        tag = "hf"
        config_path = f"checkpoints/{tag}/pipeline.yaml"
        print("Loading model...")
        inference = Inference(config_path, compile=False)
        print("Model loaded!")
    return inference


def process_images(input_image, mask_image, seed):
    global inference
    
    if input_image is None:
        raise gr.Error("Please upload an input image")
    if mask_image is None:
        raise gr.Error("Please upload a mask image")
    
    if inference is None:
        raise gr.Error("Model not loaded")
    
    # Convert to numpy
    if isinstance(input_image, Image.Image):
        input_image = np.array(input_image)
    if isinstance(mask_image, Image.Image):
        mask_image = np.array(mask_image)
    
    # Process mask like load_mask does
    mask = mask_image > 0
    if mask.ndim == 3:
        mask = mask[..., -1]
    
    image = input_image.astype(np.uint8)
    seed_value = int(seed) if seed else 42
    
    progress_messages = []
    def log_progress(message, fraction=None):
        if fraction is not None:
            progress_messages.append(f"[{fraction*100:.1f}%] {message}")
        else:
            progress_messages.append(message)
        print(f"Progress: {message}" + (f" ({fraction*100:.1f}%)" if fraction else ""))
    
    print(f"Running inference with seed={seed_value}")
    output = inference(image, mask, seed=seed_value, progress_callback=log_progress)
    
    output_path = tempfile.mktemp(suffix=".ply")
    output["gs"].save_ply(output_path)
    print(f"Saved to {output_path}")
    
    return output_path, "\n".join(progress_messages)


def create_app():
    with gr.Blocks(title="SAM 3D Objects") as demo:
        gr.Markdown("# SAM 3D Objects")
        
        with gr.Row():
            input_image = gr.Image(label="Input Image", type="numpy", height=350)
            mask_image = gr.Image(label="Mask Image", type="numpy", height=350)
        
        with gr.Row():
            seed_input = gr.Number(label="Seed", value=42, precision=0)
            process_btn = gr.Button("Process", variant="primary")
        
        with gr.Row():
            output_file = gr.File(label="Download PLY")
            progress_log = gr.Textbox(label="Progress", lines=8)
        
        model_viewer = gr.Model3D(label="3D Preview")
        
        process_btn.click(
            fn=process_images,
            inputs=[input_image, mask_image, seed_input],
            outputs=[output_file, progress_log],
            queue=False,  # Run synchronously, not in queue thread
        ).then(
            fn=lambda x: x,
            inputs=[output_file],
            outputs=[model_viewer],
            queue=False,
        )
    
    return demo


if __name__ == "__main__":
    print("Loading model...")
    load_model()
    print("Model ready!")
    
    app = create_app()
    
    # Disable queue to run in main thread (avoids threading issues with CUDA)
    app.launch(
        share=False,
        server_name="0.0.0.0",
        server_port=7860,
    )
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gradio Bug Report: CUDA GPU inference hangs when using queue (default behavior) #12492

Gradio Bug Report: CUDA GPU Inference Hangs When Using Queue

Summary

Environment

Steps to Reproduce

1. Create a Gradio app that runs PyTorch CUDA inference:

2. Click the button to trigger inference

3. Observe the hang

Expected Behavior

Actual Behavior

Diagnostic Output

Workaround / Fix

Working Example

Root Cause Analysis

With `queue=True` (default)

With `queue=False`

Possible Causes

Additional Context

Key Observation

Impact

User Experience

Suggested Investigation Areas

Temporary Workaround for Users

Related Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Component	Version/Details
OS	Windows 10 (10.0.26200)
GPU	NVIDIA GeForce RTX 5090 (Blackwell architecture)
CUDA Device Count	2
Gradio Version	6.x
PyTorch	With CUDA support
Python	3.x (venv)

Scenario	Result
Same inference code run via `python demo.py` (no Gradio)	✅ Works perfectly
Model loading at startup (before Gradio takes over)	✅ Works fine
Simple CUDA tensor operations in callback	✅ Works
Complex model forward passes (Vision Transformers)	❌ Hangs

Gradio Bug Report: CUDA GPU inference hangs when using queue (default behavior) #12492

Description

Gradio Bug Report: CUDA GPU Inference Hangs When Using Queue

Summary

Environment

Steps to Reproduce

1. Create a Gradio app that runs PyTorch CUDA inference:

2. Click the button to trigger inference

3. Observe the hang

Expected Behavior

Actual Behavior

Diagnostic Output

Workaround / Fix

Working Example

Root Cause Analysis

With queue=True (default)

With queue=False

Possible Causes

Additional Context

Key Observation

Impact

User Experience

Suggested Investigation Areas

Temporary Workaround for Users

Related Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

With `queue=True` (default)

With `queue=False`