-
Notifications
You must be signed in to change notification settings - Fork 3.2k
Description
I have been working on this repo and it litearlly took my 12 hours to figure out why Gradio was not working while their demo working
https://github.com/facebookresearch/sam-3d-objects
Here my findings as report for Gradio 6.0.1
Gradio Bug Report: CUDA GPU Inference Hangs When Using Queue
Summary
When running PyTorch CUDA inference inside a Gradio event handler with queue=True (the default), GPU operations hang indefinitely at certain CUDA kernel calls. The same code works perfectly when run directly (outside Gradio) or when queue=False is explicitly set.
Environment
| Component | Version/Details |
|---|---|
| OS | Windows 10 (10.0.26200) |
| GPU | NVIDIA GeForce RTX 5090 (Blackwell architecture) |
| CUDA Device Count | 2 |
| Gradio Version | 6.x |
| PyTorch | With CUDA support |
| Python | 3.x (venv) |
Steps to Reproduce
1. Create a Gradio app that runs PyTorch CUDA inference:
import gradio as gr
import torch
# Load a PyTorch model to GPU at startup
model = load_my_model() # Model is on CUDA
def process(input_data):
# This hangs when queue=True (default)
output = model(input_data) # CUDA operations
return output
with gr.Blocks() as demo:
btn = gr.Button("Process")
btn.click(fn=process, inputs=[...], outputs=[...])
# Default: queue=True
demo.launch()2. Click the button to trigger inference
3. Observe the hang
The function hangs at CUDA kernel execution with no GPU activity (0% GPU utilization in Task Manager/nvidia-smi).
Expected Behavior
CUDA inference should execute on GPU and complete normally, same as when running the inference code directly in a Python script.
Actual Behavior
- ✅ Model loads correctly to GPU (logs confirm
device: cuda) - ✅ CUDA is available in the callback thread (verified with
torch.cuda.is_available()) - ✅
torch.cuda.current_device()returns correct device (0) - ✅ Basic CUDA operations work (e.g.,
torch.randn(1000,1000,device='cuda')) - ❌ Complex model inference hangs indefinitely (specifically at condition embedder / DINO ViT forward pass)
- ❌ No GPU utilization visible - appears to be waiting/blocked, not running on CPU
Diagnostic Output
[DEBUG] CUDA available: True
[DEBUG] CUDA device count: 2
[DEBUG] Current CUDA device: 0
[DEBUG] CUDA device name: NVIDIA GeForce RTX 5090
Running inference with seed=42
Progress: Computing depth & pointmap (5.0%)
Progress: Preprocessing inputs (15.0%)
Progress: Sampling sparse structure (35.0%)
2025-12-02 19:28:39.697 | INFO | Running condition embedder ...
<--- HANGS HERE INDEFINITELY --->
Workaround / Fix
Setting queue=False on the event handler fixes the issue:
btn.click(
fn=process,
inputs=[...],
outputs=[...],
queue=False, # THIS FIXES IT
)Working Example
import gradio as gr
def process(input_data):
output = model(input_data)
return output
with gr.Blocks() as demo:
btn = gr.Button("Process")
btn.click(
fn=process,
inputs=[...],
outputs=[...],
queue=False, # Bypass queue threading - fixes CUDA hang
)
demo.launch()Root Cause Analysis
The issue appears to be related to how Gradio's queue system handles CUDA operations:
With queue=True (default)
Gradio runs the callback in a thread pool worker. Something in this threading model causes CUDA operations to block/deadlock.
With queue=False
The callback runs synchronously (likely in the main thread or with different threading behavior), and CUDA works correctly.
Possible Causes
- CUDA context not properly shared/accessible in queue worker threads
- Thread-local state issues with PyTorch CUDA
- Deadlock between queue management and CUDA synchronization primitives
- Specific issue with newer GPU architectures (Blackwell/RTX 50 series)
Additional Context
| Scenario | Result |
|---|---|
Same inference code run via python demo.py (no Gradio) |
✅ Works perfectly |
| Model loading at startup (before Gradio takes over) | ✅ Works fine |
| Simple CUDA tensor operations in callback | ✅ Works |
| Complex model forward passes (Vision Transformers) | ❌ Hangs |
Key Observation
The exact same inference code works perfectly when run directly without Gradio. The only difference is the execution context (Gradio queue thread vs main script).
Impact
This is a critical issue for any Gradio app that needs to run GPU inference, as queue=True is the default behavior.
User Experience
- ❌ Complete application hang
- ❌ No error messages (silent failure)
- ❌ Difficult to diagnose (CUDA appears available, model appears loaded)
- ❌ Users may incorrectly assume their GPU/model is broken
Suggested Investigation Areas
- Queue Worker Thread Spawning - How queue worker threads are spawned and their relationship to CUDA context
- CUDA Device Setting - Whether
torch.cuda.set_device()needs to be called in worker threads - CUDA Stream Synchronization - Potential deadlock in CUDA stream synchronization when called from queue threads
- Windows-Specific Behavior - Windows-specific threading behavior with CUDA
- New GPU Architecture - Blackwell/RTX 50 series specific issues
Temporary Workaround for Users
Until this issue is fixed, users running CUDA inference in Gradio should:
# Option 1: Disable queue per-event
btn.click(fn=my_cuda_function, ..., queue=False)
# Option 2: If using .then() chains, disable on all events
btn.click(..., queue=False).then(..., queue=False)Note: Disabling queue means requests won't be queued and the UI may be less responsive during long-running inference, but GPU inference will work correctly.
Related Information
- Gradio Version: 6.x
- PyTorch CUDA: Confirmed working outside Gradio
- GPU: RTX 5090 (Blackwell) - may also affect other GPUs
- Threading: Issue is specifically with queue=True threading model
Full used code below
# Use the EXACT same import path as demo.py
import sys
sys.path.append("notebook")
from inference import Inference, load_image
import os
import tempfile
import numpy as np
from PIL import Image
import gradio as gr
# Global inference instance
inference = None
def load_model():
global inference
if inference is None:
tag = "hf"
config_path = f"checkpoints/{tag}/pipeline.yaml"
print("Loading model...")
inference = Inference(config_path, compile=False)
print("Model loaded!")
return inference
def process_images(input_image, mask_image, seed):
global inference
if input_image is None:
raise gr.Error("Please upload an input image")
if mask_image is None:
raise gr.Error("Please upload a mask image")
if inference is None:
raise gr.Error("Model not loaded")
# Convert to numpy
if isinstance(input_image, Image.Image):
input_image = np.array(input_image)
if isinstance(mask_image, Image.Image):
mask_image = np.array(mask_image)
# Process mask like load_mask does
mask = mask_image > 0
if mask.ndim == 3:
mask = mask[..., -1]
image = input_image.astype(np.uint8)
seed_value = int(seed) if seed else 42
progress_messages = []
def log_progress(message, fraction=None):
if fraction is not None:
progress_messages.append(f"[{fraction*100:.1f}%] {message}")
else:
progress_messages.append(message)
print(f"Progress: {message}" + (f" ({fraction*100:.1f}%)" if fraction else ""))
print(f"Running inference with seed={seed_value}")
output = inference(image, mask, seed=seed_value, progress_callback=log_progress)
output_path = tempfile.mktemp(suffix=".ply")
output["gs"].save_ply(output_path)
print(f"Saved to {output_path}")
return output_path, "\n".join(progress_messages)
def create_app():
with gr.Blocks(title="SAM 3D Objects") as demo:
gr.Markdown("# SAM 3D Objects")
with gr.Row():
input_image = gr.Image(label="Input Image", type="numpy", height=350)
mask_image = gr.Image(label="Mask Image", type="numpy", height=350)
with gr.Row():
seed_input = gr.Number(label="Seed", value=42, precision=0)
process_btn = gr.Button("Process", variant="primary")
with gr.Row():
output_file = gr.File(label="Download PLY")
progress_log = gr.Textbox(label="Progress", lines=8)
model_viewer = gr.Model3D(label="3D Preview")
process_btn.click(
fn=process_images,
inputs=[input_image, mask_image, seed_input],
outputs=[output_file, progress_log],
queue=False, # Run synchronously, not in queue thread
).then(
fn=lambda x: x,
inputs=[output_file],
outputs=[model_viewer],
queue=False,
)
return demo
if __name__ == "__main__":
print("Loading model...")
load_model()
print("Model ready!")
app = create_app()
# Disable queue to run in main thread (avoids threading issues with CUDA)
app.launch(
share=False,
server_name="0.0.0.0",
server_port=7860,
)