Skip to content

Gradio Bug Report: CUDA GPU inference hangs when using queue (default behavior) #12492

@FurkanGozukara

Description

@FurkanGozukara

I have been working on this repo and it litearlly took my 12 hours to figure out why Gradio was not working while their demo working

https://github.com/facebookresearch/sam-3d-objects

Here my findings as report for Gradio 6.0.1

Gradio Bug Report: CUDA GPU Inference Hangs When Using Queue

Summary

When running PyTorch CUDA inference inside a Gradio event handler with queue=True (the default), GPU operations hang indefinitely at certain CUDA kernel calls. The same code works perfectly when run directly (outside Gradio) or when queue=False is explicitly set.


Environment

Component Version/Details
OS Windows 10 (10.0.26200)
GPU NVIDIA GeForce RTX 5090 (Blackwell architecture)
CUDA Device Count 2
Gradio Version 6.x
PyTorch With CUDA support
Python 3.x (venv)

Steps to Reproduce

1. Create a Gradio app that runs PyTorch CUDA inference:

import gradio as gr
import torch

# Load a PyTorch model to GPU at startup
model = load_my_model()  # Model is on CUDA

def process(input_data):
    # This hangs when queue=True (default)
    output = model(input_data)  # CUDA operations
    return output

with gr.Blocks() as demo:
    btn = gr.Button("Process")
    btn.click(fn=process, inputs=[...], outputs=[...])
    # Default: queue=True

demo.launch()

2. Click the button to trigger inference

3. Observe the hang

The function hangs at CUDA kernel execution with no GPU activity (0% GPU utilization in Task Manager/nvidia-smi).


Expected Behavior

CUDA inference should execute on GPU and complete normally, same as when running the inference code directly in a Python script.


Actual Behavior

  • ✅ Model loads correctly to GPU (logs confirm device: cuda)
  • ✅ CUDA is available in the callback thread (verified with torch.cuda.is_available())
  • torch.cuda.current_device() returns correct device (0)
  • ✅ Basic CUDA operations work (e.g., torch.randn(1000,1000,device='cuda'))
  • Complex model inference hangs indefinitely (specifically at condition embedder / DINO ViT forward pass)
  • ❌ No GPU utilization visible - appears to be waiting/blocked, not running on CPU

Diagnostic Output

[DEBUG] CUDA available: True
[DEBUG] CUDA device count: 2
[DEBUG] Current CUDA device: 0
[DEBUG] CUDA device name: NVIDIA GeForce RTX 5090

Running inference with seed=42
Progress: Computing depth & pointmap (5.0%)
Progress: Preprocessing inputs (15.0%)
Progress: Sampling sparse structure (35.0%)
2025-12-02 19:28:39.697 | INFO | Running condition embedder ...
<--- HANGS HERE INDEFINITELY --->

Workaround / Fix

Setting queue=False on the event handler fixes the issue:

btn.click(
    fn=process,
    inputs=[...],
    outputs=[...],
    queue=False,  # THIS FIXES IT
)

Working Example

import gradio as gr

def process(input_data):
    output = model(input_data)
    return output

with gr.Blocks() as demo:
    btn = gr.Button("Process")
    btn.click(
        fn=process,
        inputs=[...],
        outputs=[...],
        queue=False,  # Bypass queue threading - fixes CUDA hang
    )

demo.launch()

Root Cause Analysis

The issue appears to be related to how Gradio's queue system handles CUDA operations:

With queue=True (default)

Gradio runs the callback in a thread pool worker. Something in this threading model causes CUDA operations to block/deadlock.

With queue=False

The callback runs synchronously (likely in the main thread or with different threading behavior), and CUDA works correctly.

Possible Causes

  1. CUDA context not properly shared/accessible in queue worker threads
  2. Thread-local state issues with PyTorch CUDA
  3. Deadlock between queue management and CUDA synchronization primitives
  4. Specific issue with newer GPU architectures (Blackwell/RTX 50 series)

Additional Context

Scenario Result
Same inference code run via python demo.py (no Gradio) ✅ Works perfectly
Model loading at startup (before Gradio takes over) ✅ Works fine
Simple CUDA tensor operations in callback ✅ Works
Complex model forward passes (Vision Transformers) Hangs

Key Observation

The exact same inference code works perfectly when run directly without Gradio. The only difference is the execution context (Gradio queue thread vs main script).


Impact

This is a critical issue for any Gradio app that needs to run GPU inference, as queue=True is the default behavior.

User Experience

  • ❌ Complete application hang
  • ❌ No error messages (silent failure)
  • ❌ Difficult to diagnose (CUDA appears available, model appears loaded)
  • ❌ Users may incorrectly assume their GPU/model is broken

Suggested Investigation Areas

  1. Queue Worker Thread Spawning - How queue worker threads are spawned and their relationship to CUDA context
  2. CUDA Device Setting - Whether torch.cuda.set_device() needs to be called in worker threads
  3. CUDA Stream Synchronization - Potential deadlock in CUDA stream synchronization when called from queue threads
  4. Windows-Specific Behavior - Windows-specific threading behavior with CUDA
  5. New GPU Architecture - Blackwell/RTX 50 series specific issues

Temporary Workaround for Users

Until this issue is fixed, users running CUDA inference in Gradio should:

# Option 1: Disable queue per-event
btn.click(fn=my_cuda_function, ..., queue=False)

# Option 2: If using .then() chains, disable on all events
btn.click(..., queue=False).then(..., queue=False)

Note: Disabling queue means requests won't be queued and the UI may be less responsive during long-running inference, but GPU inference will work correctly.


Related Information

  • Gradio Version: 6.x
  • PyTorch CUDA: Confirmed working outside Gradio
  • GPU: RTX 5090 (Blackwell) - may also affect other GPUs
  • Threading: Issue is specifically with queue=True threading model

Full used code below

# Use the EXACT same import path as demo.py
import sys
sys.path.append("notebook")
from inference import Inference, load_image

import os
import tempfile
import numpy as np
from PIL import Image
import gradio as gr

# Global inference instance
inference = None


def load_model():
   global inference
   if inference is None:
       tag = "hf"
       config_path = f"checkpoints/{tag}/pipeline.yaml"
       print("Loading model...")
       inference = Inference(config_path, compile=False)
       print("Model loaded!")
   return inference


def process_images(input_image, mask_image, seed):
   global inference
   
   if input_image is None:
       raise gr.Error("Please upload an input image")
   if mask_image is None:
       raise gr.Error("Please upload a mask image")
   
   if inference is None:
       raise gr.Error("Model not loaded")
   
   # Convert to numpy
   if isinstance(input_image, Image.Image):
       input_image = np.array(input_image)
   if isinstance(mask_image, Image.Image):
       mask_image = np.array(mask_image)
   
   # Process mask like load_mask does
   mask = mask_image > 0
   if mask.ndim == 3:
       mask = mask[..., -1]
   
   image = input_image.astype(np.uint8)
   seed_value = int(seed) if seed else 42
   
   progress_messages = []
   def log_progress(message, fraction=None):
       if fraction is not None:
           progress_messages.append(f"[{fraction*100:.1f}%] {message}")
       else:
           progress_messages.append(message)
       print(f"Progress: {message}" + (f" ({fraction*100:.1f}%)" if fraction else ""))
   
   print(f"Running inference with seed={seed_value}")
   output = inference(image, mask, seed=seed_value, progress_callback=log_progress)
   
   output_path = tempfile.mktemp(suffix=".ply")
   output["gs"].save_ply(output_path)
   print(f"Saved to {output_path}")
   
   return output_path, "\n".join(progress_messages)


def create_app():
   with gr.Blocks(title="SAM 3D Objects") as demo:
       gr.Markdown("# SAM 3D Objects")
       
       with gr.Row():
           input_image = gr.Image(label="Input Image", type="numpy", height=350)
           mask_image = gr.Image(label="Mask Image", type="numpy", height=350)
       
       with gr.Row():
           seed_input = gr.Number(label="Seed", value=42, precision=0)
           process_btn = gr.Button("Process", variant="primary")
       
       with gr.Row():
           output_file = gr.File(label="Download PLY")
           progress_log = gr.Textbox(label="Progress", lines=8)
       
       model_viewer = gr.Model3D(label="3D Preview")
       
       process_btn.click(
           fn=process_images,
           inputs=[input_image, mask_image, seed_input],
           outputs=[output_file, progress_log],
           queue=False,  # Run synchronously, not in queue thread
       ).then(
           fn=lambda x: x,
           inputs=[output_file],
           outputs=[model_viewer],
           queue=False,
       )
   
   return demo


if __name__ == "__main__":
   print("Loading model...")
   load_model()
   print("Model ready!")
   
   app = create_app()
   
   # Disable queue to run in main thread (avoids threading issues with CUDA)
   app.launch(
       share=False,
       server_name="0.0.0.0",
       server_port=7860,
   )

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions