Skip to content

Conversation

@apbose
Copy link
Collaborator

@apbose apbose commented Nov 14, 2025

This PR

  1. Adds rank based logging for the distributed examples
  2. Corrects the fallback to pytorch case for NCCL converters
  3. This with Changes to TRT-LLM download tool for multigpu distributed case  #3830 provides utilities for running distributed tensor parallel examples using torch.distributed

@meta-cla meta-cla bot added the cla signed label Nov 14, 2025
@github-actions github-actions bot added component: tests Issues re: Tests component: conversion Issues re: Conversion stage component: api [Python] Issues re: Python API component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths labels Nov 14, 2025
@apbose apbose changed the title Adding rank based logging for torch distributed examples. Also correc… Adding rank based logging for torch distributed examples Nov 14, 2025
@github-actions github-actions bot requested a review from narendasan November 14, 2025 00:05
@apbose apbose marked this pull request as draft November 14, 2025 00:05
@apbose apbose changed the title Adding rank based logging for torch distributed examples Adding rank based logging for torch distributed examples[WIP] Nov 14, 2025
@apbose apbose force-pushed the abose/trt_llm_installation_changes_debug branch from 31666e3 to 52ae92a Compare November 25, 2025 22:59
@apbose apbose changed the title Adding rank based logging for torch distributed examples[WIP] Adding rank based logging for torch distributed examples Nov 26, 2025
@apbose apbose marked this pull request as ready for review November 26, 2025 00:28
@apbose apbose changed the base branch from main to abose/trt_llm_installation_dist November 26, 2025 00:28
return device_mesh, world_size, rank
# Set C++ TensorRT runtime log level based on most verbose handler
# this is similar to set_log_level()
cpp_level = min(file_level_int, console_level_int)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dont we have an API that abstracts needing to detect if the C++ runtime is available? If not we should add one

Copy link
Collaborator Author

@apbose apbose Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a function in _features.py for the above. And also moved all this to logging.py. Let me know if that function placment works

not is_platform_supported_for_trtllm(),
"Skipped on Windows, Jetson and CUDA13: NCCL backend is not supported.",
not is_distributed_nccl_available(),
"Skipped: NCCL backend is not available (Windows/Jetson not supported).",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it jetson or just Orin?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah Orin. Changed to Jetson Orin

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-02 00:37:46.920408+00:00
+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-02 00:38:18.669710+00:00
@@ -148,11 +148,11 @@
            item,
            options.jetpack == "true",
            options.limit_pr_builds == "true",
        ):
            print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)
-            filtered_includes.append(item) 
+            filtered_includes.append(item)
            distributed_includes.append(create_distributed_config(item))
        else:
            print(f"[DEBUG] FILTERED OUT", file=sys.stderr)

    # Debug: Show summary

Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-02 07:00:24.693914+00:00
+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-02 07:00:53.634960+00:00
@@ -148,11 +148,11 @@
            item,
            options.jetpack == "true",
            options.limit_pr_builds == "true",
        ):
            print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)
-            filtered_includes.append(item) 
+            filtered_includes.append(item)
            distributed_includes.append(create_distributed_config(item))
        else:
            print(f"[DEBUG] FILTERED OUT", file=sys.stderr)

    # Debug: Show summary

@apbose apbose force-pushed the abose/trt_llm_installation_changes_debug branch from aa4183e to 2ea29e4 Compare December 2, 2025 15:33
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-02 15:34:05.984305+00:00
+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-02 15:34:37.144980+00:00
@@ -148,11 +148,11 @@
            item,
            options.jetpack == "true",
            options.limit_pr_builds == "true",
        ):
            print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)
-            filtered_includes.append(item) 
+            filtered_includes.append(item)
            distributed_includes.append(create_distributed_config(item))
        else:
            print(f"[DEBUG] FILTERED OUT", file=sys.stderr)

    # Debug: Show summary

@apbose apbose force-pushed the abose/trt_llm_installation_changes_debug branch from 2ea29e4 to 6833fec Compare December 2, 2025 22:41
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-02 22:41:27.269191+00:00
+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-02 22:41:58.523912+00:00
@@ -148,11 +148,11 @@
            item,
            options.jetpack == "true",
            options.limit_pr_builds == "true",
        ):
            print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)
-            filtered_includes.append(item) 
+            filtered_includes.append(item)
            distributed_includes.append(create_distributed_config(item))
        else:
            print(f"[DEBUG] FILTERED OUT", file=sys.stderr)

    # Debug: Show summary

@apbose apbose force-pushed the abose/trt_llm_installation_changes_debug branch from 6833fec to f8befae Compare December 2, 2025 23:49
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-02 23:49:16.116928+00:00
+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-02 23:49:48.815063+00:00
@@ -148,11 +148,11 @@
            item,
            options.jetpack == "true",
            options.limit_pr_builds == "true",
        ):
            print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)
-            filtered_includes.append(item) 
+            filtered_includes.append(item)
            distributed_includes.append(create_distributed_config(item))
        else:
            print(f"[DEBUG] FILTERED OUT", file=sys.stderr)

    # Debug: Show summary
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-12-02 23:49:16.689930+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-12-02 23:50:00.341840+00:00
@@ -74,19 +74,19 @@
            try:
                size = os.path.getsize(path)
                shm_files.append((path, size))
            except OSError:
                shm_files.append((path, -1))
-        
+
        # Sort by size descending
        shm_files.sort(key=lambda x: x[1], reverse=True)
        for path, size in shm_files:
            if size >= 0:
                print(f"  {path}: {size / (1024 * 1024):.2f} MB")
            else:
                print(f"  {path}: <unable to get size>")
-        
+
        if not shm_files:
            print("  (no files found)")
    except Exception as e:
        print(f"  Error listing /dev/shm: {e}")

@apbose apbose force-pushed the abose/trt_llm_installation_changes_debug branch from f8befae to f40e84b Compare December 3, 2025 00:43
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-03 00:44:03.183076+00:00
+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-03 00:44:33.293930+00:00
@@ -148,11 +148,11 @@
            item,
            options.jetpack == "true",
            options.limit_pr_builds == "true",
        ):
            print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)
-            filtered_includes.append(item) 
+            filtered_includes.append(item)
            distributed_includes.append(create_distributed_config(item))
        else:
            print(f"[DEBUG] FILTERED OUT", file=sys.stderr)

    # Debug: Show summary
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-12-03 00:44:03.634077+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-12-03 00:44:44.650284+00:00
@@ -67,41 +67,39 @@

    # List ALL files in /dev/shm to see what's consuming space
    print("\nAll files in /dev/shm (including hidden):")
    try:
        import subprocess
+
        # Use ls -la to see all files including hidden ones
        result = subprocess.run(
-            ["ls", "-la", "/dev/shm"],
-            capture_output=True,
-            text=True,
-            timeout=5
+            ["ls", "-la", "/dev/shm"], capture_output=True, text=True, timeout=5
        )
        print(result.stdout)
-        
+
        # Also run du to see actual disk usage
        print("\nDisk usage breakdown (du -sh /dev/shm/*):")
        result = subprocess.run(
            ["du", "-sh", "/dev/shm/*"],
            capture_output=True,
            text=True,
            shell=False,
-            timeout=5
+            timeout=5,
        )
        # du with glob needs shell=True
        result = subprocess.run(
            "du -sh /dev/shm/* 2>/dev/null | head -20",
            capture_output=True,
            text=True,
            shell=True,
-            timeout=5
+            timeout=5,
        )
        print(result.stdout if result.stdout else "  (no output)")
-        
+
    except Exception as e:
        print(f"  Error listing /dev/shm: {e}")
-    
+
    # Also list using Python for comparison
    print("\nPython os.listdir():")
    try:
        shm_files = []
        for f in os.listdir("/dev/shm"):
@@ -109,25 +107,27 @@
            try:
                size = os.path.getsize(path)
                shm_files.append((path, size))
            except OSError:
                shm_files.append((path, -1))
-        
+
        # Sort by size descending
        shm_files.sort(key=lambda x: x[1], reverse=True)
        total_listed = 0
        for path, size in shm_files:
            if size >= 0:
                print(f"  {path}: {size / (1024 * 1024):.2f} MB")
                total_listed += size
            else:
                print(f"  {path}: <unable to get size>")
-        
+
        print(f"\nTotal from listed files: {total_listed / (1024 * 1024):.2f} MB")
        print(f"Reported used: {usage_before.get('used_mb', 'N/A')} MB")
-        print(f"DISCREPANCY: {usage_before.get('used_mb', 0) - total_listed / (1024 * 1024):.2f} MB unaccounted for!")
-        
+        print(
+            f"DISCREPANCY: {usage_before.get('used_mb', 0) - total_listed / (1024 * 1024):.2f} MB unaccounted for!"
+        )
+
        if not shm_files:
            print("  (no files found)")
    except Exception as e:
        print(f"  Error: {e}")

@@ -135,11 +135,11 @@
        "/dev/shm/nccl-*",
        "/dev/shm/torch_*",
        "/dev/shm/py_shared_memory_*",
        "/dev/shm/*multiprocessing*",
        "/dev/shm/vader_segment*",  # Open MPI shared memory
-        "/dev/shm/sem.*",           # POSIX semaphores
+        "/dev/shm/sem.*",  # POSIX semaphores
    ]

    total_files = 0
    total_bytes_freed = 0

@apbose apbose force-pushed the abose/trt_llm_installation_changes_debug branch from f40e84b to 99ded8c Compare December 3, 2025 05:23
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-03 05:24:09.711666+00:00
+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-03 05:24:42.564679+00:00
@@ -148,11 +148,11 @@
            item,
            options.jetpack == "true",
            options.limit_pr_builds == "true",
        ):
            print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)
-            filtered_includes.append(item) 
+            filtered_includes.append(item)
            distributed_includes.append(create_distributed_config(item))
        else:
            print(f"[DEBUG] FILTERED OUT", file=sys.stderr)

    # Debug: Show summary
--- /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-12-03 05:24:10.282669+00:00
+++ /home/runner/work/TensorRT/TensorRT/tests/py/dynamo/distributed/test_nccl_ops.py	2025-12-03 05:24:54.082736+00:00
@@ -67,33 +67,31 @@

    # List ALL files in /dev/shm to see what's consuming space
    print("\nAll files in /dev/shm (including hidden):")
    try:
        import subprocess
+
        # Use ls -la to see all files including hidden ones
        result = subprocess.run(
-            ["ls", "-la", "/dev/shm"],
-            capture_output=True,
-            text=True,
-            timeout=5
+            ["ls", "-la", "/dev/shm"], capture_output=True, text=True, timeout=5
        )
        print(result.stdout)
-        
+
        # Also run du to see actual disk usage
        print("\nDisk usage breakdown (du -sh /dev/shm/*):")
        result = subprocess.run(
            "du -sh /dev/shm/* 2>/dev/null | head -20",
            capture_output=True,
            text=True,
            shell=True,
-            timeout=5
+            timeout=5,
        )
        print(result.stdout if result.stdout else "  (no output)")
-        
+
    except Exception as e:
        print(f"  Error listing /dev/shm: {e}")
-    
+
    # Also list using Python for comparison
    print("\nPython os.listdir():")
    try:
        shm_files = []
        for f in os.listdir("/dev/shm"):
@@ -101,25 +99,27 @@
            try:
                size = os.path.getsize(path)
                shm_files.append((path, size))
            except OSError:
                shm_files.append((path, -1))
-        
+
        # Sort by size descending
        shm_files.sort(key=lambda x: x[1], reverse=True)
        total_listed = 0
        for path, size in shm_files:
            if size >= 0:
                print(f"  {path}: {size / (1024 * 1024):.2f} MB")
                total_listed += size
            else:
                print(f"  {path}: <unable to get size>")
-        
+
        print(f"\nTotal from listed files: {total_listed / (1024 * 1024):.2f} MB")
        print(f"Reported used: {usage_before.get('used_mb', 'N/A')} MB")
-        print(f"DISCREPANCY: {usage_before.get('used_mb', 0) - total_listed / (1024 * 1024):.2f} MB unaccounted for!")
-        
+        print(
+            f"DISCREPANCY: {usage_before.get('used_mb', 0) - total_listed / (1024 * 1024):.2f} MB unaccounted for!"
+        )
+
        if not shm_files:
            print("  (no files found)")
    except Exception as e:
        print(f"  Error: {e}")

@@ -127,11 +127,11 @@
        "/dev/shm/nccl-*",
        "/dev/shm/torch_*",
        "/dev/shm/py_shared_memory_*",
        "/dev/shm/*multiprocessing*",
        "/dev/shm/vader_segment*",  # Open MPI shared memory
-        "/dev/shm/sem.*",           # POSIX semaphores
+        "/dev/shm/sem.*",  # POSIX semaphores
    ]

    total_files = 0
    total_bytes_freed = 0

@apbose apbose force-pushed the abose/trt_llm_installation_changes_debug branch from 99ded8c to 6e91c4e Compare December 3, 2025 14:38
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some changes that do not conform to Python style guidelines:

--- /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-03 14:38:26.671953+00:00
+++ /home/runner/work/TensorRT/TensorRT/.github/scripts/filter-matrix.py	2025-12-03 14:38:59.549613+00:00
@@ -148,11 +148,11 @@
            item,
            options.jetpack == "true",
            options.limit_pr_builds == "true",
        ):
            print(f"[DEBUG] passed filter - adding to build matrix", file=sys.stderr)
-            filtered_includes.append(item) 
+            filtered_includes.append(item)
            distributed_includes.append(create_distributed_config(item))
        else:
            print(f"[DEBUG] FILTERED OUT", file=sys.stderr)

    # Debug: Show summary

@apbose apbose force-pushed the abose/trt_llm_installation_changes_debug branch from 6e91c4e to 3e42d12 Compare December 3, 2025 14:39
@apbose apbose force-pushed the abose/trt_llm_installation_changes_debug branch from 3e42d12 to 091c2e4 Compare December 3, 2025 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla signed component: api [Python] Issues re: Python API component: conversion Issues re: Conversion stage component: dynamo Issues relating to the `torch.compile` or `torch._dynamo.export` paths component: tests Issues re: Tests component: torch_compile

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants