GPU Cluster Monitoring (GCM): Large-Scale AI Research Cluster Monitoring
-
Updated
Feb 2, 2026 - Python
GPU Cluster Monitoring (GCM): Large-Scale AI Research Cluster Monitoring
Simulate NVIDIA GPUs for testing. 7 behavior profiles, scale to 1000+ GPUs, Docker-ready Prometheus exporter using DCGM
Complete security toolkit for enterprise NVIDIA GPU infrastructure. Includes NIST 800-53 controls, Zero Trust architecture, threat models, incident response playbooks, forensic scripts, and monitoring configurations for H100/A100/L40S and other datacenter GPUs.
Add a description, image, and links to the dcgm topic page so that developers can more easily learn about it.
To associate your repository with the dcgm topic, visit your repo's landing page and select "manage topics."