Skip to content

MLNLP-World/gpu-watchdog

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

GPU Watchdog 🐩

Motivation β€’ Overview β€’ Preview β€’ Usage β€’ Quickstart β€’ Customization β€’ Gmail API Setup β€’ FAQ β€’ Directory Structure β€’ Acknowledgements β€’ Contact β€’ Organizers


πŸ“Œ Motivation

In shared GPU or multi-user HPC environments, you often run into situations like:

  • You are training, but the GPUs are fully occupied by others, and your job slows down sharply or even OOMs;
  • You want to quickly know who is using the GPUs on the current machine;
  • You want to automatically remind yourself when GPUs enter certain states (so you do not have to keep staring at nvidia-smi).

The goal of GPU Watchdog 🐩 is: with minimal dependencies, a Bash script that can run directly on clusters, periodically reads GPU process information, and notifies you by email when trigger rules fire.

πŸš€ Overview

GPU Watchdog is a lightweight GPU monitoring and notification tool:

  • Wraps nvidia-smi output via bin/mygpu.sh (lists per-process fields like GPU=... PID=... ETIME=...)
  • bin/gpu_watch.sh decides whether to alert based on rules, with de-duplication and cooldown to prevent spamming
  • Supports two notification methods:
    • βœ… Gmail API (recommended): uses HTTPS 443, which clusters are usually more likely to allow
    • βœ… SMTP (alternative): use when your cluster allows outbound SMTP
  • In auto mode it may try local mail/mailx/sendmail, but whether it works depends on the cluster mail system; in cluster environments, Gmail API is recommended first.

πŸ‘€ Preview

1) Terminal monitoring output

Terminal Preview

Contents include:

  • A list of GPU processes in use, including GPU id, pid, etime, program path, run command, and more

2) Email alert example

Email Preview

Contents include:

  • Timestamp
  • Trigger reason and predefined rules
  • Raw output (helpful for tracing and debugging)

Note: The above screenshots are for demonstration; paths are redacted. In actual runs, real paths will be printed.

🧭 Usage

Default monitoring logic (see bin/gpu_watch.sh):

  • Count the number of GPUs currently in use (deduplicated by GPU id appearing in per-process output)
  • Find the single GPU process with the longest runtime (ETIME)
  • Alert when any of the following conditions is met:
    • GPU_COUNT > GPU_LIMIT
    • The MAX_ETIME_MIN minute threshold is exceeded (i.e., a process has been running for too long)

It also includes anti-spam mechanisms:

  • The same alert will not be sent again within COOLDOWN_MIN

⚑ Quickstart

1) Get the code

git clone https://github.com/Yukyin/gpu-watchdog.git
cd gpu-watchdog

2) Prepare config files

cp config/gpu-watch.env.example  config/gpu-watch.env
cp config/notify.env.example     config/notify.env   # Gmail (recommended)
cp config/smtp.env.example       config/smtp.env     # SMTP (optional)

3) Run a Dry Run once (no email is sent; only prints subject and body)

bash bin/gpu_watch.sh --dry-run

Note: If no trigger condition is currently met, --dry-run may produce no output; this is normal. To see real-time usage, run bash bin/mygpu.sh.

4) Test mail (force send; skip de-dup and cooldown)

bash bin/gpu_watch.sh --test-mail

πŸ•°οΈ Long-running modes

Three commonly used approaches:

Mode A: Temporary viewing

Refresh current GPU processes every 60 seconds:

watch -n 60 bash bin/mygpu.sh

Mode B: Resident loop (recommended)

Check once every 5 minutes; send an email when rules fire:

while true; do
  bash bin/gpu_watch.sh
  sleep 300
done

You can also keep it running long-term in tmux / screen.

Mode C: crontab (production-friendly)

Run once every 5 minutes:

crontab -e

Add a line (replace <repo-dir> with your actual path):

*/5 * * * * cd <repo-dir> && bash bin/gpu_watch.sh >/dev/null 2>&1

Tip: If your cluster nodes reboot or interactive sessions get reclaimed, crontab is often more stable than while true loops.

πŸ› οΈ Customization

This project’s configuration entry points are mainly three files:

  • config/gpu-watch.env: trigger rules, and recipient
  • config/notify.env: notification method selection, Gmail API configuration
  • config/smtp.env: SMTP configuration (optional; required when using SMTP)

Environment-variable overrides are also supported, which is useful for reusing one codebase across multiple machines or multiple configuration sets.

A. config/gpu-watch.env

This section determines: when to alert, and who to send to.

TO_EMAIL

  • Meaning: recipient email address for alert emails.
  • Example: TO_EMAIL="you@example.com"
  • Note: if empty, the script will error out and exit.

GPU_LIMIT

  • Meaning: alert when the number of β€œGPUs currently in use” is greater than this value.
  • Example: GPU_LIMIT=2
  • Explanation: the script extracts GPU=... from mygpu.sh output and deduplicates it to compute GPU_COUNT.

MAX_ETIME_MIN

  • Meaning: alert when the runtime (ETIME) of any GPU process exceeds this number of minutes.
  • Example: MAX_ETIME_MIN=360
  • Explanation: the script selects the maximum ETIME across all processes as MAX_ETIME and compares it to the threshold.

COOLDOWN_MIN

  • Meaning: cooldown time (minutes) for the same alert signature; no repeated sends within the cooldown window.
  • Example: COOLDOWN_MIN=60
  • Signature mechanism: SHA256 computed from trigger reasons and raw output, used for de-duplication.

B. config/notify.env

This section determines: how to send email.
Note: the recipient TO_EMAIL is not here; it is in gpu-watch.env.

NOTIFY_METHOD

  • Meaning: notification backend selection.
  • Values: gmail_api | smtp | auto
    • gmail_api: force Gmail API (recommended)
    • smtp: force SMTP (requires config/smtp.env to exist and be usable)
    • auto: automatic mode (the script tries available methods according to its own logic)
  • Example: NOTIFY_METHOD="gmail_api"

GMAIL_API_CREDENTIALS

  • Meaning: absolute path to the OAuth client credentials.json.
  • Example: GMAIL_API_CREDENTIALS="/ABS/PATH/to/credentials.json"

GMAIL_API_TOKEN

  • Meaning: absolute path to the generated token.json after authorization.
  • Example: GMAIL_API_TOKEN="/ABS/PATH/to/token.json"

FROM_EMAIL

  • Meaning: sender email address.
  • Example: FROM_EMAIL="you@example.com"
  • Suggestion: match the Gmail account associated with the token.

C. config/smtp.env

Required only when NOTIFY_METHOD="smtp" or when the script takes the SMTP path.

SMTP_HOST

  • SMTP server host, e.g. smtp.example.com

SMTP_PORT

  • Commonly 587 (STARTTLS) or 465 (SSL)

SMTP_USER / SMTP_PASS

  • SMTP username and password (use an app-specific password if possible)

FROM_EMAIL

  • If not set, the script will default to SMTP_USER

D. Optional: override config file paths with environment variables

Convenient for multiple configurations with the same codebase:

  • GPUWATCH_CONFIG=/path/to/gpu-watch.env
  • GPUWATCH_NOTIFY_CONFIG=/path/to/notify.env
  • GPUWATCH_SMTP_CONFIG=/path/to/smtp.env
  • GPUWATCH_CACHE=/path/to/cache
  • MYGPU=/path/to/mygpu.sh

Example:

GPUWATCH_CONFIG=/tmp/gpu-watch.env GPUWATCH_NOTIFY_CONFIG=/tmp/notify.env bash bin/gpu_watch.sh --dry-run

πŸ” Gmail API Setup

Goal: obtain two files
secret/credentials.json and secret/token.json

The repository’s secret/README.md also provides more detailed instructions.

1) Enable the Gmail API

In the Google Cloud Console (under the same project):

  • APIs & Services β†’ Library β†’ Gmail API β†’ Enable

Terminal Preview

2) Create an OAuth Client ID

  • APIs & Services β†’ Credentials β†’ Create Credentials β†’ OAuth client ID
  • Application type: Desktop app
  • Download the JSON and save it as: secret/credentials.json. Do not click β€œDone” yet; make sure you download on this page.

Terminal Preview

Terminal Preview

Terminal Preview

3) Add a test user in Google Auth Platform

  • Click β€œAudience” and β€œAdd users”
  • Enter your Gmail address

Terminal Preview

4) Generate secret/token.json

Install dependencies:

python -m pip install google-api-python-client google-auth google-auth-oauthlib google-auth-httplib2

Run the following script on the cluster to generate the token (it will print a URL):

BASE="$(pwd)"
BASE="$BASE" python - <<'PY'
import os
from google_auth_oauthlib.flow import InstalledAppFlow

BASE = os.environ["BASE"]
creds = f"{BASE}/secret/credentials.json"
token = f"{BASE}/secret/token.json"
scopes = ["https://www.googleapis.com/auth/gmail.send"]

flow = InstalledAppFlow.from_client_secrets_file(creds, scopes=scopes)
creds_obj = flow.run_local_server(
    host="127.0.0.1",
    port=8765,
    open_browser=False,
    authorization_prompt_message="Open this URL in your browser:\n{url}\n",
    success_message="βœ… Auth OK. You can close this tab.",
)

with open(token, "w") as f:
    f.write(creds_obj.to_json())
print("Wrote token:", token)
PY

If you need to open the authorization page in a local browser, a common approach is SSH port forwarding:

ssh -L 8765:127.0.0.1:8765 <user>@<cluster-host>

🧩 FAQ

Q1: SMTP is blocked on the cluster, what should I do?

Prefer Gmail API (HTTPS 443), which is usually easier to pass through firewalls.

Q2: I only want to be reminded when too many GPUs are occupied

Set MAX_ETIME_MIN very large, and rely mainly on GPU_LIMIT.

Q3: I only want to be reminded when a process runs for too long

Set GPU_LIMIT very large, and rely mainly on MAX_ETIME_MIN.

Q4: Alerts are too frequent, what should I do?

Increase COOLDOWN_MIN. The same alert signature will not be resent during the cooldown window.

πŸ—‚οΈ Directory Structure

.
β”œβ”€β”€ bin/
β”‚   β”œβ”€β”€ gpu_watch.sh
β”‚   β”œβ”€β”€ mygpu.sh
β”‚   β”œβ”€β”€ send_gmail_api.py
β”‚   └── send_smtp.py
β”œβ”€β”€ config/
β”‚   β”œβ”€β”€ gpu-watch.env.example
β”‚   β”œβ”€β”€ notify.env.example
β”‚   └── smtp.env.example
β”œβ”€β”€ cache/                 # runtime de-dup/cooldown records (last_sent/last_sig)
β”œβ”€β”€ secret/                # OAuth-related files (do not share casually)
β”‚   └── README.md
β”œβ”€β”€ LICENSE
β”œβ”€β”€ README.md
└── README.zh-CN.md

πŸ™ Acknowledgements

  • NVIDIA nvidia-smi
  • Gmail API / OAuth2 authorization flow

❀️ Contact

  • If you have any questions or suggestions, please feel free to open a GitHub issue.
  • You can also reach out Yuyan Chen.

πŸ‘₯ Organizers

Thanks to the following contributors for organizing this project.

Yukyin Avatar

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published