Motivation β’ Overview β’ Preview β’ Usage β’ Quickstart β’ Customization β’ Gmail API Setup β’ FAQ β’ Directory Structure β’ Acknowledgements β’ Contact β’ Organizers
In shared GPU or multi-user HPC environments, you often run into situations like:
- You are training, but the GPUs are fully occupied by others, and your job slows down sharply or even OOMs;
- You want to quickly know who is using the GPUs on the current machine;
- You want to automatically remind yourself when GPUs enter certain states (so you do not have to keep staring at
nvidia-smi).
The goal of GPU Watchdog π© is: with minimal dependencies, a Bash script that can run directly on clusters, periodically reads GPU process information, and notifies you by email when trigger rules fire.
GPU Watchdog is a lightweight GPU monitoring and notification tool:
- Wraps
nvidia-smioutput viabin/mygpu.sh(lists per-process fields likeGPU=... PID=... ETIME=...) bin/gpu_watch.shdecides whether to alert based on rules, with de-duplication and cooldown to prevent spamming- Supports two notification methods:
- β Gmail API (recommended): uses HTTPS 443, which clusters are usually more likely to allow
- β SMTP (alternative): use when your cluster allows outbound SMTP
- In
automode it may try local mail/mailx/sendmail, but whether it works depends on the cluster mail system; in cluster environments, Gmail API is recommended first.
Contents include:
- A list of GPU processes in use, including GPU id, pid, etime, program path, run command, and more
Contents include:
- Timestamp
- Trigger reason and predefined rules
- Raw output (helpful for tracing and debugging)
Note: The above screenshots are for demonstration; paths are redacted. In actual runs, real paths will be printed.
Default monitoring logic (see bin/gpu_watch.sh):
- Count the number of GPUs currently in use (deduplicated by GPU id appearing in per-process output)
- Find the single GPU process with the longest runtime (ETIME)
- Alert when any of the following conditions is met:
GPU_COUNT > GPU_LIMIT- The
MAX_ETIME_MINminute threshold is exceeded (i.e., a process has been running for too long)
It also includes anti-spam mechanisms:
- The same alert will not be sent again within
COOLDOWN_MIN
git clone https://github.com/Yukyin/gpu-watchdog.git
cd gpu-watchdogcp config/gpu-watch.env.example config/gpu-watch.env
cp config/notify.env.example config/notify.env # Gmail (recommended)
cp config/smtp.env.example config/smtp.env # SMTP (optional)bash bin/gpu_watch.sh --dry-runNote: If no trigger condition is currently met, --dry-run may produce no output; this is normal. To see real-time usage, run bash bin/mygpu.sh.
bash bin/gpu_watch.sh --test-mailThree commonly used approaches:
Refresh current GPU processes every 60 seconds:
watch -n 60 bash bin/mygpu.shCheck once every 5 minutes; send an email when rules fire:
while true; do
bash bin/gpu_watch.sh
sleep 300
doneYou can also keep it running long-term in tmux / screen.
Run once every 5 minutes:
crontab -eAdd a line (replace <repo-dir> with your actual path):
*/5 * * * * cd <repo-dir> && bash bin/gpu_watch.sh >/dev/null 2>&1Tip: If your cluster nodes reboot or interactive sessions get reclaimed, crontab is often more stable than while true loops.
This projectβs configuration entry points are mainly three files:
config/gpu-watch.env: trigger rules, and recipientconfig/notify.env: notification method selection, Gmail API configurationconfig/smtp.env: SMTP configuration (optional; required when using SMTP)
Environment-variable overrides are also supported, which is useful for reusing one codebase across multiple machines or multiple configuration sets.
This section determines: when to alert, and who to send to.
- Meaning: recipient email address for alert emails.
- Example:
TO_EMAIL="you@example.com" - Note: if empty, the script will error out and exit.
- Meaning: alert when the number of βGPUs currently in useβ is greater than this value.
- Example:
GPU_LIMIT=2 - Explanation: the script extracts
GPU=...frommygpu.shoutput and deduplicates it to computeGPU_COUNT.
- Meaning: alert when the runtime (ETIME) of any GPU process exceeds this number of minutes.
- Example:
MAX_ETIME_MIN=360 - Explanation: the script selects the maximum ETIME across all processes as
MAX_ETIMEand compares it to the threshold.
- Meaning: cooldown time (minutes) for the same alert signature; no repeated sends within the cooldown window.
- Example:
COOLDOWN_MIN=60 - Signature mechanism: SHA256 computed from trigger reasons and raw output, used for de-duplication.
This section determines: how to send email.
Note: the recipientTO_EMAILis not here; it is ingpu-watch.env.
- Meaning: notification backend selection.
- Values:
gmail_api | smtp | autogmail_api: force Gmail API (recommended)smtp: force SMTP (requiresconfig/smtp.envto exist and be usable)auto: automatic mode (the script tries available methods according to its own logic)
- Example:
NOTIFY_METHOD="gmail_api"
- Meaning: absolute path to the OAuth client
credentials.json. - Example:
GMAIL_API_CREDENTIALS="/ABS/PATH/to/credentials.json"
- Meaning: absolute path to the generated
token.jsonafter authorization. - Example:
GMAIL_API_TOKEN="/ABS/PATH/to/token.json"
- Meaning: sender email address.
- Example:
FROM_EMAIL="you@example.com" - Suggestion: match the Gmail account associated with the token.
Required only when
NOTIFY_METHOD="smtp"or when the script takes the SMTP path.
- SMTP server host, e.g.
smtp.example.com
- Commonly
587(STARTTLS) or465(SSL)
- SMTP username and password (use an app-specific password if possible)
- If not set, the script will default to
SMTP_USER
Convenient for multiple configurations with the same codebase:
GPUWATCH_CONFIG=/path/to/gpu-watch.envGPUWATCH_NOTIFY_CONFIG=/path/to/notify.envGPUWATCH_SMTP_CONFIG=/path/to/smtp.envGPUWATCH_CACHE=/path/to/cacheMYGPU=/path/to/mygpu.sh
Example:
GPUWATCH_CONFIG=/tmp/gpu-watch.env GPUWATCH_NOTIFY_CONFIG=/tmp/notify.env bash bin/gpu_watch.sh --dry-runGoal: obtain two files
secret/credentials.jsonandsecret/token.json
The repositoryβs secret/README.md also provides more detailed instructions.
In the Google Cloud Console (under the same project):
- APIs & Services β Library β Gmail API β Enable
- APIs & Services β Credentials β Create Credentials β OAuth client ID
- Application type: Desktop app
- Download the JSON and save it as:
secret/credentials.json. Do not click βDoneβ yet; make sure you download on this page.
- Click βAudienceβ and βAdd usersβ
- Enter your Gmail address
Install dependencies:
python -m pip install google-api-python-client google-auth google-auth-oauthlib google-auth-httplib2
Run the following script on the cluster to generate the token (it will print a URL):
BASE="$(pwd)"
BASE="$BASE" python - <<'PY'
import os
from google_auth_oauthlib.flow import InstalledAppFlow
BASE = os.environ["BASE"]
creds = f"{BASE}/secret/credentials.json"
token = f"{BASE}/secret/token.json"
scopes = ["https://www.googleapis.com/auth/gmail.send"]
flow = InstalledAppFlow.from_client_secrets_file(creds, scopes=scopes)
creds_obj = flow.run_local_server(
host="127.0.0.1",
port=8765,
open_browser=False,
authorization_prompt_message="Open this URL in your browser:\n{url}\n",
success_message="β
Auth OK. You can close this tab.",
)
with open(token, "w") as f:
f.write(creds_obj.to_json())
print("Wrote token:", token)
PYIf you need to open the authorization page in a local browser, a common approach is SSH port forwarding:
ssh -L 8765:127.0.0.1:8765 <user>@<cluster-host>Prefer Gmail API (HTTPS 443), which is usually easier to pass through firewalls.
Set MAX_ETIME_MIN very large, and rely mainly on GPU_LIMIT.
Set GPU_LIMIT very large, and rely mainly on MAX_ETIME_MIN.
Increase COOLDOWN_MIN. The same alert signature will not be resent during the cooldown window.
.
βββ bin/
β βββ gpu_watch.sh
β βββ mygpu.sh
β βββ send_gmail_api.py
β βββ send_smtp.py
βββ config/
β βββ gpu-watch.env.example
β βββ notify.env.example
β βββ smtp.env.example
βββ cache/ # runtime de-dup/cooldown records (last_sent/last_sig)
βββ secret/ # OAuth-related files (do not share casually)
β βββ README.md
βββ LICENSE
βββ README.md
βββ README.zh-CN.md
- NVIDIA
nvidia-smi - Gmail API / OAuth2 authorization flow
- If you have any questions or suggestions, please feel free to open a GitHub issue.
- You can also reach out Yuyan Chen.
Thanks to the following contributors for organizing this project.






