Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
29 commits
Select commit Hold shift + click to select a range
6ffd98c
Add SkyPilot integration for job launching
Dec 5, 2025
17536f3
Add workdir and file_mounts parameters to SkyPilotJob
Dec 5, 2025
b2344bf
fixes
Dec 5, 2025
6803cbf
Add SkyPilot integration example and documentation
Dec 5, 2025
a740ae1
Working example
romilbhardwaj Dec 5, 2025
efa313f
fix
romilbhardwaj Dec 8, 2025
35b6e0e
updates
romilbhardwaj Dec 8, 2025
6a75678
cleanup
romilbhardwaj Dec 8, 2025
fd310d3
cleanup
romilbhardwaj Dec 8, 2025
8511e5a
updates
romilbhardwaj Dec 8, 2025
20d36e8
Extract SkyPilotJob from monarch src
romilbhardwaj Dec 11, 2025
e23bd3f
remove stale changes
romilbhardwaj Dec 11, 2025
40f3a6a
Add DDP and titan examples
romilbhardwaj Dec 11, 2025
2132e3c
Update README.md
romilbhardwaj Dec 11, 2025
0b1e5fd
Clean up, add run_getting_started
romilbhardwaj Dec 12, 2025
3cda869
renaming
romilbhardwaj Dec 12, 2025
32ee2d3
Add DDP notebook
romilbhardwaj Dec 12, 2025
ca7014a
Readme updates
romilbhardwaj Dec 12, 2025
ffe74f5
Updates
romilbhardwaj Dec 12, 2025
8514504
fix mermaid doc
romilbhardwaj Dec 12, 2025
2d83527
Docs updates
romilbhardwaj Dec 13, 2025
4d6ed27
updates
romilbhardwaj Dec 13, 2025
3f3e890
updates
romilbhardwaj Dec 13, 2025
be7818e
Add notes on how to set resources and num nodes
romilbhardwaj Dec 13, 2025
0a443b3
fix ssh command
romilbhardwaj Dec 13, 2025
5fcf775
Update jupyter commands
romilbhardwaj Dec 13, 2025
56e7c8c
Add CPU-only support
romilbhardwaj Dec 13, 2025
236a01a
update docs
romilbhardwaj Dec 19, 2025
8207b92
review comments
romilbhardwaj Dec 19, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/examples/README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ Examples
- :doc:`distributed_tensors.py <distributed_tensors>`: Shows how to dispatch tensors and tensor level operations to a distributed mesh of workers and GPUs
- :doc:`debugging.py <debugging>`: Shows how to use the Monarch debugger to debug a distributed program
- `Multinode Slurm Tutorial <https://docs.pytorch.org/tutorials/intermediate/monarch_distributed_tutorial.html>`_: Multinode distributed training tutorial using Monarch and Slurm to run an SPMD training job.
- `Running on Kubernetes using Skypilot <https://github.com/pytorch-labs/monarch/tree/main/examples/skypilot>`_: Run Monarch on Kubernetes and cloud VMs via SkyPilot.

.. toctree::
:hidden:
Expand Down
4 changes: 2 additions & 2 deletions docs/source/examples/getting_started.py
Original file line number Diff line number Diff line change
Expand Up @@ -145,8 +145,8 @@ def get_value(self) -> int:
# ==============
# When we created our processes before, we spawned them on `this_host()` -- the machine
# running the top-level script. For larger jobs, monarch controls many machines. How these
# machines are obtained depends on the scheduling system (slurm, kubernetes, etc), but these
# schedulers are typically encapsulated in a config file.
# machines are obtained depends on the scheduling system (Slurm, Kubernetes, SkyPilot, etc.),
# but these schedulers are typically encapsulated in a config file.

from monarch.actor import context, HostMesh, hosts_from_config

Expand Down
1 change: 1 addition & 0 deletions docs/source/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,3 +82,4 @@ We welcome contributions from the community! If you're interested in contributin
- [Demo notebook](https://github.com/meta-pytorch/monarch/blob/main/examples/presentation/presentation.ipynb)
- [DevX Pytorch tutorial](https://docs.pytorch.org/tutorials/intermediate/monarch_distributed_tutorial.html)
- [Lightning Monarch blog](https://lightning.ai/meta-ai/environments/large-scale-interactive-training-with-monarch)
- [Monarch on Kubernetes using Skypilot](https://github.com/meta-pytorch/monarch/tree/main/examples/skypilot)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can drop this, leave I'll just link to the examples in index.rst

304 changes: 304 additions & 0 deletions examples/skypilot/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,304 @@
# Running Monarch on Kubernetes and cloud VMs via SkyPilot

This directory contains examples for running Monarch workloads on **Kubernetes and cloud VMs** via [SkyPilot](https://github.com/skypilot-org/skypilot).

## Overview

`SkyPilotJob` provisions cloud instances (or K8s pods) and starts Monarch workers on them, allowing you to run distributed Monarch actors across multiple Kubernetes pods.

### Architecture

```mermaid
flowchart TB
subgraph laptop["💻 Your Laptop"]
user["$ sky launch monarch_getting_started.sky.yaml"]
end

subgraph k8s["☸️ Kubernetes Cluster"]
subgraph driver["Driver Pod"]
script["skypilot_getting_started.py"]
skyjob["SkyPilotJob"]
end

subgraph workers["Worker&nbsp;Pods&nbsp;(SkyPilot&nbsp;clusters)"]
subgraph w1["Worker Pod 0"]
mw1["Monarch Worker"]
end
subgraph w2["Worker Pod 1"]
mw2["Monarch Worker"]
end
end
end

user -->|"SkyPilot launches"| driver
script --> skyjob
skyjob -->|"provisioned via SkyPilot"| workers
skyjob <-->|"TCP :22222"| mw1
skyjob <-->|"TCP :22222"| mw2
mw1
mw2
```

**How it works:**
1. You run `sky launch` from your laptop to start the driver pod
2. The driver runs `skypilot_getting_started.py` which creates a `SkyPilotJob`
3. `SkyPilotJob` provisions GPU worker pods via SkyPilot
4. The driver connects to Monarch workers over TCP (port 22222)
5. Actors are spawned on each GPU and execute your distributed code

**Supported infra:**
- Kubernetes (any cluster)
- Hyperscalers: AWS, GCP, Azure
- Neoclouds: CoreWeave, Nebius, and [20+ other clouds](https://docs.skypilot.co/en/latest/getting-started/installation.html)

## Quickstart

Prerequisites: Install SkyPilot and verify GPUs are available.
<details>
<summary><strong>SkyPilot Installation</strong></summary>

```bash
# Install SkyPilot with your preferred backend
pip install skypilot[kubernetes] # For Kubernetes
pip install skypilot[aws] # For AWS
pip install skypilot[gcp] # For GCP
pip install skypilot[all] # For all clouds

# Verify SkyPilot setup
sky check

# Verify GPUs available
sky show-gpus --infra kubernetes
```

For more details, see the [SkyPilot documentation](https://docs.skypilot.co/en/latest/getting-started/installation.html).

</details>


Run this command from your local machine to run the getting started example:

```bash
sky launch monarch_getting_started.sky.yaml -c monarch-demo
```

<details>
<summary><strong>💡 Customizing the run (GPU count, CPU-only mode, etc.)</strong></summary>

Run `sky show-gpus --infra kubernetes` to see available GPUs in your cluster, then customize with environment variables:

```bash
# Custom GPU configuration
sky launch monarch_getting_started.sky.yaml -c monarch-demo \
--env NUM_HOSTS=4 \
--env GPUS_PER_HOST=8 \
--env ACCELERATOR="H100:8"

# CPU-only mode (no GPUs required)
sky launch monarch_getting_started.sky.yaml -c monarch-demo \
--env GPUS_PER_HOST=0 \
--env ACCELERATOR=none
```

</details>


On running `sky launch`, SkyPilot will:
1. Launch a Kubernetes pod
2. Install dependencies
3. Sync the example directory with the pod
4. Run `skypilot_getting_started.py` in the pod and stream the logs

<details>
<summary><strong>Example Output</strong></summary>

```
============================================================
Monarch Getting Started with SkyPilot
============================================================

Configuration:
Cloud: kubernetes
Hosts: 2
GPUs per host: 1
Accelerator: H200:1
Cluster name: monarch-skypilot-test

[1] Creating SkyPilot job...

[2] Launching cluster and starting Monarch workers...
No cached job found at path: .monarch/job_state.pkl
Applying current job
Launching SkyPilot cluster 'monarch-skypilot-test' with 2 nodes
Running on cluster: monarch-skypilot-test
SkyPilot cluster 'monarch-skypilot-test' launched successfully
Waiting for job 1 setup to complete (timeout=300s)...
Job 1 status: JobStatus.SETTING_UP (waited 5s)
Job 1 is now RUNNING (setup complete)
Saving job to cache at .monarch/job_state.pkl
Job has started, connecting to current state
Found 2 nodes ready
Connecting to workers for mesh 'trainers': ['tcp://10.0.4.22:22222', 'tcp://10.0.4.112:22222']
Monarch internal logs are being written to /tmp/sky/monarch_log.log; execution id sky_Dec-11_01:31_653
Waiting for host mesh 'trainers' to initialize...
Host mesh 'trainers' initialized successfully
Host mesh 'trainers' ready
Got host mesh with extent: {hosts: 2}

[3] Spawning processes on cloud hosts...
Process mesh extent: {hosts: 2, gpus: 1}

[4] Spawning Counter actors...

[5] Broadcasting increment to all counters...

[6] Getting counter values...
Counter values: ValueMesh({hosts: 2, gpus: 1}):
(({'hosts': 0/2, 'gpus': 0/1}, 3), ({'hosts': 1/2, 'gpus': 0/1}, 3))

[7] Spawning Trainer actors...

[8] Performing distributed training step...
({'hosts': 0/2, 'gpus': 0/1}, "Trainer {'hosts': 0/2, 'gpus': 0/1} taking a step.")
({'hosts': 1/2, 'gpus': 0/1}, "Trainer {'hosts': 1/2, 'gpus': 0/1} taking a step.")

[9] Getting trainer info...
({'hosts': 0/2, 'gpus': 0/1}, "Trainer at rank {'hosts': 0/2, 'gpus': 0/1}")
({'hosts': 1/2, 'gpus': 0/1}, "Trainer at rank {'hosts': 1/2, 'gpus': 0/1}")

============================================================
Success! Monarch actors ran on SkyPilot cluster!
============================================================

[10] Cleaning up SkyPilot cluster...
Tearing down SkyPilot cluster 'monarch-skypilot-test'
Cluster 'monarch-skypilot-test' terminated
Cluster terminated.
```

</details>

When done, clean up with:
```bash
sky down monarch-demo
```


<details>
<summary><strong>Running from within the Kubernetes cluster</strong></summary>

If you are already in the Kubernetes cluster you'd like to run workers on, you can directly run `skypilot_getting_started.py`.

```bash
# With GPUs
python skypilot_getting_started.py --cloud kubernetes --num-hosts 2 --gpus-per-host 8 --accelerator "H200:8"

# CPU-only (no GPUs)
python skypilot_getting_started.py --cloud kubernetes --num-hosts 2 --gpus-per-host 0 --accelerator none
```

</details>

### Running the DDP Jupyter Notebook

To run the `skypilot_ddp.ipynb` notebook interactively, first launch a driver pod and then connect via SSH port forwarding:

```bash
# 1. Launch a driver pod (without running a script)
sky launch monarch_getting_started.sky.yaml -c monarch-demo

# 2. SSH into the pod with port forwarding for Jupyter
ssh monarch-demo -L 8888:localhost:8888

# 3. Inside the pod, start Jupyter Notebook
cd ~/sky_workdir
uv pip install --system jupyter
jupyter notebook --no-browser --port=8888 --ip=0.0.0.0 --NotebookApp.token='' --NotebookApp.password='' --allow-root
```

Then open http://localhost:8888 in your browser and open `skypilot_ddp.ipynb`.

When done, clean up with:
```bash
sky down monarch-demo
```

## SkyPilotJob Class

SkyPilotJob allows you to run Monarch on Kubernetes and cloud VMs via SkyPilot.

Example usage:

```python
import sky
from skypilot_job import SkyPilotJob
from monarch.actor import Actor, endpoint

class MyActor(Actor):
@endpoint
def hello(self) -> str:
return "Hello from the cloud!"

# Create a SkyPilot job with 2 nodes
job = SkyPilotJob(
meshes={"workers": 2},
resources=sky.Resources(
cloud=sky.Kubernetes(), # or sky.AWS(), sky.GCP(), etc.
accelerators="H100:1",
),
cluster_name="my-monarch-cluster",
idle_minutes_to_autostop=10,
down_on_autostop=True,
)

# Launch and connect
state = job.state()
hosts = state.workers

# Spawn processes and actors
procs = hosts.spawn_procs(per_host={"gpus": 1})
actors = procs.spawn("my_actors", MyActor)

# Use your actors
results = actors.hello.call().get()
print(results) # ["Hello from the cloud!", "Hello from the cloud!"]

# Clean up
job.kill()
```

### Network Requirements

The client must have direct network connectivity to the worker nodes:
- **Kubernetes**: Run the client inside the same cluster (e.g., in a pod)
- **Cloud VMs**: Ensure security groups allow inbound traffic on port 22222


### Default Image

By default, `SkyPilotJob` uses the `pytorch/pytorch:2.9.1-cuda12.8-cudnn9-runtime` Docker image which has compatible system libraries for `torchmonarch-nightly`.

## Troubleshooting tips

**Check SkyPilot setup:**
```bash
sky check
sky show-gpus
```

**View cluster logs:**
```bash
sky logs <cluster-name>
```

**SSH into a worker:**
```bash
sky ssh <cluster-name>
```

**Clean up clusters:**
```bash
sky down <cluster-name>
sky down --all # Remove all clusters
```

23 changes: 23 additions & 0 deletions examples/skypilot/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will let @colin2328 and @johnwhumphreys chime in as well, but IMO this package should probably go in a subdir here:

https://github.com/meta-pytorch/monarch/tree/49547b120c8902e4383b48072529b85c81187d78/python/monarch/_src/job

I think the package dir structure for independent contributors should have job/contrib or job/providers

WDYT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, it seems like JobTrait implementations should be in the monarch package. Some more discussion here.

We do something similar for community contributed clouds in SkyPilot: they are included as a part of sky.clouds module.

However for Monarch, it introduces the overhead of build monarch workers from source to keep client/worker versions in sync (discussion, example of extra build steps that a worker needs to run).

Any recommendations on how to add a new module to monarch.job while still letting workers use wheels from torchmonarch-nightly? Trying to avoid having workers build from src during dev/testing of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this resolved now that we have moved the module outside in a separate dir?

SkyPilot integration for Monarch.
This is a standalone package that provides SkyPilotJob - a way to run Monarch
workloads on Kubernetes and cloud VMs via SkyPilot.
This package is separate from the main Monarch codebase to allow independent
iteration and to avoid chicken-and-egg problems with releases.
Usage:
from skypilot_job import SkyPilotJob
job = SkyPilotJob(
meshes={"workers": 2},
resources=sky.Resources(cloud=sky.Kubernetes(), accelerators="H100:1"),
)
state = job.state()
"""

from .skypilot_job import SkyPilotJob

__all__ = ["SkyPilotJob"]

Loading