Skip to content

Conversation

@romilbhardwaj
Copy link

Hi Monarch community 👋

This PR adds support for running Monarch workloads on Kubernetes and cloud VMs via SkyPilot.

Adds a new SkyPilotJob class that provisions Monarch workers on K8s (or any of the 20+ cloud providers supported by SkyPilot):

job = SkyPilotJob(
    meshes={"workers": 2},
    resources=sky.Resources(cloud=sky.Kubernetes(), accelerators="H200:1"), # SkyPilot resources object https://skypilot.readthedocs.io/en/latest/reference/yaml-spec.html#resources
    cluster_name="my-cluster",
    file_mounts={"/local/path": "/worker/path"}, # For any code/data sync
)

With this I was able to run the getting started example from Monarch docs on my multi-node Coreweave Kubernetes cluster with H200s. Full example in examples/skypilot_getting_started.py.

Notes:

  • Cold start: Currently builds Monarch from source on workers to keep client/worker versions in sync (~7-10 min). A pre-built container image would reduce this to ~30 seconds.
  • Networking: Assumes workers are in the same network as the client. If client-worker communication is not very chatty, SkyPilot's SSH tunneling could remove this requirement.

This is an early PR - I'm looking for feedback on the direction. I'm a maintainer of the SkyPilot project and was trying to run Monarch on my k8s cluster. Figured an integration with SkyPilot would be a good fit.

Happy to iterate on the implementation if this aligns with where you'd like Monarch to go.

Romil Bhardwaj and others added 10 commits December 5, 2025 06:49
Summary:
This update introduces two new optional parameters, `workdir` and `file_mounts`, to the `SkyPilotJob` class. The `workdir` parameter allows users to specify a local directory to sync with the cluster, while `file_mounts` enables additional file mounts by mapping remote paths to local paths. These enhancements improve the flexibility and usability of job configurations in SkyPilot.
- Add SKY_README.md with comprehensive documentation:
  - Architecture overview
  - Implementation details
  - Usage examples
  - Troubleshooting guide
  - Networking considerations for Kubernetes

- Add python/examples/skypilot_getting_started.py:
  - Example script demonstrating Monarch actors on SkyPilot
  - Supports multiple clouds (Kubernetes, AWS, GCP, Azure)
  - Configurable via command-line arguments

- Update skypilot.py:
  - Add host mesh initialization wait
  - Improve logging for debugging
  - Fix worker command environment variable setup
@meta-cla
Copy link

meta-cla bot commented Dec 8, 2025

Hi @romilbhardwaj!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

@romilbhardwaj romilbhardwaj marked this pull request as draft December 8, 2025 06:16
@johnwhumphreys
Copy link
Contributor

@romilbhardwaj - I just saw this; sorry for missing it all day. I'll check it out in the morning and circle back :).

@colin2328
Copy link
Contributor

very cool! is this ready for review? (if so push it out of draft state?)

sudo apt-get update

# Install system dependencies
sudo apt-get install -y \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

skypilot requires new system level dependencies?

Copy link
Author

@romilbhardwaj romilbhardwaj Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @colin2328, these are dependencies for monarch. SkyPilot's dependencies are handled by the SkyPilot runtime during setup.

They're needed because this PR currently builds Monarch from source on the workers to keep client/worker versions in sync during development. Once this is merged and a Monarch wheel is released to PyPI, the setup simplifies to just pip install torchmonarch on workers.

The current approach is a temporary workaround for the chicken-and-egg problem: SkyPilotJob lives in Monarch source, but there's no released wheel with it yet.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question on architecture: How should new schedulers extend Monarch's JobTrait?

Should they:

  1. Live in Monarch source (python/monarch/_src/job/)
    • Pro: Clean import path, UX for users (from monarch.job import SkyPilotJob)
    • Con: Requires building from source during dev until wheel is released, may need version pinning after release?
  2. Live in a separate package (user code)
    • Pro: Don't need to rebuild monarch
    • Con: Different import pattern (from <userpkg> import SkyPilotJob?)
  3. Something else I'm missing.

Currently I have the SkyPilot implementation in python/monarch/_src/job/skypilot.py, but I can refactor it. Any guidance here would be appreciated 🙏

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc: @zdevito as this is an interesting question.

My intuition is that they should live in python/monarch/_src/job even though writing a job or changing the API will requires building from source for testing purposes. The testing cost is only paid by the developer, not the end user.

IIUC, keeping it in a separate package will still require installation on the workers while the code is in development. However it will not require building rust libraries which are probably the main source of the build latency.

It would be nice for JobTrait implementers to not require to build rust binaries while testing their code. I wonder if simply a cp of the new files into site-packages dir on the workers would work, though it seems like a hack.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Job implementations should be buildable on top of Monarch's public APIs. So, in general, I would favor the latter pattern, which also has good DevEx. I don't think we should accumulate many job implementations in core Monarch.

Note that this is distinct from the question of whether these are part of the Monarch "ecosystem" packages. In other words, we might maintain them in Monarch itself, but build and distribute them as separate packages.

We have previously talked about using pip flavors to bundle such dependencies for this purpose, e.g.,

$ pip install torchmonarch[skypilot]

etc.

@romilbhardwaj
Copy link
Author

Thanks @colin2328 @johnwhumphreys - I'll polish the PR and get it ready for review. Have a question about the architecture here, happy to do whatever aligns with Monarch's recommended design.

import os
import sys

# Set timeouts before importing monarch
Copy link
Contributor

@shayne-fletcher shayne-fletcher Dec 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

am working on enabling

from monarch.config import configure

configure(
  host_spawn_ready_timeout="300s",
  message_delivery_timeout="300s",
  mesh_proc_spawn_max_idle="300s"
)

to replace the use of env-vars here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @shayne-fletcher! That'll be super helpful for backends where cold starts can be > 30s.

@romilbhardwaj
Copy link
Author

Monarch build was slowing down worker launches, so I've refactored along the lines of option 2 in 20d36e8. Let me know if this is not the right direction :)

Tested on Coreweave k8s cluster with H200s. PR is still a draft, will finish up the examples (torchtitan and ddp) + readme and mark ready for review soon!

@colin2328
Copy link
Contributor

@romilbhardwaj "the author does not have a valid CLA on file." can you acept the CLA? and publish?

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 11, 2025
@meta-cla
Copy link

meta-cla bot commented Dec 11, 2025

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

Usage:
# Run on Kubernetes:
python getting_started.py --cloud kubernetes --num-hosts 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/getting_started/skypilot_getting_started?

@@ -0,0 +1,23 @@
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will let @colin2328 and @johnwhumphreys chime in as well, but IMO this package should probably go in a subdir here:

https://github.com/meta-pytorch/monarch/tree/49547b120c8902e4383b48072529b85c81187d78/python/monarch/_src/job

I think the package dir structure for independent contributors should have job/contrib or job/providers

WDYT?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, it seems like JobTrait implementations should be in the monarch package. Some more discussion here.

We do something similar for community contributed clouds in SkyPilot: they are included as a part of sky.clouds module.

However for Monarch, it introduces the overhead of build monarch workers from source to keep client/worker versions in sync (discussion, example of extra build steps that a worker needs to run).

Any recommendations on how to add a new module to monarch.job while still letting workers use wheels from torchmonarch-nightly? Trying to avoid having workers build from src during dev/testing of this PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this resolved now that we have moved the module outside in a separate dir?

@meta-codesync
Copy link

meta-codesync bot commented Dec 11, 2025

@colin2328 has imported this pull request. If you are a Meta employee, you can view this in D88994552.

try:
import sky
except ImportError:
print("ERROR: SkyPilot is not installed. Run: pip install skypilot[kubernetes]")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add this as an optional dependency in pyproject.toml if we move skypilot_job.py from examples to the job dir

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed!

@ahmadsharif1
Copy link
Contributor

I am also wondering how to test this when we move this outside of the examples dir and into the main package.

@romilbhardwaj How does skypilot do end to end testing in its CI? Maybe we should have a minikube/kind test setup in our CI to test this if we expose this to users. We don't want silent breakages if users are going to use this.

@romilbhardwaj
Copy link
Author

Cleaned up the readme, added an architecture diagram and included a DDP notebook inspired by the slurm example.

Tested on coreweave k8s with H200s.

This PR is ready for review now! cc @ahmadsharif1 @colin2328 @johnwhumphreys

@romilbhardwaj romilbhardwaj marked this pull request as ready for review December 13, 2025 01:24
@romilbhardwaj
Copy link
Author

romilbhardwaj commented Dec 13, 2025

@romilbhardwaj How does skypilot do end to end testing in its CI? Maybe we should have a minikube/kind test setup in our CI to test this if we expose this to users. We don't want silent breakages if users are going to use this.

Great point. For our release testing, our CI spins up k8s clusters on EKS, GKE, CW and Nebius and runs our smoke test suite which includes model training examples from pytorch, ray, nemo and other popular frameworks.

For lightweight CPU-only tests, some parts of our CI uses kind (via sky local up) to create a local k8s cluster. If it is of interest, shouldn't be too hard to edit this example to run on CPUs only for the CI tests.

@romilbhardwaj
Copy link
Author

Added some flags to allow the example to run on CPU-only k8s clusters

Copy link
Contributor

@ahmadsharif1 ahmadsharif1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for the most part. You can wait for Colin to take a look as well.

I think he wants this to be part of the release, so you can address them and merge if you want

@@ -0,0 +1,23 @@
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this resolved now that we have moved the module outside in a separate dir?

python -c "import monarch; print(f'Monarch installed: {monarch}')"
python -c "import sky; print(f'SkyPilot installed: {sky}')"
# Configure SkyPilot to use in-cluster Kubernetes context
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am curious as to what permissions the driver pod has. How can it launch worker pods using sky? Where are those permissions configured?

Copy link
Author

@romilbhardwaj romilbhardwaj Dec 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default, SkyPilot automatically creates a service account with minimal permissions it needs and attaches it to the pod. You can find the permissions here: https://docs.skypilot.co/en/latest/cloud-setup/cloud-permissions/kubernetes.html

These permissions also allow SkyPilot running inside the driver pod to create worker pods.

Optionally, the user can override and specify their own custom service account SkyPilot config.

# Configure SkyPilot to use in-cluster Kubernetes context
# This allows the driver pod to launch nested SkyPilot clusters
unset SKYPILOT_IN_CLUSTER_CONTEXT_NAME
sky api start
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to view the YAML that skypilot itself generates and applies to kubernetes (that contains the permissions, etc.)? Just curious.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep! If you run sky launch with export SKYPILOT_DEBUG=1, the terminal output will include a long k8s manifest that will be applied. It's also stored in SkyPilot internal state db.

@@ -0,0 +1,269 @@
#!/usr/bin/env python3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about this, but maybe rename this to example_driver.py or something more descriptive?

Comment on lines +224 to +225
logger.error(f"Failed to launch SkyPilot cluster: {e}")
raise RuntimeError(f"Failed to launch SkyPilot cluster: {e}") from e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if line 215 succeeds and line 222 fails? Do we still need to clean that up?

logger.info(f"SkyPilot cluster '{cluster_name}' launched successfully")

# Wait for the job to be RUNNING (setup complete, run started)
self._wait_for_job_running(cluster_name, job_id, timeout=JOB_TIMEOUT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, should you clean something up if this timeout fails if there is state on the kubernetes cluster

@@ -0,0 +1,457 @@
"""
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also wondering if this package should be in a subdir outside of the notebooks, etc. I'll leave it to you

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that'd be a good idea, especially if we later want to support @mariusae's suggestion of using pip flavors like pip install torchmonarch[skypilot]

@romilbhardwaj
Copy link
Author

Thanks for the reviews @ahmadsharif1! I'll wait for a bit for @colin2328's feedback before working through the comments and doing a round of final testing.

Also I noticed the linter was failing - do I need to take any action? Couldn't find instructions to run the linter myself.

- [Demo notebook](https://github.com/meta-pytorch/monarch/blob/main/examples/presentation/presentation.ipynb)
- [DevX Pytorch tutorial](https://docs.pytorch.org/tutorials/intermediate/monarch_distributed_tutorial.html)
- [Lightning Monarch blog](https://lightning.ai/meta-ai/environments/large-scale-interactive-training-with-monarch)
- [Monarch on Kubernetes using Skypilot](https://github.com/meta-pytorch/monarch/tree/main/examples/skypilot)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can drop this, leave I'll just link to the examples in index.rst

@colin2328
Copy link
Contributor

colin2328 commented Dec 19, 2025

@romilbhardwaj , thanks. I'll fix the lints and push for you. please address the comments first, rest is fine to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. kubernetes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants