-
Notifications
You must be signed in to change notification settings - Fork 120
Add Kubernetes support via SkyPilot #2083
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add Kubernetes support via SkyPilot #2083
Conversation
Summary: This update introduces two new optional parameters, `workdir` and `file_mounts`, to the `SkyPilotJob` class. The `workdir` parameter allows users to specify a local directory to sync with the cluster, while `file_mounts` enables additional file mounts by mapping remote paths to local paths. These enhancements improve the flexibility and usability of job configurations in SkyPilot.
- Add SKY_README.md with comprehensive documentation: - Architecture overview - Implementation details - Usage examples - Troubleshooting guide - Networking considerations for Kubernetes - Add python/examples/skypilot_getting_started.py: - Example script demonstrating Monarch actors on SkyPilot - Supports multiple clouds (Kubernetes, AWS, GCP, Azure) - Configurable via command-line arguments - Update skypilot.py: - Add host mesh initialization wait - Improve logging for debugging - Fix worker command environment variable setup
|
Hi @romilbhardwaj! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
|
@romilbhardwaj - I just saw this; sorry for missing it all day. I'll check it out in the morning and circle back :). |
|
very cool! is this ready for review? (if so push it out of draft state?) |
python/monarch/_src/job/skypilot.py
Outdated
| sudo apt-get update | ||
|
|
||
| # Install system dependencies | ||
| sudo apt-get install -y \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
skypilot requires new system level dependencies?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hey @colin2328, these are dependencies for monarch. SkyPilot's dependencies are handled by the SkyPilot runtime during setup.
They're needed because this PR currently builds Monarch from source on the workers to keep client/worker versions in sync during development. Once this is merged and a Monarch wheel is released to PyPI, the setup simplifies to just pip install torchmonarch on workers.
The current approach is a temporary workaround for the chicken-and-egg problem: SkyPilotJob lives in Monarch source, but there's no released wheel with it yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question on architecture: How should new schedulers extend Monarch's JobTrait?
Should they:
- Live in Monarch source (python/monarch/_src/job/)
- Pro: Clean import path, UX for users (
from monarch.job import SkyPilotJob) - Con: Requires building from source during dev until wheel is released, may need version pinning after release?
- Pro: Clean import path, UX for users (
- Live in a separate package (user code)
- Pro: Don't need to rebuild monarch
- Con: Different import pattern (
from <userpkg> import SkyPilotJob?)
- Something else I'm missing.
Currently I have the SkyPilot implementation in python/monarch/_src/job/skypilot.py, but I can refactor it. Any guidance here would be appreciated 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @zdevito as this is an interesting question.
My intuition is that they should live in python/monarch/_src/job even though writing a job or changing the API will requires building from source for testing purposes. The testing cost is only paid by the developer, not the end user.
IIUC, keeping it in a separate package will still require installation on the workers while the code is in development. However it will not require building rust libraries which are probably the main source of the build latency.
It would be nice for JobTrait implementers to not require to build rust binaries while testing their code. I wonder if simply a cp of the new files into site-packages dir on the workers would work, though it seems like a hack.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Job implementations should be buildable on top of Monarch's public APIs. So, in general, I would favor the latter pattern, which also has good DevEx. I don't think we should accumulate many job implementations in core Monarch.
Note that this is distinct from the question of whether these are part of the Monarch "ecosystem" packages. In other words, we might maintain them in Monarch itself, but build and distribute them as separate packages.
We have previously talked about using pip flavors to bundle such dependencies for this purpose, e.g.,
$ pip install torchmonarch[skypilot]
etc.
|
Thanks @colin2328 @johnwhumphreys - I'll polish the PR and get it ready for review. Have a question about the architecture here, happy to do whatever aligns with Monarch's recommended design. |
examples/skypilot/skypilot_ddp.py
Outdated
| import os | ||
| import sys | ||
|
|
||
| # Set timeouts before importing monarch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
am working on enabling
from monarch.config import configure
configure(
host_spawn_ready_timeout="300s",
message_delivery_timeout="300s",
mesh_proc_spawn_max_idle="300s"
)to replace the use of env-vars here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @shayne-fletcher! That'll be super helpful for backends where cold starts can be > 30s.
|
Monarch build was slowing down worker launches, so I've refactored along the lines of option 2 in 20d36e8. Let me know if this is not the right direction :) Tested on Coreweave k8s cluster with H200s. PR is still a draft, will finish up the examples (torchtitan and ddp) + readme and mark ready for review soon! |
|
@romilbhardwaj "the author does not have a valid CLA on file." can you acept the CLA? and publish? |
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
| Usage: | ||
| # Run on Kubernetes: | ||
| python getting_started.py --cloud kubernetes --num-hosts 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/getting_started/skypilot_getting_started?
| @@ -0,0 +1,23 @@ | |||
| """ | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will let @colin2328 and @johnwhumphreys chime in as well, but IMO this package should probably go in a subdir here:
I think the package dir structure for independent contributors should have job/contrib or job/providers
WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, it seems like JobTrait implementations should be in the monarch package. Some more discussion here.
We do something similar for community contributed clouds in SkyPilot: they are included as a part of sky.clouds module.
However for Monarch, it introduces the overhead of build monarch workers from source to keep client/worker versions in sync (discussion, example of extra build steps that a worker needs to run).
Any recommendations on how to add a new module to monarch.job while still letting workers use wheels from torchmonarch-nightly? Trying to avoid having workers build from src during dev/testing of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this resolved now that we have moved the module outside in a separate dir?
|
@colin2328 has imported this pull request. If you are a Meta employee, you can view this in D88994552. |
examples/skypilot/skypilot_ddp.py
Outdated
| try: | ||
| import sky | ||
| except ImportError: | ||
| print("ERROR: SkyPilot is not installed. Run: pip install skypilot[kubernetes]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should add this as an optional dependency in pyproject.toml if we move skypilot_job.py from examples to the job dir
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed!
|
I am also wondering how to test this when we move this outside of the examples dir and into the main package. @romilbhardwaj How does skypilot do end to end testing in its CI? Maybe we should have a minikube/kind test setup in our CI to test this if we expose this to users. We don't want silent breakages if users are going to use this. |
|
Cleaned up the readme, added an architecture diagram and included a DDP notebook inspired by the slurm example. Tested on coreweave k8s with H200s. This PR is ready for review now! cc @ahmadsharif1 @colin2328 @johnwhumphreys |
Great point. For our release testing, our CI spins up k8s clusters on EKS, GKE, CW and Nebius and runs our smoke test suite which includes model training examples from pytorch, ray, nemo and other popular frameworks. For lightweight CPU-only tests, some parts of our CI uses kind (via |
|
Added some flags to allow the example to run on CPU-only k8s clusters |
ahmadsharif1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good for the most part. You can wait for Colin to take a look as well.
I think he wants this to be part of the release, so you can address them and merge if you want
| @@ -0,0 +1,23 @@ | |||
| """ | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this resolved now that we have moved the module outside in a separate dir?
| python -c "import monarch; print(f'Monarch installed: {monarch}')" | ||
| python -c "import sky; print(f'SkyPilot installed: {sky}')" | ||
| # Configure SkyPilot to use in-cluster Kubernetes context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious as to what permissions the driver pod has. How can it launch worker pods using sky? Where are those permissions configured?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By default, SkyPilot automatically creates a service account with minimal permissions it needs and attaches it to the pod. You can find the permissions here: https://docs.skypilot.co/en/latest/cloud-setup/cloud-permissions/kubernetes.html
These permissions also allow SkyPilot running inside the driver pod to create worker pods.
Optionally, the user can override and specify their own custom service account SkyPilot config.
| # Configure SkyPilot to use in-cluster Kubernetes context | ||
| # This allows the driver pod to launch nested SkyPilot clusters | ||
| unset SKYPILOT_IN_CLUSTER_CONTEXT_NAME | ||
| sky api start |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to view the YAML that skypilot itself generates and applies to kubernetes (that contains the permissions, etc.)? Just curious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep! If you run sky launch with export SKYPILOT_DEBUG=1, the terminal output will include a long k8s manifest that will be applied. It's also stored in SkyPilot internal state db.
| @@ -0,0 +1,269 @@ | |||
| #!/usr/bin/env python3 | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about this, but maybe rename this to example_driver.py or something more descriptive?
| logger.error(f"Failed to launch SkyPilot cluster: {e}") | ||
| raise RuntimeError(f"Failed to launch SkyPilot cluster: {e}") from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if line 215 succeeds and line 222 fails? Do we still need to clean that up?
| logger.info(f"SkyPilot cluster '{cluster_name}' launched successfully") | ||
|
|
||
| # Wait for the job to be RUNNING (setup complete, run started) | ||
| self._wait_for_job_running(cluster_name, job_id, timeout=JOB_TIMEOUT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, should you clean something up if this timeout fails if there is state on the kubernetes cluster
| @@ -0,0 +1,457 @@ | |||
| """ | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also wondering if this package should be in a subdir outside of the notebooks, etc. I'll leave it to you
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that'd be a good idea, especially if we later want to support @mariusae's suggestion of using pip flavors like pip install torchmonarch[skypilot]
|
Thanks for the reviews @ahmadsharif1! I'll wait for a bit for @colin2328's feedback before working through the comments and doing a round of final testing. Also I noticed the linter was failing - do I need to take any action? Couldn't find instructions to run the linter myself. |
| - [Demo notebook](https://github.com/meta-pytorch/monarch/blob/main/examples/presentation/presentation.ipynb) | ||
| - [DevX Pytorch tutorial](https://docs.pytorch.org/tutorials/intermediate/monarch_distributed_tutorial.html) | ||
| - [Lightning Monarch blog](https://lightning.ai/meta-ai/environments/large-scale-interactive-training-with-monarch) | ||
| - [Monarch on Kubernetes using Skypilot](https://github.com/meta-pytorch/monarch/tree/main/examples/skypilot) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can drop this, leave I'll just link to the examples in index.rst
|
@romilbhardwaj , thanks. I'll fix the lints and push for you. please address the comments first, rest is fine to me. |
Hi Monarch community 👋
This PR adds support for running Monarch workloads on Kubernetes and cloud VMs via SkyPilot.
Adds a new
SkyPilotJobclass that provisions Monarch workers on K8s (or any of the 20+ cloud providers supported by SkyPilot):With this I was able to run the getting started example from Monarch docs on my multi-node Coreweave Kubernetes cluster with H200s. Full example in
examples/skypilot_getting_started.py.Notes:
This is an early PR - I'm looking for feedback on the direction. I'm a maintainer of the SkyPilot project and was trying to run Monarch on my k8s cluster. Figured an integration with SkyPilot would be a good fit.
Happy to iterate on the implementation if this aligns with where you'd like Monarch to go.