Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/publish_release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ jobs:
# Setup Tensor Engine dependencies
setup_tensor_engine

cargo install --path monarch_hyperactor
cargo install --path monarch_hyperactor_bin

# Build wheel
export MONARCH_PACKAGE_NAME="torchmonarch"
Expand Down
8 changes: 5 additions & 3 deletions .github/workflows/test-gpu-python.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,8 @@ jobs:
# Source common setup functions
source scripts/common-setup.sh

# Setup test environment
setup_test_environment
# Setup conda environment
setup_conda_environment

# Setup Tensor Engine dependencies
setup_tensor_engine
Expand All @@ -52,13 +52,15 @@ jobs:
# Install the built wheel from artifact
install_wheel_from_artifact

# Install test dependencies (without triggering a build)
install_python_test_dependencies

# tests the type_assert statements in test_python_actor are correct
# pyre currently does not check these assertions
pyright python/tests/test_python_actors.py

# Run GPU Python tests split into 10 groups sequentially
# Each group runs separately with process cleanup in between
pip install pytest-split

# Run tests with test_actor_error disabled
run_test_groups 0
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/wheels.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ jobs:
# Setup Tensor Engine dependencies
setup_tensor_engine

cargo install --path monarch_hyperactor
cargo install --path monarch_hyperactor_bin --no-default-features

# Build wheel
export MONARCH_PACKAGE_NAME="torchmonarch-nightly"
Expand Down
1 change: 1 addition & 0 deletions .python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.12
1 change: 1 addition & 0 deletions Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ members = [
"monarch_conda",
"monarch_extension",
"monarch_hyperactor",
"monarch_hyperactor_bin",
"monarch_messages",
"monarch_perfetto_trace",
"monarch_rdma",
Expand Down
59 changes: 37 additions & 22 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,18 @@
**Monarch** is a distributed programming framework for PyTorch based on scalable
actor messaging. It provides:

1. Remote actors with scalable messaging: Actors are grouped into collections called meshes and messages can be broadcast to all members.
2. Fault tolerance through supervision trees: Actors and processes form a tree and failures propagate up the tree, providing good default error behavior and enabling fine-grained fault recovery.
3. Point-to-point RDMA transfers: cheap registration of any GPU or CPU memory in a process, with the one-sided transfers based on libibverbs
4. Distributed tensors: actors can work with tensor objects sharded across processes

Monarch code imperatively describes how to create processes and actors using a simple python API:
1. Remote actors with scalable messaging: Actors are grouped into collections
called meshes and messages can be broadcast to all members.
2. Fault tolerance through supervision trees: Actors and processes form a tree
and failures propagate up the tree, providing good default error behavior and
enabling fine-grained fault recovery.
3. Point-to-point RDMA transfers: cheap registration of any GPU or CPU memory in
a process, with the one-sided transfers based on libibverbs
4. Distributed tensors: actors can work with tensor objects sharded across
processes

Monarch code imperatively describes how to create processes and actors using a
simple python API:

```python
from monarch.actor import Actor, endpoint, this_host
Expand All @@ -33,8 +39,9 @@ fut = trainers.train.call(step=0)
fut.get()
```


The [introduction to monarch concepts](https://meta-pytorch.org/monarch/generated/examples/getting_started.html) provides an introduction to using these features.
The
[introduction to monarch concepts](https://meta-pytorch.org/monarch/generated/examples/getting_started.html)
provides an introduction to using these features.

> ⚠️ **Early Development Warning** Monarch is currently in an experimental
> stage. You should expect bugs, incomplete features, and APIs that may change
Expand All @@ -45,16 +52,21 @@ The [introduction to monarch concepts](https://meta-pytorch.org/monarch/generate

## 📖 Documentation

View Monarch's hosted documentation [at this link](https://meta-pytorch.org/monarch/).
View Monarch's hosted documentation
[at this link](https://meta-pytorch.org/monarch/).

## Installation
Note for running distributed tensors and RDMA, the local torch version must match the version that monarch was built with.
Stable and nightly distributions require libmxl and libibverbs (runtime).

Note for running distributed tensors and RDMA, the local torch version must
match the version that monarch was built with. Stable and nightly distributions
require libmxl and libibverbs (runtime).

## Fedora

`sudo dnf install -y libibverbs rdma-core libmlx5 libibverbs-devel rdma-core-devel`

## Ubuntu

`sudo apt install -y rdma-core libibverbs1 libmlx5-1 libibverbs-dev`

### Stable
Expand All @@ -64,14 +76,15 @@ Stable and nightly distributions require libmxl and libibverbs (runtime).
torchmonarch stable is built with the latest stable torch.

### Nightly

`pip install torchmonarch-nightly`

torchmonarch-nightly is built with torch nightly.

### Build and Install from Source

If you're building Monarch from source, you should be building it with the nightly PyTorch as well for ABI compatibility.

If you're building Monarch from source, you should be building it with the
nightly PyTorch as well for ABI compatibility.

#### On Fedora distributions

Expand Down Expand Up @@ -161,10 +174,11 @@ pip list | grep monarch

#### On non-CUDA machines

You can also build Monarch to run on non-CUDA machines, e.g. locally on a MacOS system.

Note that this does not support tensor engine, which is tied to CUDA and RDMA (via ibverbs).
You can also build Monarch to run on non-CUDA machines, e.g. locally on a MacOS
system.

Note that this does not support tensor engine, which is tied to CUDA and RDMA
(via ibverbs).

```sh

Expand All @@ -180,8 +194,6 @@ rustup default nightly
# Install build dependencies
pip install --pre torch --index-url https://download.pytorch.org/whl/nightly/cpu
pip install -r build-requirements.txt
# Install test dependencies
pip install -r python/tests/requirements.txt

# Build and install Monarch
USE_TENSOR_ENGINE=0 pip install --no-build-isolation .
Expand All @@ -192,10 +204,10 @@ USE_TENSOR_ENGINE=0 pip install --no-build-isolation -e .
pip list | grep monarch
```


## Running examples

Check out the `examples/` directory for demonstrations of how to use Monarch's APIs.
Check out the `examples/` directory for demonstrations of how to use Monarch's
APIs.

We'll be adding more examples as we stabilize and polish functionality!

Expand All @@ -205,6 +217,7 @@ We have both Rust and Python unit tests. Rust tests are run with `cargo-nextest`
and Python tests are run with `pytest`.

Rust tests:

```sh
# We use cargo-nextest to run our tests, as they can provide strong process isolation
# between every test.
Expand All @@ -213,12 +226,14 @@ Rust tests:
cargo install cargo-nextest --locked
cargo nextest run
```

cargo-nextest supports all of the filtering flags of "cargo test".

Python tests:

```sh
# Make sure to install test dependencies first
pip install -r python/tests/requirements.txt
# Install test dependencies if not already installed
pip install -e '.[test]'
# Run unit tests. consider -s for more verbose output
pytest python/tests/ -v -m "not oss_skip"
```
Expand Down
2 changes: 1 addition & 1 deletion monarch_extension/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ monarch_rdma_extension = { version = "0.0.0", path = "../monarch_rdma/extension"
monarch_tensor_worker = { version = "0.0.0", path = "../monarch_tensor_worker", optional = true }
nccl-sys = { path = "../nccl-sys", optional = true }
ndslice = { version = "0.0.0", path = "../ndslice" }
pyo3 = { version = "0.24", features = ["anyhow", "multiple-pymethods", "py-clone"] }
pyo3 = { version = "0.24", features = ["anyhow", "extension-module", "multiple-pymethods", "py-clone"] }
rdmaxcel-sys = { path = "../rdmaxcel-sys", optional = true }
serde = { version = "1.0.219", features = ["derive", "rc"] }
tokio = { version = "1.47.1", features = ["full", "test-util", "tracing"] }
Expand Down
15 changes: 2 additions & 13 deletions monarch_hyperactor/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# @generated by autocargo from //monarch/monarch_hyperactor:[monarch_hyperactor,monarch_hyperactor_test_bootstrap,process_allocator-oss,test_monarch_hyperactor]
# @generated by autocargo from //monarch/monarch_hyperactor:[monarch_hyperactor,test_monarch_hyperactor]

[package]
name = "monarch_hyperactor"
Expand All @@ -7,15 +7,6 @@ authors = ["Meta"]
edition = "2021"
license = "BSD-3-Clause"

[[bin]]
name = "monarch_hyperactor_test_bootstrap"
path = "test/bootstrap.rs"
edition = "2024"

[[bin]]
name = "process_allocator"
edition = "2024"

[[test]]
name = "test_monarch_hyperactor"
path = "tests/lib.rs"
Expand All @@ -26,7 +17,6 @@ async-once-cell = "0.4.2"
async-trait = "0.1.86"
bincode = "1.3.3"
bytes = { version = "1.10", features = ["serde"] }
clap = { version = "4.5.42", features = ["derive", "env", "string", "unicode", "wrap_help"] }
erased-serde = "0.4.9"
fastrand = "2.1.1"
fbinit = { version = "0.2.0", git = "https://github.com/facebookexperimental/rust-shed.git", branch = "main" }
Expand All @@ -39,7 +29,6 @@ hyperactor_telemetry = { version = "0.0.0", path = "../hyperactor_telemetry" }
inventory = "0.3.21"
lazy_errors = "0.10.1"
lazy_static = "1.5"
libc = "0.2.139"
monarch_conda = { version = "0.0.0", path = "../monarch_conda" }
monarch_types = { version = "0.0.0", path = "../monarch_types" }
ndslice = { version = "0.0.0", path = "../ndslice" }
Expand All @@ -61,7 +50,7 @@ buck-resources = "1"
dir-diff = "0.3"

[features]
default = []
default = ["pyo3/extension-module"]
packaged_rsync = []

[lints]
Expand Down
31 changes: 31 additions & 0 deletions monarch_hyperactor_bin/Cargo.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# @generated by autocargo from //monarch/monarch_hyperactor_bin:[monarch_hyperactor_test_bootstrap,process_allocator-oss]

[package]
name = "monarch_monarch_hyperactor_bin"
version = "0.0.0"
authors = ["Meta"]
edition = "2021"
license = "BSD-3-Clause"

[[bin]]
name = "monarch_hyperactor_test_bootstrap"
path = "test/bootstrap.rs"
edition = "2024"

[[bin]]
name = "process_allocator"
edition = "2024"

[dependencies]
anyhow = "1.0.98"
clap = { version = "4.5.42", features = ["derive", "env", "string", "unicode", "wrap_help"] }
hyperactor = { version = "0.0.0", path = "../hyperactor" }
hyperactor_mesh = { version = "0.0.0", path = "../hyperactor_mesh" }
libc = "0.2.139"
monarch_hyperactor = { version = "0.0.0", path = "../monarch_hyperactor", default-features = false }
pyo3 = { version = "0.24", features = ["anyhow", "multiple-pymethods", "py-clone"] }
tokio = { version = "1.47.1", features = ["full", "test-util", "tracing"] }
tracing = { version = "0.1.41", features = ["attributes", "valuable"] }

[dev-dependencies]
ndslice = { version = "0.0.0", path = "../ndslice" }
5 changes: 4 additions & 1 deletion monarch_messages/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -14,10 +14,13 @@ enum-as-inner = "0.6.0"
hyperactor = { version = "0.0.0", path = "../hyperactor" }
monarch_types = { version = "0.0.0", path = "../monarch_types" }
ndslice = { version = "0.0.0", path = "../ndslice" }
pyo3 = { version = "0.24", features = ["anyhow", "multiple-pymethods", "py-clone"] }
pyo3 = { version = "0.24", features = ["anyhow", "extension-module", "multiple-pymethods", "py-clone"] }
serde = { version = "1.0.219", features = ["derive", "rc"] }
serde_bytes = "0.11"
thiserror = "2.0.12"
torch-sys-cuda = { version = "0.0.0", path = "../torch-sys-cuda" }
torch-sys2 = { version = "0.0.0", path = "../torch-sys2" }
tracing = { version = "0.1.41", features = ["attributes", "valuable"] }

[features]
default = []
5 changes: 4 additions & 1 deletion monarch_rdma/extension/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,10 @@ hyperactor = { version = "0.0.0", path = "../../hyperactor" }
hyperactor_mesh = { version = "0.0.0", path = "../../hyperactor_mesh" }
monarch_hyperactor = { version = "0.0.0", path = "../../monarch_hyperactor" }
monarch_rdma = { version = "0.0.0", path = ".." }
pyo3 = { version = "0.24", features = ["anyhow", "multiple-pymethods", "py-clone"] }
pyo3 = { version = "0.24", features = ["anyhow", "extension-module", "multiple-pymethods", "py-clone"] }
serde = { version = "1.0.219", features = ["derive", "rc"] }
serde_json = { version = "1.0.140", features = ["alloc", "float_roundtrip", "raw_value", "unbounded_depth"] }
tracing = { version = "0.1.41", features = ["attributes", "valuable"] }

[features]
default = []
5 changes: 4 additions & 1 deletion monarch_tensor_worker/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ monarch_messages = { version = "0.0.0", path = "../monarch_messages" }
monarch_types = { version = "0.0.0", path = "../monarch_types" }
ndslice = { version = "0.0.0", path = "../ndslice" }
parking_lot = { version = "0.12.1", features = ["send_guard"] }
pyo3 = { version = "0.24", features = ["anyhow", "multiple-pymethods", "py-clone"] }
pyo3 = { version = "0.24", features = ["anyhow", "extension-module", "multiple-pymethods", "py-clone"] }
serde = { version = "1.0.219", features = ["derive", "rc"] }
sorted-vec = "0.8.3"
tokio = { version = "1.47.1", features = ["full", "test-util", "tracing"] }
Expand All @@ -32,3 +32,6 @@ tracing-subscriber = { version = "0.3.20", features = ["chrono", "env-filter", "
[dev-dependencies]
rand = { version = "0.8", features = ["small_rng"] }
timed_test = { version = "0.0.0", path = "../timed_test" }

[features]
default = []
5 changes: 4 additions & 1 deletion monarch_types/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,14 @@ license = "BSD-3-Clause"
[dependencies]
derive_more = { version = "1.0.0", features = ["full"] }
hyperactor = { version = "0.0.0", path = "../hyperactor" }
pyo3 = { version = "0.24", features = ["anyhow", "multiple-pymethods", "py-clone"] }
pyo3 = { version = "0.24", features = ["anyhow", "extension-module", "multiple-pymethods", "py-clone"] }
serde = { version = "1.0.219", features = ["derive", "rc"] }
serde_bytes = "0.11"

[dev-dependencies]
anyhow = "1.0.98"
timed_test = { version = "0.0.0", path = "../timed_test" }
tokio = { version = "1.47.1", features = ["full", "test-util", "tracing"] }

[features]
default = []
Loading
Loading