Skip to content

Comments

madengine v2 with unified framework for local and distribution#57

Open
coketaste wants to merge 284 commits intomainfrom
coketaste/refactor-dis
Open

madengine v2 with unified framework for local and distribution#57
coketaste wants to merge 284 commits intomainfrom
coketaste/refactor-dis

Conversation

@coketaste
Copy link
Contributor

Motivation

Technical Details

Test Plan

Test Result

Submission Checklist

coketaste added 30 commits July 6, 2025 17:26
… node, run on gpu node; and legecy run with madengine
…ndent tests on CPU-only machines, while avoiding mock context failures on build-only nodes
…he multi-node arguments are properly stored in docker_env_vars, included in build_manifest.json, and will be available to the runtime containers with the --nproc_per_node value resolved based on the actual GPU count detected at runtime.
… update anisble and k8s to work infrastructrue as code.
… structure that emphasizes its core strengths in MAD package integration and distributed model execution
@coketaste coketaste changed the title New madengine CLI with unified interface for local and distribution madengine v2 with unified interface for local and distribution Jan 8, 2026
@coketaste coketaste changed the title madengine v2 with unified interface for local and distribution madengine v2 with unified framework for local and distribution Jan 8, 2026
…r different bottleneck types

- Hardware counter definitions for compute, memory, and communication analysis
- Ready-to-use configuration files for single-GPU, multi-GPU, and multi-node setups
- Perfetto visualization support for timeline analysis
- Full custom command support via cmd and env_vars fields
- Automatic rocprof/rocprofv3 detection via existing wrapper script
- Comprehensive documentation with examples for every scenario
…rofiling with rocprof v3 and stacked tools chain
… updated the tables of execution results and performance results
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants