First of all, thank you for your amazing work on the nnScaler project. It has been incredibly inspiring, and I’ve been learning and using the contents from this repository in my own work.
I have a few suggestions and questions:
1. Performance Evaluation:
It would be great if you could provide some sample code or a reference implementation to evaluate the computation and communication performance with the profiling tool.
2. Subset Parallelization Support:
Following the README, I have successfully implemented end-to-end parallelization of an entire model. However, I’m wondering if it is possible to perform parallelization on a subset of the model and generate parallelized code only for that portion.
If this functionality already exists, could you provide documentation or examples on how to use it?