Benchmark: Model benchmark - deterministic training support #731

Aishwarya-Tonpe · 2025-08-28T17:41:54Z

Support for deterministic training and reproducible logging to all PyTorch model benchmarks in SuperBench (BERT, GPT2, LLaMA, LSTM, CNN, Mixtral).

Deterministic mode: Makes sure model runs are consistent every time by fixing random seeds, turning off TF32, and using stable math operations.
Log generation: Saves key info like loss and activation stats during training.
Log comparison: Lets you compare a new run with a previous one to check if they match.
New command-line options:

--enable-determinism
--generate-log {boolean flag which when enabled, stores the metrics (loss and activation mean) to the results file}
--compare-log {log path of the json file against which you want to compare the results of the current run}
--check-frequency

Changes -

Updated pytorch_base.py to handle deterministic settings, logging, and comparisons.
Added a new example script: pytorch_deterministic_example.py
Added a test file: test_pytorch_determinism_all.py to verify everything works as expected.

Usage -

Run with --enable-determinism --generate-log to create a reference log.
Run again with --compare-log to check if the new run matches the reference.
Make sure all parameters stay the same between runs.

- Add _enable_deterministic_training() method to set all necessary seeds - Add --deterministic and --random_seed command line arguments - Integrate deterministic training in _create_model() and _generate_dataset() - Add comprehensive unit tests for deterministic functionality - Tests validate parameter parsing, functionality, and regression scenarios - All tests pass and integrate with existing SuperBench test suite

…pass check_frequency to _is_finished in train/infer; add test capturing checksum log; stabilize fp32 loss path and small-dims determinism tests

…oss BERT/GPT2/CNN/LSTM/Mixtral; per-step fp32 loss logging; checksum logs; tests updated to strict/soft determinism pattern; add strict determinism CI guidance

…rings; fix GPT-2 params; soft vs strict checks stabilized

…sum tests with BERT pattern, improve docstrings and skip logic.

…BERT, GPT-2, LSTM, CNN, LLaMA examples

… models; update tests

…/CNN/BERT/Mixtral with periodic fingerprints, per-step loss capture, TF32 off, SDPA math kernel; add model_log_utils; update examples and tests, add env gating for cuBLAS.

…ted example file, remove redundant code

… unnecessary code

…idual model classes

… reduce redundant code

tests/benchmarks/model_benchmarks/test_pytorch_determinism_all.py

…in the test file

Aishwarya-Tonpe · 2025-08-28T19:55:34Z

@Aishwarya-Tonpe please read the following Contributor License Agreement(CLA). If you agree with the CLA, please reply with the following information.
@microsoft-github-policy-service agree [company="{your company}"]
Options:

(default - no company specified) I have sole ownership of intellectual property rights to my Submissions and I am not making Submissions in the course of work for my employer.
@microsoft-github-policy-service agree
(when company given) I am making Submissions in the course of work for my employer (or my employer has intellectual property rights in my Submissions by contract or applicable law). I have permission from my employer to make Submissions and enter into this Agreement on behalf of my employer. By signing below, the defined term “You” includes me and my employer.
@microsoft-github-policy-service agree company="Microsoft"
Contributor License Agreement

@microsoft-github-policy-service agree company="Microsoft"

…se file

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…etadata overriding

…es not need to be set explicitly before running the benchmarks

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…b.com/microsoft/superbenchmark into aishwaryatonpe/deterministic-training

…tcases pass on local

abuccts

The metadata and compare log functions still seem to be unnecessary.

For compare log function, it just checks whether the loss etc. in each step are equal or not, which is just a special case of the result analysis. I think you can just re-use current result analysis module to write some yaml configs to perform this comparison, rather then writing new code to do this during online benchmark run. Besides, there exist several scenarios that current compare log function cannot cover:
1. in large scale training, the all-reduce usually produces accumulated errors due to different reduction orders among runs, so tolerating a range of differences is necessary in analysis/comparison, which can be easily configured in yaml configs of result analysis module.
2. in validation, the results may need to compared to either baseline or results of other nodes. current compare log only performs 1 on 1 comparison of a pre-defined results, and cannot compare loss between different nodes in one run.
For metadata, all settings should already be included in benchmark config. When users compare loss results in two runs, they should guarantee the configs used are same, which is the same as comparing performance results. You may also write the necessary metadata into metrics so that results analysis can compare it as well.

Currently, all benchmarks in superbench only record related metrics during each run in benchmark module, then runner will collect all metrics after each run in runner module, and analysis/comparison is performed offline after all benchmarks finished in result analysis module.

Therefore, it would be better for determinism support in model benchmark follows the same process:

write necessary results (e.g., loss, metadata, etc.) into metrics for each rank in pytorch benchmark during each run
rely on existing results collection process in runner module to collect results from each rank, rather than ad-hoc all-reduce/all-gather in benchmark
rely on existing results analysis module to compare the results offline. if there's any uncovered function for comparison, it would be better to support it generally in results analysis so that determinism in micro-benchmarks can also re-use it in the future.

Besides, please fix the unit tests accordingly.

abuccts · 2026-02-03T22:15:21Z

docs/user-tutorial/benchmarks/model-benchmarks.md

+  - `--enable-determinism`: Enables deterministic computation for reproducible results.
+  - `--deterministic_seed <seed>`: Sets the seed for reproducibility.
+  - `--generate_log` : Boolean flag that stores comparison metrics in the results file
+  - `--compare_log <results_file_path>`: Specifies the path to the reference file for comparison.


unify them to use either underscore or dash?

abuccts · 2026-02-03T22:27:06Z

superbench/benchmarks/model_benchmarks/pytorch_base.py

+    def _save_consolidated_deterministic_results(self):
+        """Gather deterministic data from all ranks and save to results-summary (rank 0 only).


all results from all ranks will be aggregated to control node by runner after benchmarks, I don't think this function is necessary

abuccts · 2026-02-03T22:28:48Z

superbench/benchmarks/model_benchmarks/pytorch_base.py

+        Loads the reference results.json file and compares deterministic metrics
+        (loss, activation mean) per-rank to verify reproducibility.
+        """
+        import torch.distributed as dist


why not import at the beginning

abuccts · 2026-02-03T22:29:41Z

superbench/benchmarks/model_benchmarks/pytorch_base.py

+        # Synchronize failure status across all ranks in distributed mode
+        if self._args.distributed_impl == DistributedImpl.DDP:
+            # Convert failure status to tensor for all_reduce
+            import torch


torch is also imported in this file, why import again?

abuccts · 2026-02-03T22:31:29Z

superbench/benchmarks/model_benchmarks/pytorch_base.py

+            failure_tensor = torch.tensor([1 if has_failure else 0], dtype=torch.int32, device='cuda')
+            dist.all_reduce(failure_tensor, op=dist.ReduceOp.MAX)


will this work for cpu mode?

abuccts · 2026-02-03T22:41:16Z

superbench/benchmarks/model_benchmarks/pytorch_bert.py

-                if self._is_finished(curr_step, end, check_frequency):
-                    return duration
+                    if self._is_finished(curr_step, end):
+                        return duration, self._finalize_periodic_logging(periodic)


this will change the behavior when the running will be stopped by duration rather than step number

abuccts · 2026-02-03T22:41:28Z

superbench/benchmarks/model_benchmarks/pytorch_cnn.py

-                if self._is_finished(curr_step, end, check_frequency):
-                    return duration
+                    if self._is_finished(curr_step, end):
+                        return duration, self._finalize_periodic_logging(periodic)


same, this will change the behavior when the running will be stopped by duration rather than step number

abuccts · 2026-02-03T22:44:20Z

superbench/benchmarks/model_benchmarks/pytorch_gpt2.py


        Return:
-            The step-time list of every training step.
+           A tuple of (step_times_ms, info) of every training step.


missing one space in indent

abuccts · 2026-02-03T22:45:47Z

superbench/benchmarks/model_benchmarks/pytorch_gpt2.py

-                if self._is_finished(curr_step, end, check_frequency):
-                    return duration
+                    if self._is_finished(curr_step, end):
+                        return duration, self._finalize_periodic_logging(periodic)


abuccts · 2026-02-03T22:45:53Z

superbench/benchmarks/model_benchmarks/pytorch_gpt2.py

                end = self._timer()
                curr_step += 1
                if curr_step > self._args.num_warmup:
-                    # Save the step time of every training/inference step, unit is millisecond.


why remove this

Aishwarya-Tonpe added 25 commits August 17, 2025 21:24

llama: add periodic checksum logging (deterministic-only, log-only); …

e103dd0

…pass check_frequency to _is_finished in train/infer; add test capturing checksum log; stabilize fp32 loss path and small-dims determinism tests

deterministic training: enable seeding + deterministic algorithms acr…

87ff6d6

…oss BERT/GPT2/CNN/LSTM/Mixtral; per-step fp32 loss logging; checksum logs; tests updated to strict/soft determinism pattern; add strict determinism CI guidance

tests(pytorch): add strict determinism skip guards and detailed docst…

8eee235

…rings; fix GPT-2 params; soft vs strict checks stabilized

Refactor LLaMA model tests: align strict, soft determinism, and check…

fe34247

…sum tests with BERT pattern, improve docstrings and skip logic.

examples: add deterministic and strict_determinism flags and docs to …

c374dfe

…BERT, GPT-2, LSTM, CNN, LLaMA examples

Deterministic fingerprints: replace checksum with Loss+ActMean across…

614f96c

… models; update tests

Deterministic training + reproducible logging: align GPT-2/LLaMA/LSTM…

689dc44

…/CNN/BERT/Mixtral with periodic fingerprints, per-step loss capture, TF32 off, SDPA math kernel; add model_log_utils; update examples and tests, add env gating for cuBLAS.

Adding flag: Checck-frequency

33c3f6a

Add Check frequency flag to tests

f35e98b

Code refactor: Move enable_determinism to base class, add a consolida…

dd7fcbe

…ted example file, remove redundant code

Code refactor: Add a new test folder to remove redundant code, remove…

d439395

… unnecessary code

Code refactor: Move loss and ActMean logging to base class from indiv…

da9c85a

…idual model classes

Code refactor: Move _benchmark() method to base class

2635aad

Code refactor: Add method _finalize_periodic_logging to base class to…

4a21990

… reduce redundant code

Code cleanup: Remove unnecessary imports

ddd3f23

Code cleanup: Remove unnecessary imports

a9cb452

Code cleanup: Remove unnecessary imports

52c5516

Code cleanup: Remove unnecessary imports

6623f59

Tescase addition: Add Failure testcase, renameflag

8853c21

Delete extra lines

14be806

Add Docstrings, align imports, add assertions messages

8cd1c19

Lint Checks

99bdc16

Lint Checks

4bc0445

Lint Checks

2c8d856

Aishwarya-Tonpe requested a review from a team as a code owner August 28, 2025 17:41

github-advanced-security bot found potential problems Aug 28, 2025

View reviewed changes

tests/benchmarks/model_benchmarks/test_pytorch_determinism_all.py Fixed Show fixed Hide fixed

Failed check: Resolving failed pipeline check for creating temp file …

d8d9ca0

…in the test file

Pipeline failure fixes : Fixing Lint failures on test, example and ba…

8bcd801

…se file

root and others added 29 commits December 8, 2025 22:21

Comments resolve : Removing check_frequency assignment to the variable

d0bfd38

Update superbench/benchmarks/model_benchmarks/pytorch_base.py

197007a

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update tests/benchmarks/model_benchmarks/test_pytorch_determinism_all.py

4724815

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update superbench/benchmarks/model_benchmarks/pytorch_base.py

fdc82ad

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Logic change to add metrics to resuls_summary file, Logic change to m…

373fdf3

…etadata overriding

Moving CUBLAS_WORKSPACE_CONFIG=:4096:8 to the code base so that it do…

11e945e

…es not need to be set explicitly before running the benchmarks

Renaming --deterministic -> --enable-determinism

4911580

Comments resolve: minor deletions

67fca5c

Update superbench/benchmarks/model_benchmarks/pytorch_base.py

ce18856

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update superbench/benchmarks/model_benchmarks/pytorch_mixtral_impl.py

31f46ad

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update docs/user-tutorial/benchmarks/model-benchmarks.md

c5895b1

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Refactoring the code: Moving utility functions to model_log_utils

e457b83

Merge branch 'aishwaryatonpe/deterministic-training' of https://githu…

02d568a

…b.com/microsoft/superbenchmark into aishwaryatonpe/deterministic-training

Updating the user docs

a249916

Updating the test files and fixing lint errors

039b17e

Lint error fixes

a26518c

Pipeline erros resolve : Link errors, function complex error

c8abf0c

Resetting the env var cause of failing testcases in the pipeline, tes…

2f5493a

…tcases pass on local

Resolving pipelines errors

8398f51

Resolving pipelines errors

7c5405a

Resolving pipeline issues

6b51a18

Adding a new test file to cover the code logic in the model_utils file

c8ca973

Resolving pipeline issues

7f6bfeb

Resolving pipeline issues

205934e

resolving pipeline issues

3e996f2

Resolving pipeline failures

ea9f6b2

Fix pipeline issues

3b31c6a

Minor change

4384412

Merge branch 'main' into aishwaryatonpe/deterministic-training

b5967f7

abuccts reviewed Feb 3, 2026

View reviewed changes

		def _save_consolidated_deterministic_results(self):
		"""Gather deterministic data from all ranks and save to results-summary (rank 0 only).

		failure_tensor = torch.tensor([1 if has_failure else 0], dtype=torch.int32, device='cuda')
		dist.all_reduce(failure_tensor, op=dist.ReduceOp.MAX)

Benchmark: Model benchmark - deterministic training support #731

Are you sure you want to change the base?

Benchmark: Model benchmark - deterministic training support #731

Uh oh!

Conversation

Aishwarya-Tonpe commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Aishwarya-Tonpe commented Aug 28, 2025

Uh oh!

abuccts left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Aishwarya-Tonpe commented Aug 28, 2025 •

edited

Loading