Skip to content

[AIROCMLIR-43] tuningRunner improvements - Add state file for crash recovery and improve logging#2208

Open
mirza-halilcevic wants to merge 34 commits intodevelopfrom
tuning-logging
Open

[AIROCMLIR-43] tuningRunner improvements - Add state file for crash recovery and improve logging#2208
mirza-halilcevic wants to merge 34 commits intodevelopfrom
tuning-logging

Conversation

@mirza-halilcevic
Copy link
Contributor

@mirza-halilcevic mirza-halilcevic commented Jan 18, 2026

Motivation

Improve crash recovery, informational output, and error reporting.

Technical Details

  • Introduce state file to keep track of each configs state. Used to detect crashes and skip over repeatedly failing/crashing configs over multiple runs.
  • Use python logger instead of prints.
  • Introduce --verbose flag for debug output. Keep --debug only for debug file generation.
  • Support stdin for configs input.
  • Improve error reporting and let fatal exceptions propagate.
  • Save elapsed tuning time to output file and track ETA.
  • Introduce --timeout flag to specify a timeout for the tuning-driver.
  • Add tuningSpace, commitId, timestamp, durationSec as fields in the output.

Test Plan

This branch was used to create the tuning databases from which the quick-tune lists in #2212 were generated.

Test Result

Submission Checklist

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances the tuning infrastructure with state file management for crash recovery and comprehensive logging improvements. The changes enable the tuner to track configuration states (running, failed, crashed, interrupted), persist them across runs, and recover gracefully from interruptions or crashes.

Changes:

  • Added JSON state file mechanism to track tuning progress and enable crash recovery
  • Introduced structured logging with color-coded output and tqdm integration
  • Enhanced error reporting with detailed context and formatted output
  • Added --retry-failed flag to selectively retry failed/crashed configs
  • Improved progress tracking with ETA estimation based on median completion times

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
mlir/utils/performance/tuningRunner.py Core implementation of state management, logging infrastructure, and enhanced error handling
mlir/utils/performance/perfRunner.py Simplified tuning database reader to handle variable column counts
mlir/utils/jenkins/Jenkinsfile.downstream Removed --quiet flag from CI tuning commands
mlir/utils/jenkins/Jenkinsfile Removed --quiet flag from fusion tuning commands
mlir/lib/Dialect/Rock/Tuning/RockTuningImpl.cpp Removed obsolete comment about hidden warning

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@mirza-halilcevic mirza-halilcevic changed the title tuningRunner improvements - Add state file for crash recovery and improve logging [AIROCMLIR-43] tuningRunner improvements - Add state file for crash recovery and improve logging Jan 23, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@dorde-antic dorde-antic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants