Skip to content

Conversation

@salexspb
Copy link

@salexspb salexspb commented Dec 30, 2025

This fixes a bug in _WindowShuffleDatasetIterator where the _init flag was not included in get_state()/set_state(), causing incorrect behavior when restoring checkpoints to a fresh iterator.

The bug manifested as:

  • When creating a fresh iterator and restoring a checkpoint to it, the _init flag would remain True (from initialization)
  • On the next window fill after restoration, _maybe_update_window_index() would see _init=True and not increment window_index
  • This caused the same window_index to be used twice, leading to incorrect shuffle seeds and data mismatch

The fix:

  • Include _init in the state dict returned by get_state()
  • Restore _init in set_state() with backwards compatibility for old checkpoints (defaults to False if not present)

Added test test_checkpoint_restore_on_fresh_iterator that:

  • Creates an iterator and checkpoints partway through a window
  • Restores the checkpoint to a fresh iterator (not the same instance)
  • Verifies data and window_index match between original and restored runs

📚 Documentation preview 📚: https://google-grain--1179.org.readthedocs.build/

This fixes a bug in _WindowShuffleDatasetIterator where the _init flag
was not included in get_state()/set_state(), causing incorrect behavior
when restoring checkpoints to a fresh iterator.

The bug manifested as:
- When creating a fresh iterator and restoring a checkpoint to it, the
  _init flag would remain True (from initialization)
- On the next window fill after restoration, _maybe_update_window_index()
  would see _init=True and not increment window_index
- This caused the same window_index to be used twice, leading to
  incorrect shuffle seeds and data mismatch

The fix:
- Include _init in the state dict returned by get_state()
- Restore _init in set_state() with backwards compatibility for old
  checkpoints (defaults to False if not present)

Added test test_checkpoint_restore_on_fresh_iterator that:
- Creates an iterator and checkpoints partway through a window
- Restores the checkpoint to a fresh iterator (not the same instance)
- Verifies data and window_index match between original and restored runs
@google-cla
Copy link

google-cla bot commented Dec 30, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant