-
Notifications
You must be signed in to change notification settings - Fork 74
Feature: Enable appending to trajectory when using ts.optimize/ts.integrate
#361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
orionarcher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also fixes a bug where, whenever a structure has converged, all other trajectories are reset by re-opening them in "w" mode: https://github.com/TorchSim/torch-sim/blob/main/torch_sim/runners.py#L483.
Can you say more here? This would be a pretty serious bug because it would mean ll the optimization trajectories are wrong.
More generally, I think this PR needs to grapple with the fact that if we are appending, the trajectories we are appending to might have different past progress. In a batched setting, we can't just take the maximum progress of one as the progress of all. I realize that makes literally everything more annoying but such is the curse and blessing of TorchSim.
Unfortunately I believe this affects all optimization trajectories. Looking at |
|
I would rather push for a PR allowing for a restart of simulation #43. In practice the difference is that supposed if you run multiple time |
@thomasloux happy to change it to this way of running. Just to make sure I understand. The proposal is to:
If that's the case, I think we have to do it like in the |
|
For a MD simulations at least, there is no such thing, all systems evolve at the same time. I actually assumes it the same for optimize, if not for convergence criteria that would be met. |
|
I don't have any strict opinion on optimization as I don't use it this much |
Thanks @thomasloux! I've updated the code to assume that for |
thomasloux
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's exactly what I wanted for restart and ts.integrate ! Just waiting for the tests to pass to approve
|
Will take a look at the failures and hopefully fix them |
|
Tests pass locally for me now, let's see what happens in the CI. |
thomasloux
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One part to clarify for ts.optimize, otherwise it's a really nice work. I think in a later PR we should clarify the Trajectory objects. I'm not a fan of the mode 'r' vs 'a'. I would probably prefer an argument: either 'restart=True' or 'overwrite_if_exists=True'.
thomasloux
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good for me now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the code to assume that for integrate() all trajectories start from the same time step, and then only integrate for n_steps - initial_step steps. Is that in line with what you had in mind?
Sorry for being a bit late on this, but I am going to push back on this implementation.
I think that n_steps should mean the number of steps you take, not the number of steps you reach. My heavy prior would be that n_steps is a number of steps taken, as a user, I'd find it quite strange if that wasn't the case. What do you think?
I also don't think that if the steps don't match that we should truncate all the trajectories to the same length. In my mind, we should detect the current step across all the trajectories, then progressively append to that step + i, where i is the number of steps taken by the integrator. Does that make sense?
I want to say I am quite appreciative of the PR and I think being able to append with the runners is a good feature. I am being nit picky because I think it's challenging to get the right behavior and expectations for this in a batched context.
|
|
||
| self.trajectories = [] | ||
| for filename in self.filenames: | ||
| self.trajectories.append( | ||
| TorchSimTrajectory( | ||
| filename=filename, | ||
| metadata=self.metadata, | ||
| **self.trajectory_kwargs, | ||
| ) | ||
| # Avoid wiping existing trajectory files when reopening them, hence | ||
| # we set to "a" mode temporarily (read mode is unaffected). | ||
| _mode = self.trajectory_kwargs.get("mode", "w") | ||
| self.trajectory_kwargs["mode"] = "a" if _mode in ["a", "w"] else "r" | ||
| self.trajectories = [ | ||
| TorchSimTrajectory( | ||
| filename=filename, | ||
| metadata=self.metadata, | ||
| **self.trajectory_kwargs, | ||
| ) | ||
| for filename in filenames | ||
| ] | ||
| # Restore original mode | ||
| self.trajectory_kwargs["mode"] = _mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why shouldn't we overwrite them if mode is "w"? In my mind that's the intended behavior if we are in write mode, to overwrite all of the existing trajectories with new ones. This could be affected by the user by setting `trajectory_kwargs={"mode"="a"}.
If the feeling is that defaulting to "w" is too strong, the place to change it would be on line 137, where we set
self.trajectory_kwargs["mode"] = self.trajectory_kwargs.get("mode", "w")
_mode = self.trajectory_kwargs.get("mode", "w"), also, not clear why we'd have "w" be the default original mode here.
Without this change, I still think load_new_trajectories makes sense than reopen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My argument is that the mode ("w" vs "a") is meant by the user only for the initial opening of the file, not the (potentially many) times we re-open the trajectory file during the optimization (e.g., when some of the trajectories finish). I don't think it is ever intentional to wipe all existing progress when that happens, even if the user specified "w" mode in the beginning.
Hence, in my PR I propose to use the user-specified mode only for the first time we open the trajectory files (in the constructor of the trajectory reporter), and afterwards avoid wiping then files by ensuring "a" mode for the times we re-open the files during the optimization.
What is your alternative proposal to ensure that (i) we use the user-intended file mode initially, but (ii) don't erase the files during the optimization?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Setting aside where it's intentional, I do think it should be possible to wipe the existing files when re-running, though I agree that by default appending is better. LAMMPS has an append mode when writing dump files that is by default set to False, so this particular sharp edge isn't unusual.
I would propose that we set the default to "a" but wipe the files if "w" is set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree it should be possible to wipe existing files when re-running, and I think it's possible to do that with the existing code. In the constructor of Trajectory, we take the mode from the user-provided kwargs:
self.trajectory_kwargs["mode"] = self.trajectory_kwargs.get("mode", "w")
This means that if we specify "w" mode and there's already existing trajectory files, they will be wiped. Since reopen_trajectories is only called during optimization/ dynamics, not at the start, I believe it should never wipe trajectories halfway an active trajectory (which is what currently happens before this bugfix).
So, if you agree that
- it is possible to wipe a trajectory via constructing as
trajectory = Trajectory(..., trajectory_kwargs=dict(mode="w")), and that - we never want to wipe trajectories during an active run,
perhaps the current version of the code is fine? Otherwise please let me know what part I don't yet understand. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah! It was I who was misunderstanding. Yeah you are totally right, the trajectories will get wiped whenenver a state converges. Temporarily switching into append mode is a very good catch!
| @property | ||
| def last_step(self) -> list[int]: | ||
| """Get the last logged step across all trajectory files. | ||
| This is useful for resuming optimizations from where they left off. | ||
| Returns: | ||
| list[int]: The last step number for each trajectory, or 0 if | ||
| no trajectories exist or all are empty | ||
| """ | ||
| if not self.trajectories: | ||
| return [] | ||
| last_steps = [] | ||
| for trajectory in self.trajectories: | ||
| if trajectory._file.isopen: | ||
| last_steps.append(trajectory.last_step) | ||
| else: | ||
| with TorchSimTrajectory(trajectory._file.filename, mode="r") as traj: | ||
| last_steps.append(traj.last_step) | ||
| return last_steps |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not use trajectory.last_step here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the file could be closed and then we cannot read the last step from it. We can of course also open the file inside the trajectory but it felt a bit odd that the trajectory would "open itself" inside last_step().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got i!
Then perhaps let's just change the name to last_steps to indicate that it is a list
@property
def last_steps(self) -> list[int]:|
Hey Orion, I think that both implementations make sense. I see two possibilities:
I think the 2nd option is more appealing then the first one. Also we could easily obtained the first behaviour by adding an argument to the functions. The truncation is also a direct result of the fact I pushed for a restart behaviour. At the moment a MD simulation from ts.integrate should always be in sync for all states in batch, so it does not make sense to allow different indices in this PR. We could change that in another PR though |
|
Good framing. Both use cases are valid so perhaps it's worth considering the left out group. Finishing the simulation if
|
|
In the case of an interupted simulation it's likely that there will be some files that are complete, some that are incomplete, and some that have not started. We need a way to resume simulations that balances ease, sharp edges, and encouraging best practices. Here is what I propose:
Does this workflow feel too complicated? Let me know what y'all think. |
Hi @orionarcher, Thanks for the detailed suggestions. That makes sense to me, I'll incorporate these suggestions, though due to vacation might only have the time towards the end of the month. Let's also iron out the last two open discussions in this PR and then hopefully we're good to go :). Daniel |
Summary
test_optimize_append_to_trajectoryandtest_integrate_append_to_trajectoryto ensure trajectories are being appended to."w"mode: https://github.com/TorchSim/torch-sim/blob/main/torch_sim/runners.py#L483.Closes #356.
Checklist
Before a pull request can be merged, the following items must be checked: