Skip to content

Conversation

@danielzuegner
Copy link
Contributor

@danielzuegner danielzuegner commented Nov 25, 2025

Summary

  • Adds functionality to read the latest step per trajectory and trajectory reporter to set the initial step accordingly.
  • Adds tests test_optimize_append_to_trajectory and test_integrate_append_to_trajectory to ensure trajectories are being appended to.
  • Also fixes a bug where, whenever a structure has converged, all other trajectories are reset by re-opening them in "w" mode: https://github.com/TorchSim/torch-sim/blob/main/torch_sim/runners.py#L483.

Closes #356.

Checklist

Before a pull request can be merged, the following items must be checked:

  • Doc strings have been added in the Google docstring format.
  • Run ruff on your code.
  • Tests have been added for any new functionality or bug fixes.

@orionarcher orionarcher self-requested a review November 25, 2025 17:23
Copy link
Collaborator

@orionarcher orionarcher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also fixes a bug where, whenever a structure has converged, all other trajectories are reset by re-opening them in "w" mode: https://github.com/TorchSim/torch-sim/blob/main/torch_sim/runners.py#L483.

Can you say more here? This would be a pretty serious bug because it would mean ll the optimization trajectories are wrong.

More generally, I think this PR needs to grapple with the fact that if we are appending, the trajectories we are appending to might have different past progress. In a batched setting, we can't just take the maximum progress of one as the progress of all. I realize that makes literally everything more annoying but such is the curse and blessing of TorchSim.

@danielzuegner
Copy link
Contributor Author

Also fixes a bug where, whenever a structure has converged, all other trajectories are reset by re-opening them in "w" mode: https://github.com/TorchSim/torch-sim/blob/main/torch_sim/runners.py#L483.

Can you say more here? This would be a pretty serious bug because it would mean ll the optimization trajectories are wrong.

More generally, I think this PR needs to grapple with the fact that if we are appending, the trajectories we are appending to might have different past progress. In a batched setting, we can't just take the maximum progress of one as the progress of all. I realize that makes literally everything more annoying but such is the curse and blessing of TorchSim.

Unfortunately I believe this affects all optimization trajectories. Looking at load_new_trajectories, we can see that the trajectories are being instantiated from scratch with self.trajectory_kwargs, i.e., write mode by default. I've noticed that when I was debugging my PR that the longer trajectories are chopped at the start when another one finishes. I'd have steps [10, 15, 20, ...] in the trajectory instead of [5, 10, ...]. Happy to be proven wrong, but I think this is a bug that potentially affects all optimization trajectories currently.

@thomasloux
Copy link
Collaborator

I would rather push for a PR allowing for a restart of simulation #43. In practice the difference is that supposed if you run multiple time ts.integrate(n_step=1000), the first call will run for longer but for the rest it will finish directly. You could add an argument to explicitly run for longer without reading yourself the trajectory.
Note: for a real restart or run for longer, this doesn't change the fact currently the TrajectoryReporter does only save SimState, and not any more variables of subclass. For instance, if you run a NVT Nosé Hoover MD, this means you loose the state of the thermostat. So even run for longer is not technically right. This would be true as well for a (near future) LBFGS optimization.

@danielzuegner
Copy link
Contributor Author

I would rather push for a PR allowing for a restart of simulation #43. In practice the difference is that supposed if you run multiple time ts.integrate(n_step=1000), the first call will run for longer but for the rest it will finish directly. You could add an argument to explicitly run for longer without reading yourself the trajectory. Note: for a real restart or run for longer, this doesn't change the fact currently the TrajectoryReporter does only save SimState, and not any more variables of subclass. For instance, if you run a NVT Nosé Hoover MD, this means you loose the state of the thermostat. So even run for longer is not technically right. This would be true as well for a (near future) LBFGS optimization.

@thomasloux happy to change it to this way of running. Just to make sure I understand. The proposal is to:

  • Detect the last time step per trajectory (say, [5, 10] for a toy example of two systems).
  • If n_steps=15, we run 10 more steps for the first one and 5 more steps for the second trajectory.

If that's the case, I think we have to do it like in the optimize() case, where convergence_tensor would be the tensor of bools whether each trajectory has reached its terminal step yet. Is that understanding correct?

@thomasloux
Copy link
Collaborator

For a MD simulations at least, there is no such thing, all systems evolve at the same time. I actually assumes it the same for optimize, if not for convergence criteria that would be met.
So in case of ts.integrate, you would get n_step from trajectory, and you would run a for loop like range(n_step_traj, n_step_total)instead of the current PR suggestion range(n_initial, n_initial + n_step) and then having to manually remove n_initial when you want to access kT values for instance.

@thomasloux
Copy link
Collaborator

thomasloux commented Nov 26, 2025

I don't have any strict opinion on optimization as I don't use it this much

@danielzuegner
Copy link
Contributor Author

For a MD simulations at least, there is no such thing, all systems evolve at the same time. I actually assumes it the same for optimize, if not for convergence criteria that would be met. So in case of ts.integrate, you would get n_step from trajectory, and you would run a for loop like range(n_step_traj, n_step_total)instead of the current PR suggestion range(n_initial, n_initial + n_step) and then having to manually remove n_initial when you want to access kT values for instance.

Thanks @thomasloux! I've updated the code to assume that for integrate() all trajectories start from the same time step, and then only integrate for n_steps - initial_step steps. Is that in line with what you had in mind?

Copy link
Collaborator

@thomasloux thomasloux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly what I wanted for restart and ts.integrate ! Just waiting for the tests to pass to approve

@danielzuegner
Copy link
Contributor Author

Will take a look at the failures and hopefully fix them

@danielzuegner
Copy link
Contributor Author

Tests pass locally for me now, let's see what happens in the CI.

Daniel Zuegner added 2 commits November 27, 2025 13:45
Copy link
Collaborator

@thomasloux thomasloux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One part to clarify for ts.optimize, otherwise it's a really nice work. I think in a later PR we should clarify the Trajectory objects. I'm not a fan of the mode 'r' vs 'a'. I would probably prefer an argument: either 'restart=True' or 'overwrite_if_exists=True'.

Copy link
Collaborator

@thomasloux thomasloux left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good for me now

Copy link
Collaborator

@orionarcher orionarcher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the code to assume that for integrate() all trajectories start from the same time step, and then only integrate for n_steps - initial_step steps. Is that in line with what you had in mind?

Sorry for being a bit late on this, but I am going to push back on this implementation.

I think that n_steps should mean the number of steps you take, not the number of steps you reach. My heavy prior would be that n_steps is a number of steps taken, as a user, I'd find it quite strange if that wasn't the case. What do you think?

I also don't think that if the steps don't match that we should truncate all the trajectories to the same length. In my mind, we should detect the current step across all the trajectories, then progressively append to that step + i, where i is the number of steps taken by the integrator. Does that make sense?

I want to say I am quite appreciative of the PR and I think being able to append with the runners is a good feature. I am being nit picky because I think it's challenging to get the right behavior and expectations for this in a batched context.

Comment on lines -173 to +205

self.trajectories = []
for filename in self.filenames:
self.trajectories.append(
TorchSimTrajectory(
filename=filename,
metadata=self.metadata,
**self.trajectory_kwargs,
)
# Avoid wiping existing trajectory files when reopening them, hence
# we set to "a" mode temporarily (read mode is unaffected).
_mode = self.trajectory_kwargs.get("mode", "w")
self.trajectory_kwargs["mode"] = "a" if _mode in ["a", "w"] else "r"
self.trajectories = [
TorchSimTrajectory(
filename=filename,
metadata=self.metadata,
**self.trajectory_kwargs,
)
for filename in filenames
]
# Restore original mode
self.trajectory_kwargs["mode"] = _mode
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why shouldn't we overwrite them if mode is "w"? In my mind that's the intended behavior if we are in write mode, to overwrite all of the existing trajectories with new ones. This could be affected by the user by setting `trajectory_kwargs={"mode"="a"}.

If the feeling is that defaulting to "w" is too strong, the place to change it would be on line 137, where we set
self.trajectory_kwargs["mode"] = self.trajectory_kwargs.get("mode", "w")

_mode = self.trajectory_kwargs.get("mode", "w"), also, not clear why we'd have "w" be the default original mode here.

Without this change, I still think load_new_trajectories makes sense than reopen.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My argument is that the mode ("w" vs "a") is meant by the user only for the initial opening of the file, not the (potentially many) times we re-open the trajectory file during the optimization (e.g., when some of the trajectories finish). I don't think it is ever intentional to wipe all existing progress when that happens, even if the user specified "w" mode in the beginning.

Hence, in my PR I propose to use the user-specified mode only for the first time we open the trajectory files (in the constructor of the trajectory reporter), and afterwards avoid wiping then files by ensuring "a" mode for the times we re-open the files during the optimization.

What is your alternative proposal to ensure that (i) we use the user-intended file mode initially, but (ii) don't erase the files during the optimization?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Setting aside where it's intentional, I do think it should be possible to wipe the existing files when re-running, though I agree that by default appending is better. LAMMPS has an append mode when writing dump files that is by default set to False, so this particular sharp edge isn't unusual.

I would propose that we set the default to "a" but wipe the files if "w" is set.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it should be possible to wipe existing files when re-running, and I think it's possible to do that with the existing code. In the constructor of Trajectory, we take the mode from the user-provided kwargs:

 self.trajectory_kwargs["mode"] = self.trajectory_kwargs.get("mode", "w")

This means that if we specify "w" mode and there's already existing trajectory files, they will be wiped. Since reopen_trajectories is only called during optimization/ dynamics, not at the start, I believe it should never wipe trajectories halfway an active trajectory (which is what currently happens before this bugfix).

So, if you agree that

  1. it is possible to wipe a trajectory via constructing as trajectory = Trajectory(..., trajectory_kwargs=dict(mode="w")), and that
  2. we never want to wipe trajectories during an active run,

perhaps the current version of the code is fine? Otherwise please let me know what part I don't yet understand. Thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! It was I who was misunderstanding. Yeah you are totally right, the trajectories will get wiped whenenver a state converges. Temporarily switching into append mode is a very good catch!

Comment on lines +343 to +362
@property
def last_step(self) -> list[int]:
"""Get the last logged step across all trajectory files.
This is useful for resuming optimizations from where they left off.
Returns:
list[int]: The last step number for each trajectory, or 0 if
no trajectories exist or all are empty
"""
if not self.trajectories:
return []
last_steps = []
for trajectory in self.trajectories:
if trajectory._file.isopen:
last_steps.append(trajectory.last_step)
else:
with TorchSimTrajectory(trajectory._file.filename, mode="r") as traj:
last_steps.append(traj.last_step)
return last_steps
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use trajectory.last_step here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the file could be closed and then we cannot read the last step from it. We can of course also open the file inside the trajectory but it felt a bit odd that the trajectory would "open itself" inside last_step().

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got i!

Then perhaps let's just change the name to last_steps to indicate that it is a list

    @property
    def last_steps(self) -> list[int]:

@thomasloux
Copy link
Collaborator

thomasloux commented Nov 29, 2025

Hey Orion, I think that both implementations make sense. I see two possibilities:

  • The user has a mode="a" in mind, so it makes sense to have n_steps to be the n_steps to add and not the total number of steps in trajectory.
  • The user wants to restart a trajectory (think of a failure or a slurm job that finished too early). Then I think it makes sense to run the same code and the that TorchSim can detect the traj file to start from it.

I think the 2nd option is more appealing then the first one. Also we could easily obtained the first behaviour by adding an argument to the functions.

The truncation is also a direct result of the fact I pushed for a restart behaviour. At the moment a MD simulation from ts.integrate should always be in sync for all states in batch, so it does not make sense to allow different indices in this PR. We could change that in another PR though

@orionarcher
Copy link
Collaborator

orionarcher commented Nov 29, 2025

Good framing. Both use cases are valid so perhaps it's worth considering the left out group.

Finishing the simulation if n_steps = additional steps

total_steps = 1000
steps_completed_before_interrupt = my_trajectory_reporter.last_step  # using the last_step property introduced in this PR
n_steps = 1000 - steps_completed_before_interupt
integrate(...n_steps=n_steps)

Appending 20 steps if n_steps = total steps

new_steps = 1000
current_number_of_steps = my_trajectory_reporter.last_step
n_steps = current_number_of_steps - new_steps
integrate(...n_steps=n_steps)

Either is still accessible no matter what default is adopted and both are about the same amount of code. Ultimately, it comes down to which is more expected and which is more common. I am not sure which is more common, it very well might be restarting, but I feel that appending steps is more expected of an n_steps argument.

At the moment a MD simulation from ts.integrate should always be in sync for all states in batch, so it does not make sense to allow different indices in this PR. We could change that in another PR though

Good point that separate indices are a distinct feature and could go elsewhere. This is a potentially dangerous default though. Imagine I have 100 simulations and I can fit 10 on a GPU at a time. My simulation proceeds nicely through the first 90 and then only makes it 1 step into the the final 10. If I rerun the same script expecting to restart, I'll erase the progress on the first 90 simulations because the smallest last_step will be 1. Instead of truncating, perhaps we could enforce that all trajectories have the same last_step? That feels safer to me.

The user wants to restart a trajectory (think of a failure or a slurm job that finished too early). Then I think it makes sense to run the same code and the that TorchSim can detect the traj file to start from it.

This is going to be difficult when some simulations will run to completion and others will not begin. I think some modifications will need to be made to the script to weed out already started PRs.

OK, as I outline these challenges and think about it I am realizing that it's pretty hard to restart simulations in TorchSim right now and my proposal would keep it hard. Perhaps we can add some flag like restart=False in the kwargs that will nicely handle the complexities outlined above and allow a script to be rerun without worrying about any of this. That would require a bit of logic added to what we have now. I have to run right now but will think on this further.

@orionarcher
Copy link
Collaborator

In the case of an interupted simulation it's likely that there will be some files that are complete, some that are incomplete, and some that have not started. We need a way to resume simulations that balances ease, sharp edges, and encouraging best practices.

Here is what I propose:

  1. Require that all files in the TrajectoryReporter all have the same number of steps, either 0 (do not exist yet) or N. If a simulation is interrupted, it becomes the responsibility of the user to identify which systems have completed and discard those from the file list. A line in the script that discards completed files can always be included and will just do nothing if it's the first run.

  2. Provide a utility for resetting many systems to the same step. This can be called externally when needed, but I don't think it should be applied blindly internally, lest we wipe lots of progress. For the partially complete files these can be wiped to the same step and n_steps can be adjusted to account for their length.

  3. n_steps should be the number of new steps not total. I looked at OpenMM, LAMMPS, and ASE and even when appending n_steps is the number of new steps, not the total number.

Does this workflow feel too complicated? Let me know what y'all think.

@danielzuegner
Copy link
Contributor Author

In the case of an interupted simulation it's likely that there will be some files that are complete, some that are incomplete, and some that have not started. We need a way to resume simulations that balances ease, sharp edges, and encouraging best practices.

Here is what I propose:

  1. Require that all files in the TrajectoryReporter all have the same number of steps, either 0 (do not exist yet) or N. If a simulation is interrupted, it becomes the responsibility of the user to identify which systems have completed and discard those from the file list. A line in the script that discards completed files can always be included and will just do nothing if it's the first run.
  2. Provide a utility for resetting many systems to the same step. This can be called externally when needed, but I don't think it should be applied blindly internally, lest we wipe lots of progress. For the partially complete files these can be wiped to the same step and n_steps can be adjusted to account for their length.
  3. n_steps should be the number of new steps not total. I looked at OpenMM, LAMMPS, and ASE and even when appending n_steps is the number of new steps, not the total number.

Does this workflow feel too complicated? Let me know what y'all think.

Hi @orionarcher,

Thanks for the detailed suggestions. That makes sense to me, I'll incorporate these suggestions, though due to vacation might only have the time towards the end of the month. Let's also iron out the last two open discussions in this PR and then hopefully we're good to go :).

Daniel

Daniel Zuegner added 2 commits December 10, 2025 16:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Cannot append to exisiting trajectory via ts.integrate/ts.optimize

3 participants