Feature: Enable appending to trajectory when using `ts.optimize`/`ts.integrate` #361

danielzuegner · 2025-11-25T15:05:07Z

Summary

Adds functionality to read the latest step per trajectory and trajectory reporter to set the initial step accordingly.
Adds tests test_optimize_append_to_trajectory and test_integrate_append_to_trajectory to ensure trajectories are being appended to.
Also fixes a bug where, whenever a structure has converged, all other trajectories are reset by re-opening them in "w" mode: https://github.com/TorchSim/torch-sim/blob/main/torch_sim/runners.py#L483.

Closes #356.

Checklist

Before a pull request can be merged, the following items must be checked:

Doc strings have been added in the Google docstring format.
Run ruff on your code.
Tests have been added for any new functionality or bug fixes.

orionarcher

Also fixes a bug where, whenever a structure has converged, all other trajectories are reset by re-opening them in "w" mode: https://github.com/TorchSim/torch-sim/blob/main/torch_sim/runners.py#L483.

Can you say more here? This would be a pretty serious bug because it would mean ll the optimization trajectories are wrong.

More generally, I think this PR needs to grapple with the fact that if we are appending, the trajectories we are appending to might have different past progress. In a batched setting, we can't just take the maximum progress of one as the progress of all. I realize that makes literally everything more annoying but such is the curse and blessing of TorchSim.

torch_sim/runners.py

torch_sim/trajectory.py

danielzuegner · 2025-11-26T08:30:06Z

Also fixes a bug where, whenever a structure has converged, all other trajectories are reset by re-opening them in "w" mode: https://github.com/TorchSim/torch-sim/blob/main/torch_sim/runners.py#L483.

Can you say more here? This would be a pretty serious bug because it would mean ll the optimization trajectories are wrong.

More generally, I think this PR needs to grapple with the fact that if we are appending, the trajectories we are appending to might have different past progress. In a batched setting, we can't just take the maximum progress of one as the progress of all. I realize that makes literally everything more annoying but such is the curse and blessing of TorchSim.

Unfortunately I believe this affects all optimization trajectories. Looking at load_new_trajectories, we can see that the trajectories are being instantiated from scratch with self.trajectory_kwargs, i.e., write mode by default. I've noticed that when I was debugging my PR that the longer trajectories are chopped at the start when another one finishes. I'd have steps [10, 15, 20, ...] in the trajectory instead of [5, 10, ...]. Happy to be proven wrong, but I think this is a bug that potentially affects all optimization trajectories currently.

thomasloux · 2025-11-26T09:09:07Z

I would rather push for a PR allowing for a restart of simulation #43. In practice the difference is that supposed if you run multiple time ts.integrate(n_step=1000), the first call will run for longer but for the rest it will finish directly. You could add an argument to explicitly run for longer without reading yourself the trajectory.
Note: for a real restart or run for longer, this doesn't change the fact currently the TrajectoryReporter does only save SimState, and not any more variables of subclass. For instance, if you run a NVT Nosé Hoover MD, this means you loose the state of the thermostat. So even run for longer is not technically right. This would be true as well for a (near future) LBFGS optimization.

tests/test_trajectory.py

danielzuegner · 2025-11-26T09:56:54Z

I would rather push for a PR allowing for a restart of simulation #43. In practice the difference is that supposed if you run multiple time ts.integrate(n_step=1000), the first call will run for longer but for the rest it will finish directly. You could add an argument to explicitly run for longer without reading yourself the trajectory. Note: for a real restart or run for longer, this doesn't change the fact currently the TrajectoryReporter does only save SimState, and not any more variables of subclass. For instance, if you run a NVT Nosé Hoover MD, this means you loose the state of the thermostat. So even run for longer is not technically right. This would be true as well for a (near future) LBFGS optimization.

@thomasloux happy to change it to this way of running. Just to make sure I understand. The proposal is to:

Detect the last time step per trajectory (say, [5, 10] for a toy example of two systems).
If n_steps=15, we run 10 more steps for the first one and 5 more steps for the second trajectory.

If that's the case, I think we have to do it like in the optimize() case, where convergence_tensor would be the tensor of bools whether each trajectory has reached its terminal step yet. Is that understanding correct?

thomasloux · 2025-11-26T10:34:37Z

For a MD simulations at least, there is no such thing, all systems evolve at the same time. I actually assumes it the same for optimize, if not for convergence criteria that would be met.
So in case of ts.integrate, you would get n_step from trajectory, and you would run a for loop like range(n_step_traj, n_step_total)instead of the current PR suggestion range(n_initial, n_initial + n_step) and then having to manually remove n_initial when you want to access kT values for instance.

thomasloux · 2025-11-26T10:35:16Z

I don't have any strict opinion on optimization as I don't use it this much

danielzuegner · 2025-11-26T10:58:08Z

For a MD simulations at least, there is no such thing, all systems evolve at the same time. I actually assumes it the same for optimize, if not for convergence criteria that would be met. So in case of ts.integrate, you would get n_step from trajectory, and you would run a for loop like range(n_step_traj, n_step_total)instead of the current PR suggestion range(n_initial, n_initial + n_step) and then having to manually remove n_initial when you want to access kT values for instance.

Thanks @thomasloux! I've updated the code to assume that for integrate() all trajectories start from the same time step, and then only integrate for n_steps - initial_step steps. Is that in line with what you had in mind?

thomasloux

That's exactly what I wanted for restart and ts.integrate ! Just waiting for the tests to pass to approve

danielzuegner · 2025-11-27T12:59:33Z

Will take a look at the failures and hopefully fix them

danielzuegner · 2025-11-27T13:41:54Z

Tests pass locally for me now, let's see what happens in the CI.

thomasloux

One part to clarify for ts.optimize, otherwise it's a really nice work. I think in a later PR we should clarify the Trajectory objects. I'm not a fan of the mode 'r' vs 'a'. I would probably prefer an argument: either 'restart=True' or 'overwrite_if_exists=True'.

torch_sim/runners.py

thomasloux

good for me now

orionarcher

I've updated the code to assume that for integrate() all trajectories start from the same time step, and then only integrate for n_steps - initial_step steps. Is that in line with what you had in mind?

Sorry for being a bit late on this, but I am going to push back on this implementation.

I think that n_steps should mean the number of steps you take, not the number of steps you reach. My heavy prior would be that n_steps is a number of steps taken, as a user, I'd find it quite strange if that wasn't the case. What do you think?

I also don't think that if the steps don't match that we should truncate all the trajectories to the same length. In my mind, we should detect the current step across all the trajectories, then progressively append to that step + i, where i is the number of steps taken by the integrator. Does that make sense?

I want to say I am quite appreciative of the PR and I think being able to append with the runners is a good feature. I am being nit picky because I think it's challenging to get the right behavior and expectations for this in a batched context.

torch_sim/trajectory.py

orionarcher · 2025-11-28T21:42:51Z

torch_sim/trajectory.py

-
-        self.trajectories = []
-        for filename in self.filenames:
-            self.trajectories.append(
-                TorchSimTrajectory(
-                    filename=filename,
-                    metadata=self.metadata,
-                    **self.trajectory_kwargs,
-                )
+        # Avoid wiping existing trajectory files when reopening them, hence
+        # we set to "a" mode temporarily (read mode is unaffected).
+        _mode = self.trajectory_kwargs.get("mode", "w")
+        self.trajectory_kwargs["mode"] = "a" if _mode in ["a", "w"] else "r"
+        self.trajectories = [
+            TorchSimTrajectory(
+                filename=filename,
+                metadata=self.metadata,
+                **self.trajectory_kwargs,
            )
+            for filename in filenames
+        ]
+        # Restore original mode
+        self.trajectory_kwargs["mode"] = _mode


Why shouldn't we overwrite them if mode is "w"? In my mind that's the intended behavior if we are in write mode, to overwrite all of the existing trajectories with new ones. This could be affected by the user by setting `trajectory_kwargs={"mode"="a"}.

If the feeling is that defaulting to "w" is too strong, the place to change it would be on line 137, where we set
self.trajectory_kwargs["mode"] = self.trajectory_kwargs.get("mode", "w")

_mode = self.trajectory_kwargs.get("mode", "w"), also, not clear why we'd have "w" be the default original mode here.

Without this change, I still think load_new_trajectories makes sense than reopen.

My argument is that the mode ("w" vs "a") is meant by the user only for the initial opening of the file, not the (potentially many) times we re-open the trajectory file during the optimization (e.g., when some of the trajectories finish). I don't think it is ever intentional to wipe all existing progress when that happens, even if the user specified "w" mode in the beginning.

Hence, in my PR I propose to use the user-specified mode only for the first time we open the trajectory files (in the constructor of the trajectory reporter), and afterwards avoid wiping then files by ensuring "a" mode for the times we re-open the files during the optimization.

What is your alternative proposal to ensure that (i) we use the user-intended file mode initially, but (ii) don't erase the files during the optimization?

Setting aside where it's intentional, I do think it should be possible to wipe the existing files when re-running, though I agree that by default appending is better. LAMMPS has an append mode when writing dump files that is by default set to False, so this particular sharp edge isn't unusual.

I would propose that we set the default to "a" but wipe the files if "w" is set.

I agree it should be possible to wipe existing files when re-running, and I think it's possible to do that with the existing code. In the constructor of Trajectory, we take the mode from the user-provided kwargs:

self.trajectory_kwargs["mode"] = self.trajectory_kwargs.get("mode", "w")

This means that if we specify "w" mode and there's already existing trajectory files, they will be wiped. Since reopen_trajectories is only called during optimization/ dynamics, not at the start, I believe it should never wipe trajectories halfway an active trajectory (which is what currently happens before this bugfix).

So, if you agree that

it is possible to wipe a trajectory via constructing as trajectory = Trajectory(..., trajectory_kwargs=dict(mode="w")), and that

we never want to wipe trajectories during an active run,

perhaps the current version of the code is fine? Otherwise please let me know what part I don't yet understand. Thanks!

Ah! It was I who was misunderstanding. Yeah you are totally right, the trajectories will get wiped whenenver a state converges. Temporarily switching into append mode is a very good catch!

torch_sim/trajectory.py

orionarcher · 2025-11-28T21:48:05Z

torch_sim/trajectory.py

+    @property
+    def last_step(self) -> list[int]:
+        """Get the last logged step across all trajectory files.
+
+        This is useful for resuming optimizations from where they left off.
+
+        Returns:
+            list[int]: The last step number for each trajectory, or 0 if
+                no trajectories exist or all are empty
+        """
+        if not self.trajectories:
+            return []
+        last_steps = []
+        for trajectory in self.trajectories:
+            if trajectory._file.isopen:
+                last_steps.append(trajectory.last_step)
+            else:
+                with TorchSimTrajectory(trajectory._file.filename, mode="r") as traj:
+                    last_steps.append(traj.last_step)
+        return last_steps


Why not use trajectory.last_step here?

Because the file could be closed and then we cannot read the last step from it. We can of course also open the file inside the trajectory but it felt a bit odd that the trajectory would "open itself" inside last_step().

Got i!

Then perhaps let's just change the name to last_steps to indicate that it is a list

@property def last_steps(self) -> list[int]:

torch_sim/runners.py

torch_sim/trajectory.py

thomasloux · 2025-11-29T06:54:24Z

Hey Orion, I think that both implementations make sense. I see two possibilities:

The user has a mode="a" in mind, so it makes sense to have n_steps to be the n_steps to add and not the total number of steps in trajectory.
The user wants to restart a trajectory (think of a failure or a slurm job that finished too early). Then I think it makes sense to run the same code and the that TorchSim can detect the traj file to start from it.

I think the 2nd option is more appealing then the first one. Also we could easily obtained the first behaviour by adding an argument to the functions.

The truncation is also a direct result of the fact I pushed for a restart behaviour. At the moment a MD simulation from ts.integrate should always be in sync for all states in batch, so it does not make sense to allow different indices in this PR. We could change that in another PR though

orionarcher · 2025-11-29T16:08:53Z

Good framing. Both use cases are valid so perhaps it's worth considering the left out group.

Finishing the simulation if `n_steps` = additional steps

total_steps = 1000
steps_completed_before_interrupt = my_trajectory_reporter.last_step  # using the last_step property introduced in this PR
n_steps = 1000 - steps_completed_before_interupt
integrate(...n_steps=n_steps)

Appending 20 steps if `n_steps` = total steps

new_steps = 1000
current_number_of_steps = my_trajectory_reporter.last_step
n_steps = current_number_of_steps - new_steps
integrate(...n_steps=n_steps)

Either is still accessible no matter what default is adopted and both are about the same amount of code. Ultimately, it comes down to which is more expected and which is more common. I am not sure which is more common, it very well might be restarting, but I feel that appending steps is more expected of an n_steps argument.

At the moment a MD simulation from ts.integrate should always be in sync for all states in batch, so it does not make sense to allow different indices in this PR. We could change that in another PR though

Good point that separate indices are a distinct feature and could go elsewhere. This is a potentially dangerous default though. Imagine I have 100 simulations and I can fit 10 on a GPU at a time. My simulation proceeds nicely through the first 90 and then only makes it 1 step into the the final 10. If I rerun the same script expecting to restart, I'll erase the progress on the first 90 simulations because the smallest last_step will be 1. Instead of truncating, perhaps we could enforce that all trajectories have the same last_step? That feels safer to me.

The user wants to restart a trajectory (think of a failure or a slurm job that finished too early). Then I think it makes sense to run the same code and the that TorchSim can detect the traj file to start from it.

This is going to be difficult when some simulations will run to completion and others will not begin. I think some modifications will need to be made to the script to weed out already started PRs.

OK, as I outline these challenges and think about it I am realizing that it's pretty hard to restart simulations in TorchSim right now and my proposal would keep it hard. Perhaps we can add some flag like restart=False in the kwargs that will nicely handle the complexities outlined above and allow a script to be rerun without worrying about any of this. That would require a bit of logic added to what we have now. I have to run right now but will think on this further.

orionarcher · 2025-12-09T17:12:58Z

In the case of an interupted simulation it's likely that there will be some files that are complete, some that are incomplete, and some that have not started. We need a way to resume simulations that balances ease, sharp edges, and encouraging best practices.

Here is what I propose:

Require that all files in the TrajectoryReporter all have the same number of steps, either 0 (do not exist yet) or N. If a simulation is interrupted, it becomes the responsibility of the user to identify which systems have completed and discard those from the file list. A line in the script that discards completed files can always be included and will just do nothing if it's the first run.
Provide a utility for resetting many systems to the same step. This can be called externally when needed, but I don't think it should be applied blindly internally, lest we wipe lots of progress. For the partially complete files these can be wiped to the same step and n_steps can be adjusted to account for their length.
n_steps should be the number of new steps not total. I looked at OpenMM, LAMMPS, and ASE and even when appending n_steps is the number of new steps, not the total number.

Does this workflow feel too complicated? Let me know what y'all think.

danielzuegner · 2025-12-10T14:47:28Z

In the case of an interupted simulation it's likely that there will be some files that are complete, some that are incomplete, and some that have not started. We need a way to resume simulations that balances ease, sharp edges, and encouraging best practices.

Here is what I propose:

Require that all files in the TrajectoryReporter all have the same number of steps, either 0 (do not exist yet) or N. If a simulation is interrupted, it becomes the responsibility of the user to identify which systems have completed and discard those from the file list. A line in the script that discards completed files can always be included and will just do nothing if it's the first run.

Provide a utility for resetting many systems to the same step. This can be called externally when needed, but I don't think it should be applied blindly internally, lest we wipe lots of progress. For the partially complete files these can be wiped to the same step and n_steps can be adjusted to account for their length.

n_steps should be the number of new steps not total. I looked at OpenMM, LAMMPS, and ASE and even when appending n_steps is the number of new steps, not the total number.

Does this workflow feel too complicated? Let me know what y'all think.

Hi @orionarcher,

Thanks for the detailed suggestions. That makes sense to me, I'll incorporate these suggestions, though due to vacation might only have the time towards the end of the month. Let's also iron out the last two open discussions in this PR and then hopefully we're good to go :).

Daniel

Daniel Zuegner added 3 commits November 25, 2025 14:31

append to trajectory

02a54e1

test append to trajectory

db4cfa2

style

6f78f54

danielzuegner mentioned this pull request Nov 25, 2025

Cannot append to exisiting trajectory via ts.integrate/ts.optimize #356

Open

revert small change

b378c6e

orionarcher self-requested a review November 25, 2025 17:23

orionarcher requested changes Nov 25, 2025

View reviewed changes

torch_sim/runners.py Outdated Show resolved Hide resolved

torch_sim/trajectory.py Outdated Show resolved Hide resolved

thomasloux reviewed Nov 26, 2025

View reviewed changes

tests/test_trajectory.py Outdated Show resolved Hide resolved

Daniel Zuegner added 2 commits November 26, 2025 10:00

maintain step per system

1feff96

format

eed1fc1

integrate only for (n_steps - initial_step) steps when continuing

031fc56

danielzuegner and others added 5 commits November 26, 2025 12:50

Merge branch 'main' into append-to-trajectory

5b2469f

change load_new_trajectories behavior

6fd5cc9

back to step

4e820ee

format

8b8c449

truncate trajectories

0a230bf

thomasloux reviewed Nov 27, 2025

View reviewed changes

Daniel Zuegner added 3 commits November 27, 2025 13:30

fix tests

e138a33

fix style

82059d5

style

2e4bb06

Daniel Zuegner added 2 commits November 27, 2025 13:45

fix type hint

65e61d0

style

c1065e7

danielzuegner requested review from orionarcher and thomasloux November 28, 2025 09:33

thomasloux requested changes Nov 28, 2025

View reviewed changes

torch_sim/runners.py Outdated Show resolved Hide resolved

torch_sim/runners.py Outdated Show resolved Hide resolved

Daniel Zuegner added 2 commits November 28, 2025 15:36

fix kT indexing in integrate, step counting in optimize

38619af

style

dde853c

thomasloux approved these changes Nov 28, 2025

View reviewed changes

orionarcher requested changes Nov 28, 2025

View reviewed changes

orionarcher reviewed Nov 28, 2025

View reviewed changes

torch_sim/trajectory.py Outdated Show resolved Hide resolved

Merge branch 'main' into append-to-trajectory

6a1114c

Merge branch 'main' into append-to-trajectory

6cba46f

Daniel Zuegner added 4 commits December 10, 2025 14:23

rename variable

787a967

return positions last step

9c637f5

truncate to positions last step

0031296

extract methods

1874dc5

Daniel Zuegner added 2 commits December 10, 2025 16:09

fix tests

51ca30d

format

dfd39f0

Feature: Enable appending to trajectory when using ts.optimize/ts.integrate #361

Are you sure you want to change the base?

Feature: Enable appending to trajectory when using ts.optimize/ts.integrate #361

Uh oh!

Conversation

danielzuegner commented Nov 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Checklist

Uh oh!

orionarcher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

danielzuegner commented Nov 26, 2025

Uh oh!

thomasloux commented Nov 26, 2025

Uh oh!

Uh oh!

danielzuegner commented Nov 26, 2025

Uh oh!

thomasloux commented Nov 26, 2025

Uh oh!

thomasloux commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

danielzuegner commented Nov 26, 2025

Uh oh!

thomasloux left a comment

Choose a reason for hiding this comment

Uh oh!

danielzuegner commented Nov 27, 2025

Uh oh!

danielzuegner commented Nov 27, 2025

Uh oh!

thomasloux left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

thomasloux left a comment

Choose a reason for hiding this comment

Uh oh!

orionarcher left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

thomasloux commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

orionarcher commented Nov 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Finishing the simulation if n_steps = additional steps

Appending 20 steps if n_steps = total steps

Uh oh!

orionarcher commented Dec 9, 2025

Uh oh!

danielzuegner commented Dec 10, 2025

Uh oh!

Feature: Enable appending to trajectory when using `ts.optimize`/`ts.integrate` #361

Feature: Enable appending to trajectory when using `ts.optimize`/`ts.integrate` #361

danielzuegner commented Nov 25, 2025 •

edited

Loading

thomasloux commented Nov 26, 2025 •

edited

Loading

orionarcher left a comment •

edited

Loading

thomasloux commented Nov 29, 2025 •

edited

Loading

orionarcher commented Nov 29, 2025 •

edited

Loading

Finishing the simulation if `n_steps` = additional steps

Appending 20 steps if `n_steps` = total steps