Skip to content

Conversation

@JanMaartenvanDoorn
Copy link
Member

@JanMaartenvanDoorn JanMaartenvanDoorn commented Dec 19, 2025

Changes proposed in this PR include:

@egordm egordm added feature New feature or request OpenSTEF 4.0 Work for OpenSTEF 4.0 labels Dec 19, 2025
Copy link
Collaborator

@egordm egordm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great changes. I have only a few nipicks about test and possibly moving some logic to timeseries dataset.

Comment on lines 187 to 222
def _infer_frequency(index: pd.DatetimeIndex) -> pd.Timedelta:
"""Infer the frequency of a pandas DatetimeIndex if the freq attribute is not set.

This method calculates the most common time difference between consecutive timestamps,
which is more permissive of missing chunks of data than the pandas infer_freq method.

Args:
index (pd.DatetimeIndex): The datetime index to infer the frequency from.

Returns:
pd.Timedelta: The inferred frequency as a pandas Timedelta.

Raises:
ValueError: If the index has fewer than 2 timestamps.
"""
minimum_required_length = 2
if len(index) < minimum_required_length:
raise ValueError("Cannot infer frequency from an index with fewer than 2 timestamps.")

# Calculate the differences between consecutive timestamps
deltas = index.to_series().diff().dropna()

# Find the most common difference
return deltas.mode().iloc[0]

def _frequency_matches(self, index: pd.DatetimeIndex) -> bool:
"""Check if the frequency of the input data matches the model frequency.

Args:
index (pd.DatetimeIndex): The input data to check.

Returns:
bool: True if the frequencies match, False otherwise.
"""
input_frequency = self._infer_frequency(index) if index.freq is None else index.freq
return input_frequency == self.frequency
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be nice to move this to TimeSeriesDataset. To have something like one function called validate_sample_interval that checks the data against the set sample interval. If user wants to be sure they can call it.

It would make it easier to test and median model code would be a lot simpler.

@egordm egordm mentioned this pull request Jan 9, 2026
3 tasks
Copy link
Collaborator

@egordm egordm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. The only change I would suggest is to avoid actually inferring timedelta, it's something user can better give manually given how many edge cases there are for this. We can only do verification. This way we also don't need any changes in stef beam and such.

Comment on lines +131 to +134
inferred_freq = pd.Timedelta(
self._infer_frequency(data.index) if data.index.freq is None else data.index.freq # type: ignore
)
sample_interval = inferred_freq.to_pytimedelta()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh maybe we shouldn't infer frequency from the input data? Forcing user to specify it would be the best. The data may contain holes or similar throwing off the inference.

We should probably only have a validate function that optionally validates if the data uses the right sample interval for functions that are sensitive to this. Like the median model.

It probably also solves issues you have been having with failing doctests.

Comment on lines +303 to +310
lag_deltas = sorted(self.lags_to_time_deltas_.values())
lag_intervals = [(lag_deltas[i] - lag_deltas[i - 1]).total_seconds() for i in range(1, len(lag_deltas))]
if not all(interval == lag_intervals[0] for interval in lag_intervals):
msg = (
"Lag features are not evenly spaced. "
"Please ensure lag features are evenly spaced and match the data frequency."
)
raise ValueError(msg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually I like splitting chunks like this off into generic standalone functions cause they are easy to test in isolation but harder to test in model.

For example here we could have a function check_timedeltas_evenly_spaced which is easy to test and doesn't require extra test for the model.

This is more of a person preference. We if it's not worth it we can skip it here.


# Check that lag frequency matches data frequency
expected_lag_interval = lag_intervals[0]
if expected_lag_interval != self.frequency_.total_seconds():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the operations can be done on timedelta structure the total seconds would be unnecessary. In this case it would be a simple comparison.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature or request OpenSTEF 4.0 Work for OpenSTEF 4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants