-
Notifications
You must be signed in to change notification settings - Fork 38
Implement-median-model-v4 #793
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: release/v4.0.0
Are you sure you want to change the base?
Conversation
egordm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great changes. I have only a few nipicks about test and possibly moving some logic to timeseries dataset.
packages/openstef-models/tests/unit/models/forecasting/test_median_forecaster.py
Outdated
Show resolved
Hide resolved
| def _infer_frequency(index: pd.DatetimeIndex) -> pd.Timedelta: | ||
| """Infer the frequency of a pandas DatetimeIndex if the freq attribute is not set. | ||
|
|
||
| This method calculates the most common time difference between consecutive timestamps, | ||
| which is more permissive of missing chunks of data than the pandas infer_freq method. | ||
|
|
||
| Args: | ||
| index (pd.DatetimeIndex): The datetime index to infer the frequency from. | ||
|
|
||
| Returns: | ||
| pd.Timedelta: The inferred frequency as a pandas Timedelta. | ||
|
|
||
| Raises: | ||
| ValueError: If the index has fewer than 2 timestamps. | ||
| """ | ||
| minimum_required_length = 2 | ||
| if len(index) < minimum_required_length: | ||
| raise ValueError("Cannot infer frequency from an index with fewer than 2 timestamps.") | ||
|
|
||
| # Calculate the differences between consecutive timestamps | ||
| deltas = index.to_series().diff().dropna() | ||
|
|
||
| # Find the most common difference | ||
| return deltas.mode().iloc[0] | ||
|
|
||
| def _frequency_matches(self, index: pd.DatetimeIndex) -> bool: | ||
| """Check if the frequency of the input data matches the model frequency. | ||
|
|
||
| Args: | ||
| index (pd.DatetimeIndex): The input data to check. | ||
|
|
||
| Returns: | ||
| bool: True if the frequencies match, False otherwise. | ||
| """ | ||
| input_frequency = self._infer_frequency(index) if index.freq is None else index.freq | ||
| return input_frequency == self.frequency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be nice to move this to TimeSeriesDataset. To have something like one function called validate_sample_interval that checks the data against the set sample interval. If user wants to be sure they can call it.
It would make it easier to test and median model code would be a lot simpler.
egordm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. The only change I would suggest is to avoid actually inferring timedelta, it's something user can better give manually given how many edge cases there are for this. We can only do verification. This way we also don't need any changes in stef beam and such.
| inferred_freq = pd.Timedelta( | ||
| self._infer_frequency(data.index) if data.index.freq is None else data.index.freq # type: ignore | ||
| ) | ||
| sample_interval = inferred_freq.to_pytimedelta() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh maybe we shouldn't infer frequency from the input data? Forcing user to specify it would be the best. The data may contain holes or similar throwing off the inference.
We should probably only have a validate function that optionally validates if the data uses the right sample interval for functions that are sensitive to this. Like the median model.
It probably also solves issues you have been having with failing doctests.
| lag_deltas = sorted(self.lags_to_time_deltas_.values()) | ||
| lag_intervals = [(lag_deltas[i] - lag_deltas[i - 1]).total_seconds() for i in range(1, len(lag_deltas))] | ||
| if not all(interval == lag_intervals[0] for interval in lag_intervals): | ||
| msg = ( | ||
| "Lag features are not evenly spaced. " | ||
| "Please ensure lag features are evenly spaced and match the data frequency." | ||
| ) | ||
| raise ValueError(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually I like splitting chunks like this off into generic standalone functions cause they are easy to test in isolation but harder to test in model.
For example here we could have a function check_timedeltas_evenly_spaced which is easy to test and doesn't require extra test for the model.
This is more of a person preference. We if it's not worth it we can skip it here.
|
|
||
| # Check that lag frequency matches data frequency | ||
| expected_lag_interval = lag_intervals[0] | ||
| if expected_lag_interval != self.frequency_.total_seconds(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the operations can be done on timedelta structure the total seconds would be unnecessary. In this case it would be a simple comparison.
Changes proposed in this PR include: