Skip to content

Conversation

@pavelkomarov
Copy link
Collaborator

@pavelkomarov pavelkomarov commented Oct 14, 2025

added robustdiff to kalman_smooth, added tests for it, added it to notebooks 1 and 2a (2b yet to do), added it to optimization and played with it a long time, still toying, improved the way imports are done in various init.py with the else clause, which I didn't know about before

Still need to:

  • rerun and finalize 2a notebook
  • reorganize, add robustdiff to, and rerun 2b notebook
  • add robustdiff to suggest_method and rerun notebook 3
  • setup and run experiments with robustdiff in notebook 4 to see whether this thing lives up to the name I've given it

…tebooks 1 and 2a (2b yet to do), added it to optimization and played with it a long time, still toying, improved the way imports are done in various __init__.py with the else clause, which I didn't know about before
{'q': (1e-10, 1e10),
'r': (1e-10, 1e10)})
'r': (1e-10, 1e10)}),
robustdiff: ({'order': {1, 2, 3}, #categorical
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might also need huberM to be a hyperparameter. Maybe discrete, because we want to hit 0 on the dot.

(lineardiff, {'order':3, 'gamma':5, 'window_size':11, 'solver':'CLARABEL'}), (lineardiff, [3, 5, 11], {'solver':'CLARABEL'}),
(rbfdiff, {'sigma':0.5, 'lmbd':0.001})
]
diff_methods_and_params = [(robustdiff, {'order':3, 'qr_ratio':1e6})]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotta remove this next commit. Should test all.

from warnings import filterwarnings, warn
from multiprocessing import Pool
from multiprocessing import Pool, Manager
from hashlib import sha1
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently calling hash() on identical objects in different processes isn't guaranteed to return the same value, so I had to get fancier.


# Map from method -> (search_space, bounds_low_hi)
method_params_and_bounds = {
spectraldiff: ({'even_extension': {True, False}, # give categorical params in a set
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the order of a bunch of listings around this PR, to better match the taxonomy paper and readme.

"""
point_params = {k:(v if search_space_types[k] == float else
int(np.round(v))) for k,v in zip(search_space_types, point)} # point -> dict
key = sha1((''.join(f"{v:.3e}" for v in point) + # This hash is stable across processes. Takes bytes
Copy link
Collaborator Author

@pavelkomarov pavelkomarov Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hashing unhashable types reliably was more pain than I expected. It's still not totally perfect, because scientific notation isn't a totally reliable way to see we've already queried somewhere very nearby. For instance, 0.00e+0 looks very different from 1.00e-5 as a string, but 1.10e3 vs identical-looking 1.10e3 could be hiding a difference much greater than 1e-5.

int(np.round(v))) for k,v in zip(search_space_types, point)} # point -> dict
key = sha1((''.join(f"{v:.3e}" for v in point) + # This hash is stable across processes. Takes bytes
''.join(str(v) for k,v in sorted(categorical_params.items()))).encode()).digest()
if key in cache: return cache[key] # short circuit if this hyperparam combo has already been queried, ~10% savings per #160
Copy link
Collaborator Author

@pavelkomarov pavelkomarov Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I actually need to store and return the value. Could maybe just shortcircuit by returning NaN, since there's a nanargmin at the end. But storing and returning a number is cheap.

params, bounds = method_params_and_bounds[func]
params.update(search_space_updates) # for things not given, use defaults
search_space, bounds = method_params_and_bounds[func]
search_space.update(search_space_updates) # for things not given, use defaults
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed this. It's not exactly the search space. It's a bunch of directions about how to make starting conditions. But it's close in spirit.

_minimize = partial(scipy.optimize.minimize, _obj_fun, method=opt_method, bounds=bounds, options={'maxiter':maxiter})
results += pool.map(_minimize, starting_points) # returns a bunch of OptimizeResult objects
with Manager() as manager:
cache = manager.dict() # cache answers to avoid expensive repeat queries
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What a great object somebody made. Super easy to use.

@pavelkomarov pavelkomarov merged commit 8eaea82 into master Oct 15, 2025
1 of 2 checks passed
@pavelkomarov pavelkomarov deleted the outlier-robust branch October 15, 2025 01:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants