Skip to content

Conversation

@fkiraly
Copy link
Collaborator

@fkiraly fkiraly commented Dec 31, 2025

This PR is an experimental draft showcasing:

  • a light-weight packaging layer for objects of any kind, using python code serialization rather than "translation"
  • a new taxon "model" which corresponds to (uninstantiated) classes
  • an example implementation for tabular classifiers, which adds records for two examples: XGBClassifier from xgboost, and AutoSklearnClassifier from auto-sklearn

Note: this is currently not using any backend - the idea is that the light-weight packaging layer can be easily serialized by the backend.

Usage pattern for end users

End users would interact with the layer as follows:

import openml

clf1 = openml.get("XGBClassifier")
clf2 = openml.get("AutoSklearnClassifier")

Either line will:

  • if the required soft dependencies (e.g., xgboost, auto-sklearn) are present, directly import the class from the package.
  • if the required soft dependencies are not present, raise an informative error message

The strings live in a unique namespace for objects - usually, for widely known packages (but not necessarily), they will correspond to class names

This could be further extended to:

  • a deps utility that, for a namespace ID provides required dependencies - the tags already are inspectable here
  • tracking of dependencies on a more granular level for reproducibility, e.g., the exact version with which a benchmark in which a model figures was carried out, via pip freeze.

Versioning inside openml is consciously avoided in this design, as versioning already happens in the dependencies which are tracked by the tag system.

The logic internally works through:

  • a registry lookup mechanism
  • a unified, but openml-private materialize interface that obtains the class from the package layer

Usage patterns for estimator contributors

Currently, an extender would interact with this by manually adding a class in openml.models.classifiers, in the simplest case they are thin wrappers pointing to a third party location:

from openml.models.apis import _ModelPkgClassifier

class OpenmlPkg__AutoSklearnClassifier(_ModelPkgClassifier):
    _tags = {
        "pkg_id": "AutoSklearnClassifier",
        "python_dependencies": "auto-sklearn",
    }

    _obj = "autosklearn.classification.AutoSklearnClassifier"

But this class could also be reliant on a full python definition of a class itself, with _obj pointing to a python file, or the method _materialize implementing explicit python code.

This can also be automated further in the context of a database backend:

  • creating a dynamic packaging class from an in-memory object
  • call of this from a publish method that automatically extracts tags etc, similar to OpenMLFlow.publish
  • or, a publish method applying to an entire package, crawling it, and extracting pointer records to all sklearn estimators

Intended backend interaction pattern

A database backend would be inserted as follows.

for publish / populate

  • publish creates a dynamic package class inheriting from the new _BasePkg
  • calls serialize on it to obtain a string
  • calls get_tags to obtain a dict of packaging metadata
  • optionally, adds data about publish process metadata (author, timestamp, etc)
  • publishes both to the database - DB API call

for query

  • DB API call for lookup mechanism
  • retrieve previously published record - also DB API call
  • invert serialize by zlib decompress and exec to obtain a dynamic python package class
  • materialize from that class

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants