duvidnn is a suite of python tools for calculating confidence and information metrics for deep learning. It provides a higher-level framework for calculating confidence and information metrics of general purpose, taxonomic and chemistry-specific neural networks.
As a bonus, duvidnn also provides an easy command-line interface for training and testing models.
You can install the precompiled version directly using pip.
$ pip install duvidnnIf you want to use duvidnn for chemistry machine learning and AI, use:
$ pip install duvidnn[chem]For integrating taxonomic information with vectome, use:
$ pip install duvidnn[bio]You can install both:
$ pip install duvidnn[bio,chem]Clone the repository, then cd into it. Then run:
$ pip install -e .duvidnn has a command-line interface for training and checkpointing the built-in models.
$ duvidnn --help
usage: duvidnn [-h] [--version] {hyperprep,train,predict,split,percentiles} ...
Calculating exact and approximate confidence and information metrics for deep learning on general purpose and chemistry tasks.
options:
-h, --help show this help message and exit
--version, -v show program's version number and exit
Sub-commands:
{hyperprep,train,predict,split,percentiles}
Use these commands to specify the tool you want to use.
hyperprep Prepare inputs for hyperparameter search.
train Train a PyTorch model.
predict Make predictions and calculate uncertainty using a duvidnn checkpoint.
split Make chemical train-test-val splits on out-of-core datasets.
percentiles Add columns indicating whether rows are in a percentile.In all cases, you can get further options with duvidnn <command> --help, for example:
duvidnn train --helpYou can add columns to datasets which annotate the top percentiles of named columns. This is compatible with extremely large datasets that don't fit in memory.
$ duvidnn percentiles \
hf://scbirlab/fang-2023-biogen-adme@scaffold-split:train \
--columns clogp tpsa \
--percentiles 1 5 10 \
--output percentiles.parquet \
--plot percentiles-plot.png \
--structure smilesIn all cases, input data can be:
- Path to a local file in CSV, Parquet, Arrow or HF Dataset format
- or a remote dataset hosted on π€ Datasets,
indicated by
hf://followed by the repository name
There are utilities for out-of-memory scaffold and (approximate using FAISS) spectral splitting of datasets
that don't fit in memory. Make it random but reproducible with --seed, otherwise a deterministic bin-packing
algorithm is used.
$ duvidnn split hf://scbirlab/fang-2023-biogen-adme@scaffold-split:train \
--train .7 \
--validation .15 \
--structure smiles \
--type faiss \
--seed 1 \
--output faiss.csv \
--plot faiss.pngTo train:
$ duvidnn train -1 hf://scbirlab/fang-2023-biogen-adme@scaffold-split:train \
-2 hf://scbirlab/fang-2023-biogen-adme@scaffold-split:test \
--class fingerprint \
--structure smiles \
--ensemble-size 10 \
--epochs 10 \
--learning-rate 0.001 \
--output model.dvDifferent model classes can be specified:
There is also a simple hyperparameter utility.
$ printf '{"model_class": "fingerprint", use_2d": [true, false], "n_units": 16, "n_hidden": 3}' | duvidnn hyperprep -o hyperopt.jsonThis generates a file containing all combinations. It can be indexed (0-based)
with the -i <int> option to supply a specific training configuration like so:
$ duvidnn train \
-1 hf://scbirlab/fang-2023-biogen-adme@scaffold-split:train \
-2 hf://scbirlab/fang-2023-biogen-adme@scaffold-split:test \
-c hyperopt.json \
-i 0 \
--output model.dvIn this way, you can generate all the hyperparameter combinations, then systematically test them one by one (or in parallel using HPC or other methods).
You can make predictions on datasets using duvidnn predict. Optionally, you can restrict prediction to only a chunk of the
dataset using --start and --stop. This can be useful to parallelize prediction across chunks.
When predicting, there is also the option to calculate uncertainty metrics like ensemble variance (--variance),
Tanimoto nearest neighbor distance to training set (--tanimoto, for chemistry models), doubtscore (--doubtscore),
and information sensitivity (--information-sensitivity).
$ duvidnn predict \
--test hf://scbirlab/fang-2023-biogen-adme@scaffold-split:test \
--checkpoint model.dv \
--start 100 \
--end 200 \
--variance \
--tanimoto \
--doubtscore \
-y clogp \
--output predictions.parquetOutputs can be made in CSV, Parquet, Arrow, or HF Dataset format. This is inferred from the file extension of
the filename provided for --output.
Note that information sensitivity using default parameters can be very slow for large models with large training data, since it must calculate second-order parameter gradients for every training example. There are approximations which can speed it up substantially, at the cost of exactness:
- The
--last-layeroption gives the biggest speed-up, since it restricts the calculation to only the output layer of the model. - Using
--optimalityassumes the model has been trained to an optimum (i.e. gradient of loss is zero). - The
--approx bekasoption uses a fast approximation of second-order gradients.
duvidnn provides python classes and functions for custom analysis.
The core of duvidnn is the ModelBox, which is a container for a trainable model and its training data.
These are connected because measures of confidence and information gain depend directly on the information
or evidence already seen by the model.
There are several ModelBox classes for specific deep learning architechtures in pytorch.
>>> from duvidnn.autoclass import MODELBOX_REGISTRY
>>> from pprint import pprint
>>> pprint(MODELBOX_REGISTRY)
{'bilinear': <class 'duvidnn.torch.modelbox.modelboxes.TorchBilinearModelBox'>,
'bilinear-fp': <class 'duvidnn.torch.modelbox.modelboxes.TorchBilinearFingerprintModelBox'>,
'chemprop': <class 'duvidnn.torch.modelbox.modelboxes.ChempropModelBox'>,
'cnn': <class 'duvidnn.torch.modelbox.modelboxes.TorchCNN2DModelBox'>,
'fingerprint': <class 'duvidnn.torch.modelbox.modelboxes.TorchFingerprintModelBox'>,
'mlp': <class 'duvidnn.torch.modelbox.modelboxes.TorchMLPModelBox'>}The modelboxes chemprop, fingerprint, and bilinear-fp featurize SMILES representations of chemical
structures. The modelbox mlp is a general purpose multilayer perceptron.
You can set up your model with various training parameters.
from duvidnn.autoclass import AutoClass
modelbox = AutoClass(
"fingerprint",
n_units=16,
n_hidden=2,
ensemble_size=10,
structure_column="smiles",
)The internal neural network is instantiated on loading training data.
modelbox.load_training_data(
data="hf://scbirlab/fang-2023-biogen-adme@scaffold-split:train",
inputs="smiles", # column name of the predictor values
labels="clogp", # column name of the values to predict
)The data can be a remote π€ dataset, in which case it is automatically downloaded. The "@"
indicates the dataset configuration, and the ":" indicates the specific data split.
Alternatively, the training data can be a local CSV or TSV file, or in-memory Pandas dataframes or dictionaries.
With training data loaded, the model can be trained!
modelbox.train(
val_filename="hf://scbirlab/fang-2023-biogen-adme@scaffold-split:test",
epochs=10,
batch_size=128,
)The ModelBox.train() method uses pytorch Lightning under the hood, so other options such as callbacks
for this framework should be accepted.
duvidnn provides a basic checkpointing mechanism to save model weights and training data to later reload.
modelbox.save_checkpoint("checkpoint.dv")
modelbox.load_checkpoint("checkpoint.dv")duvidnn ModelBoxes provide methods for evaluating predictions on new data.
predictions, metrics = modelbox.evaluate(
data="hf://scbirlab/fang-2023-biogen-adme@scaffold-split:test",
)duvidnn ModelBoxes provide methods for calculating prediction variance of ensembles,
doubtscore, and information sensitivity.
doubtscore = modelbox.doubtscore(
data="hf://scbirlab/fang-2023-biogen-adme@scaffold-split:test"
)
info_sens = modelbox.information_sensitivity(
data="hf://scbirlab/fang-2023-biogen-adme@scaffold-split:test",
approx="bekas", # approximate Hessian diagonals
n=10,
)To avoid storing large datasets in memory, duvidnn uses π€ datasets under the hood to cache data. Results can be instantiated in memory with a little effort. For example:
doubtscore = doubtscore.to_pandas()See the π€ datasets documentation for more.
Bringing a new pytorch model to duvidnn is relatively straightforward. First, write your model,
adding Lighning logic and a create_model() method:
from typing import Callable, Iterable, List, Mapping, Optional
from torch.nn import BatchNorm1d, Dropout, Linear, Module, SiLU, Sequential
from duvidnn.torch import TorchEnsembleMixin
from duvidnn.torch.models.utils.lt import LightningMixin
from torch import nn
from torch.optim import Adam, Optimizer
class SimpleMLP(nn.Module, LightningMixin):
def __init__(
self,
n_input: int,
n_units: int = 16,
n_out: int = 1,
activation: Callable = nn.SiLU, # Smooth activation to prevent vanishing gradient
learning_rate: float = .01,
optimizer: Optimizer = Adam,
*args, **kwargs
):
super().__init__(*args, **kwargs)
self.n_input = n_input
self.n_units = n_units
self.activation = activation
self.n_out = n_out
self.model_layers = nn.Sequential([
nn.Linear(self.n_input, self.n_units),
self.activation(),
nn.Linear(self.n_units, self.n_out),
])
# Lightning logic
self._init_lightning(
optimizer=optimizer,
learning_rate=learning_rate,
model_attr='model_layers', # the attribute containing the model
)
def forward(self, x):
return self.model_layers(x)Then subclass duvidnn.torch.modelbox.TorchModelBoxBase and implement the create_model() method, which should
simply return your instantiated model. If you want to preprocess input data on the fly, then
add a preprocess_data() method which takes a data dictionary and returns a data dictionary.
from typing import Dict
from duvidnn.torch.modelbox import TorchModelBoxBase
import numpy as np
class MLPModelBox(TorchModelBoxBase):
def __init__(self, *args, **kwargs):
super().__init__()
self._mlp_kwargs = kwargs
def create_model(self, *args, **kwargs):
self._model_config.update(kwargs) # makes sure model checkpointing saves the keyword args
return SimpleMLP(
n_input=self.input_shape[-1], # defined on data loading
n_out=self.output_shape[-1],
*args,
**self._model_config,
**self._mlp_kwargs, # if init kwargs are relevant to model creation
)
# Define this method if your data needs preprocessing
@staticmethod
def preprocess_data(data: Dict[str, np.ndarray], _in_key, _out_key, **kwargs) -> Dict[str, np.ndarray]:
return {
_in_key: your_featurizer(data[_in_key]),
_out_key: np.asarray(data[_out_key])
}If you want to build ModelBoxes based on a framework other than pytorch, you can subclass
the duvidnn.base.ModelBoxBase abstract class, making sure to implement its abstract methods.
Add to the issue tracker.
(To come at ReadTheDocs.)