GitHub - digenoma-lab/HistologyMultiInstanceLearning: Nextflow pipeline to evaluate MIL architectures in concordance with histology foundational models.

HistologyMultiInstanceLearning

Multi-Instance Learning (MIL) pipeline for histopathology to evaluate different MIL architectures (ABMIL, CLAM, DSMIL, etc.) using pre-extracted features from foundation models (for example, uni_v2, virchow2).

The workflow is implemented in Nextflow DSL2 and uses containers (Wave/Singularity) to run both the Python part (MIL training and grid search) and the R part (visualizations).

Pipeline overview

main.nf
Orchestrates the pipeline:
- Reads the clinical/dataset file (params.dataset).
- Reads the list of feature extractors from params/feature_extractors.csv (automatically loaded).
- Reads the list of MIL architectures from params/architectures.csv (automatically loaded).
- Uses params.features_dir to construct feature directory paths.
- Launches:
  - split_dataset: splits the dataset into train/val/test folds for cross-validation at the case level.
  - grid_search: runs grid-search for each feature_extractor × MIL architecture combination with cross-validation.
  - concat_results: concatenates all test metrics into a single summary file.
  - boxplot_auc: generates a global performance boxplot (ROC AUC).
  - roc_auc_curve: generates ROC AUC curves for each configuration.
  - heatmap_workflow:
    - select_best_config: selects the best configuration based on validation AUC.
    - predict: generates attention scores and predictions for the best model.
    - heatmap: creates heatmap visualizations for top-k patches.
    - convert_tiff: converts heatmaps to TIFF format.
modules/grid_search.nf
- process split_dataset: runs histomil-splits to create train/val/test splits for cross-validation at the case level.
- process grid_search: runs histomil-grid for each feature_extractor × MIL architecture combination and publishes:
  - test_results_*.csv (test set metrics per fold)
  - predictions_*.csv (test set predictions per fold)
- process concat_results: concatenates all test metrics into a single summary.csv file.
modules/plots.nf
- process boxplot_auc: generates a boxplot comparing ROC AUC across all configurations using bin/boxplot_auc.R.
- process roc_auc_curve: generates ROC AUC curves using bin/roc_auc_curve.R.
modules/heatmaps.nf
- process select_best_config: identifies the best hyperparameter configuration based on validation metrics.
- process predict: runs histomil-predict to generate predictions and attention scores using the best model.
- process heatmap: runs histomil-heatmap to visualize attention scores as heatmaps on slide images.
- process convert_tiff: converts generated heatmap images to tiled BigTIFF format using gdal_translate.
bin/
- boxplot_auc.R: reads the summary.csv file and generates a ROC AUC boxplot.png comparing performance across feature extractors and MIL architectures.
- roc_auc_curve.R: plots ROC curves for model predictions.

Inputs

Dataset file (params.dataset)
- CSV with at least:
  - A case_id column to identify cases (patients) for case-level splitting.
  - A slide_id column to link samples with feature files.
  - A target column (specified by params.target, e.g., target, ESR1, MKI67).
- Example structure:
```
case_id,slide_id,target
case_1,slide_1,0
case_1,slide_2,0
case_2,slide_3,1
case_2,slide_4,1
...
```
Feature extractors configuration (params/feature_extractors.csv)
- CSV file automatically loaded by the pipeline (located in params/ directory).
- Required columns:
  - patch_encoder: patch-level encoder name (e.g. uni_v2, virchow2).
  - patch_size: patch size in pixels (e.g. 256, 224).
  - mag: magnification level (e.g. 20).
  - overlap: overlap in pixels (e.g. 0).
- Example:
```
patch_encoder,patch_size,mag,overlap
uni_v2,256,20,0
virchow2,224,20,0
```
MIL architectures configuration (params/architectures.csv)
- CSV file automatically loaded by the pipeline (located in params/ directory).
- Required columns:
  - architecture: MIL architecture name (e.g. abmil, clam, dsmil, dftd, ilra, rrt, transformer, transmil, wikg).
- Example:
```
architecture
abmil
clam
dsmil
dftd
ilra
rrt
transformer
transmil
wikg
```
Features directory (params.features_dir)
- Base directory path where feature directories are located.
- Feature directories follow the pattern: {features_dir}{mag}x_{patch_size}px_{overlap}px_overlap/features_{patch_encoder}/
- Each feature directory should contain one .h5 file per slide (named {slide_id}.h5).
- Each H5 file should contain:
  - features: Array of shape (num_patches, feature_dim)
  - Optionally: coords: Array of patch coordinates
Slides directory (params.slides_dir)
- Base directory path where WSIs directories are located.
Pipeline parameters (YAML files in params/)
- The key parameters are:
  - dataset: path to the CSV with case_id, slide_id, and target columns.
  - features_dir: base directory path where feature directories are located.
  - slides_dir: base directory path where WSIs are located.
  - outdir: output directory for this run (default: ./results/).
  - target: column name of the target variable (e.g., target, ESR1, MKI67).
  - task: "classification" (currently only classification is supported).
- Example:
  - HRR ER classification (params/params_hrr_er.yml):
```
dataset: '/path/to/class_dataset_er.csv'
features_dir: "/path/to/features/base/directory/"
features_dir: "/path/to/slides/base/directory/"
outdir: "./results_hrr_er/"
target: "target"
task: "classification"
```

Outputs

All outputs are written under params.outdir (configured in the selected params file):

Training results
- training/
  - summary.csv (concatenated test metrics from all feature extractors and MIL architectures).
  - {feature_extractor}.{mil}/
    - test_results_{feature_extractor}.{mil}.csv with metrics per fold.
  - Classification metrics: test_auc, test_acc, test_f1, test_precision, test_recall.
Predictions
- predictions/
  - {feature_extractor}.{mil}/
    - predictions_{feature_extractor}.{mil}_{fold}.csv with slide_id, y_true, y_pred, y_score (probability for the positive class).
Splits
- splits/
  - {target}/
    - dataset.csv (processed dataset with case_id, slide_id, and label columns).
    - splits_{fold}_bool.csv (boolean splits for each fold with train/val/test columns).
    - splits_{fold}_descriptor.csv (summary statistics for each split).
Plots
- plots/
  - boxplot.png: Distribution of ROC AUC by feature_extractor and mil architecture.
  - *.roc_auc.png: ROC AUC curves for each configuration.
Heatmaps
- heatmaps/{feature_extractor}.{mil}/
  - attention_scores/: H5 files containing attention scores.
  - predictions.csv: Predictions for the best model.
  - topk_patches/:
    - {slide_id}/heatmap_*.png: Attention heatmap overlay.
    - {slide_id}/topk_patches/top_*.png: Highest attention patches.
  - tiff/: Converted BigTIFF heatmaps.
Pipeline information
- pipeline_info/ (timeline, report, trace, DAG HTML) generated automatically by Nextflow.

Requirements

Nextflow ≥ 22.x
Access to Singularity/Wave containers (configured in nextflow.config).
Cluster with SLURM if using the kutral profile (default in this repo).

Basic usage

Load the environment where Nextflow and Singularity are available.
Build the Singularity container for HistoMILTrainer: Navigate to the singularity/ directory and build the container image:
```
cd singularity/
singularity build histomil.sif histomil.def
```
This will create the histomil.sif image that will be used by Nextflow to run the pipeline processes.
Configure feature extractors: Ensure params/feature_extractors.csv exists and contains the feature extractor configurations you want to evaluate.
Configure MIL architectures: Ensure params/architectures.csv exists and contains the MIL architectures you want to evaluate.
Choose or edit a params file in params/ directory:
- Set dataset: path to your CSV with case_id, slide_id, and target columns.
- Set features_dir: base directory where feature directories are located.
- Set target: column name of the target variable (e.g., target, ESR1, MKI67).
- Set outdir: output directory for this run.
- Set task: "classification" (currently only classification is supported).
Run the pipeline:

# HRR ER classification
nextflow run main.nf -profile kutral -params-file params/params_hrr_er.yml

# MKI67 classification
nextflow run main.nf -profile kutral -params-file params/params_mki67_class.yml

For local execution (without SLURM), you can use the local profile defined in nextflow.config:

nextflow run main.nf -profile local -params-file params/params_hrr_er.yml

Supported MIL architectures

The pipeline supports multiple state-of-the-art MIL architectures from MIL-Lab:

ABMIL: Attention-based Multiple Instance Learning
CLAM: Clustering-constrained Attention Multiple instance learning
DSMIL: Dual-stream Multiple Instance Learning
DFTD: Deep Feature-based Top-Down attention
ILRA: Instance-Level Representation Aggregation
RRT: Residual Regression Transformer
Transformer: Transformer-based MIL
TransMIL: Transductive Multiple Instance Learning
WIKG: Weighted Instance Knowledge Graph

Each architecture can be configured via JSON files in bin/HistoMILTrainer/configs/. The pipeline uses 3-fold cross-validation by default (configurable in grid_search.py).

Note: CLAM automatically sets batch_size to 1 during training. Make sure MIL-Lab is properly installed and accessible in your Python path.

Output directory structure

After running the pipeline, the output directory (params.outdir) will have the following structure:

results/
├── splits/                    # Train/val/test splits
│   ├── target/
│   │   ├── dataset.csv
│   │   ├── splits_0_bool.csv
│   │   ├── splits_0_descriptor.csv
│   │   └── ...
│   └── ...
├── training/                  # Training results
│   ├── summary.csv            # Concatenated summary
│   ├── {feature_extractor}.{mil}/
│   │   └── test_results_{feature_extractor}.{mil}.csv
│   └── ...
├── predictions/               # Test set predictions
│   ├── {feature_extractor}.{mil}/
│   │   ├── predictions_{feature_extractor}.{mil}_0.csv
│   │   ├── predictions_{feature_extractor}.{mil}_1.csv
│   │   └── ...
│   └── ...
├── plots/                     # Generated plots
│   ├── boxplot.png            # ROC AUC comparison boxplot
│   └── *.roc_auc.png          # ROC curves
├── heatmaps/                  # Attention heatmaps and predictions
│   ├── {feature_extractor}.{mil}/
│   │   ├── attention_scores/
│   │   ├── predictions.csv
│   │   ├── topk_patches/
│   │   └── tiff/
│   └── ...
└── pipeline_info/              # Nextflow execution reports
    ├── execution_report_*.html
    ├── execution_timeline_*.html
    ├── execution_trace_*.txt
    └── pipeline_dag_*.html

Tips and best practices

Feature extractor configuration: Make sure the patch_encoder, patch_size, mag, and overlap values in params/feature_extractors.csv match the directory structure in your features_dir.
Case-level splitting: The pipeline splits data at the case level to prevent data leakage. Multiple slides from the same case will always be in the same split (train/val/test).
Cross-validation: The pipeline uses 10-fold cross-validation by default. Each fold generates separate test metrics and predictions.
Memory and GPU requirements: Grid search processes can be memory and GPU-intensive. The default configuration allocates 80G memory, 16 CPUs, and 1 GPU for grid search processes. Adjust in nextflow.config if needed.

Resume execution: Nextflow supports resuming failed runs. Use -resume flag:

nextflow run main.nf -profile kutral -params-file params/params_hrr_er.yml -resume

Feature format: Features should be pre-extracted and stored in H5 format. Each slide should have a corresponding {slide_id}.h5 file containing the features array.

Citation

If you use this pipeline in your research, please cite:

MIL-Lab: The repository containing the MIL architectures used in this pipeline
- Repository: https://github.com/mahmoodlab/MIL-Lab
- Please cite the original MIL-Lab paper and the specific architecture papers you use
HistoMIL: The library used for training MIL architectures on histology data
- Repository: https://github.com/digenoma-lab/HistoMIL
- Please cite the HistoMIL library if you use it in your research
This pipeline: If you use this Nextflow pipeline, please cite this repository

Contact

Author: Gabriel Cabas
For questions or suggestions, please open an issue or pull request in this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
.github/workflows		.github/workflows
bin		bin
data		data
imgs		imgs
modules		modules
params		params
singularity		singularity
workflows		workflows
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.nf		main.nf
nextflow.config		nextflow.config
nextflow_schema.json		nextflow_schema.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HistologyMultiInstanceLearning

Pipeline overview

Inputs

Outputs

Requirements

Basic usage

Supported MIL architectures

Output directory structure

Tips and best practices

Citation

Contact

About

Uh oh!

Releases 3

Packages

Languages

License

digenoma-lab/HistologyMultiInstanceLearning

Folders and files

Latest commit

History

Repository files navigation

HistologyMultiInstanceLearning

Pipeline overview

Inputs

Outputs

Requirements

Basic usage

Supported MIL architectures

Output directory structure

Tips and best practices

Citation

Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Languages

Packages