ARUNA: Slice-based self-supervised imputation enables upscaling of sequencing-based DNA methylation assays
Whole-genome bisulfite sequencing (WGBS) provides near-comprehensive, base-resolution maps of DNA methylation, but its cost limits large-scale studies. Reduced representation bisulfite sequencing (RRBS) and related protocols offer cost-effective alternatives, but measure only a sparse subset of CpGs, creating substantial coverage mismatches across assays.
ARUNA is a self-supervised denoising convolutional autoencoder designed to upscale sparse, sequencing-based methylomes to whole-genome resolution. ARUNA operates on methylation slices: spatially stacked genomic windows that preserve local CpG correlation and cross-sample structure, allowing it to generalize across assays, donors, tissues, and datasets.
This repository provides:
- Core ARUNA model and data-processing code.
- Precomputed example datasets (GTEx chr21 subset).
- Example notebooks for data preparation, inference, and training.
- A pretrained model checkpoint for demonstration.
- Installation
- Repository Structure
- Data and Preprocessing
- Running the Examples
- Notes on Reproducibility
- Citation
# Clone the repository:
git clone https://github.com/ylaboratory/ARUNA.git
cd ARUNA
# Create and activate a Python environment (Python ≥ 3.9 recommended):
conda create -n aruna python=3.11
conda activate aruna
# Install dependencies:
pip install -r requirements.txtThis repository is not intended to be installed as a published Python package as of now. All scripts and notebooks assume execution from the repository root.
ARUNA/
├── aruna/ # Core model and data-processing code
├── checkpoints/ # Pretrained model checkpoints
├── configs/ # Example configuration YAML files
├── data/
│ ├── gtex_subset/ # Example GTEx WGBS input data (chr21 only)
│ ├── metadata/ # Reference genome and RRBS metadata
├── notebooks/ # Example notebooks
├── scripts/ # Inference helpers
├── results/ # Output storage
└── requirements.txt
Large intermediate files and derived datasets are intentionally excluded from version control.
A reference WGBS preprocessing pipeline used to generate the input .cov files from FASTQ inputs is provided in: assets/bioinfopipe_pairedWGBS.sh.
ARUNA operates on a patch-centric representation of the methylome and also creates chrom-centric data in the preprocessing pipeline.
Input data are BED-like CpG-level methylation files (e.g. Bismark *.cov) produced from the initial bioinformatics pipeline, organized as:
data/gtex_subset/
└── <sample_id>/
└── *.cpgMerged.CpG_report.merged_CpG_evidence.cov
An example GTEx subset (16 samples and chromosome 21 only) is included for demonstration.
- Shape: (num_cpgs, num_samples)
- Canonical CpG set derived from the reference genome (hg38)
- Stores fractional methylation and read depth
- Fixed-size windows defined by number of CpGs per patch
- Enables convolutional modeling of local methylation structure
Located under: data/metadata/
Includes:
- Canonical CpG coordinates (hg38, 0-indexed)
- Chromosome lengths (hg38)
- Optional RRBS CpG observation probabilities for realistic rrbs-like missingness simulation
The subsequent step converts sample-wise methylation files into chrom-centric and patch-centric formats.
Relevant code:
aruna/process_dataset.pyaruna/patch_metadata.py
Example notebook: notebooks/data_prep.ipynb
Running this notebook produces:
data/gtex/
├── chrom_centric/
│ ├── true/
│ ├── mcar_90/
│ └── rrbs_sim/
└── patch_centric/
└── numCpg128/
├── true/
├── mcar_90/
└── rrbs_sim/
- A pretrained ARUNA model is provided in:
checkpoints/trained_model.pth. - You can run inference using:
notebooks/example_infer.ipynb. - Predicted methylation values and evaluation artifacts will be written to:
results/.
An example training workflow is provided in: notebooks/example_train.ipynb.
This notebook demonstrates:
- Loading and batching training data.
- Initializing and training ARUNA from scratch on simulated sparse methylomes.
- Saving model checkpoints for downstream inference to:
checkpoints/.
Note: Training is computationally intensive and was performed on GPU hardware for the experiments reported in the paper.
- Example data are restricted to chromosome 21 for tractability.
- Noise simulation for inference is performed once and saved to disk.
- Exact Python dependencies are specified in requirements.txt.
Coming Soon
