GitHub - ylaboratory/ARUNA: ARUNA: Slice-based self-supervised imputation for upscaling DNA methylation sequencing assays

ARUNA: Slice-based self-supervised imputation enables upscaling of sequencing-based DNA methylation assays

Overview

Whole-genome bisulfite sequencing (WGBS) provides near-comprehensive, base-resolution maps of DNA methylation, but its cost limits large-scale studies. Reduced representation bisulfite sequencing (RRBS) and related protocols offer cost-effective alternatives, but measure only a sparse subset of CpGs, creating substantial coverage mismatches across assays.

ARUNA is a self-supervised denoising convolutional autoencoder designed to upscale sparse, sequencing-based methylomes to whole-genome resolution. ARUNA operates on methylation slices: spatially stacked genomic windows that preserve local CpG correlation and cross-sample structure, allowing it to generalize across assays, donors, tissues, and datasets.

This repository provides:

Core ARUNA model and data-processing code.
Precomputed example datasets (GTEx chr21 subset).
Example notebooks for data preparation, inference, and training.
A pretrained model checkpoint for demonstration.

Installation

# Clone the repository:
git clone https://github.com/ylaboratory/ARUNA.git
cd ARUNA

# Create and activate a Python environment (Python ≥ 3.9 recommended):
conda create -n aruna python=3.11
conda activate aruna

# Install dependencies:
pip install -r requirements.txt

This repository is not intended to be installed as a published Python package as of now. All scripts and notebooks assume execution from the repository root.

Repository Structure

ARUNA/
├── aruna/                  # Core model and data-processing code
├── checkpoints/            # Pretrained model checkpoints
├── configs/                # Example configuration YAML files
├── data/
│   ├── gtex_subset/        # Example GTEx WGBS input data (chr21 only)
│   ├── metadata/           # Reference genome and RRBS metadata    
├── notebooks/              # Example notebooks 
├── scripts/                # Inference helpers
├── results/                # Output storage
└── requirements.txt

Large intermediate files and derived datasets are intentionally excluded from version control.

Data and Preprocessing

Bioinformatics pipeline

A reference WGBS preprocessing pipeline used to generate the input .cov files from FASTQ inputs is provided in: assets/bioinfopipe_pairedWGBS.sh.

Patch Creation

ARUNA operates on a patch-centric representation of the methylome and also creates chrom-centric data in the preprocessing pipeline.

Input data

Input data are BED-like CpG-level methylation files (e.g. Bismark *.cov) produced from the initial bioinformatics pipeline, organized as:

data/gtex_subset/
└── <sample_id>/
    └── *.cpgMerged.CpG_report.merged_CpG_evidence.cov

An example GTEx subset (16 samples and chromosome 21 only) is included for demonstration.

Chrom-centric

Shape: (num_cpgs, num_samples)
Canonical CpG set derived from the reference genome (hg38)
Stores fractional methylation and read depth

Patch-centric

Fixed-size windows defined by number of CpGs per patch
Enables convolutional modeling of local methylation structure

Reference metadata

Located under: data/metadata/

Includes:

Canonical CpG coordinates (hg38, 0-indexed)
Chromosome lengths (hg38)
Optional RRBS CpG observation probabilities for realistic rrbs-like missingness simulation

Running the Examples

Data creation

The subsequent step converts sample-wise methylation files into chrom-centric and patch-centric formats.

Relevant code:

aruna/process_dataset.py
aruna/patch_metadata.py

Example notebook: notebooks/data_prep.ipynb

Running this notebook produces:

data/gtex/
├── chrom_centric/
│   ├── true/
│   ├── mcar_90/
│   └── rrbs_sim/
└── patch_centric/
    └── numCpg128/
        ├── true/
        ├── mcar_90/
        └── rrbs_sim/

Inference with a Pretrained Model

A pretrained ARUNA model is provided in: checkpoints/trained_model.pth.
You can run inference using: notebooks/example_infer.ipynb.
Predicted methylation values and evaluation artifacts will be written to: results/.

Training a Model

An example training workflow is provided in: notebooks/example_train.ipynb.

This notebook demonstrates:

Loading and batching training data.
Initializing and training ARUNA from scratch on simulated sparse methylomes.
Saving model checkpoints for downstream inference to: checkpoints/.

Note: Training is computationally intensive and was performed on GPU hardware for the experiments reported in the paper.

Reproducibility Notes

Example data are restricted to chromosome 21 for tractability.
Noise simulation for inference is performed once and saved to disk.
Exact Python dependencies are specified in requirements.txt.

Citation

Coming Soon

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Quick Navigation

Installation

Repository Structure

Data and Preprocessing

Bioinformatics pipeline

Patch Creation

Input data

Chrom-centric

Patch-centric

Reference metadata

Running the Examples

Data creation

Inference with a Pretrained Model

Training a Model

Reproducibility Notes

Citation

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
aruna		aruna
assets		assets
checkpoints		checkpoints
configs		configs
data		data
notebooks		notebooks
results		results
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

ylaboratory/ARUNA

Folders and files

Latest commit

History

Repository files navigation

Overview

Quick Navigation

Installation

Repository Structure

Data and Preprocessing

Bioinformatics pipeline

Patch Creation

Input data

Chrom-centric

Patch-centric

Reference metadata

Running the Examples

Data creation

Inference with a Pretrained Model

Training a Model

Reproducibility Notes

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages