Genomic Elastic Net Boosting on GPU (GENBoostGPU)
GENBoostGPU provides a scalable framework for running elastic net regression with
boosting across thousands of CpG sites, leveraging GPU acceleration with RAPIDS cuML,
CuPy, and cuDF.
It supports SNP preprocessing, cis-window filtering, LD clumping, missing data
imputation, and phenotype integration — all optimized for large-scale epigenomics.
- Window-based orchestration:
run_windows_with_daskcoordinates execution across one or more GPUs using Dask.- Handles batch scheduling of thousands of genomic windows.
- Single-window analysis:
run_single_windowexecutes boosting elastic net on one genomic region.- Accepts pre-loaded arrays (CuPy) or file paths (PLINK, phenotype tables).
- GPU-accelerated boosting elastic net:
- Iterative boosting with cuML ElasticNet and final Ridge refit.
- Early stopping based on stability of variance explained.
- Automated SNP preprocessing:
- Zero-variance SNP filtering
- Missing genotype imputation
- LD clumping (PLINK-like) with CuPy
- Cis-window SNP filtering
- Hyperparameter optimization:
- Optuna-based tuning of ElasticNet (
alpha,l1_ratio) - Ridge regression tuning with delayed evaluation
- Optional manual cross-validation for custom grids
- Optuna-based tuning of ElasticNet (
- Scalability:
- Dask orchestration for multiple GPUs (
LocalCUDACluster) - Single-GPU fallback for smaller jobs
- Dask orchestration for multiple GPUs (
- Flexible outputs:
- SNP betas, heritability estimates, variance explained
- Window-level summary tables (
.parquet) - Intermediate ridge/elastic net models for reproducibility
GENBoostGPU is available on PyPI.
It requires Python ≥3.10 and an NVIDIA GPU with CUDA 12.x.
pip install genboostgpuFor development (from source):
git clone https://github.com/heart-gen/GENBoostGPU.git
cd GENBoostGPU
poetry installGENBoostGPU can be used either for large-scale orchestration (many genomic windows across one or more GPUs) or for single-window testing/debugging.
The simplest entry point is run_single_window, which takes either:
- File paths (PLINK genotypes + phenotype file + phenotype ID), or
- Pre-loaded CuPy arrays for genotypes and phenotypes.
from genboostgpu.vmr_runner import run_single_window
result = run_single_window(
chrom=21,
start=10_000,
end=510_000,
geno_path="data/chr21_subset.bed",
pheno_path="data/phenotypes.tsv",
pheno_id="pheno_379",
outdir="results",
n_iter=50,
n_trials=10
)
print(result)Output is a Python dictionary, e.g.:
{
"chrom": 21,
"start": 10000,
"end": 510000,
"num_snps": 742,
"final_r2": 0.34,
"h2_unscaled": 0.29,
"n_iter": 37
}This produces:
- Window-level summary (Python dict)
- Saved results (
.parquet, betas, heritability estimates) inresults/
REGION=caudate python examples/vmr_test_caudate.pyScript outline (examples/vmr_test_caudate.py):
from genboostgpu.orchestration import run_windows_with_dask
df = run_windows_with_dask(
windows, error_regions=error_regions,
outdir="results", window_size=500_000,
n_iter=100, n_trials=20, use_window=True,
save=True, prefix="vmr"
)This runs boosting elastic net across all VMR-defined windows for the chosen region.
NUM_SAMPLES=100 python examples/simu_test_100n.pyScript outline (examples/simu_test_100n.py):
from genboostgpu.orchestration import run_windows_with_dask
df = run_windows_with_dask(
windows, outdir="results", window_size=500_000,
n_iter=100, n_trials=10, use_window=False,
save=True, prefix="simu_100"
)This runs boosting elastic net across synthetic SNP–phenotype pairs for benchmarking.
The million-scale CpG pipeline example lives in examples/cpg_test_million.py. It expects per-chromosome CpG manifests, per-chromosome phenotype tables, and a PLINK genotype prefix.
Match the default templates used by examples/cpg_test_million.py:
data/
cpg_manifests/
cpg_manifest_chr{chrom}.parquet
phenotypes/
pheno_chr{chrom}.parquet
genotypes/
<plink_prefix>.bed
<plink_prefix>.bim
<plink_prefix>.fam
Concretely, the files should look like:
data/cpg_manifests/cpg_manifest_chr{chrom}.parquetdata/phenotypes/pheno_chr{chrom}.parquetdata/genotypes/<plink_prefix>.bed/.bim/.fam
If your BSseq object already exists in memory (for example, as bs), save it first:
saveRDS(bs, "data/bsseq.rds")If your sample identifiers live in pData(bs)$sample_id, remember that column name for the helper script via --sample-id-col sample_id.
Then run the repository helper script:
Rscript scripts/prepare_cpg_inputs.R --bsseq data/bsseq.rds --output dataUseful options:
--sample-id-col sample_idwhen sample IDs are stored in a specificpData(bs)column.--validate-fam data/genotypes/genotypes.famto ensure phenotype sample IDs match the PLINK.famfile.--no-smoothif theBSseqobject is already smoothed or you do not want smoothing.--min-cov 1sets the median coverage filter (e.g.,1keeps loci with median coverage ≥ 1).
The script writes per-chromosome manifests and phenotypes that match the pipeline defaults:
data/cpg_manifests/cpg_manifest_chr1.parquet, etc.data/phenotypes/pheno_chr1.parquet, etc.
With the default output layout (--output data), you can run:
python examples/cpg_test_million.py --geno-path data/genotypes/genotypesIf you write to a different directory, override the templates:
python examples/cpg_test_million.py \
--geno-path data/genotypes/genotypes \
--cpg-manifest-template data/cpg_inputs/cpg_manifests/cpg_manifest_chr{chrom}.parquet \
--pheno-template data/cpg_inputs/phenotypes/pheno_chr{chrom}.parquetThe defaults in examples/cpg_test_million.py assume data/cpg_manifests/ and data/phenotypes/, so either use --output data or pass template overrides.
- On a single GPU: runs without a Dask cluster.
- On multiple GPUs:
run_windows_with_daskautomatically launches aLocalCUDAClusterand distributes windows across devices.
If you use GENBoostGPU in your research, please cite:
Alexis Bennett and Kynon J.M. Benjamin GENBoostGPU: GPU-accelerated elastic net boosting for large-scale epigenomics DOI: 10.5281/zenodo.17238798
GENBoostGPU is licensed under the GPL-3.0 license. See the LICENSE file for details.