FastKM is a lightweight C++ tool for fast k-mer marker lookup in long reads using a minimal perfect hash function (MPHF) and a compact probabilistic fingerprint to control false positives. It scans FASTQ/FASTA long-read sequences and reports per-read marker statistics (counts, sd, coverage) across one or more k-mer databases (e.g., per-haplotype marker sets at multiple k sizes).
Given:
- a list of k-mer databases (one per file / k / label), and
- a gzipped long-read file,
FastKM:
- builds an MPHF index for each k-mer set (constant-time queries),
- scans each long read using a rolling hash (ntHash),
- checks both forward and reverse-complement k-mers,
- outputs a tabular matrix with per-read features per database:
n_*= number of marker hitsm_*= mean distance between consecutive hits (bp)s_*= stddev of distances (bp)cov_*= approximate span coverage (%) based on first/last hit positionssize_*= k-mer size
A text file where each line has 3 fields:
The kmerdb.txt file has the following content:
/trio_data/unique-mers/uk15/hapA_only_kmers.txt 15 A
/trio_data/unique-mers/uk15/hapB_only_kmers.txt 15 B
/trio_data/unique-mers/uk18/hapA_only_kmers.txt 18 A
/trio_data/unique-mers/uk18/hapB_only_kmers.txt 18 B
/trio_data/unique-mers/uk21/hapA_only_kmers.txt 21 A
/trio_data/unique-mers/uk21/hapB_only_kmers.txt 21 B
/trio_data/unique-mers/uk24/hapA_only_kmers.txt 24 A
/trio_data/unique-mers/uk24/hapB_only_kmers.txt 24 B
The columns are :
- File with uniq k-mers
- k-mer size
- haplotype
- gzipped FASTQ/FASTA supported via
kseq+ zlib, Reads shorter than 500 bp are skipped.
./FastKM kmerdb.txt long-reads.fastq.gz <number_of_cores>
If you use FastKM in academic work, please cite the associated repository and (if applicable) the manuscript where FastKM is described.
MIT LICENSE.
Maintainer: Alex Di Genova Issues/feature requests: please open a GitHub issue in this repository.