Skip to content

ultra-fast matching of kmers using rolling and perfect hashing

License

Notifications You must be signed in to change notification settings

digenoma-lab/FastKM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

FastKM

FastKM is a lightweight C++ tool for fast k-mer marker lookup in long reads using a minimal perfect hash function (MPHF) and a compact probabilistic fingerprint to control false positives. It scans FASTQ/FASTA long-read sequences and reports per-read marker statistics (counts, sd, coverage) across one or more k-mer databases (e.g., per-haplotype marker sets at multiple k sizes).

What it does

Given:

  1. a list of k-mer databases (one per file / k / label), and
  2. a gzipped long-read file,

FastKM:

  • builds an MPHF index for each k-mer set (constant-time queries),
  • scans each long read using a rolling hash (ntHash),
  • checks both forward and reverse-complement k-mers,
  • outputs a tabular matrix with per-read features per database:
    • n_* = number of marker hits
    • m_* = mean distance between consecutive hits (bp)
    • s_* = stddev of distances (bp)
    • cov_* = approximate span coverage (%) based on first/last hit positions
    • size_* = k-mer size

Input formats

1) K-mer database list (argument 1)

A text file where each line has 3 fields:

The kmerdb.txt file has the following content:

/trio_data/unique-mers/uk15/hapA_only_kmers.txt 15 A
/trio_data/unique-mers/uk15/hapB_only_kmers.txt 15 B
/trio_data/unique-mers/uk18/hapA_only_kmers.txt 18 A
/trio_data/unique-mers/uk18/hapB_only_kmers.txt 18 B
/trio_data/unique-mers/uk21/hapA_only_kmers.txt 21 A
/trio_data/unique-mers/uk21/hapB_only_kmers.txt 21 B
/trio_data/unique-mers/uk24/hapA_only_kmers.txt 24 A
/trio_data/unique-mers/uk24/hapB_only_kmers.txt 24 B

The columns are :

  1. File with uniq k-mers
  2. k-mer size
  3. haplotype

2) Long reads file (argument 2)

  • gzipped FASTQ/FASTA supported via kseq + zlib, Reads shorter than 500 bp are skipped.

Run the code

./FastKM  kmerdb.txt long-reads.fastq.gz <number_of_cores>

Citation

If you use FastKM in academic work, please cite the associated repository and (if applicable) the manuscript where FastKM is described.

License

MIT LICENSE.

Contact

Maintainer: Alex Di Genova Issues/feature requests: please open a GitHub issue in this repository.

About

ultra-fast matching of kmers using rolling and perfect hashing

Resources

License

Stars

Watchers

Forks

Packages

No packages published