Pipeline for processing bulk RNA-seq data from raw .fastq files to read counts.
The pipeline is contained within a single shell script and processes short read paired-end .fastq files from bulk RNA-sequencing and outputs raw read counts (un-normalized).
It is written for use on Linux, and is tested on Ubuntu so it should work on debian based systems at least.
The script assumes you are using a debian based linux distro, but it can be adapted for use on other versions (Fedora, Arch). The following programs must be installed otherwise the script will fail to run and exit with an error.
For security, it is safer to manually install the software via the below command(s). For convenience, the script can attempt to install the necessary software if allowed.
sudo apt install fastqc multiqc hisat2 samtools subread md5sum
fastqc
sudo apt install fastqc
multiqc
sudo apt install multiqc
hisat2
sudo apt install hisat2
samtools
sudo apt install samtools
featureCounts (function of subread)
sudo apt install subread
md5sum (it may come with your install)
sudo apt install md5sum
The available flags are defined below.
Example command:
bash RNAseq-pipeline_version.sh -f /path/to/fastqs -t 32 -5 30 -3 70 -i GRCm39
-f /path/to/fastqs [required] Designates the file path for the fastq files to be processed. This is also used as the main working directory and all output files will exist here.
-i <str> [required] Specific reference genome index to use. Path to index files and .gtf must be manually defined in the script.
-t <int> [optional] Number of threads to use. Must be integer value. If no input specified, then by default 80% of threads on the system will be used.
-5 <int> [optional] Number of bases to trim from the 5' end of sequences.
-3 <int> [optional] Number of bases to trim from the 3' end of sequences.
-q [optional] Run script in QC only mode. This runs fastqc and multiqc and then exits. Useful for deciding -5 and -3 flag trim settings.
-h Prints usage information.
The script expects a file named "md5.txt" in the same directory as the fastq files to be processed. If there is no file found, the script will offer to continue without verifiying. Verification of fastq file hashes is highly encouraged.
This section includes additional settings that can be manually modified if necessary. Typically, they will not need to be altered.
By default it is defined as "ILLUMINA". If another platform was used, it can be entered here for posterity.
HISAT2 offers 3 options for strandedness when aligning reads.
R or RF for RF/fr-firststrand stranded (dUTP)
F or FR for FR/fr-secondstrand stranded (Ligation)
No inclusion of the flag for unstranded
Most relatively recent sequencing data will be dUTP (firststranded), abbreviated as "RF".
If there is no available index available or you just want to build it yourself, the following commands will let you build an index using hisat2. Note: your machine must have at least 160 GB of memory available.
Run each command in sequence.
hisat2_extract_splice_sites.py nameofgtffile.gtf > genome.ss
hisat2_extract_exons.py nameofgtffile.gtf > genome.exon
hisat2-build -p <num of threads> --exon genome.exon --ss genome.ss Reference_genome_primary_assembly.fa Output_name_of_index