Contents
- Overview
- Directory Structures
- Input Files
- Workflow Profile Configuration
- Output Files
- Additional Resources
- Credits
This pipeline assumes an existing conda installation and is written as a Snakemake workflow. To install Snakemake with conda, run
conda env create -f workflow/envs/snakemake.yaml
conda activate snakemaketo create and activate a conda environment named snakemake. Once all the input files are ready, run the pipeline locally with
# by default, uses the workflow profile workflow/profiles/default
./run_pipeline.shor on SLURM cluster with
./run_pipeline.sh --workflow-profile workflow/profiles/slurmOther common usage notes
run_pipeline.shpasses any additional arguments to snakemake.- To perform a dry-run:
./run_pipeline.sh --dry-run - To force (re)execution of all rules regardless of past output:
./run_pipeline.sh --forceall - To remove intermediate files generated by the pipeline:
./run_pipeline.sh clean
- To perform a dry-run:
Hardware: This pipeline was developed to be run on a high performance computing (HPC) cluster, but it can also be run locally on a personal computer.
Operating system: Linux, macOS
- Lack of Windows support is due to our use of the Bioconda channel for creating the conda environments described in
workflow/envs/. Bioconda currently does not support Windows. However, we have successfully run this pipeline on Windows Subsystem for Linux (WSL). - While Bioconda supports macOS and has built an extensive repository of packages for x86_64-based Macs (i.e., with Intel processors; conda platform
osx-64), only a limited number of packages (or versions of packages) have been built for new ARM-based Macs (i.e., with M-series Apple Silicon processors; conda platformosx-arm64). Consequently, the versions specified by the environment YAML files described inworkflow/envs/may not be available on Bioconda for theosx-arm64platform.- To run this pipeline on an Apple Silicon Mac, we recommend using conda to create an environment targeting the
osx-64platform, and executing the pipeline programs under Rosetta translation:- While macOS should automatically prompt to install Rosetta if needed, it can also be manually installed using the terminal command
/usr/sbin/softwareupdate --install-rosetta.
- While macOS should automatically prompt to install Rosetta if needed, it can also be manually installed using the terminal command
- To run this pipeline on an Apple Silicon Mac, we recommend using conda to create an environment targeting the
Code interpreters and runtimes: The pipeline relies on scripts written in Bash and Python and has been validated using the following versions:
- Bash: 4.2 through 5.2
- Python: 3.9+ (the
workflow/envs/conda environments file currently use version 3.10)
Packages: Additional required third-party programs and packages are specified in conda environments described in workflow/envs/.
- Note: Other recent versions of the same software programs will likely work, but we have not tested all of them. Some specific requirements are discussed below.
- The version of
deeptoolsused (3.5.2) requires amatplotlibversion between 3.1.0 and 3.7.5, since later versions of matplotlib deprecate some functions used bydeeptoolsversion 3.5.2. Newer versions ofdeeptoolshave been updated to support the newermatplotlibAPIs. - NumPy version 2.x is not currently supported.
- The version of
Terms
- split-pool tag: the individual sequences that are added during a single round of split-pool barcoding (DPM, EVEN, ODD, TERM)
- split-pool barcode: the permutation of split-pool tags that uniquely identifies a cluster
- Antibody ID: a 9 nt sequence within the antibody oligo that uniquely identifies a type of antibody. In the pipeline, these are identified as a "BPM" tag.
- cluster: a set of reads sharing the same split-pool barcode. Antibody oligo reads in the cluster are used to assign genomic DNA reads to specific antibodies.
Workflow
- Define samples and paths to FASTQ files (
fastq2json.pyor manually generatesamples.json) - Split FASTQ files into chunks for parallel processing
- Adaptor trimming (Trim Galore!)
- Barcode identification (Splitcode)
- Split DNA (DPM) and antibody ID oligo (BPM) reads [ChIP-DIP] or RNA (RPM) and antibody ID oligo (BPM) [SPIDR] into separate files
- (ChIP-DIP) DNA read workflow:
- DPM trimming (Cutadapt)
- Genomic alignment (Bowtie 2)
- Chromosome renaming (e.g., to UCSC chromosome names) and filtering (e.g., remove non-canonical chromosomes)
- Masking (bedtools; use ENCODE blacklists)
- (SPIDR) RNA read workflow:
- RPM triming (Cutadapt)
- Repeat alignment (Bowtie 2)
- Bam to fastq of unaligned reads (Samtools)
- Genomic alignment of unaligned reads (STAR)
- Chromosome renaming (e.g., to UCSC chromosome names) and filtering (e.g., remove non-canoical chromosomes)
- Antibody oligo read workflow: 2. FASTQ to BAM conversion
- Cluster generation
- Cluster assignment and antibody specific BAM file creation
- Antibody-specific bigWig file creation (deepTools bamCoverage)
- Summary plots
- Cluster size distributions
- Maximum representation oligo ECDFs
- Summary statistics
- MultiQC (trimming, alignments)
- Barcode identification efficiency
- Cluster statistics
Notes / FAQ
- How are antibody oligo PCR duplicates removed? In step 6.ii above (
fastq_to_bam.py), oligo UMIs are converted to integers and used as the reference position (column 4) in the BAM file. When clusters are generated in step 7 (during both generation of the split clusters and merged clusters), reads mapped to the same "position" (e.g., with the same UMI) are deduplicated. - How are DNA or RNA sequence PCR duplicates removed? We assume that different DNA or RNA molecules within a single cluster are unlikely to have the exact same start and end positions. Therefore, we consider identical fragments (same start and end positions) in the same cluster to be PCR duplicates. When clusters are generated in step 7 (during both generation of the split clusters and merged clusters), reads mapped to the same position are deduplicated. Note that for RNA, we are unable to distinguish if reads that align to the same RNA transcript at different positions are fragments from the same or from different RNA transcripts.
- The pipeline is non-deterministic, but the output should be functionally equivalent between runs. Specifically, the non-determinism affects the order in which reads are sorted in BAM files (even after running
samtools sort), and which representative read from a set of PCR duplicates is kept, none of which should affect downstream analyses (e.g., bigWig generation, peak calling, etc.). Non-determinism arises from the following causes of reads being processed in different orders:- If
num_chunksis set to greater than 1 inconfig.yaml, then the order in which chunks of reads are processed can change between pipeline runs. - If more than 1 processor is available, then Bowtie 2 will align multiple reads in parallel (
-poption), and the output order is non-deterministic. We do not use the--reorderflag to fix the output order for performance reasons. - If
samples.jsonspecifies multiple samples, andmerge_sampleis set totrueinconfig.yaml, then the @RG records in the merged BAM file headers will have unique suffixes appended to their IDs. Seesamtools mergedocumentation.
- If
See also visualizations of the pipeline generated by Snakemake (commands below assume that Graphviz is installed):
images/dag.pdf:snakemake --dag | dot -Tpdf > images/dag.pdfimages/filegraph.pdf:snakemake --filegraph | dot -Tpdf > images/filegraph.pdfimages/rulegraph.pdf:snakemake --rulegraph | dot -Tpdf > images/rulegraph.pdf
We will refer to 4 directories:
-
Working directory: We follow the Snakemake documentation in using the term "working directory" to refer to
either the directory in which you have invoked Snakemake or whatever was specified for
--directoryor theworkdir:directive- This is where Snakemake creates a
.snakemakedirectory within which it installs conda environments and keeps track of metadata regarding the pipeline (e.g., which steps have finished successfully, which steps failed, and which steps have yet to be run). - Rule commands are executed relative to this directory.
- This is where Snakemake creates a
-
Workflow directory: where the Snakefile and scripts reside
envs/scripts/profiles/<profile_name>/config.yaml: paths are interpreted relative to the working directory.
fastq2json.pySnakefile
-
Input directory: where configuration and data files reside
data/config/config.yaml: paths are interpreted relative to the working directoryconfig.txt/example_config.txtformat.txt/example_format.txtsamples.json/example_samples.json: paths are interpreted relative to the working directory
resources/run_pipeline.sh: the paths in the arguments--snakefile <path to Snakefile>,--profile <path to profile>,--workflow-profile <path to workflow profile>, or--configfile <path to config.yaml>are relative to where you runrun_pipeline.sh
-
Output directory: where to place this directory can be changed in
config.yamlvia theoutput_dirkey.alignments/alignments_parts/bigwigs/clusters/: cluster file and cluster statisticsclusters_parts/fastqs/: adapter-trimmed reads with identified barcodes appended to the read name<sample>_R1.part_<###>.barcoded.fastq.gz: read 1 of fully-barcoded templates<sample>_R1.part_<###>.unassigned.fastq.gz: read 1 of templates without full barcodes (not matching format.txt)<sample>_R1.part_<###>.barcoded_bpm.fastq.gz: read 1 of fully-barcoded templates with a BPM tag, corresponding to antibody ID oligonucleotides<sample>_R1.part_<###>.barcoded_dpm.fastq.gz: read 1 of fully-barcoded templates with a DPM tag, corresponding to genomic DNA (CHIP-DIP)<sample>_R1.part_<###>.barcoded_rpm.fastq.gz: read 1 of fully-bacroded templates with a RPM tag, corresponding to RNA (SPIDR)<sample>_R2.part_<###>.barcoded_rpm.fastq.gz: read 2 of fully-barcoded templates with a RPM tag, corresponding to RNA (SPIDR)
logs/qc/splitbams/split_fastq/: unprocessed reads split into chunks for parallel processingtrimmed/: adapter- and tag-trimmed reads<sample>_R<1|2>.part_<###>_val_<1|2>.fq.gz: adapter-trimmed reads<sample>_R1.part_<###>.barcoded_rpm.RDtrim.fastq.gz: read 1 of fully-barcoded templates with a RPM tag (corresponding to RNA) where the RPM tag has been trimmed amd empty insert reads removed (SPIDR)<sample>_R2.part_<###>.barcoded_rpm.RDtrim.fastq.gz: read 2 of fully-barcoded templates with a RPM tag (corresponding to RNA) where the RPM tag has been trimmed amd empty insert reads removed (SPIDR)<sample>_R1.part_<###>.barcoded_dpm.RDtrim.fastq.gz: read 1 of fully-barcoded templates with a DPM tag (corresponding to genomic DNA) where the DPM tag has been trimmed and DPM dimers removed (CHIP-DIP)
barcode_identification_efficiency.txtpipeline_counts.txt
For reproducibility, we recommend keeping the workflow, input, and output directories together. In other words, the complete directory should look like this GitHub repository with an extra output subdirectory (by default, results/) created upon running this pipeline.
However, the workflow directory can also be kept separate and used repeatedly on different datasets.
- The working directory should stay with the input directory, so that the
.snakemakefolder containing the Snakemake pipeline metadata is paired with the configuration files.- Update paths in
config.yaml- The
scripts_dirkey must point to the workflow scripts directory. - Assuming that the above directory structure is followed, the other paths can remain relative paths to configuration files and resource files in the input directory.
- The
- Update
run_pipeline.sh: add--snakefile <path_to_Snakefile>and--workflow-profile <path_to_profile>arguments to the Snakemake invocation to use the Snakefile and workflow profile from the workflow directory. - Run
run_pipeline.shfrom the input directory.
- Update paths in
Documentation for input files in this section is identical to that at config/README.md.
These files are located under <input_directory>/config/.
-
config.yaml: Pipeline configuration - YAML file containing the processing settings and paths of required input files. Paths are specified relative to the working directory.- Required? Yes. Must be provided in the working directory, or specified via
--configfile <path_to_config.yaml>when invoking Snakemake. - Required keys
barcode_config: path to barcode config file (e.g.,config.txt)barcode_format: path to barcode format file (e.g.,format.txt)scripts_dir: path to scripts folder in the workflow directorysamples: path tosamples.jsonfilebead_umi_length: integer length of bead oligo UMIsnum_tags: number of split-and-pool barcode rounds (odd, even, term)experiment_type: DNA or RNA workflow (CHIP-DIP or SPIDR)
- Required keys for DNA workflow
bowtie2_index_dna: path to Bowtie 2 genome indexcutadapt_dpm: path to DPM sequences
- Required keys for RNA workflow
bowtie2_index_rna: path to Bowtie 2 repeat genome indexstar_index_rna: path to STAR transcriptome index
- Optional keys: If these keys are omitted from
config.yamlor set tonull, then they will take on the default values indicated.output_dir(default ="results"): path to create the output directory within which all intermediate and output files are placed.temp_dir(default =$TMPDIR(if set) or"/tmp"): path to a temporary directory with sufficient free disk space, such as used by the-Toption of GNU sortmask(default =null): path to BED file of genomic regions to ignore, such as ENCODE blacklist regions; reads mapping to these regions are discarded. Ifnull, no masking is performed.path_chrom_map(default =null): path to chromosome name map file. Ifnull, chromosome renaming and filtering are skipped, and the final BAM and/or bigWig files will use all chromosome names as-is from the Bowtie 2 or STAR index.num_chunks(default =2): integer between 1 and 99 giving the number of chunks to split FASTQ files from each sample into for parallel processinggenerate_splitbams(default =false): boolean value indicating whether to generate separate BAM files for each antibody targetmin_oligos(default =2): integer giving the minimum count of deduplicated antibody oligo reads in a cluster for that cluster to be assigned to the corresponding antibody target; this criteria is intersected (AND) with theproportionandmax_sizecriteriaproportion(default =0.8): float giving the minimum proportion of deduplicated antibody oligo reads in a cluster for that cluster to be assigned to the corresponding antibody target; this criteria is intersected (AND) with themin_oligosandmax_sizecriteriamax_size(default =10000): integer giving the maximum count of deduplicated genomic DNA reads in a cluster for that cluster to be to be assigned to the corresponding antibody target; this criteria is intersected (AND) with theproportionandmax_sizecriteriamerge_samples(default =false): boolean indicating whether to merge cluster files and target-specific BAM and bigWig files across samplesbinsize(default =false): integer specifying bigWig binsize; set tofalseto skip bigWig generation. Only relevant if generate_splitbams istrue.bigwig_normalization(default ="None"): normalization strategy for calculating coverage from reads; passed to the--normalizeUsingargument for thebamCoveragecommand from the deepTools suite. As of version 3.5.2, deepToolsbamCoveragecurrently supportsRPKM,CPM,BPM,RPGC, orNone. Only relevant if bigWig generation is requested (i.e.,generate_splitbamsistrueandbinsizeis notfalse).email(default =null): email to send error notifications to if errors are encountered during the pipeline. Ifnull, no emails are sent.
- Additional notes
nullvalues can be specified explicitly (e.g.,format: null) or implicitly (e.g.,format:).- For keys
mask,path_chrom_map, andemail, an empty string""is treated identically to if the value isnull.
- Required? Yes. Must be provided in the working directory, or specified via
-
samples.json: Samples file - JSON file with the paths of FASTQ files (read1, read2) to process.-
Required? Yes.
-
config.yamlkey to specify the path to this file:samples -
This can be prepared using
fastq2json.py --fastq_dir <path_to_directory_of_FASTQs>or manually formatted as follows:{ "sample1": { "R1": [ "<path_to_data>/sample1_run1_R1.fastq.gz", "<path_to_data>/sample1_run2_R1.fastq.gz", ], "R2": [ "<path_to_data>/sample1_run1_R2.fastq.gz", "<path_to_data>/sample1_run2_R2.fastq.gz", ] }, "sample2": { "R1": [ "<path_to_data>/sample2_R1.fastq.gz" ], "R2": [ "<path_to_data>/sample2_R2.fastq.gz" ] }, ... } -
Data assumptions:
- FASTQ files are gzip-compressed.
- Read names do not contain two consecutive colons (
::). This is required because the pipeline adds::to the end of read names before adding barcode information; the string::is used as a delimiter in the pipeline to separate the original read name from the identified barcode.
-
If there are multiple FASTQ files per read orientation per sample (as shown for
sample1in the example above), the pipeline will concatenate them and process them together as the same sample. -
Each sample is processed independently, generating independent cluster and BAM files. Statistics used for quality assessment (barcode identification efficiency, cluster statistics, MultiQC report, cluster size distributions, splitbam statistics) are computed independently for each sample but reported together in aggregate files to enable quick quality comparison across samples.
-
The provided sample read files under the
data/folder were simulated via a Google Colab notebook. The genomic DNA reads correspond to ChIP-seq peaks on chromosome 19 (mm10) for transcription factors MYC (simulated as corresponding to Antibody IDBEAD_AB1-A1) and TCF12 (simulated as corresponding to Antibody IDBEAD_AB2-A2). -
Sample names (the keys of the samples JSON file) cannot contain any periods (
.). This is enforced to simplify wildcard pattern matching in the Snakefile and to simplify implementation ofworkflow/scripts/python/threshold_tag_and_split.py:label_bam_file().
-
-
config.txt: Barcode config file - text file containing the sequences of split-pool tags and search parameters for tag findings-
Required? Yes.
-
config.yamlkey to specify the path to this file:barcode_config -
Used by: splitcode software (Snakefile
rule barcode_id),scripts/python/fastq_to_bam.py(Snakefilerule fastq_to_bam), andscripts/python/barcode_identification_efficiency.py(Snakefilerule barcode_identification_efficiency).- This file is also parsed in the Snakefile itself if
generate_splitbamsis set totrueinconfig.yamlto determine the set of antibody targets for which to generate individual de-multiplexed BAM files (and bigWig file too, if requested).
- This file is also parsed in the Snakefile itself if
-
Format: See (https://splitcode.readthedocs.io/en/latest/example.html#config-file) for complete description of parameter options
-
An example barcoding configuration file is annotated below:
#Header Categories #groups: Each tag belongs to a specific group (e.g., BPM, DPM, RPM, R1_ODD, R2_EVEN, R3_ODD, R4_EVEN, R5_ODD, Y (aka. TERM), UNUSED_EVEN, UNUSED_ODD) #id: The specific name of each tag (e.g., BEAD_A1, ROUND1_A1) #tags: The sequence of the tag #distances: Error tolerance of sequence matching. 0 indicates no mismatches, inserts or deltions. We typically do not allow for mismatches for DPM, BPM and Y. 2:1:2 indicates 2 mismatches, 1 insertion/deletion and 2 total errors allowed. EVEN and ODD barcodes are designed to have a hamming distance of 2 (2 mismatches) #locations: read and position of tag, formatted as read:start:stop. For example 0:0:9 indicates read 1 within positions 0-9 of the read (first 9 bps) #maxFindsG: Maximum number of times a specific group can be found (e.g., for a value of 1, once a BPM tag is found, no other ids from the BPM group are searched for) #next: Group and location to search for after current id is found. A value of - indicates no additional tags are expected (e.g., following DPM or BPM on read 1). For R2, the sequencing of split-and-pool tags is indicated through next. The structure of R2 after 6 rounds of barcoding is [Y][R5_ODD][R4_EVEN][R3_ODD][R2_EVEN][R1_ODD]. Accordingly, the next for Y is R5_ODD, the next for R5_ODD is R4_EVEN, etc. The group to search for is surrounded in two curly braces and the position to search is indicated by the following number (e.g., {{R5_ODD}}6 indicates search for R5_ODD group at least 6 bp after the current tag). For split-and-pool tags, we assume at least 6bp between tags to account for sticky ends. #left (optional): Trim sequence and everything to left of identified sequence groups id tags distances locations maxFIndsG next left # DPM tags # 1. Group: DPM # 2. Id: must contain "DPM", such as "DPM<xxx>"; must NOT contain "BEAD" # - Can only contain alphanumeric characters, underscores, and hyphens, # i.e., must match the regular expression "[a-zA-Z0-9_\-]+" # 3. Tag sequence: (see resources/dpm96.fasta) # 4. Distnaces (error tolerance): acceptable Hamming distance between dpm tags (usually 0) # 5. Locations: First 8 bp of read 1 # 6. MaxFindG: 1, single DPM per read # 7. Next: - (no following tags on read 1) # 8. Left: 1 (trim DPM from read 1) DPM DPMBot6_1-A1 TGGGTGTT 0 0:0:8 1 - 1 DPM DPMBot6_2-A2 TGACATGT 0 0:0:8 1 - 1 ... # RPM tags # 1. Group: RPM # 2. Id: must contain "RPM", such as "RPM<xxx>"; must NOT contain "BEAD" # - Can only contain alphanumeric characters, underscores, and hyphens, # 3. Tag sequence: CTGACGCTAAGTGCTGAT # 4. Distnaces (error tolerance): insertions/deletions:mismatch:total (2:1:2) # 5. Locations: End of read 2 (for 6 rounds of full barcoding, after position 90 at a minimum) # 6. MaxFindG: 1, single RPM per read # 7. Next: - (no following tags on read 2) # 8. Left: 0 (not trimmed, trimming during cutadapt) RPM RPM CTGACGCTAAGTGCTGAT 2:1:2 1:90 1 - - ... # Antibody ID oligonucleotide tags # 1. Group: BPM # 2. Id: must start with "BEAD_" # 3. Tag sequences: see example config # 4. Distances: acceptable Hamming distance between bpm tags (usually 0) # 5. Locations: First 9 bp of read 1 (CHIP-DIP), positions 10-19 of read 1 (SPIDR) # 6. MaxFindG: 1, single BPM per read # 7. Next: - (no following tags on read 1) # 8. Left: 1 (trim BPM from read 1) BPM BEAD_AB1-A1 GGAACAGTT 0 0:0:9 1 - 1 BPM BEAD_AB2-A2 CGCCGAATT 0 0:0:9 1 - 1 ... # Split-pool even/odd tags # 1. Group: R1_ODD, R2_EVEN, R3_ODD, R4_EVEN, R5_ODD # 2. Id: ROUND#<xxx>; must not contain "BEAD"; # of ROUND should match R# of group # 3. Tag sequences: see example config # 4. Distances: insertions/deletions:mismatch:total (2:1:2), hamming distance between EVEN/ODD tags is 2 # 5. Locations: R2, staggered with at least 15 bp added each round (in reality ODD/EVEN length is 24) # 6. MaxFindG: 1, single tag for each round per read # 7. Next: {{Previous round}}6, SPACER of 6 accounts for sticky ends # 8. Left: - R5_ODD ROUND5_F12 AGACTCAATTAGGCT 2:1:2 1:15 1 {{R4_EVEN}}6 - R3_ODD ROUND3_D12 GCTGACATAAGACCT 2:1:2 1:45 1 {{R2_EVEN}}6 - R1_ODD ROUND1_B11 CCTTCCTATGCTACT 2:1:2 1:75 1 {{RPM}}6 ... R4_EVEN ROUND4_D12 ACTGTTCGACACGTC 2:1:2 1:30 1 {{R3_ODD}}6 R2_EVEN ROUND2_B12 GCACACGATCATCTG 2:1:2 1:60 1 {{R1_ODD}}6 # Split-pool term tags # 1. Group: Y # 2. Id: NYStgBot<xxx>; must not contain "BEAD" # 3. Tag sequences: see example config # 4. Distances: acceptable Hamming distance between Y tags (usually 0) # 5. Locations: start of read 2 (1:0:15) # 6. MaxFindG: 1, single term tag per read # 7. Next: {{Previous round}}6, SPACER of 6 accounts for sticky ends # 8. Left: - Y NYStgBot_1-A1 TATTATGGT 0 1:0:15 1 {{R5_ODD}}6 -
-
-
Notes regarding the entries in
example_config.txt- Names: Each name ends with
#-Well(for example,4-A4) where the#gives the row-major index of the tag in a 96-well plate, andWelldenotes the corresponding row and column. - Sequences
- The design of a DPM tags allows for 9 bp of unique sequence, but only 8 bp are used in the published SPRITE tag set (in bottom tags, the 9th bp is currently a constant
'T').example_config.txttherefore only includes the unique 8 bp sequences. - The design of EVEN and ODD tags allows for 17 bp of unique sequence, but only 16 bp are used in the published SPRITE tag set (in bottom tags, the 17th bp is currently a constant
'T').example_config.txtfurther trims the 1st unique bp from the bottom tag, leaving only the middle 15 bp unique bottom sequence. - The design of Y (terminal) tags allows for 9-12 bp of unique sequence.
- The design of a DPM tags allows for 9 bp of unique sequence, but only 8 bp are used in the published SPRITE tag set (in bottom tags, the 9th bp is currently a constant
- Names: Each name ends with
-
-
format.txt: Barcode format file - sequence of tag groups that form a complete barcode, one sequence per line-
Required? Yes
-
config.yamlkey to specify the path to this file:barcode_format -
Used by: Splitcode software
DPM,Y,R5_ODD,R4_EVEN,R3_ODD,R2_EVEN,R1_ODD BPM,Y,R5_ODD,R4_EVEN,R3_ODD,R2_EVEN,R1_ODD Y,R5_ODD,R4_EVEN,R3_ODD,R2_EVEN,R1_ODD,RPM
-
-
chrom_map.txt: Chromosome names map - tab-delimited text file specifying which chromosomes from the Bowtie 2 index to keep and how to rename them (if at all).- Required? No, but necessary if using a blacklist mask that uses different chromosome names than used in the Bowtie 2 index.
config.yamlkey to specify the path to this file:path_chrom_map- Used by:
scripts/python/rename_and_filter_chr.py(Snakefilerule rename_and_filter_chr,rule merge_mask, andrule effective_genome_size) - Column 1 specifies chromosomes (following naming convention used in the index) to keep.
- The order of chromosomes provided here is maintained in the SAM/BAM file header, and consequently specifies the coordinate sorting order at the reference sequence level.
- Column 2 specifies new chromosome names for the corresponding chromosomes in column 1.
- The provided
chrom-map.txtin this repository contains examples for retaining only canonical human or mouse chromosomes (i.e., excluding alternate loci, unlocalized sequences, and unplaced sequences) and renaming them to UCSC chromosome names (i.e.,chr1,chr2, ...,chrX,chrY,chrM) as needed. The header of provided file also includes more detailed documentation about the specific format requirements, such as allowed characters.
These files are located under <input_directory>/resources/.
dpm96.fasta: FASTA file containing the sequences of DPM tags- Required? Yes.
config.yamlkey to specify the path to this file:cutadapt_dpm- Used by:
cutadapt(Snakefilerule cutadapt_dpm) - Each of these sequences are 10 nt long, consisting of a unique 9 nt DPM_Bottom sequences as originally designed for SPRITE (technically, only the first 8 nt are unique, and the 9th sequence is always a
T), plus aTthat is complementary to a 3'Aadded to a genomic DNA sequence via dA-tailing.
-
blacklist_hg38.bed,blacklist_mm10.bed: blacklisted genomic regions for ChIP-seq data- Required? No, but highly recommended.
config.yamlkey to specify the path to this file:mask- Used by: Snakefile
rule merge_mask, whose output is used byrule repeat_maskandrule effective_genome_size - For human genome release hg38, we use ENCFF356LFX from ENCODE. For mouse genome release mm10, we use mm10-blacklist.v2.bed.gz. These BED files use UCSC chromosome names (e.g.,
chr1,chr2, ...). The pipeline performs chromosome name remapping (if specified) before this step.-
Reference paper: Amemiya HM, Kundaje A, Boyle AP. The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Sci Rep. 2019;9(1):9354. doi:10.1038/s41598-019-45839-z
-
Example code used to download them into the
resources/directory:wget -O - https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz | zcat | sort -V -k1,3 > "resources/blacklist_hg38.bed" wget -O - https://github.com/Boyle-Lab/Blacklist/raw/master/lists/mm10-blacklist.v2.bed.gz | zcat | sort -V -k1,3 > "resources/blacklist_mm10.bed"
-
-
index_mm10/*.bt2,index_hg38/*.bt2: Bowtie 2 genome index-
Required? Yes for DNA/CHIP-DIP.
-
config.yamlkey to specify the path to the index:bowtie2_index_dna -
Used by: Snakefile
rule bowtie2_align_dna -
If you do not have an existing Bowtie 2 index, you can download pre-built indices from the Bowtie 2 developers:
# for human primary assembly hg38 mkdir -p resources/index_hg38 wget https://genome-idx.s3.amazonaws.com/bt/GRCh38_noalt_as.zip unzip -j -d resources/index_hg38 GRCh38_noalt_as.zip \*.bt2 # for mouse primary assembly mm10 mkdir -p resources/index_mm10 wget https://genome-idx.s3.amazonaws.com/bt/mm10.zip unzip -j -d resources/index_mm10 mm10.zip \*.bt2
This will create a set of files under
resources/index_hg38orresources/index_mm10. If we want to use themm10genome assembly, for example, the code above will populateresources/index_mm10with the following files:mm10.1.bt2,mm10.2.bt2,mm10.3.bt2,mm10.4.bt2,mm10.rev.1.bt2,mm10.rev.2.bt2. The path prefix to this index (as accepted by thebowtie2 -x <bt2-idx>argument) is thereforeresources/index_mm10/mm10, which is set in the configuration file,config.yaml.Note that the pre-built indices linked above use UCSC chromosome names (
chr1,chr2, ...,chrX,chrY,chrM). If your alignment indices use different chromosome names (e.g., Ensembl chromosome names are1,2, ...,X,Y,MT), updatechrom-map.txtsuch that chromosome names in BAM files are converted to UCSC chromosome names. You can check the names of the reference sequences used to build the index by using the commandbowtie2-inspect -n <bt2-idx>.
-
-
index_ncRNA/*.bt2: Bowtie 2 repeat transcriptome index- Required? Yes for RNA/SPIDR.
config.yamlkey to specify the path to the index:bowtie2_index_rna- Used by: Snakefile
rule bowtie2_align_rnaCustom built from fasta file containing sequences for repeat RNAs
-
index_star_mm10/,index_star_hg38/: STAR genome index- Required? Yes for RNA/SPIDR.
config.yamlkey to specify the path to the index:star_index_rna- Used by: Snakefile
rule star_align
These files are located under <workflow_directory>/profiles/.
<profile>/config[.vX+].yaml: Workflow profile config file
- Required? Yes if running on a compute cluster, such as a SLURM environment.
- The path to this file is specified using the
--workflow-profileargument to thesnakemakeprogram -- i.e., in therun_pipeline.shscript. - The optional
.vX+part of the filename denotes the minimum supported Snakemake major versionX. - This file specifies workflow-default command line options for Snakemake as well as the resources available to each rule.
- Paths are interpreted relative to the working directory. Global environment variables such as
$USERcan be used. [Snakemake documentation] - Values for resource specifications can be dynamically set, including use of Python code like
int(f"{threads}").
- Paths are interpreted relative to the working directory. Global environment variables such as
- Two profiles are provided with this workflow. Key options are described below.
default: for local executioncores: maximum number of CPU cores to use in parallel
slurm: for execution on a SLURM serverjobs: maximum number of jobs to submit at a timelatency-wait: number of seconds to wait for output files of a job to be visible on the file system after the job finished to account for file system latency. This is particularly relevant for distributed computing environments (e.g., on an HPC cluster) with shared/networked file systems.default-resources- Other potentially useful SLURM options that are not currently specified in the profile config (see SLURM exectuor plugin documentation):
slurm_partition: <string>slurm_account: <string>
- Other potentially useful SLURM options that are not currently specified in the profile config (see SLURM exectuor plugin documentation):
slurm-logdir: directory to save SLURM log files; used by the SLURM executor plugin- SLURM standard output and standard error log files are combined into a single log file output to
<slurm-logdir>/<rule>/<wildcards>/<jobid>.logwhere<wildcards>is an underscore-delimited string of wildcards.
- SLURM standard output and standard error log files are combined into a single log file output to
slurm-delete-logfiles-older-than: SLURM log files will be deleted after this many days; used by the SLURM executor plugin- Other potentially useful Snakemake options that are not currently specified in the profile config:
max-threads: <int>- Define the maximum number of threads available to any individual rule.cores: <int>- Define the maximum number of total cores to request from the cluster scheduler.
These files are generated in the output directory, which is specified by the output_dir key in the pipeline configuration file.
-
Barcode identification efficiency (
barcode_identification_efficiency.txt): A statistical summary of how many tags were found per read and the proportion of reads with a matching tag at each position. -
Pipeline counts (
pipeline_counts.txt): A tree-like summary of how many reads remained at each step of the pipeline, produced per aliquot and in aggregate. This can be used to quickly view the proportion of reads corresponding to antibody oligos versus genomic DNA reads versus RNA reads; the proportion of properly barcoded reads; etc.- The 4 columns of numbers are as follows:
- The number of reads remaining after that step
- The proportion of reads remaining relative to the immediately previous step of the pipeline
- The proportion of reads remaining relative to the read type - antibody ID oligo (
bpm), genomic DNA (dpm) or RNA (rpm) - The proportion of reads remaining relative to the starting number of reads.
- A tabular version is saved to
qc/pipeline_counts.csv
- The 4 columns of numbers are as follows:
-
Cluster file (
clusters/<sample>.clusters): Tab-delimited file, where each line represents a single cluster.- The first column is the cluster barcode.
- The remainder of the line is a list of reads. DNA reads are formatted as
DPM[strand]_chr:start-end, RNA reads are formatted asRPM[strand]_chr:start-endorRPM[strand]_repeat:start-endand antibody oligo reads are formatted asBPM[]_<AntibodyID>:<UMI>-0.
-
Cluster statistics (
clusters/cluster_statistics.txt): The number of clusters and antibody oligo (BPM) or genomic DNA (DPM) reads per library. -
Cluster size distribution (
clusters/[BPM,DPM,RPM]_cluster_distribution.pdf): The proportion of clusters that belong to each size category. -
Cluster size read distribution (
clusters/[BPM,DPM,RPM]_read_distribution.pdf): The proportion of reads that belong to clusters of each size category.- This can be more useful than the number of clusters since relatively few large clusters can contain many sequencing reads (i.e., a large fraction of the library) while many small clusters will contain few sequencing reads (i.e., a much smaller fraction of the library).
-
Maximum representation oligo eCDF (
clusters/Max_representation_ecdf.pdf): A plot showing the distribution of proportion of antibody oligo (BPM) reads in each cluster that belong to the maximum represented Antibody ID in that cluster.- A successful experiment should have an ECDF close to a right angle. Deviations from this indicate that beads contain mixtures of Antibody IDs. Understanding the uniqueness of Antibody ID reads per cluster is important for choosing the thresholding parameter
proportionfor cluster assignment.
- A successful experiment should have an ECDF close to a right angle. Deviations from this indicate that beads contain mixtures of Antibody IDs. Understanding the uniqueness of Antibody ID reads per cluster is important for choosing the thresholding parameter
-
Maximum representation oligo counts ECDF (
clusters/Max_representation_counts.pdf): A plot showing the distribution of number of antibody oligo (BPM) reads in each cluster that belong to the maximum represented Antibody ID in that cluster.- If clusters are nearly unique in Antibody ID composition, then this plot is a surrogate for BPM size distribution. Understanding the number of Antibody ID reads per cluster is important for choosing the thresholding parameters
min_oligofor cluster assignment.
- If clusters are nearly unique in Antibody ID composition, then this plot is a surrogate for BPM size distribution. Understanding the number of Antibody ID reads per cluster is important for choosing the thresholding parameters
-
BAM files for individual antibodies (
splitbams/*.bam)- Thresholding criteria (
min_oligos,proportion,max_size) for assigning individual clusters to individual antibodies are set inconfig.yaml. - The "none" BAM file (
<sample>.[DNA,RNA].merged.labeled_none.bam) contains DNA or RNA reads from clusters without antibody oligo reads. - The "ambiguous" BAM file (
<sample>.[DNA,RNA].merged.labeled_ambiguous.bam) contains DNA or RNA reads from clusters that failed theproportionthresholding criteria. - The "uncertain" BAM file (
<sample>.[DNA,RNA].merged.labeled_uncertain.bam) contains DNA or RNA reads from clusters that failed themin_oligothresholding criteria. - The "filtered" BAM file (
<sample>.[DNA,RNA].merged.labeled_filtered.bam) contains DNA or RNA reads from clusters that failed themax_sizethresholding criteria.
- Thresholding criteria (
-
Read count summary for individual antibodies (
splitbams/splitbam_statistics.txt)- The number of read counts contained within each individual BAM file assigned to individual antibodies.
-
BigWig files for individual antibodies (
bigwigs/*.bw)- BigWigs are generated using
deeptools bamCoveragewith binsize set inconfig.yaml.
- BigWigs are generated using
Calculator for uniqueness of bead barcodes: Google Colab notebook (also available as GitHub Gist). This calculator can be used to calculate one of two values:
- Given the number of beads and the number of possible unique barcodes, calculate the proportion of beads expected to be uniquely barcoded.
- Given the number of beads and a desired proportion of beads to be uniquely barcoded, calculate the number of barcodes required.
Publications
- Perez AA*, Goronzy IN*, Blanco MR, Yeh BT, Guo JK, Lopes CS, Ettlin O, Burr A, Guttman M. ChIP-DIP maps binding of hundreds of proteins to DNA simultaneously and identifies diverse gene regulatory elements. Nat Genet. 2024 Dec;56(12):2827–2841. doi:10.1038/s41588-024-02000-5.
- Wolin E*, Guo JK*, Blanco MR*, Goronzy IN**, Gorhe D**, Dong W** Perez AA, Keskin A, Valenzuela E, Abdou AA, Urbinati CR, Kaufhold R, Rube HT, Brito Querido J, Guttman M, Jovanovic M. SPIDR enables multiplexed mapping of RNA-protein interactions and uncovers a mechanism for selective translational suppression upon cell stress. Cell. 2025 Sep 18;188(19):5384-5402.e25. doi: [10.1016/j.cell.2025.06.042]
Developers
- Isabel Goronzy (@igoronzy): Designed pipeline. Merged DNA (chipdip) and RNA (spidr) pipelines into unified pipeline.
- Jimmy Guo: Wrote RNA processing steps.
- Benjamin Yeh (@bentyeh): Streamlined and documented the pipeline.
Other contributors
- Andrew Perez (@HeyDrew64)
- Mario Blanco (@mrblanco)
- Mitchell Guttman (@mitchguttman)