bead-barcoding-pipeline

Contents

Overview
Directory Structures
Input Files
- Configuration Files
- Resource Files
Workflow Profile Configuration
Output Files
Additional Resources
Credits

Overview

Quick Start

This pipeline assumes an existing conda installation and is written as a Snakemake workflow. To install Snakemake with conda, run

conda env create -f workflow/envs/snakemake.yaml
conda activate snakemake

to create and activate a conda environment named snakemake. Once all the input files are ready, run the pipeline locally with

# by default, uses the workflow profile workflow/profiles/default
./run_pipeline.sh

or on SLURM cluster with

./run_pipeline.sh --workflow-profile workflow/profiles/slurm

Other common usage notes

run_pipeline.sh passes any additional arguments to snakemake.
- To perform a dry-run: ./run_pipeline.sh --dry-run
- To force (re)execution of all rules regardless of past output: ./run_pipeline.sh --forceall
- To remove intermediate files generated by the pipeline: ./run_pipeline.sh clean

System Requirements

Hardware: This pipeline was developed to be run on a high performance computing (HPC) cluster, but it can also be run locally on a personal computer.

Operating system: Linux, macOS

Lack of Windows support is due to our use of the Bioconda channel for creating the conda environments described in workflow/envs/. Bioconda currently does not support Windows. However, we have successfully run this pipeline on Windows Subsystem for Linux (WSL).
While Bioconda supports macOS and has built an extensive repository of packages for x86_64-based Macs (i.e., with Intel processors; conda platform osx-64), only a limited number of packages (or versions of packages) have been built for new ARM-based Macs (i.e., with M-series Apple Silicon processors; conda platform osx-arm64). Consequently, the versions specified by the environment YAML files described in workflow/envs/ may not be available on Bioconda for the osx-arm64 platform.
- To run this pipeline on an Apple Silicon Mac, we recommend using conda to create an environment targeting the osx-64 platform, and executing the pipeline programs under Rosetta translation:
  - While macOS should automatically prompt to install Rosetta if needed, it can also be manually installed using the terminal command /usr/sbin/softwareupdate --install-rosetta.

Code interpreters and runtimes: The pipeline relies on scripts written in Bash and Python and has been validated using the following versions:

Bash: 4.2 through 5.2
Python: 3.9+ (the workflow/envs/ conda environments file currently use version 3.10)

Packages: Additional required third-party programs and packages are specified in conda environments described in workflow/envs/.

Note: Other recent versions of the same software programs will likely work, but we have not tested all of them. Some specific requirements are discussed below.
- The version of deeptools used (3.5.2) requires a matplotlib version between 3.1.0 and 3.7.5, since later versions of matplotlib deprecate some functions used by deeptools version 3.5.2. Newer versions of deeptools have been updated to support the newer matplotlib APIs.
- NumPy version 2.x is not currently supported.

Pipeline

Terms

split-pool tag: the individual sequences that are added during a single round of split-pool barcoding (DPM, EVEN, ODD, TERM)
split-pool barcode: the permutation of split-pool tags that uniquely identifies a cluster
Antibody ID: a 9 nt sequence within the antibody oligo that uniquely identifies a type of antibody. In the pipeline, these are identified as a "BPM" tag.
cluster: a set of reads sharing the same split-pool barcode. Antibody oligo reads in the cluster are used to assign genomic DNA reads to specific antibodies.

Workflow

Define samples and paths to FASTQ files (fastq2json.py or manually generate samples.json)
Split FASTQ files into chunks for parallel processing
Adaptor trimming (Trim Galore!)
Barcode identification (Splitcode)
Split DNA (DPM) and antibody ID oligo (BPM) reads [ChIP-DIP] or RNA (RPM) and antibody ID oligo (BPM) [SPIDR] into separate files
(ChIP-DIP) DNA read workflow:
1. DPM trimming (Cutadapt)
2. Genomic alignment (Bowtie 2)
3. Chromosome renaming (e.g., to UCSC chromosome names) and filtering (e.g., remove non-canonical chromosomes)
4. Masking (bedtools; use ENCODE blacklists)
(SPIDR) RNA read workflow:
RPM triming (Cutadapt)
Repeat alignment (Bowtie 2)
Bam to fastq of unaligned reads (Samtools)
Genomic alignment of unaligned reads (STAR)
Chromosome renaming (e.g., to UCSC chromosome names) and filtering (e.g., remove non-canoical chromosomes)
Antibody oligo read workflow: 2. FASTQ to BAM conversion
Cluster generation
Cluster assignment and antibody specific BAM file creation
Antibody-specific bigWig file creation (deepTools bamCoverage)
Summary plots
1. Cluster size distributions
2. Maximum representation oligo ECDFs
Summary statistics
1. MultiQC (trimming, alignments)
2. Barcode identification efficiency
3. Cluster statistics

Notes / FAQ

How are antibody oligo PCR duplicates removed? In step 6.ii above (fastq_to_bam.py), oligo UMIs are converted to integers and used as the reference position (column 4) in the BAM file. When clusters are generated in step 7 (during both generation of the split clusters and merged clusters), reads mapped to the same "position" (e.g., with the same UMI) are deduplicated.
How are DNA or RNA sequence PCR duplicates removed? We assume that different DNA or RNA molecules within a single cluster are unlikely to have the exact same start and end positions. Therefore, we consider identical fragments (same start and end positions) in the same cluster to be PCR duplicates. When clusters are generated in step 7 (during both generation of the split clusters and merged clusters), reads mapped to the same position are deduplicated. Note that for RNA, we are unable to distinguish if reads that align to the same RNA transcript at different positions are fragments from the same or from different RNA transcripts.
The pipeline is non-deterministic, but the output should be functionally equivalent between runs. Specifically, the non-determinism affects the order in which reads are sorted in BAM files (even after running samtools sort), and which representative read from a set of PCR duplicates is kept, none of which should affect downstream analyses (e.g., bigWig generation, peak calling, etc.). Non-determinism arises from the following causes of reads being processed in different orders:
- If num_chunks is set to greater than 1 in config.yaml, then the order in which chunks of reads are processed can change between pipeline runs.
- If more than 1 processor is available, then Bowtie 2 will align multiple reads in parallel (-p option), and the output order is non-deterministic. We do not use the --reorder flag to fix the output order for performance reasons.
- If samples.json specifies multiple samples, and merge_sample is set to true in config.yaml, then the @RG records in the merged BAM file headers will have unique suffixes appended to their IDs. See samtools merge documentation.

See also visualizations of the pipeline generated by Snakemake (commands below assume that Graphviz is installed):

images/dag.pdf: snakemake --dag | dot -Tpdf > images/dag.pdf
images/filegraph.pdf: snakemake --filegraph | dot -Tpdf > images/filegraph.pdf
images/rulegraph.pdf: snakemake --rulegraph | dot -Tpdf > images/rulegraph.pdf

Directory Structures

We will refer to 4 directories:

Working directory: We follow the Snakemake documentation in using the term "working directory" to refer to

either the directory in which you have invoked Snakemake or whatever was specified for --directory or the workdir: directive
- This is where Snakemake creates a .snakemake directory within which it installs conda environments and keeps track of metadata regarding the pipeline (e.g., which steps have finished successfully, which steps failed, and which steps have yet to be run).
- Rule commands are executed relative to this directory.
Workflow directory: where the Snakefile and scripts reside
- envs/
- scripts/
- profiles/
  - <profile_name>/
    - config.yaml: paths are interpreted relative to the working directory.
- fastq2json.py
- Snakefile
Input directory: where configuration and data files reside
- data/
- config/
  - config.yaml: paths are interpreted relative to the working directory
  - config.txt / example_config.txt
  - format.txt / example_format.txt
  - samples.json / example_samples.json: paths are interpreted relative to the working directory
- resources/
- run_pipeline.sh: the paths in the arguments --snakefile <path to Snakefile>, --profile <path to profile>, --workflow-profile <path to workflow profile>, or --configfile <path to config.yaml> are relative to where you run run_pipeline.sh
Output directory: where to place this directory can be changed in config.yaml via the output_dir key.
- alignments/
- alignments_parts/
- bigwigs/
- clusters/: cluster file and cluster statistics
- clusters_parts/
- fastqs/: adapter-trimmed reads with identified barcodes appended to the read name
  - <sample>_R1.part_<###>.barcoded.fastq.gz: read 1 of fully-barcoded templates
  - <sample>_R1.part_<###>.unassigned.fastq.gz: read 1 of templates without full barcodes (not matching format.txt)
  - <sample>_R1.part_<###>.barcoded_bpm.fastq.gz: read 1 of fully-barcoded templates with a BPM tag, corresponding to antibody ID oligonucleotides
  - <sample>_R1.part_<###>.barcoded_dpm.fastq.gz: read 1 of fully-barcoded templates with a DPM tag, corresponding to genomic DNA (CHIP-DIP)
  - <sample>_R1.part_<###>.barcoded_rpm.fastq.gz: read 1 of fully-bacroded templates with a RPM tag, corresponding to RNA (SPIDR)
  - <sample>_R2.part_<###>.barcoded_rpm.fastq.gz: read 2 of fully-barcoded templates with a RPM tag, corresponding to RNA (SPIDR)
- logs/
- qc/
- splitbams/
- split_fastq/: unprocessed reads split into chunks for parallel processing
- trimmed/: adapter- and tag-trimmed reads
  - <sample>_R<1|2>.part_<###>_val_<1|2>.fq.gz: adapter-trimmed reads
  - <sample>_R1.part_<###>.barcoded_rpm.RDtrim.fastq.gz: read 1 of fully-barcoded templates with a RPM tag (corresponding to RNA) where the RPM tag has been trimmed amd empty insert reads removed (SPIDR)
  - <sample>_R2.part_<###>.barcoded_rpm.RDtrim.fastq.gz: read 2 of fully-barcoded templates with a RPM tag (corresponding to RNA) where the RPM tag has been trimmed amd empty insert reads removed (SPIDR)
  - <sample>_R1.part_<###>.barcoded_dpm.RDtrim.fastq.gz: read 1 of fully-barcoded templates with a DPM tag (corresponding to genomic DNA) where the DPM tag has been trimmed and DPM dimers removed (CHIP-DIP)
- barcode_identification_efficiency.txt
- pipeline_counts.txt

For reproducibility, we recommend keeping the workflow, input, and output directories together. In other words, the complete directory should look like this GitHub repository with an extra output subdirectory (by default, results/) created upon running this pipeline.

However, the workflow directory can also be kept separate and used repeatedly on different datasets.

The working directory should stay with the input directory, so that the .snakemake folder containing the Snakemake pipeline metadata is paired with the configuration files.
- Update paths in config.yaml
  - The scripts_dir key must point to the workflow scripts directory.
  - Assuming that the above directory structure is followed, the other paths can remain relative paths to configuration files and resource files in the input directory.
- Update run_pipeline.sh: add --snakefile <path_to_Snakefile> and --workflow-profile <path_to_profile> arguments to the Snakemake invocation to use the Snakefile and workflow profile from the workflow directory.
- Run run_pipeline.sh from the input directory.

Input Files

Documentation for input files in this section is identical to that at config/README.md.

Configuration Files

These files are located under <input_directory>/config/.

config.yaml: Pipeline configuration - YAML file containing the processing settings and paths of required input files. Paths are specified relative to the working directory.
- Required? Yes. Must be provided in the working directory, or specified via --configfile <path_to_config.yaml> when invoking Snakemake.
- Required keys
  - barcode_config: path to barcode config file (e.g., config.txt)
  - barcode_format: path to barcode format file (e.g., format.txt)
  - scripts_dir: path to scripts folder in the workflow directory
  - samples: path to samples.json file
  - bead_umi_length: integer length of bead oligo UMIs
  - num_tags: number of split-and-pool barcode rounds (odd, even, term)
  - experiment_type: DNA or RNA workflow (CHIP-DIP or SPIDR)
- Required keys for DNA workflow
  - bowtie2_index_dna: path to Bowtie 2 genome index
  - cutadapt_dpm: path to DPM sequences
- Required keys for RNA workflow
  - bowtie2_index_rna: path to Bowtie 2 repeat genome index
  - star_index_rna: path to STAR transcriptome index
- Optional keys: If these keys are omitted from config.yaml or set to null, then they will take on the default values indicated.
  - output_dir (default = "results"): path to create the output directory within which all intermediate and output files are placed.
  - temp_dir (default = $TMPDIR (if set) or "/tmp"): path to a temporary directory with sufficient free disk space, such as used by the -T option of GNU sort
  - mask (default = null): path to BED file of genomic regions to ignore, such as ENCODE blacklist regions; reads mapping to these regions are discarded. If null, no masking is performed.
  - path_chrom_map (default = null): path to chromosome name map file. If null, chromosome renaming and filtering are skipped, and the final BAM and/or bigWig files will use all chromosome names as-is from the Bowtie 2 or STAR index.
  - num_chunks (default = 2): integer between 1 and 99 giving the number of chunks to split FASTQ files from each sample into for parallel processing
  - generate_splitbams (default = false): boolean value indicating whether to generate separate BAM files for each antibody target
  - min_oligos (default = 2): integer giving the minimum count of deduplicated antibody oligo reads in a cluster for that cluster to be assigned to the corresponding antibody target; this criteria is intersected (AND) with the proportion and max_size criteria
  - proportion (default = 0.8): float giving the minimum proportion of deduplicated antibody oligo reads in a cluster for that cluster to be assigned to the corresponding antibody target; this criteria is intersected (AND) with the min_oligos and max_size criteria
  - max_size (default = 10000): integer giving the maximum count of deduplicated genomic DNA reads in a cluster for that cluster to be to be assigned to the corresponding antibody target; this criteria is intersected (AND) with the proportion and max_size criteria
  - merge_samples (default = false): boolean indicating whether to merge cluster files and target-specific BAM and bigWig files across samples
  - binsize (default = false): integer specifying bigWig binsize; set to false to skip bigWig generation. Only relevant if generate_splitbams is true.
  - bigwig_normalization (default = "None"): normalization strategy for calculating coverage from reads; passed to the --normalizeUsing argument for the bamCoverage command from the deepTools suite. As of version 3.5.2, deepTools bamCoverage currently supports RPKM, CPM, BPM, RPGC, or None. Only relevant if bigWig generation is requested (i.e., generate_splitbams is true and binsize is not false).
  - email (default = null): email to send error notifications to if errors are encountered during the pipeline. If null, no emails are sent.
- Additional notes
  - null values can be specified explicitly (e.g., format: null) or implicitly (e.g., format: ).
  - For keys mask, path_chrom_map, and email, an empty string "" is treated identically to if the value is null.
samples.json: Samples file - JSON file with the paths of FASTQ files (read1, read2) to process.
- Required? Yes.
- config.yaml key to specify the path to this file: samples
- This can be prepared using fastq2json.py --fastq_dir <path_to_directory_of_FASTQs> or manually formatted as follows:
```
{
   "sample1": {
     "R1": [
       "<path_to_data>/sample1_run1_R1.fastq.gz",
       "<path_to_data>/sample1_run2_R1.fastq.gz",
     ],
     "R2": [
       "<path_to_data>/sample1_run1_R2.fastq.gz",
       "<path_to_data>/sample1_run2_R2.fastq.gz",
     ]
   },
   "sample2": {
     "R1": [
       "<path_to_data>/sample2_R1.fastq.gz"
     ],
     "R2": [
       "<path_to_data>/sample2_R2.fastq.gz"
     ]
   },
   ...
}
```
- Data assumptions:
  - FASTQ files are gzip-compressed.
  - Read names do not contain two consecutive colons (::). This is required because the pipeline adds :: to the end of read names before adding barcode information; the string :: is used as a delimiter in the pipeline to separate the original read name from the identified barcode.
- If there are multiple FASTQ files per read orientation per sample (as shown for sample1 in the example above), the pipeline will concatenate them and process them together as the same sample.
- Each sample is processed independently, generating independent cluster and BAM files. Statistics used for quality assessment (barcode identification efficiency, cluster statistics, MultiQC report, cluster size distributions, splitbam statistics) are computed independently for each sample but reported together in aggregate files to enable quick quality comparison across samples.
- The provided sample read files under the data/ folder were simulated via a Google Colab notebook. The genomic DNA reads correspond to ChIP-seq peaks on chromosome 19 (mm10) for transcription factors MYC (simulated as corresponding to Antibody ID BEAD_AB1-A1) and TCF12 (simulated as corresponding to Antibody ID BEAD_AB2-A2).
- Sample names (the keys of the samples JSON file) cannot contain any periods (.). This is enforced to simplify wildcard pattern matching in the Snakefile and to simplify implementation of workflow/scripts/python/threshold_tag_and_split.py:label_bam_file().

config.txt: Barcode config file - text file containing the sequences of split-pool tags and search parameters for tag findings

Required? Yes.
config.yaml key to specify the path to this file: barcode_config
Used by: splitcode software (Snakefile rule barcode_id), scripts/python/fastq_to_bam.py (Snakefile rule fastq_to_bam), and scripts/python/barcode_identification_efficiency.py (Snakefile rule barcode_identification_efficiency).
- This file is also parsed in the Snakefile itself if generate_splitbams is set to true in config.yaml to determine the set of antibody targets for which to generate individual de-multiplexed BAM files (and bigWig file too, if requested).

Format: See (https://splitcode.readthedocs.io/en/latest/example.html#config-file) for complete description of parameter options

An example barcoding configuration file is annotated below:

#Header Categories
#groups: Each tag belongs to a specific group (e.g., BPM, DPM, RPM, R1_ODD, R2_EVEN, R3_ODD, R4_EVEN, R5_ODD, Y (aka. TERM), UNUSED_EVEN, UNUSED_ODD)
#id: The specific name of each tag (e.g., BEAD_A1, ROUND1_A1)
#tags: The sequence of the tag
#distances: Error tolerance of sequence matching. 0 indicates no mismatches, inserts or deltions. We typically do not allow for mismatches for DPM, BPM and Y. 2:1:2 indicates 2 mismatches, 1 insertion/deletion and 2 total errors allowed. EVEN and ODD barcodes are designed to have a hamming distance of 2 (2 mismatches)
#locations: read and position of tag, formatted as read:start:stop. For example 0:0:9 indicates read 1 within positions 0-9 of the read (first 9 bps)
#maxFindsG: Maximum number of times a specific group can be found (e.g., for a value of 1, once a BPM tag is found, no other ids from the BPM group are searched for)
#next: Group and location to search for after current id is found. A value of - indicates no additional tags are expected (e.g., following DPM or BPM on read 1). For R2, the sequencing of split-and-pool tags is indicated through next. The structure of R2 after 6 rounds of barcoding is [Y][R5_ODD][R4_EVEN][R3_ODD][R2_EVEN][R1_ODD]. Accordingly, the next for Y is R5_ODD, the next for R5_ODD is R4_EVEN, etc. The group to search for is surrounded in two curly braces and the position to search is indicated by the following number (e.g., {{R5_ODD}}6 indicates search for R5_ODD group at least 6 bp after the current tag). For split-and-pool tags, we assume at least 6bp between tags to account for sticky ends.
#left (optional): Trim sequence and everything to left of identified sequence
groups  id  tags  distances  locations  maxFIndsG  next  left


# DPM tags
# 1. Group: DPM
# 2. Id: must contain "DPM", such as "DPM<xxx>"; must NOT contain "BEAD"
#    - Can only contain alphanumeric characters, underscores, and hyphens,
#      i.e., must match the regular expression "[a-zA-Z0-9_\-]+"
# 3. Tag sequence: (see resources/dpm96.fasta)
# 4. Distnaces (error tolerance): acceptable Hamming distance between dpm tags (usually 0)
# 5. Locations: First 8 bp of read 1
# 6. MaxFindG: 1, single DPM per read
# 7. Next: - (no following tags on read 1)
# 8. Left: 1 (trim DPM from read 1)
DPM	DPMBot6_1-A1	TGGGTGTT	0  0:0:8  1  -  1
DPM	DPMBot6_2-A2	TGACATGT	0  0:0:8  1  -  1
...

# RPM tags
# 1. Group: RPM
# 2. Id: must contain "RPM", such as "RPM<xxx>"; must NOT contain "BEAD"
#    - Can only contain alphanumeric characters, underscores, and hyphens,
# 3. Tag sequence: CTGACGCTAAGTGCTGAT
# 4. Distnaces (error tolerance): insertions/deletions:mismatch:total (2:1:2)
# 5. Locations: End of read 2 (for 6 rounds of full barcoding, after position 90 at a minimum)
# 6. MaxFindG: 1, single RPM per read
# 7. Next: - (no following tags on read 2)
# 8. Left: 0 (not trimmed, trimming during cutadapt)
RPM	RPM CTGACGCTAAGTGCTGAT  2:1:2  1:90  1  -  -
...

# Antibody ID oligonucleotide tags
# 1. Group: BPM
# 2. Id: must start with "BEAD_"
# 3. Tag sequences: see example config
# 4. Distances: acceptable Hamming distance between bpm tags (usually 0)
# 5. Locations: First 9 bp of read 1 (CHIP-DIP), positions 10-19 of read 1 (SPIDR)
# 6. MaxFindG: 1, single BPM per read
# 7. Next: - (no following tags on read 1)
# 8. Left: 1 (trim BPM from read 1)
BPM	BEAD_AB1-A1	GGAACAGTT	0  0:0:9  1  -  1
BPM	BEAD_AB2-A2	CGCCGAATT	0  0:0:9  1  -  1
...

# Split-pool even/odd tags
# 1. Group: R1_ODD, R2_EVEN, R3_ODD, R4_EVEN, R5_ODD
# 2. Id: ROUND#<xxx>; must not contain "BEAD"; # of ROUND should match R# of group
# 3. Tag sequences: see example config
# 4. Distances:  insertions/deletions:mismatch:total (2:1:2), hamming distance between EVEN/ODD tags is 2
# 5. Locations: R2, staggered with at least 15 bp added each round (in reality ODD/EVEN length is 24)
# 6. MaxFindG: 1, single tag for each round per read
# 7. Next: {{Previous round}}6, SPACER of 6 accounts for sticky ends
# 8. Left: -

R5_ODD  ROUND5_F12      AGACTCAATTAGGCT 2:1:2   1:15    1       {{R4_EVEN}}6  -
R3_ODD  ROUND3_D12      GCTGACATAAGACCT 2:1:2   1:45    1       {{R2_EVEN}}6  -
R1_ODD  ROUND1_B11      CCTTCCTATGCTACT 2:1:2   1:75    1       {{RPM}}6
...
R4_EVEN ROUND4_D12      ACTGTTCGACACGTC 2:1:2   1:30    1       {{R3_ODD}}6
R2_EVEN ROUND2_B12      GCACACGATCATCTG 2:1:2   1:60    1       {{R1_ODD}}6

# Split-pool term tags
# 1. Group: Y
# 2. Id: NYStgBot<xxx>; must not contain "BEAD"
# 3. Tag sequences: see example config
# 4. Distances:  acceptable Hamming distance between Y tags (usually 0)
# 5. Locations: start of read 2 (1:0:15)
# 6. MaxFindG: 1, single term tag per read
# 7. Next: {{Previous round}}6, SPACER of 6 accounts for sticky ends
# 8. Left: -
Y       NYStgBot_1-A1   TATTATGGT       0       1:0:15  1       {{R5_ODD}}6  -

Notes regarding the entries in example_config.txt
- Names: Each name ends with #-Well (for example, 4-A4) where the # gives the row-major index of the tag in a 96-well plate, and Well denotes the corresponding row and column.
- Sequences
  - The design of a DPM tags allows for 9 bp of unique sequence, but only 8 bp are used in the published SPRITE tag set (in bottom tags, the 9th bp is currently a constant 'T'). example_config.txt therefore only includes the unique 8 bp sequences.
  - The design of EVEN and ODD tags allows for 17 bp of unique sequence, but only 16 bp are used in the published SPRITE tag set (in bottom tags, the 17th bp is currently a constant 'T'). example_config.txt further trims the 1st unique bp from the bottom tag, leaving only the middle 15 bp unique bottom sequence.
  - The design of Y (terminal) tags allows for 9-12 bp of unique sequence.

format.txt: Barcode format file - sequence of tag groups that form a complete barcode, one sequence per line
- Required? Yes
- config.yaml key to specify the path to this file: barcode_format
- Used by: Splitcode software
```
DPM,Y,R5_ODD,R4_EVEN,R3_ODD,R2_EVEN,R1_ODD
BPM,Y,R5_ODD,R4_EVEN,R3_ODD,R2_EVEN,R1_ODD
Y,R5_ODD,R4_EVEN,R3_ODD,R2_EVEN,R1_ODD,RPM
```
chrom_map.txt: Chromosome names map - tab-delimited text file specifying which chromosomes from the Bowtie 2 index to keep and how to rename them (if at all).
- Required? No, but necessary if using a blacklist mask that uses different chromosome names than used in the Bowtie 2 index.
- config.yaml key to specify the path to this file: path_chrom_map
- Used by: scripts/python/rename_and_filter_chr.py (Snakefile rule rename_and_filter_chr, rule merge_mask, and rule effective_genome_size)
- Column 1 specifies chromosomes (following naming convention used in the index) to keep.
  - The order of chromosomes provided here is maintained in the SAM/BAM file header, and consequently specifies the coordinate sorting order at the reference sequence level.
- Column 2 specifies new chromosome names for the corresponding chromosomes in column 1.
- The provided chrom-map.txt in this repository contains examples for retaining only canonical human or mouse chromosomes (i.e., excluding alternate loci, unlocalized sequences, and unplaced sequences) and renaming them to UCSC chromosome names (i.e., chr1, chr2, ..., chrX, chrY, chrM) as needed. The header of provided file also includes more detailed documentation about the specific format requirements, such as allowed characters.

Resource Files

These files are located under <input_directory>/resources/.

dpm96.fasta: FASTA file containing the sequences of DPM tags
- Required? Yes.
- config.yaml key to specify the path to this file: cutadapt_dpm
- Used by: cutadapt (Snakefile rule cutadapt_dpm)
- Each of these sequences are 10 nt long, consisting of a unique 9 nt DPM_Bottom sequences as originally designed for SPRITE (technically, only the first 8 nt are unique, and the 9th sequence is always a T), plus a T that is complementary to a 3' A added to a genomic DNA sequence via dA-tailing.

blacklist_hg38.bed, blacklist_mm10.bed: blacklisted genomic regions for ChIP-seq data
- Required? No, but highly recommended.
- config.yaml key to specify the path to this file: mask
- Used by: Snakefile rule merge_mask, whose output is used by rule repeat_mask and rule effective_genome_size
- For human genome release hg38, we use ENCFF356LFX from ENCODE. For mouse genome release mm10, we use mm10-blacklist.v2.bed.gz. These BED files use UCSC chromosome names (e.g., chr1, chr2, ...). The pipeline performs chromosome name remapping (if specified) before this step.
  - Reference paper: Amemiya HM, Kundaje A, Boyle AP. The ENCODE Blacklist: Identification of Problematic Regions of the Genome. Sci Rep. 2019;9(1):9354. doi:10.1038/s41598-019-45839-z
  - Example code used to download them into the resources/ directory:
```
wget -O - https://www.encodeproject.org/files/ENCFF356LFX/@@download/ENCFF356LFX.bed.gz |
    zcat |
    sort -V -k1,3 > "resources/blacklist_hg38.bed"

wget -O - https://github.com/Boyle-Lab/Blacklist/raw/master/lists/mm10-blacklist.v2.bed.gz |
    zcat |
    sort -V -k1,3 > "resources/blacklist_mm10.bed"
```
index_mm10/*.bt2, index_hg38/*.bt2: Bowtie 2 genome index
- Required? Yes for DNA/CHIP-DIP.
- config.yaml key to specify the path to the index: bowtie2_index_dna
- Used by: Snakefile rule bowtie2_align_dna
- If you do not have an existing Bowtie 2 index, you can download pre-built indices from the Bowtie 2 developers:
```
# for human primary assembly hg38
mkdir -p resources/index_hg38
wget https://genome-idx.s3.amazonaws.com/bt/GRCh38_noalt_as.zip
unzip -j -d resources/index_hg38 GRCh38_noalt_as.zip \*.bt2

# for mouse primary assembly mm10
mkdir -p resources/index_mm10
wget https://genome-idx.s3.amazonaws.com/bt/mm10.zip
unzip -j -d resources/index_mm10 mm10.zip \*.bt2
```
  This will create a set of files under resources/index_hg38 or resources/index_mm10. If we want to use the mm10 genome assembly, for example, the code above will populate resources/index_mm10 with the following files: mm10.1.bt2, mm10.2.bt2, mm10.3.bt2, mm10.4.bt2, mm10.rev.1.bt2, mm10.rev.2.bt2. The path prefix to this index (as accepted by the bowtie2 -x <bt2-idx> argument) is therefore resources/index_mm10/mm10, which is set in the configuration file, config.yaml.
  
  Note that the pre-built indices linked above use UCSC chromosome names (chr1, chr2, ..., chrX, chrY, chrM). If your alignment indices use different chromosome names (e.g., Ensembl chromosome names are 1, 2, ..., X, Y, MT), update chrom-map.txt such that chromosome names in BAM files are converted to UCSC chromosome names. You can check the names of the reference sequences used to build the index by using the command bowtie2-inspect -n <bt2-idx>.
index_ncRNA/*.bt2: Bowtie 2 repeat transcriptome index
- Required? Yes for RNA/SPIDR.
- config.yaml key to specify the path to the index: bowtie2_index_rna
- Used by: Snakefile rule bowtie2_align_rna Custom built from fasta file containing sequences for repeat RNAs
index_star_mm10/, index_star_hg38/: STAR genome index
- Required? Yes for RNA/SPIDR.
- config.yaml key to specify the path to the index: star_index_rna
- Used by: Snakefile rule star_align

Workflow Profile Configuration

These files are located under <workflow_directory>/profiles/.

<profile>/config[.vX+].yaml: Workflow profile config file

Required? Yes if running on a compute cluster, such as a SLURM environment.
The path to this file is specified using the --workflow-profile argument to the snakemake program -- i.e., in the run_pipeline.sh script.
The optional .vX+ part of the filename denotes the minimum supported Snakemake major version X.
This file specifies workflow-default command line options for Snakemake as well as the resources available to each rule.
- Paths are interpreted relative to the working directory. Global environment variables such as $USER can be used. [Snakemake documentation]
- Values for resource specifications can be dynamically set, including use of Python code like int(f"{threads}").
Two profiles are provided with this workflow. Key options are described below.
- default: for local execution
  - cores: maximum number of CPU cores to use in parallel
- slurm: for execution on a SLURM server
  - jobs: maximum number of jobs to submit at a time
  - latency-wait: number of seconds to wait for output files of a job to be visible on the file system after the job finished to account for file system latency. This is particularly relevant for distributed computing environments (e.g., on an HPC cluster) with shared/networked file systems.
  - default-resources
    - Other potentially useful SLURM options that are not currently specified in the profile config (see SLURM exectuor plugin documentation):
      - slurm_partition: <string>
      - slurm_account: <string>
  - slurm-logdir: directory to save SLURM log files; used by the SLURM executor plugin
    - SLURM standard output and standard error log files are combined into a single log file output to <slurm-logdir>/<rule>/<wildcards>/<jobid>.log where <wildcards> is an underscore-delimited string of wildcards.
  - slurm-delete-logfiles-older-than: SLURM log files will be deleted after this many days; used by the SLURM executor plugin
  - Other potentially useful Snakemake options that are not currently specified in the profile config:
    - max-threads: <int> - Define the maximum number of threads available to any individual rule.
    - cores: <int> - Define the maximum number of total cores to request from the cluster scheduler.

Output Files

These files are generated in the output directory, which is specified by the output_dir key in the pipeline configuration file.

Barcode identification efficiency (barcode_identification_efficiency.txt): A statistical summary of how many tags were found per read and the proportion of reads with a matching tag at each position.
Pipeline counts (pipeline_counts.txt): A tree-like summary of how many reads remained at each step of the pipeline, produced per aliquot and in aggregate. This can be used to quickly view the proportion of reads corresponding to antibody oligos versus genomic DNA reads versus RNA reads; the proportion of properly barcoded reads; etc.
- The 4 columns of numbers are as follows:
  1. The number of reads remaining after that step
  2. The proportion of reads remaining relative to the immediately previous step of the pipeline
  3. The proportion of reads remaining relative to the read type - antibody ID oligo (bpm), genomic DNA (dpm) or RNA (rpm)
  4. The proportion of reads remaining relative to the starting number of reads.
- A tabular version is saved to qc/pipeline_counts.csv
Cluster file (clusters/<sample>.clusters): Tab-delimited file, where each line represents a single cluster.
- The first column is the cluster barcode.
- The remainder of the line is a list of reads. DNA reads are formatted as DPM[strand]_chr:start-end, RNA reads are formatted as RPM[strand]_chr:start-end or RPM[strand]_repeat:start-end and antibody oligo reads are formatted as BPM[]_<AntibodyID>:<UMI>-0.
Cluster statistics (clusters/cluster_statistics.txt): The number of clusters and antibody oligo (BPM) or genomic DNA (DPM) reads per library.
Cluster size distribution (clusters/[BPM,DPM,RPM]_cluster_distribution.pdf): The proportion of clusters that belong to each size category.
Cluster size read distribution (clusters/[BPM,DPM,RPM]_read_distribution.pdf): The proportion of reads that belong to clusters of each size category.
- This can be more useful than the number of clusters since relatively few large clusters can contain many sequencing reads (i.e., a large fraction of the library) while many small clusters will contain few sequencing reads (i.e., a much smaller fraction of the library).
Maximum representation oligo eCDF (clusters/Max_representation_ecdf.pdf): A plot showing the distribution of proportion of antibody oligo (BPM) reads in each cluster that belong to the maximum represented Antibody ID in that cluster.
- A successful experiment should have an ECDF close to a right angle. Deviations from this indicate that beads contain mixtures of Antibody IDs. Understanding the uniqueness of Antibody ID reads per cluster is important for choosing the thresholding parameter proportion for cluster assignment.
Maximum representation oligo counts ECDF (clusters/Max_representation_counts.pdf): A plot showing the distribution of number of antibody oligo (BPM) reads in each cluster that belong to the maximum represented Antibody ID in that cluster.
- If clusters are nearly unique in Antibody ID composition, then this plot is a surrogate for BPM size distribution. Understanding the number of Antibody ID reads per cluster is important for choosing the thresholding parameters min_oligo for cluster assignment.
BAM files for individual antibodies (splitbams/*.bam)
- Thresholding criteria (min_oligos, proportion, max_size) for assigning individual clusters to individual antibodies are set in config.yaml.
- The "none" BAM file (<sample>.[DNA,RNA].merged.labeled_none.bam) contains DNA or RNA reads from clusters without antibody oligo reads.
- The "ambiguous" BAM file (<sample>.[DNA,RNA].merged.labeled_ambiguous.bam) contains DNA or RNA reads from clusters that failed the proportion thresholding criteria.
- The "uncertain" BAM file (<sample>.[DNA,RNA].merged.labeled_uncertain.bam) contains DNA or RNA reads from clusters that failed the min_oligo thresholding criteria.
- The "filtered" BAM file (<sample>.[DNA,RNA].merged.labeled_filtered.bam) contains DNA or RNA reads from clusters that failed the max_size thresholding criteria.
Read count summary for individual antibodies (splitbams/splitbam_statistics.txt)
- The number of read counts contained within each individual BAM file assigned to individual antibodies.
BigWig files for individual antibodies (bigwigs/*.bw)
- BigWigs are generated using deeptools bamCoverage with binsize set in config.yaml.

Additional Resources

Calculator for uniqueness of bead barcodes: Google Colab notebook (also available as GitHub Gist). This calculator can be used to calculate one of two values:

Given the number of beads and the number of possible unique barcodes, calculate the proportion of beads expected to be uniquely barcoded.
Given the number of beads and a desired proportion of beads to be uniquely barcoded, calculate the number of barcodes required.

Publications

Perez AA*, Goronzy IN*, Blanco MR, Yeh BT, Guo JK, Lopes CS, Ettlin O, Burr A, Guttman M. ChIP-DIP maps binding of hundreds of proteins to DNA simultaneously and identifies diverse gene regulatory elements. Nat Genet. 2024 Dec;56(12):2827–2841. doi:10.1038/s41588-024-02000-5.
Wolin E*, Guo JK*, Blanco MR*, Goronzy IN**, Gorhe D**, Dong W** Perez AA, Keskin A, Valenzuela E, Abdou AA, Urbinati CR, Kaufhold R, Rube HT, Brito Querido J, Guttman M, Jovanovic M. SPIDR enables multiplexed mapping of RNA-protein interactions and uncovers a mechanism for selective translational suppression upon cell stress. Cell. 2025 Sep 18;188(19):5384-5402.e25. doi: [10.1016/j.cell.2025.06.042]

Credits

Developers

Isabel Goronzy (@igoronzy): Designed pipeline. Merged DNA (chipdip) and RNA (spidr) pipelines into unified pipeline.
Jimmy Guo: Wrote RNA processing steps.
Benjamin Yeh (@bentyeh): Streamlined and documented the pipeline.

Other contributors

Andrew Perez (@HeyDrew64)
Mario Blanco (@mrblanco)
Mitchell Guttman (@mitchguttman)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

bead-barcoding-pipeline

Overview

Quick Start

System Requirements

Pipeline

Directory Structures

Input Files

Configuration Files

Resource Files

Workflow Profile Configuration

Output Files

Additional Resources

Credits

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
config		config
data		data
resources		resources
workflow		workflow
LICENSE		LICENSE
README.md		README.md
fastq2json.py		fastq2json.py
run_pipeline.sh		run_pipeline.sh

License

GuttmanLab/bead-barcoding-pipeline

Folders and files

Latest commit

History

Repository files navigation

bead-barcoding-pipeline

Overview

Quick Start

System Requirements

Pipeline

Directory Structures

Input Files

Configuration Files

Resource Files

Workflow Profile Configuration

Output Files

Additional Resources

Credits

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages