scbirlab/nf-reclaim is a Nextflow pipeline to identify bacterial orthologs and putative spectrum of activity of targets with known inhibitors.
Table of contents
- Processing steps
- Requirements
- Quick start
- Inputs
- Outputs
- Credit
- Issues, problems, suggestions
- Further help
scbirlab/nf-reclaim carries out the following steps:
- Fetch all high confidence targets with inhibitors from ChEMBL
- Fetch all protein sequences for targets from UniProt
- Fetch proteomes of query organisms from UniProt
- BLAST target protein sequences against query organism proteomes
- Annotate LOEUF genomic constraint scores for human targets
- Output inhibitors for targets meeting identity, coverage, and LOEUF cutoffs from ChEMBL
In parallel, the pipeline fetches data to assess LOEUF cutoffs for toxicity:
- Fetch all inhibitors with cell IC50 or CC50 and target biochemical
$K_i$ from ChEMBL - Fetch all targets of these inhibitors from ChEMBL
- Filter down to inhibitors that have target biochemical
$K_i$ from ChEMBL - Output cell line IC50 paired with target biochemical
$K_i$
You need Nextflow and either Anaconda, Singularity, or Docker to be installed.
If you're at the Crick or your shared cluster has Nextflow and Singularity already installed, try:
module load Nextflow SingularityOtherwise, if it's your first time using Nextflow on your system, you can install it using conda:
conda install -c bioconda nextflowYou may need to set the NXF_HOME environment variable. For example,
mkdir -p ~/.nextflow
export NXF_HOME=~/.nextflowTo make this a permanent change, you can do something like the following:
mkdir -p ~/.nextflow
echo "export NXF_HOME=~/.nextflow" >> ~/.bash_profile
source ~/.bash_profileThe easiest way to get going is by specifying parameters on the command-line:
nextflow run scbirlab/nf-reclaim \
--organism_id 243273 \
--min_identity 0.3 \
--min_coverage 0.5 \
--min_pchembl 7.0Here's what the flags mean:
--organism_id: The Taxon ID of the organism, whih you can find at NCBI or UniProt--min_identity(optional): minimum amino acid identity for orthology--min_coverage(optional): minimum coverage for orthology--min_pchembl(optional): minimum reported pChEMBL (potency) for inhibitors
Other options are available.
scbirlab/nf-reclaim runs on a Singularity container engine by default to ensure software versions are consistent. If you have
docker installed, you can run using -with-docker to use it instead, or if you have Conda you can run -with-conda.
Make a sample sheet (see below) with columns representing the flags above, and,
optionally, a nextflow.config file in the directory where you want the pipeline to run.
Then simply run:
nextflow run scbirlab/nf-reclaimIf you want to run a particular tagged version of the pipeline, such as v0.0.2, you can do so using
nextflow run scbirlab/nf-reclaim -r v0.0.2For help, use nextflow run scbirlab/nf-reclaim --help.
The first time you run the pipeline on your system, the software dependencies in environment.yml will be installed.
This may take several minutes.
The pipeline can be run with command-line arguments:
nextflow run scbirlab/nf-reclaim --organism_id <taxon ID>The following parameters are required:
--organism_id Taxon ID for organism
# or if using sample sheet
--sample_sheet CSV listing Taxon ID for multiple organismsThe following parameters have default options, and are optional.
min_identity = 35: minimum amino acid identity for orthologymin_coverage = 0.7: minimum sequence coverage for orthologymin_loeuf = 0.515: minimum genomic constraint for human targetsmin_pchembl = 6.0: minimum inhibitor pChEMBLgnomad_version = "4.1": which gNOMAD version to use for LOEUF valuestox_cell_lines = ["HCT116","HEK293T",...,"CHO"]: cell lines to fetch toxicity dataoutputs = "outputs": output directory
You can run multiple combinations in one command using a sample sheet. The sample sheet is a CSV file with one row per combination of parameters to run.
nextflow run scbirlab/nf-reclaim --sample_sheet path/to/sample-sheet.csvHere is an example of the sample sheet to find all the mycoplasma orthologous putative inhibitors:
| organism_id | proteome_name |
|---|---|
| 243273 | "Mycoplasma genitalium" |
Further examples are in the test directory of this repository.
For reproducibility, self-documentation, and to save typing,
parameters with the same names as the command line flags above can be
provided in a nextflow.config file in the working directory. For example:
params {
organism_id = "243273"
}Or with a sample sheet:
params {
sample_sheet = "path/to/sample-sheet.csv"
}Outputs are saved in the output folder defined above.
Add to the issue tracker.
Here are the pages of the software and databases used by this pipeline.
Databases:
- ChEMBL for inhibitors and targets
- UniProt for protein sequences
- NCBI Genbank for taxonomy
Software:
- diamond to BLAST many-against-many protein sequences.