Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
95 commits
Select commit Hold shift + click to select a range
ac4176d
update TSCC version with latest Skipper code
byee4 Jun 19, 2023
b4e3536
bedgraphtobigwig -> executable
byee4 Jun 20, 2023
6332383
add multiqc
byee4 Jul 2, 2023
c5fa28f
update profile for tscc/yeo queue
byee4 Sep 20, 2023
9cf716d
adds Skipper_ark.py and ark config
byee4 Oct 12, 2023
a8ebd71
add warnings to detect wrong GFF.
algaebrown Oct 12, 2023
d9983e1
disentangle config
algaebrown Oct 12, 2023
0616fad
add tscc 2.0 config
Oct 12, 2023
fbf62dc
modulize skipper and make PE end2end reuse rules
algaebrown Nov 1, 2023
d82f013
make input/output more explicit
algaebrown Dec 6, 2023
454e0ce
conda env for some rules
algaebrown Dec 6, 2023
f6f10a9
new rules for masking and ML
algaebrown Dec 6, 2023
0e1a991
adding new outputs to main
algaebrown Dec 6, 2023
2a1d2ec
scripts associated with new rules
algaebrown Dec 6, 2023
db1d333
remove unused stuffs
algaebrown Dec 6, 2023
7e4d164
incoporate containers;remove unused variables
algaebrown Dec 8, 2023
c297e83
add gitignore
algaebrown Dec 8, 2023
827f7a4
fix wrong docker and silent failure
algaebrown Dec 11, 2023
ef838e4
Merge pull request #19 from algaebrown/vbb63a25-tscc-charlene
byee4 Jan 2, 2024
664b366
update memory sys reqs and comment non-working modules
Jan 2, 2024
f8f7c84
somehow with these changes, scaled bigwig rule works
Jan 3, 2024
eacd4b6
update TSCC 2.0 Paths
algaebrown Jan 3, 2024
7ff2e00
add memory and time
algaebrown Jan 4, 2024
91ce2b3
fix both have no enriched windows
algaebrown Jan 4, 2024
a12dec7
bcftools conda env for variant rules
algaebrown Jan 4, 2024
28fa6b7
Fix running time and paths
algaebrown Jan 4, 2024
d43ffd5
uncomment rule outputs
Jan 10, 2024
309fb40
update memory resources
Jan 11, 2024
df3f533
fix some syntax issues
Jan 11, 2024
c13cd36
outputs vcf
Jan 12, 2024
6476030
fix output naming
algaebrown Jan 13, 2024
16ac966
1.99.0
May 3, 2024
65f34da
Merge pull request #30 from YeoLab/resources
byee4 May 3, 2024
0423242
adds a second round of trimming to remove adapter dimers
Aug 27, 2024
53602f6
update config with deep learning paths
algaebrown Sep 10, 2024
250cfa9
add rule to make coverage tracks
algaebrown Sep 10, 2024
c733dc0
rules to annotate finemapped windows and find regions that are both t…
algaebrown Sep 10, 2024
3e9908a
comment because they cause problem during rerun
algaebrown Sep 10, 2024
7f35699
count uniquely mapped reads
algaebrown Sep 10, 2024
f69b538
count uniquely mapped reads
algaebrown Sep 10, 2024
fa06d53
new deep learning code for training and variants
algaebrown Sep 10, 2024
93234c4
old gkmsvm code
algaebrown Sep 10, 2024
9d2faaf
old gkmsvm code but important for new model benchmark
algaebrown Sep 10, 2024
5f9f1a6
deep learning conda environment, sometimes has numba issue on TSCC.
algaebrown Sep 10, 2024
e374a81
popgen and disease stuffs (variant scripts)
algaebrown Sep 10, 2024
e70e27e
old gkmsvm code
algaebrown Sep 10, 2024
effc754
auxilliary code for fundemental eclip analysis
algaebrown Sep 10, 2024
f683c7e
utility code for gnomAD stuffs. MAPs and o/e scaling
algaebrown Sep 10, 2024
71e42f7
utility code for gnomAD and mutation rate model
algaebrown Sep 10, 2024
fad08fc
old utility code for gnomAD scaling o/e
algaebrown Sep 10, 2024
3dd9941
utility code to score any variant with trained deep learning model.
algaebrown Sep 10, 2024
c7fe396
utility gnomAD stuffs
algaebrown Sep 10, 2024
463a617
fundemental eclip analysis
algaebrown Sep 10, 2024
3915109
configs charlene trying to help others
algaebrown Sep 10, 2024
1d3af0d
profiles to run on single node, cpu or gpu clusters
algaebrown Sep 10, 2024
0c0c1a3
added ML and variant code;
algaebrown Sep 10, 2024
6975b9f
WIP; porting to snakemake8 for end-to-end. miscellenous utility
algaebrown Sep 10, 2024
5d1b5c4
Merge remote-tracking branch 'upstream/trim_better' into vbb63a25-tsc…
algaebrown Sep 10, 2024
893bc63
update readme with GPU instructions
algaebrown Sep 10, 2024
02e47c4
fix missing coverage rule
algaebrown Sep 10, 2024
a351238
remove unused gkmsvm rules
algaebrown Sep 10, 2024
016b805
fix bug. too little variant don't test.report N
algaebrown Sep 10, 2024
898958b
fix container and numba problem
algaebrown Sep 11, 2024
e2ec39f
fix empty file problem and numba problem and container
algaebrown Sep 11, 2024
5bcb403
miscellaneous
algaebrown Sep 11, 2024
602a115
Merge pull request #1 from algaebrown/cleanup-gkmsvm
algaebrown Sep 11, 2024
aeb1177
Merge pull request #32 from YeoLab/trim_better
algaebrown Sep 27, 2024
abe28a8
fix zarr path
algaebrown Oct 25, 2024
c956dbf
Merge pull request #35 from algaebrown/vbb63a25-tscc-charlene
algaebrown Nov 20, 2024
af95cd4
utility code to score variants
algaebrown Jan 16, 2025
5122ade
update until submission
algaebrown Apr 13, 2025
8a57c0b
update config
algaebrown Apr 13, 2025
2a77693
refactor outputs into two parts
algaebrown Apr 13, 2025
9f746b9
move param to resource
algaebrown Apr 13, 2025
3aa57c4
remove unused conda yaml
algaebrown Apr 13, 2025
e63e411
moving things that are not in main pipeline
algaebrown Apr 13, 2025
9ce9801
it seem to work on snakemake9, finger crossed for end-to-end
algaebrown Apr 13, 2025
15e8492
factor out vcf related params
algaebrown Apr 19, 2025
1d2fc52
output reproduciblity odds ratio
algaebrown Apr 19, 2025
3b68b4a
add nread in peak QC metric
algaebrown Apr 19, 2025
e10d9d4
set rerun-incomplete to true
algaebrown Apr 19, 2025
e6d740d
add popgen reference files
algaebrown Apr 22, 2025
2f9b8cf
update readme
algaebrown Apr 22, 2025
27acdac
isolate reference variables
algaebrown Apr 22, 2025
685e0c8
fix GPU rules
algaebrown Apr 22, 2025
2390a10
output reproducibility odds ratio
algaebrown Apr 22, 2025
d6f6429
add unset slurm ID
algaebrown Apr 22, 2025
cfe09b0
Merge branch 'dev/charlene/snakemake9' into vbb63a25-tscc-charlene
algaebrown Apr 22, 2025
69e01f0
include VEP envvars
byee4 Apr 23, 2025
02fc6bd
updates resource requirements to work with most encode data
byee4 May 17, 2025
a47c35a
Merge pull request #47 from algaebrown/vbb63a25-tscc-charlene
byee4 May 17, 2025
45853fb
Revert "Compatible with snakemake 9. Can run end-to-end with all the …
byee4 May 17, 2025
25136a2
Merge pull request #49 from YeoLab/revert-47-vbb63a25-tscc-charlene
byee4 May 17, 2025
01da75f
Merge branch 'update_resource_reqs' into modularized
byee4 May 17, 2025
2c463f5
move param to resource
byee4 Jun 17, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
.snakemake
.ipynb_checkpoints
__pycache__
67 changes: 44 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,26 +32,6 @@ For example, below are some commands for installing Miniconda and Snakemake.

`conda create -c conda-forge -c bioconda -n snakemake snakemake`

Skipper requires several R packages. In order to install the precise versions used in the manuscript, we have scripts to install the used versions of R and corresponding packages from source.

Use conda to create an environment for installing R:

`conda env create -f documents/rskipper.yml`

Use the get_R.sh script to complete installation of R. Expect the whole process to take around 4 hours. Provide your conda directory as the first argument and the directory you wish to install R as the second:

`bash -l tools/get_R.sh /home/eboyle/miniconda3 /projects/ps-yeolab3/eboyle/encode/pipeline/gran`

Alternatively, at least as of this writing, Skipper is compatible with the newest version of R and its packages. The required packages can be installed for an existing R installation as follows:

`install.packages(c("tidyverse", "VGAM", "viridis", "ggrepel", "RColorBrewer", "Rtsne", "ggupset", "ggdendro", "cowplot"))`

`if (!require("BiocManager", quietly = TRUE))`
`install.packages("BiocManager")`
`BiocManager::install(c("GenomicRanges","fgsea","rtracklayer"))`

Paths to locally installed versions can be supplied in the config file, described below.

<h2>Preparing to run Skipper</h2>
Skipper uses a Snakemake workflow. The `Skipper.py` file contains the rules necessary to process CLIP data from fastqs. Skipper also supports running on BAMs - note that Skipper's analysis of repetitive elements will assume that non-uniquely mapping reads are contained within the BAM files.

Expand All @@ -66,14 +46,16 @@ Numerous resources must be entered in the `Skipper_config.py` file:
| MANIFEST | Information on samples to run |
| GENOME | Samtools- and STAR-indexed fasta of genome for the sample of interest |
| STAR_DIR | Path to STAR reference for aligning sequencing reads |
| WORKDIR | Path to outputs |
| protocol | ENCODE3 to run paired-end. ENCODE4 to run single-end |


Other paths to help Skipper run must be entered:

| Path | Description |
| ----------- | ----------- |
| EXE_DIR | For convenience to point to stable locally installed software: it is added to PATH when Skipper runs |
| TOOL_DIR | Directory for the tools located in the GitHub |
| RBPNET_PATH | Directory for Deep Learning code [RBPNet](https://github.com/algaebrown/RBPNet/)|


Information about the CLIP library to be analyzed is also required:
Expand Down Expand Up @@ -137,11 +119,42 @@ Remember to load the Snakemake environment before running

Use the dry run function to confirm that Snakemake can parse all the information:

`snakemake -ns Skipper.py -j 1`
```
snakemake -kps Skipper.py \
--configfile $CONFIG \
--profile profiles/tscc2 -n
```

Once Snakemake has confirmed DAG creation, submit the jobs using whatever high performance computing infrastructure options suit you:

`snakemake -kps Skipper.py -w 15 -j 30 --cluster "qsub -e {params.error_file} -o {params.out_file} -l walltime={params.run_time} -l nodes=1:ppn={threads} -q home-yeo"`
```
snakemake -kps Skipper.py \
--configfile $CONFIG \
--profile profiles/tscc2 -n
```

Some deep learning rules will benefit from using GPU (temporary solution):

```
# Run on CPU util data preparation
CONFIG=/tscc/nfs/home/hsher/projects/skipper/encode_configs/Skipper_pe_small_test.yaml

snakemake -kps Skipper.py \
--configfile $CONFIG \
--profile profiles/tscc2 --until rbpnet_prepare_data

# Perform training, validation and seqlet finding using GPU
snakemake -kps Skipper.py \
--configfile $CONFIG \
--profile profiles/tscc2_gpu --until rbpnet_seqlet

# Finish the remaining using CPU
snakemake -kps Skipper.py \
--configfile $CONFIG \
--profile profiles/tscc2

# Sorry this is not end-to-end yet. Snakemake 8 can do it end-to-end but with some refactoring.
```

Did Skipper terminate? Sometimes jobs fail - inspect any error output and rerun the same command if there is no apparent explanation such as uninstalled dependencies or a misformatted input file. Snakemake will try to pick up where it left off.

Expand All @@ -164,3 +177,11 @@ Skipper produces a lot of output. The `output/figures` directory contains figure
Annotated reproducible enriched windows can be accessed at `output/reproducible_enriched_windows/` and Homer motif output is at `output/homer/`

Example CLIP fastqs and processed data are available at GEO and SRA: `https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE213867`

## Common problems:
1. Pulling singularity and get "no space left on device"
```
export SINGULARITY_TMPDIR=/tscc/lustre/ddn/scratch/hsher/singularity_tmp
export TMPDIR=/tscc/lustre/ddn/scratch/hsher/singularity_tmp
export SINGULARITY_CACHEDIR=/tscc/lustre/ddn/scratch/hsher/singularity_cache
```
Loading