Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,3 +12,6 @@ venv3/
.vscode/
.coverage
*.DS_Store

# Secrets
sa_token.json
18 changes: 7 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

This is a tool for transforming and processing
[VCF](https://samtools.github.io/hts-specs/VCFv4.3.pdf) files in a scalable
manner based on [Apache Beam](https://beam.apache.org/) using
manner based on [Apache Beam](https://beam.apache.org/) using
[Dataflow](https://cloud.google.com/dataflow/) on Google Cloud Platform.

It can be used to directly load VCF files to
Expand Down Expand Up @@ -45,6 +45,7 @@ running `gcloud components update` (more details [here](https://cloud.google.com

Use the following command to get the latest version of Variant Transforms.
```bash
# TODO: The Docker image must be rebuilt and hosted elsewhere
docker pull gcr.io/cloud-lifesciences/gcp-variant-transforms
```

Expand All @@ -56,11 +57,8 @@ Run the script below and replace the following parameters:
* `GOOGLE_CLOUD_REGION`: You must choose a geographic region for Cloud Dataflow
to process your data, for example: `us-west1`. For more information please refer to
[Setting Regions](docs/setting_region.md).
* `GOOGLE_CLOUD_LOCATION`: You may choose a geographic location for Cloud Life
Sciences API to orchestrate job from. This is not where the data will be processed,
but where some operation metadata will be stored. This can be the same or different from
the region chosen for Cloud Dataflow. If this is not set, the metadata will be stored in
us-central1. See the list of [Currently Available Locations](https://cloud.google.com/life-sciences/docs/concepts/locations).
* `GOOGLE_CLOUD_LOCATION`: You may choose a geographic location for Cloud Batch to orchestrate job from. This is not where the data will be processed,
but where some operation metadata will be stored. This can be the same or different from the region chosen for Cloud Dataflow. If this is not set, it will use the default value you have configured for `batch/location` in your gcloud CLI (You can see how to set the default location [here](./setting_region.md/#running-jobs-in-a-particular-region)). See the list of [Currently Available Locations](https://cloud.google.com/batch/docs/locations).
* `TEMP_LOCATION`: This can be any folder in Google Cloud Storage that your
project has write access to. It's used to store temporary files and logs
from the pipeline.
Expand Down Expand Up @@ -89,6 +87,7 @@ COMMAND="vcf_to_bq \
--job_name vcf-to-bigquery \
--runner DataflowRunner"

# TODO: The Docker image must be rebuilt and hosted elsewhere
docker run -v ~/.config:/root/.config \
gcr.io/cloud-lifesciences/gcp-variant-transforms \
--project "${GOOGLE_CLOUD_PROJECT}" \
Expand All @@ -114,10 +113,10 @@ In addition to using the docker image, you may run the pipeline directly from
source. First install git, python, pip, and virtualenv:

```bash
sudo apt-get install -y git python3-pip python3-venv python3.7-venv python-dev build-essential
sudo apt-get install -y git python3-pip python3-venv python3.12-venv python-dev build-essential
```

Note that python 3.8 is not yet supported, so ensure you are using Python 3.7.
We encourage you to use Python 3.12, as we have upgraded and fully adapted the project to work well with this version.

Run virtualenv, clone the repo, and install pip packages:

Expand Down Expand Up @@ -180,8 +179,6 @@ details.

## Additional topics

* [Understanding the BigQuery Variants Table
Schema](https://cloud.google.com/life-sciences/docs/how-tos/bigquery-variants-schema)
* [Loading multiple files](docs/multiple_files.md)
* [Variant merging](docs/variant_merging.md)
* [Handling large inputs](docs/large_inputs.md)
Expand All @@ -195,4 +192,3 @@ details.

* [Development Guide](docs/development_guide.md)
* [Release process](docs/release.md)

4 changes: 2 additions & 2 deletions docker/batch_runner.sh
Original file line number Diff line number Diff line change
Expand Up @@ -79,10 +79,10 @@ function main {
# If missing, we will try to find the default values.
google_cloud_project="${google_cloud_project:-$(gcloud config get-value project)}"
region="${region:-$(gcloud config get-value compute/region)}"
location="${location:-$(gcloud config get-value batch/location)}"
vt_docker_image="${vt_docker_image:-us-east1-docker.pkg.dev/variant-transform-dxt/dxt-public-variant-transform/batch-runner:latest}"

sdk_container_image="${sdk_container_image:-}"
location="${location:-}"
temp_location="${temp_location:-}"
subnetwork="${subnetwork:-}"
use_public_ips="${use_public_ips:-}"
Expand Down Expand Up @@ -113,7 +113,7 @@ function main {
# Build Dataflow required args based on `docker run ...` inputs.
df_required_args="--project ${google_cloud_project} --region ${region} --temp_location ${temp_location}"

# Build up optional args for pipelines-tools and Dataflow, if they are provided.
# Build up optional args for Batch Job and Dataflow, if they are provided.
pt_optional_args=""
df_optional_args=""

Expand Down
9 changes: 7 additions & 2 deletions docs/advanced_flags.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ Specify a subnetwork by using the `--subnetwork` flag and provide the name of th

Example:
```bash
# TODO: The Docker image must be rebuilt and hosted elsewhere
COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq ...

docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
Expand All @@ -29,6 +30,7 @@ on the subnet.

Example:
```bash
# TODO: The Docker image must be rebuilt and hosted elsewhere
COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq ...

docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
Expand All @@ -39,12 +41,14 @@ docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
```

## Custom Dataflow Runner Image
<!-- TODO: The Docker image must be rebuilt and hosted elsewhere -->
By default Variant Transforms uses a custom docker image to run the pipeline in: `gcr.io/cloud-lifesciences/variant-transforms-custom-runner:latest`.
This image contains all the necessary python/linux dependencies needed to run variant transforms so that they are not downloaded from the internet when the pipeline starts.

You can override which container is used by passing a `--sdk_container_image` as in the following example:

```bash
# TODO: The Docker image must be rebuilt and hosted elsewhere
COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq ...

docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
Expand All @@ -58,6 +62,7 @@ docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
By default the dataflow workers will use the [default compute service account](https://cloud.google.com/compute/docs/access/service-accounts#default_service_account). You can override which service account to use with the `--service_account` flag as in the following example:

```bash
# TODO: The Docker image must be rebuilt and hosted elsewhere
COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq ...

docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
Expand All @@ -68,5 +73,5 @@ docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
```

**Other Service Account Notes:**
- The [Life Sciences Service Account is not changable](https://cloud.google.com/life-sciences/docs/troubleshooting#missing_service_account)
- The [Dataflow Admin Service Account is not changable](https://cloud.google.com/dataflow/docs/concepts/security-and-permissions#service_account)
- [Control access for a job using a custom service account](https://cloud.google.com/batch/docs/create-run-job-custom-service-account)
- The [Dataflow Admin Service Account is not changable](https://cloud.google.com/dataflow/docs/concepts/security-and-permissions#service_account)
9 changes: 3 additions & 6 deletions docs/bigquery_to_vcf.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,11 +21,8 @@ Run the script below and replace the following parameters:
* `GOOGLE_CLOUD_REGION`: You must choose a geographic region for Cloud Dataflow
to process your data, for example: `us-west1`. For more information please refer to
[Setting Regions](docs/setting_region.md).
* `GOOGLE_CLOUD_LOCATION`: You may choose a geographic location for Cloud Life
Sciences API to orchestrate job from. This is not where the data will be processed,
but where some operation metadata will be stored. This can be the same or different from
the region chosen for Cloud Dataflow. If this is not set, the metadata will be stored in
us-central1. See the list of [Currently Available Locations](https://cloud.google.com/life-sciences/docs/concepts/locations).
* `GOOGLE_CLOUD_LOCATION`: You may choose a geographic location for Cloud Batch to orchestrate job from. This is not where the data will be processed, but where some operation metadata will be stored. This can be the same or different from
the region chosen for Cloud Dataflow. If this is not set, it will use the default value you have configured for `batch/location` in your gcloud CLI (You can see how to set the default location [here](./setting_region.md/#running-jobs-in-a-particular-region)). See the list of [Currently Available Locations](https://cloud.google.com/batch/docs/locations).
* `TEMP_LOCATION`: This can be any folder in Google Cloud Storage that your
project has write access to. It's used to store temporary files and logs
from the pipeline.
Expand All @@ -50,7 +47,7 @@ COMMAND="bq_to_vcf \
--output_file ${OUTPUT_FILE} \
--job_name bq-to-vcf \
--runner DataflowRunner"

# TODO: The Docker image must be rebuilt and hosted elsewhere
docker run -v ~/.config:/root/.config \
gcr.io/cloud-lifesciences/gcp-variant-transforms \
--project "${GOOGLE_CLOUD_PROJECT}" \
Expand Down
6 changes: 3 additions & 3 deletions docs/development_guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,10 +41,10 @@ git remote add upstream git@github.com:googlegenomics/gcp-variant-transforms.git

#### Setup virtualenv

Ensure you are using Python 3.7 version, since Apache Beam does not support 3.8.
We encourage you to use Python 3.12, as we have upgraded and fully adapted the project to work well with this version.

```bash
sudo apt-get install python3-pip python3-venv python3.7-venv python-dev build-essential
sudo apt-get install python3-pip python3-venv python3.12-venv python-dev build-essential
python3 -m venv venv3
. venv3/bin/activate
```
Expand Down Expand Up @@ -102,7 +102,7 @@ checked into the git repository and can be imported into
File | Settings | Editor | Inspections.

Code inspections can be run from the Analyze menu. To speed up the inspection
process, you can go to File | Project Structure | Modules and only set the
process, you can go to File | Project Structure | Modules and only set the
gcp_variant_transforms as the Sources. You may exclude other folders, or specify
the inspection scope to be only Module 'gcp-variant-transforms' when running
the inspection. The result window can be accessed from View > Tool Windows.
Expand Down
1 change: 1 addition & 0 deletions docs/sample_queries/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,7 @@ users to send us their queries so we can share them here with all other
researchers. Please feel free to [submit an issue](https://github.com/googlegenomics/gcp-variant-transforms/issues)
or contact us via our public mailing list
[gcp-life-sciences-discuss@googlegroups.com](mailto:gcp-life-sciences-discuss@googlegroups.com).
<!-- TODO: Change this email -->

## Genome Aggregation Database (gnomAD)

Expand Down
34 changes: 14 additions & 20 deletions docs/setting_region.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,31 +13,26 @@ are located in the same region:
* Your pipeline's temporary location set by `--temp_location` flag.
* Your output BigQuery dataset set by `--output_table` flag.
* Your Dataflow pipeline set by `--region` flag.
* Your Life Sciences API location set by `--location` flag.
* Your Cloud Batch location set by `--location` flag.

## Running jobs in a particular region
The Dataflow API [requires](https://cloud.google.com/dataflow/docs/guides/specifying-exec-params#configuring-pipelineoptions-for-execution-on-the-cloud-dataflow-service)
setting a [GCP
region](https://cloud.google.com/compute/docs/regions-zones/#available) via
`--region` flag to run.

When running from Docker, the Cloud Life Sciences API is used to spin up a
worker that launches and monitors the Dataflow job. Cloud Life Sciences API
is a [regionalized service](https://cloud.google.com/life-sciences/docs/concepts/locations)
that runs in multiple regions. This is set with the `--location` flag. The
Life Sciences API location is where metadata about the pipeline's progress
will be stored, and can be different from the region where the data is
processed. Note that Cloud Life Sciences API is not available in all regions,
and if this flag is left out, the metadata will be stored in us-central1. See
the list of [Currently Available Locations](https://cloud.google.com/life-sciences/docs/concepts/locations).

In addition to this requirment you might also
choose to run Variant Transforms in a specific region following your project’s
security and compliance requirements. For example, in order
to restrict your processing job to europe-west4 (Netherlands), set the region
and location as follows:
When running from Docker, Cloud Batch is used to provision a worker VM that runs a user-defined task, such as launching a Dataflow job. Cloud Batch is a regionalized service that runs jobs in specific regions. This is set using the `--location` flag. The location determines where the Batch job will be executed, and where metadata about the job will be stored. Note that Cloud Batch is not available in all regions, and the `--location` flag is required. If the `--location` flag is not set explicitly, the job submission will fail unless you have already configured a default location in your gcloud CLI. In that case, Cloud Batch will use the default location. You can set this default location with the following command:

```bash
gcloud config set batch/location YOUR_DEFAULT_LOCATION
```

See the list of [Currently Available Locations](https://cloud.google.com/batch/docs/locations).

In addition to this requirment you might also choose to run Variant Transforms in a specific region following your project’s security and compliance requirements. For example, in order to restrict your processing job to europe-west4 (Netherlands), set the region and location as follows:

```bash
# TODO: The Docker image must be rebuilt and hosted elsewhere
COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq ...

docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
Expand All @@ -49,7 +44,7 @@ docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
```

Note that values of `--project`, `--region`, and `--temp_location` flags will be automatically
passed as `COMMAND` inputs in [`piplines_runner.sh`](docker/pipelines_runner.sh).
passed as `COMMAND` inputs in [`batch_runner.sh`](docker/batch_runner.sh).

Instead of setting `--region` flag for each run, you can set your default region
using the following command. In that case, you will not need to set the `--region`
Expand Down Expand Up @@ -83,12 +78,11 @@ when you are [creating it](https://cloud.google.com/storage/docs/creating-bucket
When you create a bucket, you [permanently
define](https://cloud.google.com/storage/docs/moving-buckets#storage-create-bucket-console)
its name, its geographic location, and the project it is part of. For an existing bucket, you can check
[its information](https://cloud.google.com/storage/docs/getting-bucket-information) to find out
[its information](https://cloud.google.com/storage/docs/getting-bucket-information) to find out
about its geographic location.

## Setting BigQuery dataset region
## Setting BigQuery dataset region

You can choose the region for the BigQuery dataset at dataset creation time.

![BigQuery dataset region](images/bigquery_dataset_region.png)

11 changes: 7 additions & 4 deletions docs/variant_annotation.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,9 +78,9 @@ minimum number of flags to enable this feature is `--run_annotation_pipeline`
and `--annotation_output_dir [GCS_PATH]` where `[GCS_PATH]` is a path in a GCS
bucket that your project owns.

Variant annotation will start a separate Cloud Life Sciences pipeline to run
the vep_runner. You can provide `--location` to specify the location to use
for Cloud Life Sciences API. If not provided, it will default to `us-central1`.
Variant annotation will start separate multiple Batch jobs to run
the vep_runner, the number of Batch jobs depend on both the size of your input files and the value you set for [`--number_of_runnables_per_job`](#details). You can provide `--location` to specify the location to use
for Cloud Batch. If not provided, it will default to `us-central1`.
The compute region will come from the `--region` flag passed from docker.


Expand All @@ -106,6 +106,8 @@ followed by `_vep_output.vcf`. Note that if this directory already exists, then
Variant Transforms fails. This is to prevent unintentional overwriting of old
annotated VCFs.

* `--number_of_runnables_per_job` The maximum number of runnables (e.g. VEP jobs) to create per job (default: 95). The batch system only supports a maximum of 100 runnables per job, so this flag cannot be set higher than 95. This ensures that there are always 5 runnables reserved for system cycles. For larger input files, it is recommended to set a smaller value for the this flag to achieve faster processing speed

* [`--shard_variants`](https://github.com/googlegenomics/gcp-variant-transforms/blob/master/gcp_variant_transforms/options/variant_transform_options.py#L290)
by default, the input files are sharded into smaller temporary VCF files before
running VEP annotation. If the input files are small, i.e., each VCF file
Expand All @@ -118,12 +120,14 @@ true. The default value should work for most cases. You may change this flag to
a smaller value if you have a dataset with a lot of samples. Notice that
pipeline may take longer to finish for smaller value of this flag.

<!-- TODO: The Docker image must be rebuilt and hosted elsewhere -->
* [`--vep_image_uri`](https://github.com/googlegenomics/gcp-variant-transforms/blob/c4659bba2cf577d64f15db5cd9f477d9ea2b51b0/gcp_variant_transforms/options/variant_transform_options.py#L196)
the docker image for VEP created using the
[Dockerfile in variant-annotation](https://github.com/googlegenomics/variant-annotation/tree/master/batch/vep)
GitHub repo. By default `gcr.io/cloud-lifesciences/vep:104` is used which is
a public image that Google maintains (VEP version 104).

<!-- TODO: The Docker image must be rebuilt and hosted elsewhere -->
* [`--vep_cache_path`](https://github.com/googlegenomics/gcp-variant-transforms/blob/c4659bba2cf577d64f15db5cd9f477d9ea2b51b0/gcp_variant_transforms/options/variant_transform_options.py#L200)
the GCS location that has the compressed version of VEP cache. This file can be
created using
Expand Down Expand Up @@ -227,4 +231,3 @@ public databases are also made available in BigQuery including
([table](https://bigquery.cloud.google.com/table/isb-cgc:genome_reference.Ensembl2Reactome?tab=details))
and [WikiPathways](https://www.wikipathways.org)
([table](https://bigquery.cloud.google.com/table/isb-cgc:QotM.WikiPathways_20170425_Annotated?tab=details)).

6 changes: 2 additions & 4 deletions docs/vcf_files_preprocessor.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,11 +46,9 @@ Run the script below and replace the following parameters:
* `GOOGLE_CLOUD_REGION`: You must choose a geographic region for Cloud Dataflow
to process your data, for example: `us-west1`. For more information please refer to
[Setting Regions](docs/setting_region.md).
* `GOOGLE_CLOUD_LOCATION`: You may choose a geographic location for Cloud Life
Sciences API to orchestrate job from. This is not where the data will be processed,
* `GOOGLE_CLOUD_LOCATION`: You may choose a geographic location for Cloud Batch to orchestrate job from. This is not where the data will be processed,
but where some operation metadata will be stored. This can be the same or different from
the region chosen for Cloud Dataflow. If this is not set, the metadata will be stored in
us-central1. See the list of [Currently Available Locations](https://cloud.google.com/life-sciences/docs/concepts/locations).
the region chosen for Cloud Dataflow. If this is not set, it will use the default value you have configured for `batch/location` in your gcloud CLI (You can see how to set the default location [here](./setting_region.md/#running-jobs-in-a-particular-region)). See the list of [Currently Available Locations](https://cloud.google.com/batch/docs/locations).
* `TEMP_LOCATION`: This can be any folder in Google Cloud Storage that your
project has write access to. It's used to store temporary files and logs
from the pipeline.
Expand Down
Loading