googlegenomics · ptdtan · Aug 8, 2025 · Aug 8, 2025
diff --git a/.gitignore b/.gitignore
@@ -12,3 +12,6 @@ venv3/
 .vscode/
 .coverage
 *.DS_Store
+
+# Secrets
+sa_token.json
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@
 
 This is a tool for transforming and processing
 [VCF](https://samtools.github.io/hts-specs/VCFv4.3.pdf) files in a scalable
-manner based on [Apache Beam](https://beam.apache.org/) using 
+manner based on [Apache Beam](https://beam.apache.org/) using
 [Dataflow](https://cloud.google.com/dataflow/) on Google Cloud Platform.
 
 It can be used to directly load VCF files to
@@ -45,6 +45,7 @@ running `gcloud components update` (more details [here](https://cloud.google.com
 
 Use the following command to get the latest version of Variant Transforms.
 ```bash
+# TODO: The Docker image must be rebuilt and hosted elsewhere
 docker pull gcr.io/cloud-lifesciences/gcp-variant-transforms
 ```
 
@@ -56,11 +57,8 @@ Run the script below and replace the following parameters:
   * `GOOGLE_CLOUD_REGION`: You must choose a geographic region for Cloud Dataflow
   to process your data, for example: `us-west1`. For more information please refer to
   [Setting Regions](docs/setting_region.md).
-  * `GOOGLE_CLOUD_LOCATION`: You may choose a geographic location for Cloud Life
-  Sciences API to orchestrate job from. This is not where the data will be processed,
-  but where some operation metadata will be stored. This can be the same or different from
-  the region chosen for Cloud Dataflow. If this is not set, the metadata will be stored in
-  us-central1. See the list of [Currently Available Locations](https://cloud.google.com/life-sciences/docs/concepts/locations).
+  * `GOOGLE_CLOUD_LOCATION`: You may choose a geographic location for Cloud Batch to orchestrate job from. This is not where the data will be processed,
+  but where some operation metadata will be stored. This can be the same or different from the region chosen for Cloud Dataflow. If this is not set, it will use the default value you have configured for `batch/location` in your gcloud CLI (You can see how to set the default location [here](./setting_region.md/#running-jobs-in-a-particular-region)). See the list of [Currently Available Locations](https://cloud.google.com/batch/docs/locations).
   * `TEMP_LOCATION`: This can be any folder in Google Cloud Storage that your
   project has write access to. It's used to store temporary files and logs
   from the pipeline.
@@ -89,6 +87,7 @@ COMMAND="vcf_to_bq \
   --job_name vcf-to-bigquery \
   --runner DataflowRunner"
 
+# TODO: The Docker image must be rebuilt and hosted elsewhere
 docker run -v ~/.config:/root/.config \
   gcr.io/cloud-lifesciences/gcp-variant-transforms \
   --project "${GOOGLE_CLOUD_PROJECT}" \
@@ -114,10 +113,10 @@ In addition to using the docker image, you may run the pipeline directly from
 source. First install git, python, pip, and virtualenv:
 
 ```bash
-sudo apt-get install -y git python3-pip python3-venv python3.7-venv python-dev build-essential
+sudo apt-get install -y git python3-pip python3-venv python3.12-venv python-dev build-essential
 ```
 
-Note that python 3.8 is not yet supported, so ensure you are using Python 3.7.
+We encourage you to use Python 3.12, as we have upgraded and fully adapted the project to work well with this version.
 
 Run virtualenv, clone the repo, and install pip packages:
 
@@ -180,8 +179,6 @@ details.
 
 ## Additional topics
 
-* [Understanding the BigQuery Variants Table
-  Schema](https://cloud.google.com/life-sciences/docs/how-tos/bigquery-variants-schema)
 * [Loading multiple files](docs/multiple_files.md)
 * [Variant merging](docs/variant_merging.md)
 * [Handling large inputs](docs/large_inputs.md)
@@ -195,4 +192,3 @@ details.
 
 * [Development Guide](docs/development_guide.md)
 * [Release process](docs/release.md)
-
diff --git a/docker/batch_runner.sh b/docker/batch_runner.sh
@@ -79,10 +79,10 @@ function main {
   # If missing, we will try to find the default values.
   google_cloud_project="${google_cloud_project:-$(gcloud config get-value project)}"
   region="${region:-$(gcloud config get-value compute/region)}"
+  location="${location:-$(gcloud config get-value batch/location)}"
   vt_docker_image="${vt_docker_image:-us-east1-docker.pkg.dev/variant-transform-dxt/dxt-public-variant-transform/batch-runner:latest}"
 
   sdk_container_image="${sdk_container_image:-}"
-  location="${location:-}"
   temp_location="${temp_location:-}"
   subnetwork="${subnetwork:-}"
   use_public_ips="${use_public_ips:-}"
@@ -113,7 +113,7 @@ function main {
   # Build Dataflow required args based on `docker run ...` inputs.
   df_required_args="--project ${google_cloud_project} --region ${region} --temp_location ${temp_location}"
 
-  # Build up optional args for pipelines-tools and Dataflow, if they are provided.
+  # Build up optional args for Batch Job and Dataflow, if they are provided.
   pt_optional_args=""
   df_optional_args=""
 

diff --git a/docs/advanced_flags.md b/docs/advanced_flags.md
@@ -8,6 +8,7 @@ Specify a subnetwork by using the `--subnetwork` flag and provide the name of th
 
 Example:
 ```bash
+# TODO: The Docker image must be rebuilt and hosted elsewhere
 COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq ...
 
 docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
@@ -29,6 +30,7 @@ on the subnet.
 
 Example:
 ```bash
+# TODO: The Docker image must be rebuilt and hosted elsewhere
 COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq ...
 
 docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
@@ -39,12 +41,14 @@ docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
 ```
 
 ## Custom Dataflow Runner Image
+<!-- TODO: The Docker image must be rebuilt and hosted elsewhere -->
 By default Variant Transforms uses a custom docker image to run the pipeline in: `gcr.io/cloud-lifesciences/variant-transforms-custom-runner:latest`.
 This image contains all the necessary python/linux dependencies needed to run variant transforms so that they are not downloaded from the internet when the pipeline starts.
 
 You can override which container is used by passing a `--sdk_container_image` as in the following example:
 
 ```bash
+# TODO: The Docker image must be rebuilt and hosted elsewhere
 COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq ...
 
 docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
@@ -58,6 +62,7 @@ docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
 By default the dataflow workers will use the [default compute service account](https://cloud.google.com/compute/docs/access/service-accounts#default_service_account). You can override which service account to use with the `--service_account` flag as in the following example:
 
 ```bash
+# TODO: The Docker image must be rebuilt and hosted elsewhere
 COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq ...
 
 docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
@@ -68,5 +73,5 @@ docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
 ```
 
 **Other Service Account Notes:**
-- The [Life Sciences Service Account is not changable](https://cloud.google.com/life-sciences/docs/troubleshooting#missing_service_account)
-- The [Dataflow Admin Service Account is not changable](https://cloud.google.com/dataflow/docs/concepts/security-and-permissions#service_account)
+- [Control access for a job using a custom service account](https://cloud.google.com/batch/docs/create-run-job-custom-service-account)
+- The [Dataflow Admin Service Account is not changable](https://cloud.google.com/dataflow/docs/concepts/security-and-permissions#service_account)
diff --git a/docs/bigquery_to_vcf.md b/docs/bigquery_to_vcf.md
@@ -21,11 +21,8 @@ Run the script below and replace the following parameters:
   * `GOOGLE_CLOUD_REGION`: You must choose a geographic region for Cloud Dataflow
   to process your data, for example: `us-west1`. For more information please refer to
   [Setting Regions](docs/setting_region.md).
-  * `GOOGLE_CLOUD_LOCATION`: You may choose a geographic location for Cloud Life
-  Sciences API to orchestrate job from. This is not where the data will be processed,
-  but where some operation metadata will be stored. This can be the same or different from
-  the region chosen for Cloud Dataflow. If this is not set, the metadata will be stored in
-  us-central1. See the list of [Currently Available Locations](https://cloud.google.com/life-sciences/docs/concepts/locations).
+  * `GOOGLE_CLOUD_LOCATION`: You may choose a geographic location for Cloud Batch to orchestrate job from. This is not where the data will be processed, but where some operation metadata will be stored. This can be the same or different from
+  the region chosen for Cloud Dataflow. If this is not set, it will use the default value you have configured for `batch/location` in your gcloud CLI (You can see how to set the default location [here](./setting_region.md/#running-jobs-in-a-particular-region)). See the list of [Currently Available Locations](https://cloud.google.com/batch/docs/locations).
   * `TEMP_LOCATION`: This can be any folder in Google Cloud Storage that your
   project has write access to. It's used to store temporary files and logs
   from the pipeline.
@@ -50,7 +47,7 @@ COMMAND="bq_to_vcf \
   --output_file ${OUTPUT_FILE} \
   --job_name bq-to-vcf \
   --runner DataflowRunner"
-
+# TODO: The Docker image must be rebuilt and hosted elsewhere
 docker run -v ~/.config:/root/.config \
   gcr.io/cloud-lifesciences/gcp-variant-transforms \
   --project "${GOOGLE_CLOUD_PROJECT}" \

diff --git a/docs/development_guide.md b/docs/development_guide.md
@@ -41,10 +41,10 @@ git remote add upstream git@github.com:googlegenomics/gcp-variant-transforms.git
 
 #### Setup virtualenv
 
-Ensure you are using Python 3.7 version, since Apache Beam does not support 3.8.
+We encourage you to use Python 3.12, as we have upgraded and fully adapted the project to work well with this version.
 
 ```bash
-sudo apt-get install python3-pip python3-venv python3.7-venv python-dev build-essential
+sudo apt-get install python3-pip python3-venv python3.12-venv python-dev build-essential
 python3 -m venv venv3
 . venv3/bin/activate
 ```
@@ -102,7 +102,7 @@ checked into the git repository and can be imported into
 File | Settings | Editor | Inspections.
 
 Code inspections can be run from the Analyze menu. To speed up the inspection
-process, you can go to File | Project Structure | Modules and only set the 
+process, you can go to File | Project Structure | Modules and only set the
 gcp_variant_transforms as the Sources. You may exclude other folders, or specify
 the inspection scope to be only Module 'gcp-variant-transforms' when running
 the inspection. The result window can be accessed from View > Tool Windows.

diff --git a/docs/sample_queries/README.md b/docs/sample_queries/README.md
@@ -10,6 +10,7 @@ users to send us their queries so we can share them here with all other
 researchers. Please feel free to [submit an issue](https://github.com/googlegenomics/gcp-variant-transforms/issues)
 or contact us via our public mailing list
 [gcp-life-sciences-discuss@googlegroups.com](mailto:gcp-life-sciences-discuss@googlegroups.com).
+<!-- TODO: Change this email -->
 
 ## Genome Aggregation Database (gnomAD)
 

diff --git a/docs/setting_region.md b/docs/setting_region.md
@@ -13,31 +13,26 @@ are located in the same region:
 * Your pipeline's temporary location set by `--temp_location` flag.
 * Your output BigQuery dataset set by `--output_table` flag.
 * Your Dataflow pipeline set by `--region` flag.
-* Your Life Sciences API location set by `--location` flag.
+* Your Cloud Batch location set by `--location` flag.
 
 ## Running jobs in a particular region
 The Dataflow API [requires](https://cloud.google.com/dataflow/docs/guides/specifying-exec-params#configuring-pipelineoptions-for-execution-on-the-cloud-dataflow-service)
 setting a [GCP
 region](https://cloud.google.com/compute/docs/regions-zones/#available) via
 `--region` flag to run.
 
-When running from Docker, the Cloud Life Sciences API is used to spin up a
-worker that launches and monitors the Dataflow job. Cloud Life Sciences API
-is a [regionalized service](https://cloud.google.com/life-sciences/docs/concepts/locations)
-that runs in multiple regions. This is set with the `--location` flag. The
-Life Sciences API location is where metadata about the pipeline's progress
-will be stored, and can be different from the region where the data is
-processed. Note that Cloud Life Sciences API is not available in all regions,
-and if this flag is left out, the metadata will be stored in us-central1. See
-the list of [Currently Available Locations](https://cloud.google.com/life-sciences/docs/concepts/locations).
-
-In addition to this requirment you might also
-choose to run Variant Transforms in a specific region following your project’s
-security and compliance requirements. For example, in order
-to restrict your processing job to europe-west4 (Netherlands), set the region
-and location as follows:
+When running from Docker, Cloud Batch is used to provision a worker VM that runs a user-defined task, such as launching a Dataflow job. Cloud Batch is a regionalized service that runs jobs in specific regions. This is set using the `--location` flag. The location determines where the Batch job will be executed, and where metadata about the job will be stored. Note that Cloud Batch is not available in all regions, and the `--location` flag is required. If the `--location` flag is not set explicitly, the job submission will fail unless you have already configured a default location in your gcloud CLI. In that case, Cloud Batch will use the default location. You can set this default location with the following command:
 
 ```bash
+gcloud config set batch/location YOUR_DEFAULT_LOCATION
+```
+
+See the list of [Currently Available Locations](https://cloud.google.com/batch/docs/locations).
+
+In addition to this requirment you might also choose to run Variant Transforms in a specific region following your project’s security and compliance requirements. For example, in order to restrict your processing job to europe-west4 (Netherlands), set the region and location as follows:
+
+```bash
+# TODO: The Docker image must be rebuilt and hosted elsewhere
 COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq ...
 
 docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
@@ -49,7 +44,7 @@ docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
 ```
 
 Note that values of `--project`, `--region`, and `--temp_location` flags will be automatically
-passed as `COMMAND` inputs in [`piplines_runner.sh`](docker/pipelines_runner.sh).
+passed as `COMMAND` inputs in [`batch_runner.sh`](docker/batch_runner.sh).
 
 Instead of setting `--region` flag for each run, you can set your default region
 using the following command. In that case, you will not need to set the `--region`
@@ -83,12 +78,11 @@ when you are [creating it](https://cloud.google.com/storage/docs/creating-bucket
 When you create a bucket, you [permanently
 define](https://cloud.google.com/storage/docs/moving-buckets#storage-create-bucket-console)
 its name, its geographic location, and the project it is part of. For an existing bucket, you can check
-[its information](https://cloud.google.com/storage/docs/getting-bucket-information) to find out 
+[its information](https://cloud.google.com/storage/docs/getting-bucket-information) to find out
 about its geographic location.
 
-## Setting BigQuery dataset region 
+## Setting BigQuery dataset region
 
 You can choose the region for the BigQuery dataset at dataset creation time.
 
 ![BigQuery dataset region](images/bigquery_dataset_region.png)
-
diff --git a/docs/variant_annotation.md b/docs/variant_annotation.md
@@ -78,9 +78,9 @@ minimum number of flags to enable this feature is `--run_annotation_pipeline`
 and `--annotation_output_dir [GCS_PATH]` where `[GCS_PATH]` is a path in a GCS
 bucket that your project owns.
 
-Variant annotation will start a separate Cloud Life Sciences pipeline to run
-the vep_runner. You can provide `--location` to specify the location to use
-for Cloud Life Sciences API. If not provided, it will default to `us-central1`.
+Variant annotation will start separate multiple Batch jobs to run
+the vep_runner, the number of Batch jobs depend on both the size of your input files and the value you set for [`--number_of_runnables_per_job`](#details). You can provide `--location` to specify the location to use
+for Cloud Batch. If not provided, it will default to `us-central1`.
 The compute region will come from the `--region` flag passed from docker.
 
 
@@ -106,6 +106,8 @@ followed by `_vep_output.vcf`. Note that if this directory already exists, then
 Variant Transforms fails. This is to prevent unintentional overwriting of old
 annotated VCFs.
 
+* `--number_of_runnables_per_job` The maximum number of runnables (e.g. VEP jobs) to create per job (default: 95). The batch system only supports a maximum of 100 runnables per job, so this flag cannot be set higher than 95. This ensures that there are always 5 runnables reserved for system cycles. For larger input files, it is recommended to set a smaller value for the this flag to achieve faster processing speed
+
 * [`--shard_variants`](https://github.com/googlegenomics/gcp-variant-transforms/blob/master/gcp_variant_transforms/options/variant_transform_options.py#L290)
 by default, the input files are sharded into smaller temporary VCF files before
 running VEP annotation. If the input files are small, i.e., each VCF file
@@ -118,12 +120,14 @@ true. The default value should work for most cases. You may change this flag to
 a smaller value if you have a dataset with a lot of samples. Notice that
 pipeline may take longer to finish for smaller value of this flag.
 
+<!-- TODO: The Docker image must be rebuilt and hosted elsewhere -->
 * [`--vep_image_uri`](https://github.com/googlegenomics/gcp-variant-transforms/blob/c4659bba2cf577d64f15db5cd9f477d9ea2b51b0/gcp_variant_transforms/options/variant_transform_options.py#L196)
 the docker image for VEP created using the
 [Dockerfile in variant-annotation](https://github.com/googlegenomics/variant-annotation/tree/master/batch/vep)
 GitHub repo. By default `gcr.io/cloud-lifesciences/vep:104` is used which is
 a public image that Google maintains (VEP version 104).
 
+<!-- TODO: The Docker image must be rebuilt and hosted elsewhere -->
 * [`--vep_cache_path`](https://github.com/googlegenomics/gcp-variant-transforms/blob/c4659bba2cf577d64f15db5cd9f477d9ea2b51b0/gcp_variant_transforms/options/variant_transform_options.py#L200)
 the GCS location that has the compressed version of VEP cache. This file can be
 created using
@@ -227,4 +231,3 @@ public databases are also made available in BigQuery including
 ([table](https://bigquery.cloud.google.com/table/isb-cgc:genome_reference.Ensembl2Reactome?tab=details))
 and [WikiPathways](https://www.wikipathways.org)
 ([table](https://bigquery.cloud.google.com/table/isb-cgc:QotM.WikiPathways_20170425_Annotated?tab=details)).
-
diff --git a/docs/vcf_files_preprocessor.md b/docs/vcf_files_preprocessor.md
@@ -46,11 +46,9 @@ Run the script below and replace the following parameters:
   * `GOOGLE_CLOUD_REGION`: You must choose a geographic region for Cloud Dataflow
   to process your data, for example: `us-west1`. For more information please refer to
   [Setting Regions](docs/setting_region.md).
-  * `GOOGLE_CLOUD_LOCATION`: You may choose a geographic location for Cloud Life
-  Sciences API to orchestrate job from. This is not where the data will be processed,
+  * `GOOGLE_CLOUD_LOCATION`: You may choose a geographic location for Cloud Batch to orchestrate job from. This is not where the data will be processed,
   but where some operation metadata will be stored. This can be the same or different from
-  the region chosen for Cloud Dataflow. If this is not set, the metadata will be stored in
-  us-central1. See the list of [Currently Available Locations](https://cloud.google.com/life-sciences/docs/concepts/locations).
+  the region chosen for Cloud Dataflow. If this is not set, it will use the default value you have configured for `batch/location` in your gcloud CLI (You can see how to set the default location [here](./setting_region.md/#running-jobs-in-a-particular-region)). See the list of [Currently Available Locations](https://cloud.google.com/batch/docs/locations).
   * `TEMP_LOCATION`: This can be any folder in Google Cloud Storage that your
   project has write access to. It's used to store temporary files and logs
   from the pipeline.