Skip to content

Commit 8049c19

Browse files
authored
Enforce region (#527)
* Enforce providing --region flag insead of --zone Also adding --region flag to all of our integration tests. Remove zone from all integration json files * First round of comments * Second round of review
1 parent d993334 commit 8049c19

40 files changed

+111
-108
lines changed

README.md

Lines changed: 10 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,9 @@ Run the script below and replace the following parameters:
5353

5454
* `GOOGLE_CLOUD_PROJECT`: This is your project ID that contains the BigQuery
5555
dataset.
56+
* `GOOGLE_CLOUD_REGION`: You must choose a geographic region for Cloud Dataflow
57+
to process your data, for example: `us-west1`. For more info about regions
58+
please refer to [Setting Regions](docs/setting_region.md).
5659
* `INPUT_PATTERN`: A location in Google Cloud Storage where the
5760
VCF file are stored. You may specify a single file or provide a pattern to
5861
load multiple files at once. Please refer to the
@@ -69,6 +72,7 @@ Run the script below and replace the following parameters:
6972
#!/bin/bash
7073
# Parameters to replace:
7174
GOOGLE_CLOUD_PROJECT=GOOGLE_CLOUD_PROJECT
75+
GOOGLE_CLOUD_REGION=GOOGLE_CLOUD_REGION
7276
INPUT_PATTERN=gs://BUCKET/*.vcf
7377
OUTPUT_TABLE=GOOGLE_CLOUD_PROJECT:BIGQUERY_DATASET.BIGQUERY_TABLE
7478
TEMP_LOCATION=gs://BUCKET/temp
@@ -83,15 +87,15 @@ COMMAND="vcf_to_bq \
8387
docker run -v ~/.config:/root/.config \
8488
gcr.io/cloud-lifesciences/gcp-variant-transforms \
8589
--project "${GOOGLE_CLOUD_PROJECT}" \
86-
--zones us-west1-b \
90+
--region "${GOOGLE_CLOUD_REGION}" \
8791
"${COMMAND}"
8892
```
89-
The flags `--project` and `--zones` are optional, given that these properties
90-
are set in your local configuration. You may set the default project and zones
91-
using the following commands:
93+
Both `--project` and `--region` flags are needed unless their default values
94+
are set in your local `gcloud` configuration. You may set the default project
95+
and region using the following commands:
9296
```bash
9397
gcloud config set project GOOGLE_CLOUD_PROJECT
94-
gcloud config set compute/zone ZONE
98+
gcloud config set compute/region REGION
9599
```
96100

97101
The underlying pipeline uses
@@ -143,13 +147,13 @@ python -m gcp_variant_transforms.vcf_to_bq \
143147
--input_pattern gs://BUCKET/*.vcf \
144148
--output_table GOOGLE_CLOUD_PROJECT:BIGQUERY_DATASET.BIGQUERY_TABLE \
145149
--project "${GOOGLE_CLOUD_PROJECT}" \
150+
--region "${GOOGLE_CLOUD_REGION}" \
146151
--temp_location gs://BUCKET/temp \
147152
--job_name vcf-to-bigquery \
148153
--setup_file ./setup.py \
149154
--runner DataflowRunner
150155
```
151156

152-
153157
## Running VCF files preprocessor
154158

155159
The VCF files preprocessor is used for validating the datasets such that the
@@ -165,12 +169,6 @@ The BigQuery to VCF pipeline is used to export variants in BigQuery to one VCF f
165169
Please refer to [BigQuery to VCF pipeline](docs/bigquery_to_vcf.md) for more
166170
details.
167171

168-
## Running jobs in a particular region/zone
169-
170-
You may need to constrain Cloud Dataflow job processing to a specific geographic
171-
region in support of your project’s security and compliance needs. See
172-
[Setting zone/region doc](docs/setting_zone_region.md).
173-
174172

175173
## Additional topics
176174

docker/pipelines_runner.sh

Lines changed: 9 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ set -euo pipefail
2222
#################################################
2323
function parse_args {
2424
# getopt command is only for checking arguments.
25-
getopt -o '' -l project:,temp_location:,docker_image:,zones: -- "$@"
25+
getopt -o '' -l project:,temp_location:,docker_image:,region: -- "$@"
2626
while [[ "$#" -gt 0 ]]; do
2727
case "$1" in
2828
--project)
@@ -37,8 +37,8 @@ function parse_args {
3737
vt_docker_image="$2"
3838
;;
3939

40-
--zones)
41-
zones="$2"
40+
--region)
41+
region="$2"
4242
;;
4343

4444
*)
@@ -58,7 +58,7 @@ function main {
5858

5959
google_cloud_project="${google_cloud_project:-$(gcloud config get-value project)}"
6060
vt_docker_image="${vt_docker_image:-gcr.io/cloud-lifesciences/gcp-variant-transforms:${COMMIT_SHA}}"
61-
zones="${zones:-$(gcloud config get-value compute/zone)}"
61+
region="${region:-$(gcloud config get-value compute/region)}"
6262
temp_location="${temp_location:-''}"
6363

6464
if [[ -z "${google_cloud_project}" ]]; then
@@ -67,9 +67,9 @@ function main {
6767
exit 1
6868
fi
6969

70-
if [[ -z "${zones}" ]]; then
71-
echo "Please set the zones using flags --zones."
72-
echo "Or set default zone in your local client configuration using gcloud config set compute/zone ZONE."
70+
if [[ -z "${region}" ]]; then
71+
echo "Please set the region using flags --region."
72+
echo "Or set default region in your local client configuration using gcloud config set compute/region REGION."
7373
exit 1
7474
fi
7575

@@ -79,11 +79,11 @@ function main {
7979
fi
8080

8181
pipelines --project "${google_cloud_project}" run \
82-
--command "/opt/gcp_variant_transforms/bin/${command} --project ${google_cloud_project}" \
82+
--command "/opt/gcp_variant_transforms/bin/${command} --project ${google_cloud_project} --region ${region}" \
8383
--output "${temp_location}"/runner_logs_$(date +%Y%m%d_%H%M%S).log \
8484
--wait \
8585
--scopes "https://www.googleapis.com/auth/cloud-platform" \
86-
--zones "${zones}" \
86+
--regions "${region}" \
8787
--image "${vt_docker_image}" \
8888
--pvm-attempts 0 \
8989
--attempts 1 \

docs/bigquery_to_vcf.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -44,7 +44,7 @@ COMMAND="bq_to_vcf \
4444
docker run -v ~/.config:/root/.config \
4545
gcr.io/cloud-lifesciences/gcp-variant-transforms \
4646
--project "${GOOGLE_CLOUD_PROJECT}" \
47-
--zones us-west1-b \
47+
--region us-west1 \
4848
"${COMMAND}"
4949
```
5050

docs/setting_region.md

Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# Setting GCP region
2+
3+
## What to consider
4+
5+
Google Cloud Platform services are available in [many
6+
locations](https://cloud.google.com/about/locations/) across the globe.
7+
You can minimize network latency and network transport costs by running your
8+
Dataflow job in the same region as its input bucket, output dataset, and
9+
temporary directory are located. More specifically, in order to run Variant
10+
Transforms most efficiently you should make sure all the following resources
11+
are located in the same region:
12+
* Your source bucket set by `--input_pattern` flag.
13+
* Your pipeline's temporary location set by `--temp_location` flag.
14+
* Your output BigQuery dataset set by `--output_table` flag.
15+
* Your Dataflow pipeline set by `--region` flag.
16+
17+
## Running jobs in a particular region
18+
The Dataflow API [requires](https://beam.apache.org/blog/2019/08/22/beam-2.15.0.html)
19+
setting a [GCP
20+
region](https://cloud.google.com/compute/docs/regions-zones/#available) via
21+
`--region` flag to run. In addition to this requirment you might also
22+
choose to run Variant Transforms in a specific region following your project’s
23+
security and compliance requirements. For example, in order
24+
to restrict your processing job to Europe, update the region as follows:
25+
26+
```bash
27+
COMMAND="/opt/gcp_variant_transforms/bin/vcf_to_bq ...
28+
29+
docker run gcr.io/cloud-lifesciences/gcp-variant-transforms \
30+
--project "${GOOGLE_CLOUD_PROJECT}" \
31+
--region "${GOOGLE_CLOUD_REGION}" \
32+
"${COMMAND}"
33+
```
34+
35+
Note that values of `--project` and `--region` flags will be automatically
36+
passed as `COMMAND` args in [`piplines_runner.sh`](docker/pipelines_runner.sh).
37+
Alternatively, you can set your default region using the following command:
38+
39+
```bash
40+
gcloud config set compute/region "europe-west1"
41+
```
42+
43+
In this case you do not need to set the `--region` flag any more. For more
44+
information please refer to this [cloud SDK page](https://cloud.google.com/sdk/gcloud/reference/config/set).
45+
46+
If you are running Variant Transforms from GitHub, you just need to specify
47+
region for the Dataflow API as below.
48+
49+
```bash
50+
python -m gcp_variant_transforms.vcf_to_bq ... \
51+
--project "${GOOGLE_CLOUD_PROJECT}" \
52+
--region "${GOOGLE_CLOUD_REGION}" \
53+
```
54+
55+
## Setting Google Cloud Storage bucket region
56+
57+
You can choose your [GCS bucket's region](https://cloud.google.com/storage/docs/locations)
58+
when you are [creating it](https://cloud.google.com/storage/docs/creating-buckets#storage-create-bucket-console).
59+
When you create a bucket, you [permanently
60+
define](https://cloud.google.com/storage/docs/moving-buckets#storage-create-bucket-console)
61+
its name, its geographic location, and the project it is part of. For an existing bucket, you can check
62+
[its information](https://cloud.google.com/storage/docs/getting-bucket-information) to find out
63+
about its geographic location.
64+
65+
## Setting BigQuery dataset region
66+
67+
You can choose the region for the BigQuery dataset at dataset creation time.
68+
69+
![BigQuery dataset region](images/bigquery_dataset_region.png)
70+

docs/setting_zone_region.md

Lines changed: 0 additions & 34 deletions
This file was deleted.

docs/troubleshooting.md

Lines changed: 10 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,17 @@ group or file a GitHub issue if you believe that there is a bug in the pipeline.
1414
[predefined machine types](https://cloud.google.com/compute/pricing#predefined_machine_types)
1515
for the full list.
1616
* Ensure you have enough [quota](https://cloud.google.com/compute/quotas) in the
17-
zone/region running the pipeline. By default, the pipeline runs in the
18-
`us-central1` region. You may change this by specifying `--region <region>`
19-
or `--zone <zone>` when running the pipeline. You can check for quota issues
20-
by navigating to the
21-
[Compute Engine quotas page](https://console.cloud.google.com/iam-admin/quotas?service=compute.googleapis.com)
17+
region running the pipeline. You need to [set a region](./setting_region.md)
18+
for running the pipeline by specifying `--region <region>`. You can check for
19+
quota issues by navigating to the [Compute Engine quotas page](https://console.cloud.google.com/iam-admin/quotas?service=compute.googleapis.com)
2220
while the pipeline is running, which shows saturated quotas at the top of the
23-
page.
21+
page (highlighted in red).
22+
* Ensure your source GCS bucket is located in the same region as where you are
23+
running your Dataflow pipeline. According to [data
24+
locality](https://cloud.google.com/dataflow/docs/concepts/regional-endpoints#data_locality)
25+
guidelines the GCS bucket containing your VCF files as well as the temporary
26+
directory of your pipeline should be located in the same region as your
27+
Dataflow pipeline.
2428
* `gzip` and `bzip2` file formats cannot be sharded, which considerably slows
2529
down the pipeline. Consider decompressing the files prior to running the
2630
pipeline. You may use [dsub](https://github.com/googlegenomics/dsub) to write

docs/vcf_files_preprocessor.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -83,7 +83,7 @@ COMMAND="vcf_to_bq_preprocess \
8383
docker run -v ~/.config:/root/.config \
8484
gcr.io/cloud-lifesciences/gcp-variant-transforms \
8585
--project "${GOOGLE_CLOUD_PROJECT}" \
86-
--zones us-west1-b \
86+
--region us-west1 \
8787
"${COMMAND}"
8888
```
8989

gcp_variant_transforms/testing/integration/bq_to_vcf_tests/no_options.json

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,6 @@
44
"input_table": "gcp-variant-transforms-test:bq_to_vcf_integration_tests.4_0",
55
"output_file_name": "bq_to_vcf_no_options.vcf",
66
"runner": "DirectRunner",
7-
"zones": ["us-west1-b"],
87
"expected_output_file": "gcp_variant_transforms/testing/data/vcf/bq_to_vcf/expected_output/no_options.vcf"
98
}
109
]

gcp_variant_transforms/testing/integration/bq_to_vcf_tests/option_allow_incompatible_schema.json

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,6 @@
55
"output_file_name": "bq_to_vcf_option_allow_incompatible_schema.vcf",
66
"allow_incompatible_schema": true,
77
"runner": "DirectRunner",
8-
"zones": ["us-west1-b"],
98
"expected_output_file": "gcp_variant_transforms/testing/data/vcf/bq_to_vcf/expected_output/option_allow_incompatible_schema.vcf"
109
}
1110
]

gcp_variant_transforms/testing/integration/bq_to_vcf_tests/option_customized_export.json

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,6 @@
66
"genomic_regions": "19:1234566-1234570 20:14369-17330",
77
"call_names": "NA00001 NA00003",
88
"runner": "DirectRunner",
9-
"zones": ["us-west1-b"],
109
"expected_output_file": "gcp_variant_transforms/testing/data/vcf/bq_to_vcf/expected_output/option_customized_export.vcf"
1110
}
1211
]

0 commit comments

Comments
 (0)