@@ -19,7 +19,7 @@ clause. This extra cost can add up to a significant amount, specially if a few
1919
2020We are offering three solutions for situations like this: one solution is using
2121BigQuery [ clustering] ( https://cloud.google.com/bigquery/docs/clustered-tables ) ,
22- another solution is using Variant Transforms' native partitioning , and a third
22+ another solution is using Variant Transforms' native sharding , and a third
2323hybrid solution.
2424
2525## Solution 1: BigQuery clustering
@@ -71,7 +71,7 @@ CLUSTER BY reference_name, start_position, end_position AS (
7171```
7272
7373Since clustering currently is only supported for partitioned tables, in this
74- query first we add a dummy ` DATE ` column to our table. By Partitioning table
74+ query first we add a dummy ` DATE ` column to our table. By partitioning table
7575using this ` DATE ` column, we are able to cluster it based on the values of
7676` reference_name, start_position, end_position ` columns.
7777
@@ -116,10 +116,10 @@ it has a few limitations:
116116 * If you append data to an existing clustered table it will become partially
117117 sorted. So you need to regularly re-cluster your table.
118118
119- ## Solution 2: Partitioning output table
119+ ## Solution 2: Sharding output table
120120
121121The second solution for reducing the cost of queries is to use Variant
122- Transforms' native partitioning . Variant transforms is able to split the output
122+ Transforms' native sharding . Variant transforms is able to split the output
123123table into several smaller tables, each containing variants of a specific region of a
124124genome. For example, you can have one output table per chromosome, in that
125125case the above query can be written as:
@@ -142,18 +142,18 @@ This solution has some limitations comparing to clustering. For example, you
142142will be charged for processing of a whole chromosome's table even if only a
143143small region is being processed. As an example, the previous query, will cost
144144152 GB or $0.74. This is significantly less than the original cost without
145- partitioning but it's more than clustering cost.
145+ sharding but it's more than clustering cost.
146146
147- You could define your partitions to be more fine grained and have multiple
147+ You could define your shards to be more fine grained and have multiple
148148tables per chromosome. However, you need to anticipate how your future queries
149- are going to be in order to optimize your output partitions . Since in many
149+ are going to be in order to optimize your output shards . Since in many
150150use cases it not obvious to anticipate future queries, we offer the
151151third solution as the most practical and cost effective solution.
152152
153153## Solution 3: Hybrid Solution
154154
155155This solution combines two previous solutions to offer the benefits of both.
156- Using the Variant Transforms' native partitioning , the output table will be
156+ Using the Variant Transforms' native sharding , the output table will be
157157split into several smaller tables (perhaps one table per chromosome) and then
158158each table will be clustered based on the ` start_position, end_position `
159159columns to further optimize them for running queries.
@@ -165,23 +165,23 @@ this hybrid solution.
165165
166166If you append new rows to your clustered table and your table gradually becomes
167167partially sorted, you still have a strict guarantee that your query cost will
168- be limited to the size of the partitioned table ($0.74 in this case).
169- Also, partitioning output table into several smaller tables reduces the initial
168+ be limited to the size of the sharded table ($0.74 in this case).
169+ Also, sharding output table into several smaller tables reduces the initial
170170clustering time significantly.
171171
172172In the following section we will explain how you could use Variant Transforms
173- to easily partition your output table to minimize the cost of your queries.
173+ to easily shard your output table to minimize the cost of your queries.
174174
175- ## Partition Config files
175+ ## Sharding Config files
176176
177- Solution #2 or #3 require a * partition config file* to specify the output
178- tables. The config file is set using the ` --partition_config_path ` flag and is
177+ Solution #2 or #3 require a * sharding config file* to specify the output
178+ tables. The config file is set using the ` --sharding_config_path ` flag and is
179179formatted as a [ ` YAML ` ] ( https://en.wikipedia.org/wiki/YAML ) file with a straight
180- forward structure. [ Here] ( https://github.com/googlegenomics/gcp-variant-transforms/blob/master/gcp_variant_transforms/data/partition_configs /homo_sapiens_default.yaml )
180+ forward structure. [ Here] ( https://github.com/googlegenomics/gcp-variant-transforms/blob/master/gcp_variant_transforms/data/sharding_configs /homo_sapiens_default.yaml )
181181you can find a config file that splits output table into 25 tables, one per
182- chromosome plus an extra [ residual partition ] ( #residual-partition ) . We
182+ chromosome plus an extra [ residual shard ] ( #residual-shard ) . We
183183recommend using this config file as default for human samples by adding:
184- ` --partition_config_path gcp_variant_transforms/data/partition_configs /homo_sapiens_default.yaml `
184+ ` --sharding_config_path gcp_variant_transforms/data/sharding_configs /homo_sapiens_default.yaml `
185185flag to your variant transforms command. Here is a snippet of that file:
186186
187187```
@@ -192,12 +192,12 @@ flag to your variant transforms command. Here is a snippet of that file:
192192 - "1"
193193```
194194
195- This defines a partition , named ` chr1 ` , that will include all variants whose
195+ This defines a shard , named ` chr1 ` , that will include all variants whose
196196` reference_name ` is equal to ` chr1 ` or ` 1 ` . Note that the ` reference_name `
197197string is * case-insensitive* , so if your variants have ` Chr1 ` or ` CHR1 ` they
198- will all be matched to this partition .
198+ will all be matched to this shard .
199199
200- The final output table name for this partition will have ` _chr1 `
200+ The final output table name for this shard will have ` _chr1 `
201201suffix. More precisely, if
202202` --output_table my-project:my_dataset.my_table `
203203is set, then the output table for chromosome 1
@@ -206,9 +206,9 @@ variants will be available at
206206suffix for your table names. Here, for simplicity, we used the same string
207207(` chr1 ` ) for both ` reference_name ` matching and table name suffix.
208208
209- As we mentioned earlier, partitioning can be done at a more fine grained level
209+ As we mentioned earlier, sharding can be done at a more fine grained level
210210and does not have to be limited to chromosomes. For example, the following
211- config defines two partitions that contain variants of chromosome X:
211+ config defines two shards that contain variants of chromosome X:
212212```
213213- partition:
214214 partition_name: "chrX_part1"
@@ -223,12 +223,12 @@ If the *start position* of a variant on chromosome X is less than `100,000,000`
223223it will be assigned to ` chrX_part1 ` table otherwise it will be assigned to
224224` chrX_part2 ` table.
225225
226- ### Residual Partition
227- All partitions defined in a config file follow the same principal, variants will
226+ ### Residual Shard
227+ All shards defined in a config file follow the same principal, variants will
228228be assigned to them based on their defined ` regions ` . The only exception is the
229- ` residual ` partition , this partition acts as * default partition * meaning that
230- all variants that were not assigned to any partition will end up in this
231- partition . For example consider the following config file:
229+ ` residual ` shard , this shard acts as * default shard * that
230+ all variants that were not assigned to any shard will end up in this
231+ shard . For example consider the following config file:
232232```
233233- partition:
234234 partition_name: "first_50M"
@@ -257,11 +257,11 @@ This config file splits all the variants into 3 tables:
257257 * All variants of ` chr1 ` , ` chr2 ` , and ` chr3 ` with start position ` >= 100M `
258258 * All variants of other chromosomes.
259259
260- Using the ` residual ` partition you can make sure your output tables will include
260+ Using the ` residual ` shard you can make sure your output tables will include
261261* all* input variants. However, if in your analysis you don't need the residual
262- variants, you can simply remove the last partition from your config file. In
262+ variants, you can simply remove the last shard from your config file. In
263263the case of previous example, you will have only 2 tables as output and variants
264- that did not match to those two partitions will be dropped from the final output.
264+ that did not match to those two shards will be dropped from the final output.
265265
266266This feature can be used more broadly for filtering out unwanted variants from
267267the output tables. Filtering reduces the cost of running Variant Transforms
0 commit comments