Skip to content

Commit 576b2dd

Browse files
authored
Rename 'partitin' to 'sharding' (#540)
"'Partitioning" in BigQuery terminology has a very clear meaning. We use the same word when Variant Transforms shards output tables, for example one output BQ table per chromosome. To avoid confusion, we rename this feature of Variant Transforms to "sharding".
1 parent 4eb3d7d commit 576b2dd

20 files changed

+714
-700
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -179,7 +179,7 @@ details.
179179
* [Handling large inputs](docs/large_inputs.md)
180180
* [Appending data to existing tables](docs/data_append.md)
181181
* [Variant Annotation](docs/variant_annotation.md)
182-
* [Partitioning](docs/partitioning.md)
182+
* [Sharding](docs/sharding.md)
183183
* [Flattening the BigQuery table](docs/flattening_table.md)
184184
* [Troubleshooting](docs/troubleshooting.md)
185185

docs/large_inputs.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -132,18 +132,18 @@ loading large (>5TB) data to BigQuery at once.
132132
As a result, we recommend setting `--num_bigquery_write_shards 20` when loading
133133
any data that has more than 1 billion rows (after merging) or 1TB of final
134134
output. You may use a smaller number of write shards (e.g. 5) when using
135-
[partitioned output](#--partition_config_path) as each partition also acts as a
135+
[sharded output](#--sharding_config_path) as each partition also acts as a
136136
"shard". Note that using a larger value (e.g. 50) can cause BigQuery write to
137137
fail as there is a maximum limit on the number of concurrent writes per table.
138138

139-
### `--partition_config_path`
139+
### `--sharding_config_path`
140140

141-
Partitioning the output can save significant query costs once the data is in
141+
Sharding the output can save significant query costs once the data is in
142142
BigQuery. It can also optimize the cost/time of the pipeline (e.g. it natively
143143
shards the BigQuery output per partition and merging can also occur per
144144
partition).
145145

146146
As a result, we recommend setting the partition config for very large data
147-
where possible. Please see the [documentation](partitioning.md) for more
147+
where possible. Please see the [documentation](sharding.md) for more
148148
details.
149149

Lines changed: 30 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ clause. This extra cost can add up to a significant amount, specially if a few
1919

2020
We are offering three solutions for situations like this: one solution is using
2121
BigQuery [clustering](https://cloud.google.com/bigquery/docs/clustered-tables),
22-
another solution is using Variant Transforms' native partitioning, and a third
22+
another solution is using Variant Transforms' native sharding, and a third
2323
hybrid solution.
2424

2525
## Solution 1: BigQuery clustering
@@ -71,7 +71,7 @@ CLUSTER BY reference_name, start_position, end_position AS (
7171
```
7272

7373
Since clustering currently is only supported for partitioned tables, in this
74-
query first we add a dummy `DATE` column to our table. By Partitioning table
74+
query first we add a dummy `DATE` column to our table. By partitioning table
7575
using this `DATE` column, we are able to cluster it based on the values of
7676
`reference_name, start_position, end_position` columns.
7777

@@ -116,10 +116,10 @@ it has a few limitations:
116116
* If you append data to an existing clustered table it will become partially
117117
sorted. So you need to regularly re-cluster your table.
118118

119-
## Solution 2: Partitioning output table
119+
## Solution 2: Sharding output table
120120

121121
The second solution for reducing the cost of queries is to use Variant
122-
Transforms' native partitioning. Variant transforms is able to split the output
122+
Transforms' native sharding. Variant transforms is able to split the output
123123
table into several smaller tables, each containing variants of a specific region of a
124124
genome. For example, you can have one output table per chromosome, in that
125125
case the above query can be written as:
@@ -142,18 +142,18 @@ This solution has some limitations comparing to clustering. For example, you
142142
will be charged for processing of a whole chromosome's table even if only a
143143
small region is being processed. As an example, the previous query, will cost
144144
152 GB or $0.74. This is significantly less than the original cost without
145-
partitioning but it's more than clustering cost.
145+
sharding but it's more than clustering cost.
146146

147-
You could define your partitions to be more fine grained and have multiple
147+
You could define your shards to be more fine grained and have multiple
148148
tables per chromosome. However, you need to anticipate how your future queries
149-
are going to be in order to optimize your output partitions. Since in many
149+
are going to be in order to optimize your output shards. Since in many
150150
use cases it not obvious to anticipate future queries, we offer the
151151
third solution as the most practical and cost effective solution.
152152

153153
## Solution 3: Hybrid Solution
154154

155155
This solution combines two previous solutions to offer the benefits of both.
156-
Using the Variant Transforms' native partitioning, the output table will be
156+
Using the Variant Transforms' native sharding, the output table will be
157157
split into several smaller tables (perhaps one table per chromosome) and then
158158
each table will be clustered based on the `start_position, end_position`
159159
columns to further optimize them for running queries.
@@ -165,23 +165,23 @@ this hybrid solution.
165165

166166
If you append new rows to your clustered table and your table gradually becomes
167167
partially sorted, you still have a strict guarantee that your query cost will
168-
be limited to the size of the partitioned table ($0.74 in this case).
169-
Also, partitioning output table into several smaller tables reduces the initial
168+
be limited to the size of the sharded table ($0.74 in this case).
169+
Also, sharding output table into several smaller tables reduces the initial
170170
clustering time significantly.
171171

172172
In the following section we will explain how you could use Variant Transforms
173-
to easily partition your output table to minimize the cost of your queries.
173+
to easily shard your output table to minimize the cost of your queries.
174174

175-
## Partition Config files
175+
## Sharding Config files
176176

177-
Solution #2 or #3 require a *partition config file* to specify the output
178-
tables. The config file is set using the `--partition_config_path` flag and is
177+
Solution #2 or #3 require a *sharding config file* to specify the output
178+
tables. The config file is set using the `--sharding_config_path` flag and is
179179
formatted as a [`YAML`](https://en.wikipedia.org/wiki/YAML) file with a straight
180-
forward structure. [Here](https://github.com/googlegenomics/gcp-variant-transforms/blob/master/gcp_variant_transforms/data/partition_configs/homo_sapiens_default.yaml)
180+
forward structure. [Here](https://github.com/googlegenomics/gcp-variant-transforms/blob/master/gcp_variant_transforms/data/sharding_configs/homo_sapiens_default.yaml)
181181
you can find a config file that splits output table into 25 tables, one per
182-
chromosome plus an extra [residual partition](#residual-partition). We
182+
chromosome plus an extra [residual shard](#residual-shard). We
183183
recommend using this config file as default for human samples by adding:
184-
`--partition_config_path gcp_variant_transforms/data/partition_configs/homo_sapiens_default.yaml`
184+
`--sharding_config_path gcp_variant_transforms/data/sharding_configs/homo_sapiens_default.yaml`
185185
flag to your variant transforms command. Here is a snippet of that file:
186186

187187
```
@@ -192,12 +192,12 @@ flag to your variant transforms command. Here is a snippet of that file:
192192
- "1"
193193
```
194194

195-
This defines a partition, named `chr1`, that will include all variants whose
195+
This defines a shard, named `chr1`, that will include all variants whose
196196
`reference_name` is equal to `chr1` or `1`. Note that the `reference_name`
197197
string is *case-insensitive*, so if your variants have `Chr1` or `CHR1` they
198-
will all be matched to this partition.
198+
will all be matched to this shard.
199199

200-
The final output table name for this partition will have `_chr1`
200+
The final output table name for this shard will have `_chr1`
201201
suffix. More precisely, if
202202
`--output_table my-project:my_dataset.my_table`
203203
is set, then the output table for chromosome 1
@@ -206,9 +206,9 @@ variants will be available at
206206
suffix for your table names. Here, for simplicity, we used the same string
207207
(`chr1`) for both `reference_name` matching and table name suffix.
208208

209-
As we mentioned earlier, partitioning can be done at a more fine grained level
209+
As we mentioned earlier, sharding can be done at a more fine grained level
210210
and does not have to be limited to chromosomes. For example, the following
211-
config defines two partitions that contain variants of chromosome X:
211+
config defines two shards that contain variants of chromosome X:
212212
```
213213
- partition:
214214
partition_name: "chrX_part1"
@@ -223,12 +223,12 @@ If the *start position* of a variant on chromosome X is less than `100,000,000`
223223
it will be assigned to `chrX_part1` table otherwise it will be assigned to
224224
`chrX_part2` table.
225225

226-
### Residual Partition
227-
All partitions defined in a config file follow the same principal, variants will
226+
### Residual Shard
227+
All shards defined in a config file follow the same principal, variants will
228228
be assigned to them based on their defined `regions`. The only exception is the
229-
`residual` partition, this partition acts as *default partition* meaning that
230-
all variants that were not assigned to any partition will end up in this
231-
partition. For example consider the following config file:
229+
`residual` shard, this shard acts as *default shard* that
230+
all variants that were not assigned to any shard will end up in this
231+
shard. For example consider the following config file:
232232
```
233233
- partition:
234234
partition_name: "first_50M"
@@ -257,11 +257,11 @@ This config file splits all the variants into 3 tables:
257257
* All variants of `chr1`, `chr2`, and `chr3` with start position `>= 100M`
258258
* All variants of other chromosomes.
259259

260-
Using the `residual` partition you can make sure your output tables will include
260+
Using the `residual` shard you can make sure your output tables will include
261261
*all* input variants. However, if in your analysis you don't need the residual
262-
variants, you can simply remove the last partition from your config file. In
262+
variants, you can simply remove the last shard from your config file. In
263263
the case of previous example, you will have only 2 tables as output and variants
264-
that did not match to those two partitions will be dropped from the final output.
264+
that did not match to those two shards will be dropped from the final output.
265265

266266
This feature can be used more broadly for filtering out unwanted variants from
267267
the output tables. Filtering reduces the cost of running Variant Transforms

gcp_variant_transforms/data/partition_configs/homo_sapiens_default.yaml renamed to gcp_variant_transforms/data/sharding_configs/homo_sapiens_default.yaml

File renamed without changes.

0 commit comments

Comments
 (0)