explore parallelization in operations using `group_by()`

Our team has encountered a performance issue when using package functions that use `dplyr::group_by()` followed by `dplyr::summarise()` on very large dataframes. A common example is `get_ds_rt()`. 

This seems to be a `dplyr` issue that is common across many usages.

A workaround is to split up the dataframe into chunks and run one chunk at a time. This could be parallelized by the user as well.

However, perhaps there is a way to do this within the package? It could be off by default and could be implemented by adding an argument in these functions such as `nthreads` where a user can specify how many threads they would be willing to use on the operation. 

(e..g https://multidplyr.tidyverse.org/ ? or other similar packages)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

explore parallelization in operations using `group_by()` #21

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

explore parallelization in operations using group_by() #21

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

explore parallelization in operations using `group_by()` #21