-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Our team has encountered a performance issue when using package functions that use dplyr::group_by() followed by dplyr::summarise() on very large dataframes. A common example is get_ds_rt().
This seems to be a dplyr issue that is common across many usages.
A workaround is to split up the dataframe into chunks and run one chunk at a time. This could be parallelized by the user as well.
However, perhaps there is a way to do this within the package? It could be off by default and could be implemented by adding an argument in these functions such as nthreads where a user can specify how many threads they would be willing to use on the operation.
(e..g https://multidplyr.tidyverse.org/ ? or other similar packages)
Metadata
Metadata
Assignees
Labels
No labels