Skip to content

explore parallelization in operations using group_by() #21

@henry-ngo

Description

@henry-ngo

Our team has encountered a performance issue when using package functions that use dplyr::group_by() followed by dplyr::summarise() on very large dataframes. A common example is get_ds_rt().

This seems to be a dplyr issue that is common across many usages.

A workaround is to split up the dataframe into chunks and run one chunk at a time. This could be parallelized by the user as well.

However, perhaps there is a way to do this within the package? It could be off by default and could be implemented by adding an argument in these functions such as nthreads where a user can specify how many threads they would be willing to use on the operation.

(e..g https://multidplyr.tidyverse.org/ ? or other similar packages)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions