Skip to content

Categories missing with highly partitioned dask dataframes in PointsModel #1009

@jonas2612

Description

@jonas2612
  1. Reproduce using the blobs dataset

    import numpy as np
    import pandas as pd
    import dask.dataframe as dd
    from spatialdata.datasets import blobs
    
    s = blobs()
    tbl = next(iter(s.tables.values()))
    df = tbl.obs.copy()
    n_cats = 15
    cats = pd.Index([f"G{i:04d}" for i in range(n_cats)], dtype="string")
    
    rng = np.random.default_rng(0)
    n = len(df)
    k_front = min(10_000, n)
    front = rng.choice(cats[:20], size=k_front)
    back  = rng.choice(cats, size=n - k_front)
    df["gene"] = pd.Index(np.concatenate([front, back]), dtype="string")
    
    ddf_many = dd.from_pandas(df, npartitions=217)
    
    c1 = ddf_many["gene"].astype(str).astype("category").head(1).cat.categories
    print("many partitions, categories seen via head(1):", len(c1))  # typically ~20
    
    ddf_as_known = ddf_many["gene"].astype("category").cat.as_known()
    print("with .cat.as_known(), categories:", len(ddf_as_known._meta.cat.categories))

Describe the bug
The setting of the categories in L888 of scr/spatialdata/models/models.py is not taking into account all categories. If the categories are set per partition, not all categories will be properly registered, leading to an inadvertent filtering of the points dataframe and feature loss.

To Reproduce
See example above on blobs dataset or datasets of the Allen Brain Atlas (https://knowledge.brain-map.org/abcatlas#AQEBSzlKTjIzUDI0S1FDR0s5VTc1QQACSFNZWlBaVzE2NjlVODIxQldZUAADAAQBAAKEUL8fg4IJfwOFLj12hMQ92QQyTlFUSUU3VEFNUDhQUUFITzRQAAWBr6ZKgemsDoGggUeAktXoBgAHAAAFAAYBAQJGUzAwRFhWMFQ5UjFYOUZKNFFFAAN%2BAAAABAAACFZGT0ZZUEZRR1JLVURRVVozRkYACUxWREJKQVc4Qkk1WVNTMVFVQkcACgALAVRMT0tXQ0w5NVJVMDNEOVBFVEcAAjczR1ZURFhERUdFMjdNMlhKTVQAAwEEAQACIzAwMDAwMAADyAEABQEBAiMwMDAwMDAAA8gBAAAAAgEA). As far as we know the error is occuring on all datasets there.

According to https://docs.dask.org/en/stable/dataframe-categoricals.html a solution could be to use

data[c] = data[c].cat.as_known()

to make the categories visible and then for registration

data[c] = data[c].cat.set_categories(data[c]._meta.cat.categories)

Although, the last step could probably be skipped. This implementation is a bit slower than the previous one.

I'll open a pull request with these change later.

Expected behavior
Registration of all categories within points-dataframe.

  • OS: Linux Ubuntu
  • Version 0.5.0

Additional context
spatialdata_io.readers.merscope with datasets from the Allen Brain Atlas result in transcript-dataframe with ~20-30 genes instead of ~500 genes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions