Categories missing with highly partitioned dask dataframes in PointsModel

1. Reproduce using the [`blobs` dataset](https://spatialdata.scverse.org/en/stable/api/datasets.html#spatialdata.datasets.blobs)

    ```python
    import numpy as np
    import pandas as pd
    import dask.dataframe as dd
    from spatialdata.datasets import blobs
    
    s = blobs()
    tbl = next(iter(s.tables.values()))
    df = tbl.obs.copy()
    n_cats = 15
    cats = pd.Index([f"G{i:04d}" for i in range(n_cats)], dtype="string")
    
    rng = np.random.default_rng(0)
    n = len(df)
    k_front = min(10_000, n)
    front = rng.choice(cats[:20], size=k_front)
    back  = rng.choice(cats, size=n - k_front)
    df["gene"] = pd.Index(np.concatenate([front, back]), dtype="string")
    
    ddf_many = dd.from_pandas(df, npartitions=217)
    
    c1 = ddf_many["gene"].astype(str).astype("category").head(1).cat.categories
    print("many partitions, categories seen via head(1):", len(c1))  # typically ~20
    
    ddf_as_known = ddf_many["gene"].astype("category").cat.as_known()
    print("with .cat.as_known(), categories:", len(ddf_as_known._meta.cat.categories))
    ```

**Describe the bug**
The setting of the categories in L888 of scr/spatialdata/models/models.py is not taking into account all categories. If the categories are set per partition, not all categories will be properly registered, leading to an inadvertent filtering of the points dataframe and feature loss.

**To Reproduce**
See example above on `blobs` dataset or datasets of the Allen Brain Atlas (https://knowledge.brain-map.org/abcatlas#AQEBSzlKTjIzUDI0S1FDR0s5VTc1QQACSFNZWlBaVzE2NjlVODIxQldZUAADAAQBAAKEUL8fg4IJfwOFLj12hMQ92QQyTlFUSUU3VEFNUDhQUUFITzRQAAWBr6ZKgemsDoGggUeAktXoBgAHAAAFAAYBAQJGUzAwRFhWMFQ5UjFYOUZKNFFFAAN%2BAAAABAAACFZGT0ZZUEZRR1JLVURRVVozRkYACUxWREJKQVc4Qkk1WVNTMVFVQkcACgALAVRMT0tXQ0w5NVJVMDNEOVBFVEcAAjczR1ZURFhERUdFMjdNMlhKTVQAAwEEAQACIzAwMDAwMAADyAEABQEBAiMwMDAwMDAAA8gBAAAAAgEA). As far as we know the error is occuring on all datasets there.

According to https://docs.dask.org/en/stable/dataframe-categoricals.html a solution could be to use
```python
data[c] = data[c].cat.as_known()
```
to make the categories visible and then for registration
```python
data[c] = data[c].cat.set_categories(data[c]._meta.cat.categories)
```
Although, the last step could probably be skipped. This implementation is a bit slower than the previous one.

I'll open a pull request with these change later.

**Expected behavior**
Registration of all categories within points-dataframe.

- OS: Linux Ubuntu
- Version 0.5.0

**Additional context**
spatialdata_io.readers.merscope with datasets from the Allen Brain Atlas result in transcript-dataframe with ~20-30 genes instead of ~500 genes.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Categories missing with highly partitioned dask dataframes in PointsModel #1009

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Categories missing with highly partitioned dask dataframes in PointsModel #1009

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions