-
Notifications
You must be signed in to change notification settings - Fork 78
Description
-
Reproduce using the
blobsdatasetimport numpy as np import pandas as pd import dask.dataframe as dd from spatialdata.datasets import blobs s = blobs() tbl = next(iter(s.tables.values())) df = tbl.obs.copy() n_cats = 15 cats = pd.Index([f"G{i:04d}" for i in range(n_cats)], dtype="string") rng = np.random.default_rng(0) n = len(df) k_front = min(10_000, n) front = rng.choice(cats[:20], size=k_front) back = rng.choice(cats, size=n - k_front) df["gene"] = pd.Index(np.concatenate([front, back]), dtype="string") ddf_many = dd.from_pandas(df, npartitions=217) c1 = ddf_many["gene"].astype(str).astype("category").head(1).cat.categories print("many partitions, categories seen via head(1):", len(c1)) # typically ~20 ddf_as_known = ddf_many["gene"].astype("category").cat.as_known() print("with .cat.as_known(), categories:", len(ddf_as_known._meta.cat.categories))
Describe the bug
The setting of the categories in L888 of scr/spatialdata/models/models.py is not taking into account all categories. If the categories are set per partition, not all categories will be properly registered, leading to an inadvertent filtering of the points dataframe and feature loss.
To Reproduce
See example above on blobs dataset or datasets of the Allen Brain Atlas (https://knowledge.brain-map.org/abcatlas#AQEBSzlKTjIzUDI0S1FDR0s5VTc1QQACSFNZWlBaVzE2NjlVODIxQldZUAADAAQBAAKEUL8fg4IJfwOFLj12hMQ92QQyTlFUSUU3VEFNUDhQUUFITzRQAAWBr6ZKgemsDoGggUeAktXoBgAHAAAFAAYBAQJGUzAwRFhWMFQ5UjFYOUZKNFFFAAN%2BAAAABAAACFZGT0ZZUEZRR1JLVURRVVozRkYACUxWREJKQVc4Qkk1WVNTMVFVQkcACgALAVRMT0tXQ0w5NVJVMDNEOVBFVEcAAjczR1ZURFhERUdFMjdNMlhKTVQAAwEEAQACIzAwMDAwMAADyAEABQEBAiMwMDAwMDAAA8gBAAAAAgEA). As far as we know the error is occuring on all datasets there.
According to https://docs.dask.org/en/stable/dataframe-categoricals.html a solution could be to use
data[c] = data[c].cat.as_known()to make the categories visible and then for registration
data[c] = data[c].cat.set_categories(data[c]._meta.cat.categories)Although, the last step could probably be skipped. This implementation is a bit slower than the previous one.
I'll open a pull request with these change later.
Expected behavior
Registration of all categories within points-dataframe.
- OS: Linux Ubuntu
- Version 0.5.0
Additional context
spatialdata_io.readers.merscope with datasets from the Allen Brain Atlas result in transcript-dataframe with ~20-30 genes instead of ~500 genes.