Skip to content

Conversation

@happymealinthebuilding
Copy link
Member

This Pull Request integrates the finding marker genes branch into develop. This integration provides a foundational set of tools and a demonstrated workflow for identifying and working with marker genes. The utility functions offer flexibility for analyzing clustering results and visualizing gene expression patterns, which will be crucial for ongoing and future research requiring cell type characterization and biomarker discovery.

happymealinthebuilding and others added 19 commits April 11, 2025 20:07
Implement marker gene expression visualization

Introduces the `visualize_marker_genes` function for visualizing
marker gene expression across cell clusters in an AnnData object.
This function:
- Accepts an `adata` object, a list/dict of `marker_genes`, and a
  `cluster_key` (defaulting to 'leiden') from `adata.obs` for grouping.
- Generates a dot plot using `sc.pl.dotplot` to show average expression
  and the percentage of cells expressing each marker gene per cluster.
- Additionally, creates a stacked violin plot using `sc.pl.stacked_violin`
  to display the distribution of marker gene expression within each cluster.
- Facilitates comprehensive visual identification and validation of
  potential marker genes.
…lgorithm_new_cluster_names

Standardize cluster categories and names

Implements the `rename_clusters` function to update cluster labels
within an AnnData object. This function:
- Takes an `adata` object, a `cluster_algo` key (e.g., 'leiden', 'louvain'
  from `adata.obs`), and a list of `new_cluster_names`.
- Validates that the `cluster_algo` key exists in `adata.obs` and that
  the number of `new_cluster_names` matches the existing number of clusters.
- Utilizes `adata.rename_categories()` to perform the renaming in-place.
- Ensures consistent and interpretable cluster labeling for downstream
  marker gene analysis.
Extract top N ranked marker genes and p-values

Implements the `extract_top_genes` function to retrieve top-ranked
genes and their associated p-values from pre-computed `rank_genes_groups`
results stored in an AnnData object. This function:
- Takes an `adata` object and an optional `n_top` parameter (defaulting to 5)
  to specify the number of top genes per group.
- Validates the presence of `adata.uns['rank_genes_groups']` before processing.
- Parses the `names` and `pvals` fields from the `rank_genes_groups` results.
- Returns a tuple containing two pandas DataFrames:
    1. A DataFrame of top gene names per cluster/group.
    2. A combined DataFrame with top gene names and their corresponding
       p-values for each cluster/group.
- Provides a structured and quantitative list of candidate marker genes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

5 participants