WIP: clinvar-to-tsv changes for new clinvar filter #109

stolpeo · 2025-12-18T13:09:08Z

Summary by CodeRabbit

Bug Fixes
- Fixed label formatting in conflicting classification output.
Refactor
- Enhanced pathogenicity data processing with improved aggregation of clinical assertions for conflicting germline classifications and revised initialization approach.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-18T13:09:19Z

Walkthrough

The change modifies ClinVar pathogenicity handling by simplifying summary field initialization and introducing paranoid-pathogenicity processing for conflicting germline classifications that aggregates contributing pathogenicities with uncertain significance fallback.

Changes

Cohort / File(s)	Summary
ClinVar Pathogenicity Processing `rules/pre-mehari/snakefiles/scripts/clinvar-to-tsv.py`	Replaces prior ClinVar summary handling with simplified initialization; introduces branching logic that processes conflicting germline classifications via aggregation of contributing pathogenicities ordered by PATHOGENICITIES value, with fallback to uncertain significance; retains token-based parsing for non-conflicting path; resets summary fields to empty when discontinued; removes trailing tabs from conflict labels

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~35–50 minutes

Areas requiring extra attention:

Logic for detecting and handling conflicting germline classifications
Aggregation and ordering mechanism for contributing pathogenicities
Fallback behavior to uncertain significance when no pathogenicities are present
Verification that non-conflicting path token-based parsing remains functional
Testing coverage for both branching paths and edge cases

Poem

🐰 A rabbit hops through ClinVar's maze,
Two paths now gleam in structured ways,
Conflicting calls find clarity's touch,
Pathogenicity aggregates just right, not much,
With fallback grace to uncertain ground—
In parsed logic, true strength is found! ✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title check	❓ Inconclusive	The title 'WIP: clinvar-to-tsv changes for new clinvar filter' is vague and uses the non-descriptive 'WIP' prefix without clarifying the actual change being made.	Replace the vague 'WIP' prefix and 'changes for new clinvar filter' with a specific description of the main change, e.g., 'Handle conflicting ClinVar pathogenicity classifications' or similar that reflects the core modification.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix-add-clinvar-conflicting

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

rules/pre-mehari/snakefiles/scripts/clinvar-to-tsv.py (2)
189-209: Consider extracting common token-matching logic and clarifying sorting behavior.

Two observations:

Code duplication: The token extraction and PATHOGENICITIES matching logic (lines 197-200) is similar to lines 167-170 in the conflicting branch. Consider extracting a helper function.

Sorting inconsistency: The conflicting case sorts pathogenicities by severity (most pathogenic first), but the non-conflicting case preserves the original order from the description. If this is intentional (e.g., respecting ClinVar's ordering), a brief comment would clarify. If not, consider applying consistent sorting.
🔎 Optional helper function extraction:
def extract_pathogenicities(tokens: typing.Iterable[str]) -> list[str]:
    """Extract known pathogenicity terms from tokens."""
    result = []
    for token in tokens:
        if token in PATHOGENICITIES:
            result.append(token)
    return result
Then use in both branches:
-                    for token in description_tokens:
-                        for key in PATHOGENICITIES.keys():
-                            if key == token:
-                                contributing_pathogenicities.add(key)
+                    contributing_pathogenicities.update(
+                        extract_pathogenicities(description_tokens)
+                    )
185-188: Consider handling unknown review status values.

If ClinVar introduces a new AggregateGermlineReviewStatus value not present in REVIEW_STATUS_LABELS or REVIEW_STATUS_STARS, the dictionary lookups will raise KeyError. Consider using .get() with a sensible default, or explicitly handling unknown statuses.
🔎 Example with defaults:
-        summary_paranoid_review_status_label = REVIEW_STATUS_LABELS[
-            germline_classification.review_status
-        ]
-        summary_paranoid_gold_stars = REVIEW_STATUS_STARS[germline_classification.review_status]
+        summary_paranoid_review_status_label = REVIEW_STATUS_LABELS.get(
+            germline_classification.review_status, "unknown"
+        )
+        summary_paranoid_gold_stars = REVIEW_STATUS_STARS.get(
+            germline_classification.review_status, 0
+        )

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f83d1b5 and 3d4e2a9.

📒 Files selected for processing (1)

rules/pre-mehari/snakefiles/scripts/clinvar-to-tsv.py (3 hunks)

🔇 Additional comments (2)

rules/pre-mehari/snakefiles/scripts/clinvar-to-tsv.py (2)

88-88: LGTM!

The trailing tab removal from the conflicting classifications label is a good fix that prevents formatting issues in the output.

147-188: LGTM!

The conflicting classification handling logic is well-structured:

Properly filters out low-quality submissions (unspecified, flagged, no criteria, etc.)

Correctly aggregates pathogenicities from clinical assertions using a set to avoid duplicates

Sorts by pathogenicity severity (most pathogenic first)

Has a sensible fallback to "uncertain significance" when no valid pathogenicities are found

coderabbitai · 2025-12-18T13:12:24Z

rules/pre-mehari/snakefiles/scripts/clinvar-to-tsv.py

+    # Summary ClinVar fields are discontinued
+    summary_clinvar_pathogenicity = []
+    summary_clinvar_pathogenicity_label = ""
+    summary_clinvar_review_status_label = ""
+    summary_clinvar_gold_stars = 0


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Search for all usages of summary_clinvar fields in Python files rg -n "summary_clinvar" --type py

Repository: varfish-org/varfish-db-downloader

Length of output: 2247

🏁 Script executed:

# Read the context around the initialization lines (140-144) head -n 200 rules/pre-mehari/snakefiles/scripts/clinvar-to-tsv.py | tail -n +120

Repository: varfish-org/varfish-db-downloader

Length of output: 4055

🏁 Script executed:

# Check the TSV output headers to see what fields are included sed -n '240,270p' rules/pre-mehari/snakefiles/scripts/clinvar-to-tsv.py

Repository: varfish-org/varfish-db-downloader

Length of output: 797

🏁 Script executed:

# Check if there's any documentation or changelog about this change rg -n "paranoid\|discontinued" --type py --type md --type txt -i

Repository: varfish-org/varfish-db-downloader

Length of output: 59

Deprecation of summary_clinvar fields is intentional but document the transition.

The discontinuation of summary_clinvar_* fields (set to empty values at lines 140-144) is an intentional refactoring in favor of summary_paranoid_* fields, which are calculated using improved classification logic. Since both old and new fields are preserved in the TSV output, backward compatibility is maintained. However, add a clear comment or documentation note explaining that downstream consumers should migrate to the new summary_paranoid_* fields, as the legacy fields will remain empty.

🤖 Prompt for AI Agents

rules/pre-mehari/snakefiles/scripts/clinvar-to-tsv.py around lines 140-144: the summary_clinvar_* variables are intentionally set to empty for a migration to summary_paranoid_* logic; add a short inline comment above these assignments stating that the legacy summary_clinvar_* fields are deprecated and intentionally left blank, that downstream consumers should migrate to summary_paranoid_* fields (which are the new authoritative values), and that both old and new columns remain in the TSV for backward compatibility; also add a brief note to the TSV output generation (or repository README) indicating the deprecation and recommended migration target so users know to stop relying on the empty legacy fields.

WIP: clinvar-to-tsv changes for new clinvar filter

3d4e2a9

coderabbitai bot reviewed Dec 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: clinvar-to-tsv changes for new clinvar filter #109

WIP: clinvar-to-tsv changes for new clinvar filter #109

Uh oh!

stolpeo commented Dec 18, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Dec 18, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

WIP: clinvar-to-tsv changes for new clinvar filter #109

Are you sure you want to change the base?

WIP: clinvar-to-tsv changes for new clinvar filter #109

Uh oh!

Conversation

stolpeo commented Dec 18, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stolpeo commented Dec 18, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 18, 2025 •

edited

Loading