Skip to content

Conversation

@stolpeo
Copy link
Contributor

@stolpeo stolpeo commented Dec 18, 2025

Summary by CodeRabbit

  • Bug Fixes

    • Fixed label formatting in conflicting classification output.
  • Refactor

    • Enhanced pathogenicity data processing with improved aggregation of clinical assertions for conflicting germline classifications and revised initialization approach.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 18, 2025

Walkthrough

The change modifies ClinVar pathogenicity handling by simplifying summary field initialization and introducing paranoid-pathogenicity processing for conflicting germline classifications that aggregates contributing pathogenicities with uncertain significance fallback.

Changes

Cohort / File(s) Summary
ClinVar Pathogenicity Processing
rules/pre-mehari/snakefiles/scripts/clinvar-to-tsv.py
Replaces prior ClinVar summary handling with simplified initialization; introduces branching logic that processes conflicting germline classifications via aggregation of contributing pathogenicities ordered by PATHOGENICITIES value, with fallback to uncertain significance; retains token-based parsing for non-conflicting path; resets summary fields to empty when discontinued; removes trailing tabs from conflict labels

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~35–50 minutes

Areas requiring extra attention:

  • Logic for detecting and handling conflicting germline classifications
  • Aggregation and ordering mechanism for contributing pathogenicities
  • Fallback behavior to uncertain significance when no pathogenicities are present
  • Verification that non-conflicting path token-based parsing remains functional
  • Testing coverage for both branching paths and edge cases

Poem

🐰 A rabbit hops through ClinVar's maze,
Two paths now gleam in structured ways,
Conflicting calls find clarity's touch,
Pathogenicity aggregates just right, not much,
With fallback grace to uncertain ground—
In parsed logic, true strength is found!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title 'WIP: clinvar-to-tsv changes for new clinvar filter' is vague and uses the non-descriptive 'WIP' prefix without clarifying the actual change being made. Replace the vague 'WIP' prefix and 'changes for new clinvar filter' with a specific description of the main change, e.g., 'Handle conflicting ClinVar pathogenicity classifications' or similar that reflects the core modification.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix-add-clinvar-conflicting

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
rules/pre-mehari/snakefiles/scripts/clinvar-to-tsv.py (2)

189-209: Consider extracting common token-matching logic and clarifying sorting behavior.

Two observations:

  1. Code duplication: The token extraction and PATHOGENICITIES matching logic (lines 197-200) is similar to lines 167-170 in the conflicting branch. Consider extracting a helper function.

  2. Sorting inconsistency: The conflicting case sorts pathogenicities by severity (most pathogenic first), but the non-conflicting case preserves the original order from the description. If this is intentional (e.g., respecting ClinVar's ordering), a brief comment would clarify. If not, consider applying consistent sorting.

🔎 Optional helper function extraction:
def extract_pathogenicities(tokens: typing.Iterable[str]) -> list[str]:
    """Extract known pathogenicity terms from tokens."""
    result = []
    for token in tokens:
        if token in PATHOGENICITIES:
            result.append(token)
    return result

Then use in both branches:

-                    for token in description_tokens:
-                        for key in PATHOGENICITIES.keys():
-                            if key == token:
-                                contributing_pathogenicities.add(key)
+                    contributing_pathogenicities.update(
+                        extract_pathogenicities(description_tokens)
+                    )

185-188: Consider handling unknown review status values.

If ClinVar introduces a new AggregateGermlineReviewStatus value not present in REVIEW_STATUS_LABELS or REVIEW_STATUS_STARS, the dictionary lookups will raise KeyError. Consider using .get() with a sensible default, or explicitly handling unknown statuses.

🔎 Example with defaults:
-        summary_paranoid_review_status_label = REVIEW_STATUS_LABELS[
-            germline_classification.review_status
-        ]
-        summary_paranoid_gold_stars = REVIEW_STATUS_STARS[germline_classification.review_status]
+        summary_paranoid_review_status_label = REVIEW_STATUS_LABELS.get(
+            germline_classification.review_status, "unknown"
+        )
+        summary_paranoid_gold_stars = REVIEW_STATUS_STARS.get(
+            germline_classification.review_status, 0
+        )
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f83d1b5 and 3d4e2a9.

📒 Files selected for processing (1)
  • rules/pre-mehari/snakefiles/scripts/clinvar-to-tsv.py (3 hunks)
🔇 Additional comments (2)
rules/pre-mehari/snakefiles/scripts/clinvar-to-tsv.py (2)

88-88: LGTM!

The trailing tab removal from the conflicting classifications label is a good fix that prevents formatting issues in the output.


147-188: LGTM!

The conflicting classification handling logic is well-structured:

  • Properly filters out low-quality submissions (unspecified, flagged, no criteria, etc.)
  • Correctly aggregates pathogenicities from clinical assertions using a set to avoid duplicates
  • Sorts by pathogenicity severity (most pathogenic first)
  • Has a sensible fallback to "uncertain significance" when no valid pathogenicities are found

Comment on lines +140 to +144
# Summary ClinVar fields are discontinued
summary_clinvar_pathogenicity = []
summary_clinvar_pathogenicity_label = ""
summary_clinvar_review_status_label = ""
summary_clinvar_gold_stars = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# Search for all usages of summary_clinvar fields in Python files
rg -n "summary_clinvar" --type py

Repository: varfish-org/varfish-db-downloader

Length of output: 2247


🏁 Script executed:

# Read the context around the initialization lines (140-144)
head -n 200 rules/pre-mehari/snakefiles/scripts/clinvar-to-tsv.py | tail -n +120

Repository: varfish-org/varfish-db-downloader

Length of output: 4055


🏁 Script executed:

# Check the TSV output headers to see what fields are included
sed -n '240,270p' rules/pre-mehari/snakefiles/scripts/clinvar-to-tsv.py

Repository: varfish-org/varfish-db-downloader

Length of output: 797


🏁 Script executed:

# Check if there's any documentation or changelog about this change
rg -n "paranoid\|discontinued" --type py --type md --type txt -i

Repository: varfish-org/varfish-db-downloader

Length of output: 59


Deprecation of summary_clinvar fields is intentional but document the transition.

The discontinuation of summary_clinvar_* fields (set to empty values at lines 140-144) is an intentional refactoring in favor of summary_paranoid_* fields, which are calculated using improved classification logic. Since both old and new fields are preserved in the TSV output, backward compatibility is maintained. However, add a clear comment or documentation note explaining that downstream consumers should migrate to the new summary_paranoid_* fields, as the legacy fields will remain empty.

🤖 Prompt for AI Agents
rules/pre-mehari/snakefiles/scripts/clinvar-to-tsv.py around lines 140-144: the
summary_clinvar_* variables are intentionally set to empty for a migration to
summary_paranoid_* logic; add a short inline comment above these assignments
stating that the legacy summary_clinvar_* fields are deprecated and
intentionally left blank, that downstream consumers should migrate to
summary_paranoid_* fields (which are the new authoritative values), and that
both old and new columns remain in the TSV for backward compatibility; also add
a brief note to the TSV output generation (or repository README) indicating the
deprecation and recommended migration target so users know to stop relying on
the empty legacy fields.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants