-
Notifications
You must be signed in to change notification settings - Fork 3
WIP: clinvar-to-tsv changes for new clinvar filter #109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughThe change modifies ClinVar pathogenicity handling by simplifying summary field initialization and introducing paranoid-pathogenicity processing for conflicting germline classifications that aggregates contributing pathogenicities with uncertain significance fallback. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~35–50 minutes Areas requiring extra attention:
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
rules/pre-mehari/snakefiles/scripts/clinvar-to-tsv.py (2)
189-209: Consider extracting common token-matching logic and clarifying sorting behavior.Two observations:
Code duplication: The token extraction and PATHOGENICITIES matching logic (lines 197-200) is similar to lines 167-170 in the conflicting branch. Consider extracting a helper function.
Sorting inconsistency: The conflicting case sorts pathogenicities by severity (most pathogenic first), but the non-conflicting case preserves the original order from the description. If this is intentional (e.g., respecting ClinVar's ordering), a brief comment would clarify. If not, consider applying consistent sorting.
🔎 Optional helper function extraction:
def extract_pathogenicities(tokens: typing.Iterable[str]) -> list[str]: """Extract known pathogenicity terms from tokens.""" result = [] for token in tokens: if token in PATHOGENICITIES: result.append(token) return resultThen use in both branches:
- for token in description_tokens: - for key in PATHOGENICITIES.keys(): - if key == token: - contributing_pathogenicities.add(key) + contributing_pathogenicities.update( + extract_pathogenicities(description_tokens) + )
185-188: Consider handling unknown review status values.If ClinVar introduces a new
AggregateGermlineReviewStatusvalue not present inREVIEW_STATUS_LABELSorREVIEW_STATUS_STARS, the dictionary lookups will raiseKeyError. Consider using.get()with a sensible default, or explicitly handling unknown statuses.🔎 Example with defaults:
- summary_paranoid_review_status_label = REVIEW_STATUS_LABELS[ - germline_classification.review_status - ] - summary_paranoid_gold_stars = REVIEW_STATUS_STARS[germline_classification.review_status] + summary_paranoid_review_status_label = REVIEW_STATUS_LABELS.get( + germline_classification.review_status, "unknown" + ) + summary_paranoid_gold_stars = REVIEW_STATUS_STARS.get( + germline_classification.review_status, 0 + )
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
rules/pre-mehari/snakefiles/scripts/clinvar-to-tsv.py(3 hunks)
🔇 Additional comments (2)
rules/pre-mehari/snakefiles/scripts/clinvar-to-tsv.py (2)
88-88: LGTM!The trailing tab removal from the conflicting classifications label is a good fix that prevents formatting issues in the output.
147-188: LGTM!The conflicting classification handling logic is well-structured:
- Properly filters out low-quality submissions (unspecified, flagged, no criteria, etc.)
- Correctly aggregates pathogenicities from clinical assertions using a set to avoid duplicates
- Sorts by pathogenicity severity (most pathogenic first)
- Has a sensible fallback to "uncertain significance" when no valid pathogenicities are found
| # Summary ClinVar fields are discontinued | ||
| summary_clinvar_pathogenicity = [] | ||
| summary_clinvar_pathogenicity_label = "" | ||
| summary_clinvar_review_status_label = "" | ||
| summary_clinvar_gold_stars = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# Search for all usages of summary_clinvar fields in Python files
rg -n "summary_clinvar" --type pyRepository: varfish-org/varfish-db-downloader
Length of output: 2247
🏁 Script executed:
# Read the context around the initialization lines (140-144)
head -n 200 rules/pre-mehari/snakefiles/scripts/clinvar-to-tsv.py | tail -n +120Repository: varfish-org/varfish-db-downloader
Length of output: 4055
🏁 Script executed:
# Check the TSV output headers to see what fields are included
sed -n '240,270p' rules/pre-mehari/snakefiles/scripts/clinvar-to-tsv.pyRepository: varfish-org/varfish-db-downloader
Length of output: 797
🏁 Script executed:
# Check if there's any documentation or changelog about this change
rg -n "paranoid\|discontinued" --type py --type md --type txt -iRepository: varfish-org/varfish-db-downloader
Length of output: 59
Deprecation of summary_clinvar fields is intentional but document the transition.
The discontinuation of summary_clinvar_* fields (set to empty values at lines 140-144) is an intentional refactoring in favor of summary_paranoid_* fields, which are calculated using improved classification logic. Since both old and new fields are preserved in the TSV output, backward compatibility is maintained. However, add a clear comment or documentation note explaining that downstream consumers should migrate to the new summary_paranoid_* fields, as the legacy fields will remain empty.
🤖 Prompt for AI Agents
rules/pre-mehari/snakefiles/scripts/clinvar-to-tsv.py around lines 140-144: the
summary_clinvar_* variables are intentionally set to empty for a migration to
summary_paranoid_* logic; add a short inline comment above these assignments
stating that the legacy summary_clinvar_* fields are deprecated and
intentionally left blank, that downstream consumers should migrate to
summary_paranoid_* fields (which are the new authoritative values), and that
both old and new columns remain in the TSV for backward compatibility; also add
a brief note to the TSV output generation (or repository README) indicating the
deprecation and recommended migration target so users know to stop relying on
the empty legacy fields.
Summary by CodeRabbit
Bug Fixes
Refactor
✏️ Tip: You can customize this high-level summary in your review settings.