-
Notifications
You must be signed in to change notification settings - Fork 0
Release WASP2 v1.2.1: Rust acceleration + unified pipeline #9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Created detailed task breakdown for: - Phase 1: Explore & Understand (10-14 days) - Code inventory and architecture mapping - Deep dive into counting, analysis, and mapping modules - Security and performance audit - Test coverage analysis - Phase 2: Clean & Refactor (20-28 days) - Critical security and correctness fixes - Comprehensive test suite (>80% coverage) - Code quality improvements across all modules - Documentation finalization - Performance optimization Total timeline: 6-8 weeks Deliverables: 30+ documentation and code outputs Success metrics and risk mitigation strategies included
Created comprehensive documentation for WASP2 codebase analysis: ## Deliverables (Task 1.1.1, 1.1.2, 1.1.3): ### 1. docs/DIRECTORY_STRUCTURE.md (180 lines) - Complete repository structure mapping - All 24 Python files documented with line counts and purposes - Entry points identified (bin/WASP2 + 3 module __main__.py) - File organization by module (counting: 1,430 LOC, analysis: 2,776 LOC, mapping: 1,562 LOC) - Critical observations: bin/WASP2 is broken, no tests found, code duplication exists ### 2. docs/DEPENDENCIES.md (490 lines) - Analysis of all 9 dependencies (numpy, pandas, polars, scipy, pysam, pybedtools, bedtools, typer, anndata) - Usage patterns across codebase - Risk assessment for each dependency - Version constraint recommendations (currently none specified) - Security vulnerabilities identified (no version pinning) - Pandas vs Polars redundancy analysis ### 3. docs/ARCHITECTURE.md (640 lines) - High-level system architecture with ASCII diagrams - Module architecture breakdowns (counting, analysis, mapping) - Data flow diagrams for bulk and single-cell workflows - Design patterns identified (orchestrator, file-based data passing) - Integration points between modules - 10 architectural issues identified (3 critical, 4 medium, 3 minor) - Technology stack layer analysis ## Key Findings: **Critical Issues:** - bin/WASP2 main executable is incomplete/broken - No test suite found (zero unit or integration tests) - Code duplication between modules (e.g., count_alleles.py in multiple places) **Architecture:** - Modular pipeline design with 3 independent modules - File-based interfaces for module communication - Typer CLI framework for all commands - Beta-binomial statistical model for AI analysis - 3-step WASP mapping pipeline for bias correction **Dependencies:** - 5,768 total lines of Python code - No version constraints (security risk) - Both pandas and polars used (intentional but adds complexity) - scipy critical for statistical correctness ## Phase 1.1 Status: ✅ COMPLETE Next: Task 1.2 - Module Deep Dive: Counting
Created comprehensive framework for end-to-end pipeline testing and regression validation to support Phase 2 refactoring with confidence. ## What We Did: Attempted full pipeline execution to: 1. Understand end-to-end flow 2. Validate code review findings 3. Establish baseline/regression testing framework ## Critical Findings: **Missing Dependencies** (BLOCKERS): - bcftools: Called by filter_variant_data.py but NOT in environment.yml - bedtools: Called by filter_variant_data.py but NOT in environment.yml - samtools: Required by mapping module but NOT in environment.yml These tools are invoked via subprocess but missing from conda environment. Pipeline cannot run without them. ## Baseline Infrastructure Created: ### 1. Execution Scripts: - scripts/run_baseline.sh (5.2 KB) - Automated baseline creation - Runs Counting → Analysis pipeline - Saves MD5 checksums for regression testing - Captures timing and metadata - scripts/validate_against_baseline.sh (4.3 KB) - Automated regression testing - Compares outputs against baseline - Shows diffs if outputs changed - Pass/fail reporting ### 2. Documentation: - baselines/PIPELINE_EXECUTION_PLAN.md - Complete step-by-step execution guide - Both Counting-only and Full pipeline options - Dependency requirements - Expected outputs and validation procedures - PIPELINE_RUN_FINDINGS.md - Summary of execution attempt - Confirmed blockers - Test data validated - Next steps outlined ### 3. Test Data: - Extracted test bundle to test_data/ - Validated VCF, BAM, BED files - 111,454 NA12878 chr10 SNPs (hg38) - as_counts.txt shows expected output format ## How Baselines Support Phase 2: **Sanity Check Workflow**: ```bash # After each refactor: ./scripts/validate_against_baseline.sh # If PASS: Refactor is safe (no regression) # If FAIL: Either bug introduced or baseline needs update ``` This gives us **confidence** that refactoring doesn't break functionality. ## Validated Code Review Findings: ✅ Confirmed: bcftools missing (subprocess call fails) ✅ Confirmed: bedtools missing (subprocess call fails) ✅ Confirmed: samtools needed (for mapping module) ⏳ Pending: Binary search performance issue (needs env to test) ⏳ Pending: Sample parsing bug (needs env to test) ## Next Steps: 1. Fix environment.yml (add bcftools, bedtools, samtools) 2. Run ./scripts/run_baseline.sh to establish baseline 3. Continue Phase 1 module documentation 4. Use baselines for Phase 2 regression testing ## Files Added: - scripts/run_baseline.sh (executable) - scripts/validate_against_baseline.sh (executable) - baselines/PIPELINE_EXECUTION_PLAN.md (11 KB) - PIPELINE_RUN_FINDINGS.md (10 KB) - test_data/README.md, test_data/as_counts.txt (reference files) - .gitignore updates (ignore extracted test data and baseline outputs)
Added critical dependencies that are called via subprocess but were missing: - bcftools: Required by src/counting/filter_variant_data.py (VCF filtering) - samtools: Required by src/mapping/ module (BAM operations) - typing_extensions: Used in all __main__.py files for Annotated types These tools are invoked directly in the code but were not installable, causing pipeline failures. This fixes the blockers identified in PIPELINE_RUN_FINDINGS.md. Closes: Critical blocker for baseline execution
Created comprehensive status document showing: - What was fixed (environment.yml) - How to run baselines (step-by-step) - Expected outputs and artifacts - Regression testing workflow - Current status of all components The environment is now ready for baseline execution once conda is available. User can follow the guide to install environment and run baselines on their local machine. Next: Either continue Phase 1 docs OR user runs baseline first
Created comprehensive technical documentation and issues inventory for the counting module through detailed line-by-line code analysis. ## Deliverables (1,200+ lines of documentation): ### 1. docs/modules/COUNTING_MODULE.md (620 lines) Complete technical deep dive covering: - Architecture and design patterns (Orchestrator + Worker) - File-by-file analysis of all 7 files (1,430 LOC total) - Key algorithms (binary search, VCF filtering, GTF parsing) - Data structures (Polars DataFrames, AnnData format) - API reference with signatures and examples - Bulk vs Single-cell workflow comparison ### 2. docs/modules/COUNTING_ISSUES.md (600 lines) Actionable technical debt inventory with 24 catalogued issues: **Critical (3)**: - C1: Binary search implemented but NOT USED (performance bug) - C2: AnnData dimensions transposed (breaks scanpy/SnapATAC2) - C3: Sample parsing bug (TypeError on None input) **High Priority (5)**: - H1: 80% code duplication (WaspCountFiles vs WaspCountSC) - H2: 123 lines of dead code (old parse functions never called) - H3: No error handling (print instead of raise) - H4: Unclear read_set behavior (intentional or bug?) - H5: Debug code in production (line 140) **Medium Priority (8)**: - No type hints (M1) - Missing docstrings (M2) - No progress indicators (M3) - Hard-coded constants (M4) - Commented-out code (M5) - Inconsistent subprocess error handling (M6) - No input validation (M7) - Regex duplication (M8) **Low Priority (8)**: - 13 TODO comments throughout codebase ## Key Findings: ### Performance Issue: Binary search function exists (`find_read_aln_pos()`) but bulk counting uses O(n) linear search. Single-cell version DOES use binary search - inconsistency suggests oversight rather than design choice. ### Compatibility Issue: AnnData output has transposed dimensions (SNPs × Cells instead of Cells × SNPs), breaking standard scanpy/SnapATAC2 workflows. ### Code Quality: - 14% of codebase is duplicated between bulk and single-cell - 8.6% of codebase is dead code (never called) - Widespread use of `print()` instead of exceptions - No type hints or comprehensive docstrings ## Phase 2 Roadmap Established: Organized issues into 4-week sprint plan: - Week 1: Critical fixes (C1-C3, H3) - all Easy to Medium effort - Week 2: Code quality (H1, H2, H4, H5) - Week 3: Polish (type hints, docstrings, progress bars) - Week 4: Cleanup (TODOs, consistent patterns) All fixes will be validated against baselines (scripts/validate_against_baseline.sh). ## Analysis Methodology: 1. Read all 7 files line-by-line (completed earlier) 2. Document architecture and data flow 3. Identify algorithms and their complexity 4. Catalog all issues by severity 5. Propose fixes with effort estimates 6. Create Phase 2 prioritization ## Files Analyzed: - __main__.py (221 lines) - CLI interface - run_counting.py (230 lines) - Bulk orchestrator - count_alleles.py (123 lines) - Core bulk counting - filter_variant_data.py (237 lines) - VCF filtering - parse_gene_data.py (213 lines) - GTF/GFF3 parsing - run_counting_sc.py (178 lines) - Single-cell orchestrator - count_alleles_sc.py (228 lines) - Core single-cell counting Total: 1,430 LOC fully documented ## Next Steps: Phase 1.3: Analysis Module Deep Dive (as_analysis.py - 674 lines, most complex)
Analyzed all 10 files in analysis module (2,779 LOC - 48% of codebase). Deliverables: - docs/modules/ANALYSIS_MODULE.md: Technical overview of statistical engine - docs/modules/ANALYSIS_ISSUES.md: 18 catalogued issues with fix estimates Key findings: - Beta-binomial model with LRT for allelic imbalance detection - Phased/unphased genotype support with dynamic programming - 300+ lines of dead commented code (A1) - Same None-check bug as counting module (A3) - Duplicate allele counting code (A4) - Debug prints in production (A2) Module breakdown: - Core: as_analysis.py (674 LOC), as_analysis_sc.py (258 LOC) - Comparison: compare_ai.py (516 LOC) - Orchestrators: run_*.py (547 LOC combined) - Utils: filter_data.py, count_alleles*.py (432 LOC) Estimated cleanup effort: 20 hours over 4-week sprint
Analyzed all 7 files in mapping module (1,569 LOC - 27% of codebase). Deliverables: - docs/modules/MAPPING_MODULE.md: WASP algorithm technical overview - docs/modules/MAPPING_ISSUES.md: 12 catalogued issues Key findings: - WASP mapping bias removal via allele swapping - Two-step pipeline: make_reads → [external remap] → filter_remapped - Phased paired-end reads ONLY (single-end/unphased not implemented) - Same None-check bug as other modules (M1) - 241 lines of dead commented code (M3) - Memory risk loading all read names into RAM (M6) Module breakdown: - Core: make_remap_reads.py (499 LOC), intersect_variant_data.py (306 LOC) - Orchestration: run_mapping.py (240 LOC), __main__.py (180 LOC) - Utils: remap_utils.py, filter_remap_reads.py, wasp_data_files.py (344 LOC) Decorator patterns: @tempdir_decorator, @check_filt_input Estimated cleanup effort: 14 hours (+ 3-4 days for unphased support)
Use /usr/bin/time -v to track peak memory usage during pipeline execution.
Changes:
- Wrap counting/analysis commands with /usr/bin/time -v
- Extract 'Maximum resident set size' from output
- Save full memory profiles to baselines/{counting,analysis}/memory_profile.txt
- Include peak memory in console output and metadata file
- Show memory usage in final summary
Memory metrics captured:
- Peak RSS (Maximum resident set size) in KB and MB
- Full /usr/bin/time -v output including:
* Page faults (major/minor)
* Context switches
* I/O operations
* CPU usage percentage
This enables:
1. Performance regression detection (memory creep)
2. Validation of memory optimizations
3. Comparison before/after refactoring
Addresses issues: M6 (memory risk), A9 (no chunking)
Complete end-to-end WASP2 pipeline execution script with memory profiling. Pipeline stages: 1. WASP Mapping (optional, skipped if no genome ref) - make_reads: Generate swapped allele FASTQs - BWA remap: Realign to genome - filter_remapped: Keep reads mapping to same position 2. Counting: Count alleles on WASP-filtered BAM 3. Analysis: Detect allelic imbalance Memory tracking per step: - /usr/bin/time -v for each major operation - Peak RSS captured for all steps - Overall pipeline peak memory calculated Output metrics: - Read counts (original vs WASP-filtered, % kept) - SNP counts and total allele counts (ref/alt/other) - Regions tested and significant regions (FDR < 0.05) - Time and memory per step - MD5 checksums for regression testing Features: - Graceful degradation (skips mapping if no genome) - Comprehensive metadata file - Sample outputs saved (head -20) - Sanity checks on output files Usage: ./scripts/run_full_pipeline_baseline.sh Requirements: - bcftools, bedtools, samtools, python (required) - bwa + genome reference (optional, for mapping)
Critical fixes for running WASP2 pipeline: 1. Revert relative imports in __main__.py files - Modules use sys.path manipulation (lines 220-222) - Works with direct execution: python src/MODULE/__main__.py - Does NOT work with python -m approach 2. Fix polars API breakage (discovered during baseline run) - src/counting/run_counting.py:228 - Changed: has_header=True → include_header=True - Newer polars versions renamed this parameter 3. Update baseline script to use direct module execution - Changed from: python -m src.counting count-variants - Changed to: python src/counting/__main__.py count-variants Pipeline successfully executes end-to-end: - Counting: 111,454 SNPs in 11s (639 MB peak memory) - Analysis: 43 regions in 4s (341 MB peak memory) Issues discovered: - Polars DataFrame API mismatch (FIXED) - Module import pattern not compatible with python -m (DOCUMENTED) - PerformanceWarning on LazyFrame.columns access - DataOrientationWarning on DataFrame construction - FutureWarning on groupby.apply with grouping columns
Baseline execution results from full WASP2 pipeline run: Inputs: - BAM: test_data/CD4_ATACseq_Day1_merged_filtered.sort.bam (7.6 MB) - VCF: test_data/filter_chr10.vcf (11.5 MB) - BED: test_data/NA12878_snps_chr10.bed (2.6 MB) - Sample: NA12878 Outputs: - baselines/counting/counts.tsv (111,454 SNPs) - Total alleles: 3,041 (ref: 1,971, alt: 1,061, other: 9) - MD5: 612330f6ce767e5d014d1acb82159564 - Time: 11s, Memory: 639 MB - baselines/analysis/ai_results.tsv (43 regions) - Significant regions (FDR < 0.05): 0 - MD5: febc7046e96deb0f5aa14de8135b439d - Time: 4s, Memory: 341 MB Memory profiles: - baselines/counting/memory_profile.txt (full /usr/bin/time -v output) - baselines/analysis/memory_profile.txt (full /usr/bin/time -v output) Metadata: - baselines/pipeline_metadata.txt (comprehensive run summary) Sample outputs: - baselines/counting/counts_head20.txt - baselines/analysis/ai_results_head20.txt Intermediate files: - baselines/counting/temp/ (VCF BED conversion, intersections) Use these baselines for: 1. Regression testing after refactoring 2. Performance comparison (time/memory) 3. Output validation (MD5 checksums) 4. Sanity checking pipeline correctness Total pipeline: 15s, 639 MB peak memory
Two critical API fixes discovered during full pipeline testing: 1. Fix pandas DataFrame.to_list() → tolist() - src/analysis/as_analysis.py:278, 327 - Newer pandas deprecated .to_list() method - Use .tolist() for compatibility 2. Add BAM sorting/indexing after BWA remap - scripts/run_full_pipeline_baseline.sh:145-150 - BWA output needs sorting before filtering - Added: samtools sort + samtools index - Fixes: 'fetch called on bamfile without index' error Known issue documented: - WASP mapping generates 0 swapped reads on test data - to_remap.bam has 5,450 reads but swap produces empty FASTQs - Issue in swap_chrom_alleles() - requires investigation - Counting/Analysis work correctly with original BAM Baselines remain from successful run without mapping: - 111,454 SNPs counted (11s, 639 MB) - 43 regions analyzed (4s, 341 MB)
Critical fix for Polars 1.35+ API change:
partition_by('column', as_dict=True) now returns dict with TUPLE KEYS:
- Old behavior: {'read_name': DataFrame}
- New behavior: {('read_name',): DataFrame}
This caused 100% of reads to be skipped during allele swapping:
- Before fix: "processed 0 reads"
- After fix: "processed 4,878 reads"
Files modified:
- src/mapping/make_remap_reads.py:139-147 (swap_chrom_alleles)
- src/mapping/make_remap_reads.py:261-268 (swap_chrom_alleles_multi)
Fix: Extract first element from tuple keys
{k[0] if isinstance(k, tuple) else k: v for k, v in partition.items()}
Tested:
✓ 4,878 swapped reads generated (was 0)
✓ 9,756 lines in FASTQ files (4,878 × 2)
✓ Read names tagged with _WASP_
This completes the Polars API compatibility trilogy:
1. has_header → include_header (write_csv)
2. to_list() → tolist() (pandas compatibility)
3. partition_by tuple keys (this fix)
… Baseline
Phase 1.3 Deliverables (Analysis Module Deep Dive):
- docs/modules/ANALYSIS_MODULE.md: Comprehensive 231-line technical overview
* Statistical methodology (beta-binomial models, LRT)
* Three-command pipeline (bulk, single-cell, comparison)
* Key optimization functions (opt_prob, opt_phased_new, opt_unphased_dp)
* Data structures and complexity metrics (2,779 LOC total)
- docs/modules/ANALYSIS_ISSUES.md: Technical debt catalog
* 18 issues identified (3 critical, 6 high, 7 medium, 2 low)
* Critical: None check bug, debug prints, dead code (300+ LOC)
* High: FDR inconsistency, duplicate counting code, no input validation
* Estimated fix time: ~20 hours across 4-week sprint
Phase 1.4 Results (Full Pipeline Baseline):
- Successfully executed complete WASP2 pipeline with all three steps
- baselines/pipeline_metadata.txt: Updated with full metrics
Pipeline Performance (Chr10 Reference):
Step 1: WASP Mapping
- BWA remapping of swapped alleles
- Original reads: 126,061
- WASP filtered: 125,387 (99% kept)
- Time: 8s, Peak Memory: 488 MB
- Substeps: make_reads (488 MB), BWA (241 MB), filter (303 MB)
Step 2: Counting Alleles
- Input: WASP-filtered BAM + VCF
- Output: 111,454 SNPs, 2,388 allele counts
- Time: 9s, Peak Memory: 639 MB
- MD5: 127a81810a43db3cc6924a26f591cc7a
Step 3: Analyzing Imbalance
- Input: 111,454 SNPs across 39 regions
- Beta-binomial statistical model
- Output: 39 regions tested, 0 significant (FDR < 0.05)
- Time: 3s, Peak Memory: 340 MB
- MD5: 394e1a7dbf14220079c3142c5b15bad8
Total Pipeline:
- Time: 20s (8s map + 9s count + 3s analyze)
- Peak Memory: 639 MB (counting step)
- Memory Efficiency: 31.95 MB/s
Reference Genome Setup:
- Downloaded chr10 from UCSC (131 MB uncompressed)
- Indexed with BWA (94s, 224 MB indices)
- Enables complete end-to-end WASP workflow testing
Files Modified:
- baselines/pipeline_metadata.txt: Full 3-step metrics
- baselines/analysis/ai_results_head20.txt: Sample output
- baselines/analysis/memory_profile.txt: 340 MB peak
- baselines/counting/memory_profile.txt: 639 MB peak
- baselines/PIPELINE_EXECUTION_PLAN.md: Removed (obsolete)
This provides complete regression baselines for:
- Performance tracking (time/memory across all steps)
- Output validation (MD5 checksums for counts + analysis)
- Pipeline correctness verification
- Memory profiling for optimization targets
Phase 1 Status: Tasks 1.1-1.4 complete (Architecture, Counting, Analysis, Mapping documented)
Next: Phase 1.5-1.8 (Cross-cutting concerns, testing, security, synthesis)
Exclude chr10 reference genome and BWA index files from git tracking: - test_data/*.fa (genome.fa - 131 MB) - test_data/*.fa.* (BWA indices: .amb, .ann, .bwt, .pac, .sa - 224 MB) These are large binary files that can be regenerated: wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr10.fa.gz gunzip chr10.fa.gz && mv chr10.fa test_data/genome.fa bwa index test_data/genome.fa Total excluded: ~355 MB of reference data
Created comprehensive test framework to validate code changes don't break functionality or degrade performance during refactoring. Test Coverage (10 tests, all passing): --------------------------------------- ✅ Counting Module (4 tests): - MD5 checksum validation (output correctness) - Output structure validation (columns, types, row count) - Memory regression (≤ 639 MB × 1.20 tolerance) - Performance regression (≤ 9.26s × 1.30 tolerance) ✅ Analysis Module (4 tests): - MD5 checksum validation - Output structure validation (statistical properties) - Memory regression (≤ 340 MB × 1.20 tolerance) - Performance regression (≤ 2.97s × 1.30 tolerance) ✅ Mapping Module (1 test): - WASP filter rate validation (>95% reads kept) ✅ End-to-End (1 test): - Full pipeline reproducibility Features: --------- - Validates against committed baseline outputs (MD5 checksums) - Detects performance/memory regressions with tolerance - Tests run in ~25 seconds (fast feedback) - Comprehensive documentation in tests/regression/README.md Usage: ------ # Before refactoring pytest tests/regression/ -v # During development (fast tests only) pytest tests/regression/ -v -m "not slow" # After changes (full suite including E2E) pytest tests/regression/ -v Files: ------ - tests/regression/test_pipeline_regression.py (345 LOC) * Parse /usr/bin/time -v output for memory/time * MD5 validation against baselines * Statistical validity checks - tests/regression/README.md * Usage guide and CI integration instructions * Baseline update procedures * Troubleshooting guide This provides safety net for: - Type hint addition (Phase 2A) - Architecture refactoring (Phase 2B) - Code cleanup and optimization - API changes
The full pipeline test (TestFullPipelineIntegration) re-ran the complete pipeline, causing minor timing/memory variations in baseline files. Changes (all within normal variance): - Timestamp updated to latest run - Memory: 488→489 MB mapping (1 MB increase, <1%) - Timing variations: ±1s (normal run-to-run variance) These micro-variations are expected and within tolerance thresholds defined in regression tests (±20% memory, ±30% time). MD5 checksums unchanged: - Counting: 127a81810a43db3cc6924a26f591cc7a ✓ - Analysis: 394e1a7dbf14220079c3142c5b15bad8 ✓ Output correctness preserved.
Created comprehensive DAG to visualize Phase 2 refactoring work: Features: - Mermaid diagram showing task dependencies - 5 parallel tracks (Foundation, Architecture, Error Handling, Testing, Optimization) - 40+ tasks with duration estimates - Critical path analysis (44 hours over 6 weeks) - Parallelization opportunities identified Quick Wins (can start now): - QW-1: Remove dead code (2h) - QW-3: Fix None check bugs (30m) - QW-4: Use binary search (30m) - QW-5: Standardize FDR (1h) Next after Quick Wins: - TH-1/TH-2/TH-3: Type hints (parallel, 8h each) - Then architecture refactoring - Then error handling - Then comprehensive testing This provides clear roadmap for Phase 2 work with dependency tracking.
QW-3: Fix None check bugs (4 locations) - src/counting/__main__.py: lines 118, 203 - src/analysis/__main__.py: lines 216, 331 - Changed 'if len(samples) > 0:' to 'if samples is not None and len(samples) > 0:' - Prevents TypeError when CLI args are None QW-1: Remove dead code (~207 lines) - src/analysis/as_analysis.py: Removed 147 lines * Commented parse_opt(), binom_model(), get_imbalance() functions * Old opt_phased_new() implementation - src/analysis/run_analysis.py: Removed 57 lines * Commented WaspAnalysisData class definition - src/analysis/compare_ai.py: Removed 3 lines * Commented import statements All regression tests passing (10/10 in 24.85s)
Memory usage slightly improved: - Counting: 658 MB → 651 MB - Analysis: 349 MB → 348 MB - Total pipeline: 642 MB → 636 MB All changes within regression test tolerances.
Removed debug print statements (3 locations): - src/counting/run_counting_sc.py:140 - removed vars() debug print - src/analysis/run_compare_ai.py:39-40 - removed 2 debug prints - src/analysis/as_analysis_sc.py:89 - removed disp debug print Replaced error prints with proper exceptions (3 locations): - src/counting/run_counting_sc.py:106 - raise ValueError for invalid feature file type - src/counting/run_counting.py:93 - raise ValueError for invalid region file type - src/counting/filter_variant_data.py:211 - raise ValueError for unrecognized BED format All regression tests passing (10/10 in 25.09s)
Minor timing variations from test execution. All changes within normal tolerances (±5%).
Minor timing variations from regression test execution. All changes within normal tolerances.
Replaced custom bh_correction() with scipy.stats.false_discovery_control(): - Removed 40-line custom Benjamini-Hochberg implementation - Updated 2 call sites to use scipy's version: * src/analysis/as_analysis.py:463 * src/analysis/compare_ai.py:509 - Removed bh_correction imports from 3 files - Removed unused rankdata import Benefits: - Single source of truth for FDR correction - Battle-tested scipy implementation - Reduced maintenance burden - MD5 validation confirms identical results All regression tests passing (10/10 in 24.94s) MD5 checksums unchanged - FDR values are bit-identical
Minor timing variations from test execution. All MD5 checksums unchanged - FDR values validated.
Resolve README conflict by combining both descriptions.
Added type hints to all 7 files in src/mapping/ module: - __main__.py: Return types for Typer commands - wasp_data_files.py: WaspDataFiles class with Union types - filter_remap_reads.py: File path parameters - run_mapping.py: Complex decorator typing with Callable - intersect_variant_data.py: subprocess and BedTool operations - remap_utils.py: Generator type hints for paired read iteration - make_remap_reads.py: ReadStats class and WASP algorithm functions Key technical achievements: - Generator[Tuple[AlignedSegment, AlignedSegment], None, None] for paired reads - Callable[..., Any] decorator patterns matching TH-1 - Union[str, Path] and Optional types for file paths - pl.DataFrame and List[str] for data structures mypy validation: 11 type errors found (beneficial - catching potential bugs) All runtime behavior unchanged - type hints are documentation only Implements TH-3 from docs/TYPE_HINTS_MAPPING_PLAN.md Estimated time: 16.5 hours (2.0x difficulty vs counting)
Added proper dependency management and setup documentation: Files added: - DEVELOPMENT.md: Complete setup guide for contributors - Conda and pip installation instructions - Testing and type checking commands - Common issues and solutions - requirements.txt: pip-based dependency specification - All Python packages with version constraints - Note about system dependencies (bcftools, samtools, bedtools) Files updated: - environment.yml: Enhanced conda environment - Updated Python to 3.11.* (from 3.9.*) - Added pytest>=7.0 for testing - Added pytest-cov for coverage reports - Added mypy for type checking - Organized with comments This resolves test environment setup issues and ensures reproducibility. Testing workflow: - Option 1 (conda): conda env create -f environment.yml - Option 2 (pip): pip install -r requirements.txt - Run tests: python -m pytest tests/ -v
- Add synthetic INDEL dataset + quickbench commands for indel parity and trim-combo invariants - Add regression tests to lock indel behavior - Add region-based dev harness to compare unified vs multi-pass on real subsets
- Make multi-pass remapper emit original-orientation FASTQs for reverse reads - Fix dev harness tempdir creation when --keep-tmp is unset - Add synthetic reverse-strand pairR + assertions to quickbench SNV parity test
- Avoid split/flatten allocations in generate_haplotype_seqs_view for non-indel cases - Keep exact behavior by falling back to the slow path when query spans don't match allele length
- Use SmallVec for per-read overlap/span buffers in unified pipeline - Add WASP2_BAM_THREADS knob to tune htslib decompression vs Rayon workers - Switch Python parallel entrypoint to per-call Rayon pool (supports sweeps) - Add dev harness thread_sweep + Rust unified_profile binary for valgrind profiling - Export rlib for internal binaries and silence a doctest-only example
- Figure 1: Speed benchmarks with Rust acceleration (7x faster) - Figure 2: Allele counting accuracy comparison - Figure 3: Statistical analysis with QTL replication - Figure 4: Single-cell analysis framework Includes: - All figure generation scripts - Benchmark data (TSV/JSON) - Publication-quality plots (PDF/PNG/TIFF) - Validation and release scripts Data validated against 137 RNA + 137 ATAC iPSCORE samples. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Deterministic seeds for reproducible subsampling - Avoid qstat segfault in scaling jobs - Add INDEL variant FAIR comparison script Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Documentation: - Analysis summary, data validation, experiment reports - Figure engineering plan, statistical report - Figure gallery Figure data: - Figure 1 panel exports (PDF/PNG/SVG) - Figure 2 timing results and tools (STAR 2.7.4a archive) - Figure 3 statistical data (QQ, volcano, QTL replication) Data: - CVPC eQTL stats for QTL replication analysis Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Removes paper/benchmark materials (71GB) for clean software release: - Removed: paper/, benchmarking/, results/, validation/ - Removed: baselines/, benchmarks/, figures/, gm12878_benchmark/ - Removed: scripts/, data/, doc/, .devcontainer/ - Removed: benchmark_*.py, profile_*.py, simulate_*.py, etc. Retained core software (~4MB): - src/ (Python modules: counting, mapping, analysis) - rust/src/ (Rust acceleration code) - tests/ (Test suite) - docs/ (Sphinx documentation) - .github/ (CI/CD workflows) - Package config (pyproject.toml, environment.yml, etc.) Full reproducibility preserved in WASP2-exp/ropc-indels branch. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Collaborator
Author
|
Closing - will create proper release PR after polishing in a separate development workflow. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Changes in this release
This release extracts the core WASP2 software (~4MB) from the development repository, removing paper/benchmark materials (71GB) to create a clean, installable package.
Core features
wasp2-count: Allele-specific read counting with Rust backendwasp2-map: Mapping bias removal pipelinewasp2-analyze: Statistical analysis for ASETechnical improvements
Test Plan
pytest tests/passesmaturin buildwasp2-count --helpwasp2-map --helpwasp2-analyze --helpNotes
🤖 Generated with Claude Code