Skip to content

Conversation

@Jaureguy760
Copy link
Collaborator

Summary

  • 7× faster BAM counting via Rust acceleration
  • Unified FASTQ/BAM pipeline for streamlined ASE analysis
  • Comprehensive ATAC-seq/RNA-seq support
  • Enhanced variant handling (SNVs)

Changes in this release

This release extracts the core WASP2 software (~4MB) from the development repository, removing paper/benchmark materials (71GB) to create a clean, installable package.

Core features

  • wasp2-count: Allele-specific read counting with Rust backend
  • wasp2-map: Mapping bias removal pipeline
  • wasp2-analyze: Statistical analysis for ASE

Technical improvements

  • Rust acceleration for BAM operations
  • Multi-threaded processing support
  • Improved memory efficiency

Test Plan

  • pytest tests/ passes
  • Rust builds with maturin build
  • CLI commands work:
    • wasp2-count --help
    • wasp2-map --help
    • wasp2-analyze --help

Notes

  • Full benchmarking and paper materials preserved in WASP2-exp repository
  • This represents 194 commits of development work since the last release

🤖 Generated with Claude Code

Created detailed task breakdown for:
- Phase 1: Explore & Understand (10-14 days)
  - Code inventory and architecture mapping
  - Deep dive into counting, analysis, and mapping modules
  - Security and performance audit
  - Test coverage analysis

- Phase 2: Clean & Refactor (20-28 days)
  - Critical security and correctness fixes
  - Comprehensive test suite (>80% coverage)
  - Code quality improvements across all modules
  - Documentation finalization
  - Performance optimization

Total timeline: 6-8 weeks
Deliverables: 30+ documentation and code outputs
Success metrics and risk mitigation strategies included
Created comprehensive documentation for WASP2 codebase analysis:

## Deliverables (Task 1.1.1, 1.1.2, 1.1.3):

### 1. docs/DIRECTORY_STRUCTURE.md (180 lines)
- Complete repository structure mapping
- All 24 Python files documented with line counts and purposes
- Entry points identified (bin/WASP2 + 3 module __main__.py)
- File organization by module (counting: 1,430 LOC, analysis: 2,776 LOC, mapping: 1,562 LOC)
- Critical observations: bin/WASP2 is broken, no tests found, code duplication exists

### 2. docs/DEPENDENCIES.md (490 lines)
- Analysis of all 9 dependencies (numpy, pandas, polars, scipy, pysam, pybedtools, bedtools, typer, anndata)
- Usage patterns across codebase
- Risk assessment for each dependency
- Version constraint recommendations (currently none specified)
- Security vulnerabilities identified (no version pinning)
- Pandas vs Polars redundancy analysis

### 3. docs/ARCHITECTURE.md (640 lines)
- High-level system architecture with ASCII diagrams
- Module architecture breakdowns (counting, analysis, mapping)
- Data flow diagrams for bulk and single-cell workflows
- Design patterns identified (orchestrator, file-based data passing)
- Integration points between modules
- 10 architectural issues identified (3 critical, 4 medium, 3 minor)
- Technology stack layer analysis

## Key Findings:

**Critical Issues:**
- bin/WASP2 main executable is incomplete/broken
- No test suite found (zero unit or integration tests)
- Code duplication between modules (e.g., count_alleles.py in multiple places)

**Architecture:**
- Modular pipeline design with 3 independent modules
- File-based interfaces for module communication
- Typer CLI framework for all commands
- Beta-binomial statistical model for AI analysis
- 3-step WASP mapping pipeline for bias correction

**Dependencies:**
- 5,768 total lines of Python code
- No version constraints (security risk)
- Both pandas and polars used (intentional but adds complexity)
- scipy critical for statistical correctness

## Phase 1.1 Status: ✅ COMPLETE

Next: Task 1.2 - Module Deep Dive: Counting
Created comprehensive framework for end-to-end pipeline testing and
regression validation to support Phase 2 refactoring with confidence.

## What We Did:

Attempted full pipeline execution to:
1. Understand end-to-end flow
2. Validate code review findings
3. Establish baseline/regression testing framework

## Critical Findings:

**Missing Dependencies** (BLOCKERS):
- bcftools: Called by filter_variant_data.py but NOT in environment.yml
- bedtools: Called by filter_variant_data.py but NOT in environment.yml
- samtools: Required by mapping module but NOT in environment.yml

These tools are invoked via subprocess but missing from conda environment.
Pipeline cannot run without them.

## Baseline Infrastructure Created:

### 1. Execution Scripts:
- scripts/run_baseline.sh (5.2 KB)
  - Automated baseline creation
  - Runs Counting → Analysis pipeline
  - Saves MD5 checksums for regression testing
  - Captures timing and metadata

- scripts/validate_against_baseline.sh (4.3 KB)
  - Automated regression testing
  - Compares outputs against baseline
  - Shows diffs if outputs changed
  - Pass/fail reporting

### 2. Documentation:
- baselines/PIPELINE_EXECUTION_PLAN.md
  - Complete step-by-step execution guide
  - Both Counting-only and Full pipeline options
  - Dependency requirements
  - Expected outputs and validation procedures

- PIPELINE_RUN_FINDINGS.md
  - Summary of execution attempt
  - Confirmed blockers
  - Test data validated
  - Next steps outlined

### 3. Test Data:
- Extracted test bundle to test_data/
- Validated VCF, BAM, BED files
- 111,454 NA12878 chr10 SNPs (hg38)
- as_counts.txt shows expected output format

## How Baselines Support Phase 2:

**Sanity Check Workflow**:
```bash
# After each refactor:
./scripts/validate_against_baseline.sh

# If PASS: Refactor is safe (no regression)
# If FAIL: Either bug introduced or baseline needs update
```

This gives us **confidence** that refactoring doesn't break functionality.

## Validated Code Review Findings:

✅ Confirmed: bcftools missing (subprocess call fails)
✅ Confirmed: bedtools missing (subprocess call fails)
✅ Confirmed: samtools needed (for mapping module)
⏳ Pending: Binary search performance issue (needs env to test)
⏳ Pending: Sample parsing bug (needs env to test)

## Next Steps:

1. Fix environment.yml (add bcftools, bedtools, samtools)
2. Run ./scripts/run_baseline.sh to establish baseline
3. Continue Phase 1 module documentation
4. Use baselines for Phase 2 regression testing

## Files Added:

- scripts/run_baseline.sh (executable)
- scripts/validate_against_baseline.sh (executable)
- baselines/PIPELINE_EXECUTION_PLAN.md (11 KB)
- PIPELINE_RUN_FINDINGS.md (10 KB)
- test_data/README.md, test_data/as_counts.txt (reference files)
- .gitignore updates (ignore extracted test data and baseline outputs)
Added critical dependencies that are called via subprocess but were missing:
- bcftools: Required by src/counting/filter_variant_data.py (VCF filtering)
- samtools: Required by src/mapping/ module (BAM operations)
- typing_extensions: Used in all __main__.py files for Annotated types

These tools are invoked directly in the code but were not installable,
causing pipeline failures. This fixes the blockers identified in
PIPELINE_RUN_FINDINGS.md.

Closes: Critical blocker for baseline execution
Created comprehensive status document showing:
- What was fixed (environment.yml)
- How to run baselines (step-by-step)
- Expected outputs and artifacts
- Regression testing workflow
- Current status of all components

The environment is now ready for baseline execution once conda is available.
User can follow the guide to install environment and run baselines on their
local machine.

Next: Either continue Phase 1 docs OR user runs baseline first
Created comprehensive technical documentation and issues inventory
for the counting module through detailed line-by-line code analysis.

## Deliverables (1,200+ lines of documentation):

### 1. docs/modules/COUNTING_MODULE.md (620 lines)
Complete technical deep dive covering:
- Architecture and design patterns (Orchestrator + Worker)
- File-by-file analysis of all 7 files (1,430 LOC total)
- Key algorithms (binary search, VCF filtering, GTF parsing)
- Data structures (Polars DataFrames, AnnData format)
- API reference with signatures and examples
- Bulk vs Single-cell workflow comparison

### 2. docs/modules/COUNTING_ISSUES.md (600 lines)
Actionable technical debt inventory with 24 catalogued issues:

**Critical (3)**:
- C1: Binary search implemented but NOT USED (performance bug)
- C2: AnnData dimensions transposed (breaks scanpy/SnapATAC2)
- C3: Sample parsing bug (TypeError on None input)

**High Priority (5)**:
- H1: 80% code duplication (WaspCountFiles vs WaspCountSC)
- H2: 123 lines of dead code (old parse functions never called)
- H3: No error handling (print instead of raise)
- H4: Unclear read_set behavior (intentional or bug?)
- H5: Debug code in production (line 140)

**Medium Priority (8)**:
- No type hints (M1)
- Missing docstrings (M2)
- No progress indicators (M3)
- Hard-coded constants (M4)
- Commented-out code (M5)
- Inconsistent subprocess error handling (M6)
- No input validation (M7)
- Regex duplication (M8)

**Low Priority (8)**:
- 13 TODO comments throughout codebase

## Key Findings:

### Performance Issue:
Binary search function exists (`find_read_aln_pos()`) but bulk counting
uses O(n) linear search. Single-cell version DOES use binary search -
inconsistency suggests oversight rather than design choice.

### Compatibility Issue:
AnnData output has transposed dimensions (SNPs × Cells instead of
Cells × SNPs), breaking standard scanpy/SnapATAC2 workflows.

### Code Quality:
- 14% of codebase is duplicated between bulk and single-cell
- 8.6% of codebase is dead code (never called)
- Widespread use of `print()` instead of exceptions
- No type hints or comprehensive docstrings

## Phase 2 Roadmap Established:

Organized issues into 4-week sprint plan:
- Week 1: Critical fixes (C1-C3, H3) - all Easy to Medium effort
- Week 2: Code quality (H1, H2, H4, H5)
- Week 3: Polish (type hints, docstrings, progress bars)
- Week 4: Cleanup (TODOs, consistent patterns)

All fixes will be validated against baselines (scripts/validate_against_baseline.sh).

## Analysis Methodology:

1. Read all 7 files line-by-line (completed earlier)
2. Document architecture and data flow
3. Identify algorithms and their complexity
4. Catalog all issues by severity
5. Propose fixes with effort estimates
6. Create Phase 2 prioritization

## Files Analyzed:

- __main__.py (221 lines) - CLI interface
- run_counting.py (230 lines) - Bulk orchestrator
- count_alleles.py (123 lines) - Core bulk counting
- filter_variant_data.py (237 lines) - VCF filtering
- parse_gene_data.py (213 lines) - GTF/GFF3 parsing
- run_counting_sc.py (178 lines) - Single-cell orchestrator
- count_alleles_sc.py (228 lines) - Core single-cell counting

Total: 1,430 LOC fully documented

## Next Steps:

Phase 1.3: Analysis Module Deep Dive (as_analysis.py - 674 lines, most complex)
Analyzed all 10 files in analysis module (2,779 LOC - 48% of codebase).

Deliverables:
- docs/modules/ANALYSIS_MODULE.md: Technical overview of statistical engine
- docs/modules/ANALYSIS_ISSUES.md: 18 catalogued issues with fix estimates

Key findings:
- Beta-binomial model with LRT for allelic imbalance detection
- Phased/unphased genotype support with dynamic programming
- 300+ lines of dead commented code (A1)
- Same None-check bug as counting module (A3)
- Duplicate allele counting code (A4)
- Debug prints in production (A2)

Module breakdown:
- Core: as_analysis.py (674 LOC), as_analysis_sc.py (258 LOC)
- Comparison: compare_ai.py (516 LOC)
- Orchestrators: run_*.py (547 LOC combined)
- Utils: filter_data.py, count_alleles*.py (432 LOC)

Estimated cleanup effort: 20 hours over 4-week sprint
Analyzed all 7 files in mapping module (1,569 LOC - 27% of codebase).

Deliverables:
- docs/modules/MAPPING_MODULE.md: WASP algorithm technical overview
- docs/modules/MAPPING_ISSUES.md: 12 catalogued issues

Key findings:
- WASP mapping bias removal via allele swapping
- Two-step pipeline: make_reads → [external remap] → filter_remapped
- Phased paired-end reads ONLY (single-end/unphased not implemented)
- Same None-check bug as other modules (M1)
- 241 lines of dead commented code (M3)
- Memory risk loading all read names into RAM (M6)

Module breakdown:
- Core: make_remap_reads.py (499 LOC), intersect_variant_data.py (306 LOC)
- Orchestration: run_mapping.py (240 LOC), __main__.py (180 LOC)
- Utils: remap_utils.py, filter_remap_reads.py, wasp_data_files.py (344 LOC)

Decorator patterns: @tempdir_decorator, @check_filt_input

Estimated cleanup effort: 14 hours (+ 3-4 days for unphased support)
Use /usr/bin/time -v to track peak memory usage during pipeline execution.

Changes:
- Wrap counting/analysis commands with /usr/bin/time -v
- Extract 'Maximum resident set size' from output
- Save full memory profiles to baselines/{counting,analysis}/memory_profile.txt
- Include peak memory in console output and metadata file
- Show memory usage in final summary

Memory metrics captured:
- Peak RSS (Maximum resident set size) in KB and MB
- Full /usr/bin/time -v output including:
  * Page faults (major/minor)
  * Context switches
  * I/O operations
  * CPU usage percentage

This enables:
1. Performance regression detection (memory creep)
2. Validation of memory optimizations
3. Comparison before/after refactoring

Addresses issues: M6 (memory risk), A9 (no chunking)
Complete end-to-end WASP2 pipeline execution script with memory profiling.

Pipeline stages:
1. WASP Mapping (optional, skipped if no genome ref)
   - make_reads: Generate swapped allele FASTQs
   - BWA remap: Realign to genome
   - filter_remapped: Keep reads mapping to same position
2. Counting: Count alleles on WASP-filtered BAM
3. Analysis: Detect allelic imbalance

Memory tracking per step:
- /usr/bin/time -v for each major operation
- Peak RSS captured for all steps
- Overall pipeline peak memory calculated

Output metrics:
- Read counts (original vs WASP-filtered, % kept)
- SNP counts and total allele counts (ref/alt/other)
- Regions tested and significant regions (FDR < 0.05)
- Time and memory per step
- MD5 checksums for regression testing

Features:
- Graceful degradation (skips mapping if no genome)
- Comprehensive metadata file
- Sample outputs saved (head -20)
- Sanity checks on output files

Usage:
  ./scripts/run_full_pipeline_baseline.sh

Requirements:
- bcftools, bedtools, samtools, python (required)
- bwa + genome reference (optional, for mapping)
Critical fixes for running WASP2 pipeline:

1. Revert relative imports in __main__.py files
   - Modules use sys.path manipulation (lines 220-222)
   - Works with direct execution: python src/MODULE/__main__.py
   - Does NOT work with python -m approach

2. Fix polars API breakage (discovered during baseline run)
   - src/counting/run_counting.py:228
   - Changed: has_header=True → include_header=True
   - Newer polars versions renamed this parameter

3. Update baseline script to use direct module execution
   - Changed from: python -m src.counting count-variants
   - Changed to: python src/counting/__main__.py count-variants

Pipeline successfully executes end-to-end:
- Counting: 111,454 SNPs in 11s (639 MB peak memory)
- Analysis: 43 regions in 4s (341 MB peak memory)

Issues discovered:
- Polars DataFrame API mismatch (FIXED)
- Module import pattern not compatible with python -m (DOCUMENTED)
- PerformanceWarning on LazyFrame.columns access
- DataOrientationWarning on DataFrame construction
- FutureWarning on groupby.apply with grouping columns
Baseline execution results from full WASP2 pipeline run:

Inputs:
- BAM: test_data/CD4_ATACseq_Day1_merged_filtered.sort.bam (7.6 MB)
- VCF: test_data/filter_chr10.vcf (11.5 MB)
- BED: test_data/NA12878_snps_chr10.bed (2.6 MB)
- Sample: NA12878

Outputs:
- baselines/counting/counts.tsv (111,454 SNPs)
  - Total alleles: 3,041 (ref: 1,971, alt: 1,061, other: 9)
  - MD5: 612330f6ce767e5d014d1acb82159564
  - Time: 11s, Memory: 639 MB

- baselines/analysis/ai_results.tsv (43 regions)
  - Significant regions (FDR < 0.05): 0
  - MD5: febc7046e96deb0f5aa14de8135b439d
  - Time: 4s, Memory: 341 MB

Memory profiles:
- baselines/counting/memory_profile.txt (full /usr/bin/time -v output)
- baselines/analysis/memory_profile.txt (full /usr/bin/time -v output)

Metadata:
- baselines/pipeline_metadata.txt (comprehensive run summary)

Sample outputs:
- baselines/counting/counts_head20.txt
- baselines/analysis/ai_results_head20.txt

Intermediate files:
- baselines/counting/temp/ (VCF BED conversion, intersections)

Use these baselines for:
1. Regression testing after refactoring
2. Performance comparison (time/memory)
3. Output validation (MD5 checksums)
4. Sanity checking pipeline correctness

Total pipeline: 15s, 639 MB peak memory
Two critical API fixes discovered during full pipeline testing:

1. Fix pandas DataFrame.to_list() → tolist()
   - src/analysis/as_analysis.py:278, 327
   - Newer pandas deprecated .to_list() method
   - Use .tolist() for compatibility

2. Add BAM sorting/indexing after BWA remap
   - scripts/run_full_pipeline_baseline.sh:145-150
   - BWA output needs sorting before filtering
   - Added: samtools sort + samtools index
   - Fixes: 'fetch called on bamfile without index' error

Known issue documented:
- WASP mapping generates 0 swapped reads on test data
- to_remap.bam has 5,450 reads but swap produces empty FASTQs
- Issue in swap_chrom_alleles() - requires investigation
- Counting/Analysis work correctly with original BAM

Baselines remain from successful run without mapping:
- 111,454 SNPs counted (11s, 639 MB)
- 43 regions analyzed (4s, 341 MB)
Critical fix for Polars 1.35+ API change:

partition_by('column', as_dict=True) now returns dict with TUPLE KEYS:
- Old behavior: {'read_name': DataFrame}
- New behavior: {('read_name',): DataFrame}

This caused 100% of reads to be skipped during allele swapping:
- Before fix: "processed 0 reads"
- After fix: "processed 4,878 reads"

Files modified:
- src/mapping/make_remap_reads.py:139-147 (swap_chrom_alleles)
- src/mapping/make_remap_reads.py:261-268 (swap_chrom_alleles_multi)

Fix: Extract first element from tuple keys
{k[0] if isinstance(k, tuple) else k: v for k, v in partition.items()}

Tested:
✓ 4,878 swapped reads generated (was 0)
✓ 9,756 lines in FASTQ files (4,878 × 2)
✓ Read names tagged with _WASP_

This completes the Polars API compatibility trilogy:
1. has_header → include_header (write_csv)
2. to_list() → tolist() (pandas compatibility)
3. partition_by tuple keys (this fix)
… Baseline

Phase 1.3 Deliverables (Analysis Module Deep Dive):
- docs/modules/ANALYSIS_MODULE.md: Comprehensive 231-line technical overview
  * Statistical methodology (beta-binomial models, LRT)
  * Three-command pipeline (bulk, single-cell, comparison)
  * Key optimization functions (opt_prob, opt_phased_new, opt_unphased_dp)
  * Data structures and complexity metrics (2,779 LOC total)

- docs/modules/ANALYSIS_ISSUES.md: Technical debt catalog
  * 18 issues identified (3 critical, 6 high, 7 medium, 2 low)
  * Critical: None check bug, debug prints, dead code (300+ LOC)
  * High: FDR inconsistency, duplicate counting code, no input validation
  * Estimated fix time: ~20 hours across 4-week sprint

Phase 1.4 Results (Full Pipeline Baseline):
- Successfully executed complete WASP2 pipeline with all three steps
- baselines/pipeline_metadata.txt: Updated with full metrics

Pipeline Performance (Chr10 Reference):
  Step 1: WASP Mapping
    - BWA remapping of swapped alleles
    - Original reads: 126,061
    - WASP filtered: 125,387 (99% kept)
    - Time: 8s, Peak Memory: 488 MB
    - Substeps: make_reads (488 MB), BWA (241 MB), filter (303 MB)

  Step 2: Counting Alleles
    - Input: WASP-filtered BAM + VCF
    - Output: 111,454 SNPs, 2,388 allele counts
    - Time: 9s, Peak Memory: 639 MB
    - MD5: 127a81810a43db3cc6924a26f591cc7a

  Step 3: Analyzing Imbalance
    - Input: 111,454 SNPs across 39 regions
    - Beta-binomial statistical model
    - Output: 39 regions tested, 0 significant (FDR < 0.05)
    - Time: 3s, Peak Memory: 340 MB
    - MD5: 394e1a7dbf14220079c3142c5b15bad8

  Total Pipeline:
    - Time: 20s (8s map + 9s count + 3s analyze)
    - Peak Memory: 639 MB (counting step)
    - Memory Efficiency: 31.95 MB/s

Reference Genome Setup:
  - Downloaded chr10 from UCSC (131 MB uncompressed)
  - Indexed with BWA (94s, 224 MB indices)
  - Enables complete end-to-end WASP workflow testing

Files Modified:
  - baselines/pipeline_metadata.txt: Full 3-step metrics
  - baselines/analysis/ai_results_head20.txt: Sample output
  - baselines/analysis/memory_profile.txt: 340 MB peak
  - baselines/counting/memory_profile.txt: 639 MB peak
  - baselines/PIPELINE_EXECUTION_PLAN.md: Removed (obsolete)

This provides complete regression baselines for:
  - Performance tracking (time/memory across all steps)
  - Output validation (MD5 checksums for counts + analysis)
  - Pipeline correctness verification
  - Memory profiling for optimization targets

Phase 1 Status: Tasks 1.1-1.4 complete (Architecture, Counting, Analysis, Mapping documented)
Next: Phase 1.5-1.8 (Cross-cutting concerns, testing, security, synthesis)
Exclude chr10 reference genome and BWA index files from git tracking:
- test_data/*.fa (genome.fa - 131 MB)
- test_data/*.fa.* (BWA indices: .amb, .ann, .bwt, .pac, .sa - 224 MB)

These are large binary files that can be regenerated:
  wget http://hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/chr10.fa.gz
  gunzip chr10.fa.gz && mv chr10.fa test_data/genome.fa
  bwa index test_data/genome.fa

Total excluded: ~355 MB of reference data
Created comprehensive test framework to validate code changes don't break
functionality or degrade performance during refactoring.

Test Coverage (10 tests, all passing):
---------------------------------------

✅ Counting Module (4 tests):
  - MD5 checksum validation (output correctness)
  - Output structure validation (columns, types, row count)
  - Memory regression (≤ 639 MB × 1.20 tolerance)
  - Performance regression (≤ 9.26s × 1.30 tolerance)

✅ Analysis Module (4 tests):
  - MD5 checksum validation
  - Output structure validation (statistical properties)
  - Memory regression (≤ 340 MB × 1.20 tolerance)
  - Performance regression (≤ 2.97s × 1.30 tolerance)

✅ Mapping Module (1 test):
  - WASP filter rate validation (>95% reads kept)

✅ End-to-End (1 test):
  - Full pipeline reproducibility

Features:
---------
- Validates against committed baseline outputs (MD5 checksums)
- Detects performance/memory regressions with tolerance
- Tests run in ~25 seconds (fast feedback)
- Comprehensive documentation in tests/regression/README.md

Usage:
------
# Before refactoring
pytest tests/regression/ -v

# During development (fast tests only)
pytest tests/regression/ -v -m "not slow"

# After changes (full suite including E2E)
pytest tests/regression/ -v

Files:
------
- tests/regression/test_pipeline_regression.py (345 LOC)
  * Parse /usr/bin/time -v output for memory/time
  * MD5 validation against baselines
  * Statistical validity checks

- tests/regression/README.md
  * Usage guide and CI integration instructions
  * Baseline update procedures
  * Troubleshooting guide

This provides safety net for:
- Type hint addition (Phase 2A)
- Architecture refactoring (Phase 2B)
- Code cleanup and optimization
- API changes
The full pipeline test (TestFullPipelineIntegration) re-ran the complete
pipeline, causing minor timing/memory variations in baseline files.

Changes (all within normal variance):
- Timestamp updated to latest run
- Memory: 488→489 MB mapping (1 MB increase, <1%)
- Timing variations: ±1s (normal run-to-run variance)

These micro-variations are expected and within tolerance thresholds
defined in regression tests (±20% memory, ±30% time).

MD5 checksums unchanged:
- Counting: 127a81810a43db3cc6924a26f591cc7a ✓
- Analysis: 394e1a7dbf14220079c3142c5b15bad8 ✓

Output correctness preserved.
Created comprehensive DAG to visualize Phase 2 refactoring work:

Features:
- Mermaid diagram showing task dependencies
- 5 parallel tracks (Foundation, Architecture, Error Handling, Testing, Optimization)
- 40+ tasks with duration estimates
- Critical path analysis (44 hours over 6 weeks)
- Parallelization opportunities identified

Quick Wins (can start now):
- QW-1: Remove dead code (2h)
- QW-3: Fix None check bugs (30m)
- QW-4: Use binary search (30m)
- QW-5: Standardize FDR (1h)

Next after Quick Wins:
- TH-1/TH-2/TH-3: Type hints (parallel, 8h each)
- Then architecture refactoring
- Then error handling
- Then comprehensive testing

This provides clear roadmap for Phase 2 work with dependency tracking.
QW-3: Fix None check bugs (4 locations)
- src/counting/__main__.py: lines 118, 203
- src/analysis/__main__.py: lines 216, 331
- Changed 'if len(samples) > 0:' to 'if samples is not None and len(samples) > 0:'
- Prevents TypeError when CLI args are None

QW-1: Remove dead code (~207 lines)
- src/analysis/as_analysis.py: Removed 147 lines
  * Commented parse_opt(), binom_model(), get_imbalance() functions
  * Old opt_phased_new() implementation
- src/analysis/run_analysis.py: Removed 57 lines
  * Commented WaspAnalysisData class definition
- src/analysis/compare_ai.py: Removed 3 lines
  * Commented import statements

All regression tests passing (10/10 in 24.85s)
Memory usage slightly improved:
- Counting: 658 MB → 651 MB
- Analysis: 349 MB → 348 MB
- Total pipeline: 642 MB → 636 MB

All changes within regression test tolerances.
Removed debug print statements (3 locations):
- src/counting/run_counting_sc.py:140 - removed vars() debug print
- src/analysis/run_compare_ai.py:39-40 - removed 2 debug prints
- src/analysis/as_analysis_sc.py:89 - removed disp debug print

Replaced error prints with proper exceptions (3 locations):
- src/counting/run_counting_sc.py:106 - raise ValueError for invalid feature file type
- src/counting/run_counting.py:93 - raise ValueError for invalid region file type
- src/counting/filter_variant_data.py:211 - raise ValueError for unrecognized BED format

All regression tests passing (10/10 in 25.09s)
Minor timing variations from test execution.
All changes within normal tolerances (±5%).
Minor timing variations from regression test execution.
All changes within normal tolerances.
Replaced custom bh_correction() with scipy.stats.false_discovery_control():
- Removed 40-line custom Benjamini-Hochberg implementation
- Updated 2 call sites to use scipy's version:
  * src/analysis/as_analysis.py:463
  * src/analysis/compare_ai.py:509
- Removed bh_correction imports from 3 files
- Removed unused rankdata import

Benefits:
- Single source of truth for FDR correction
- Battle-tested scipy implementation
- Reduced maintenance burden
- MD5 validation confirms identical results

All regression tests passing (10/10 in 24.94s)
MD5 checksums unchanged - FDR values are bit-identical
Minor timing variations from test execution.
All MD5 checksums unchanged - FDR values validated.
Resolve README conflict by combining both descriptions.
Added type hints to all 7 files in src/mapping/ module:
- __main__.py: Return types for Typer commands
- wasp_data_files.py: WaspDataFiles class with Union types
- filter_remap_reads.py: File path parameters
- run_mapping.py: Complex decorator typing with Callable
- intersect_variant_data.py: subprocess and BedTool operations
- remap_utils.py: Generator type hints for paired read iteration
- make_remap_reads.py: ReadStats class and WASP algorithm functions

Key technical achievements:
- Generator[Tuple[AlignedSegment, AlignedSegment], None, None] for paired reads
- Callable[..., Any] decorator patterns matching TH-1
- Union[str, Path] and Optional types for file paths
- pl.DataFrame and List[str] for data structures

mypy validation: 11 type errors found (beneficial - catching potential bugs)
All runtime behavior unchanged - type hints are documentation only

Implements TH-3 from docs/TYPE_HINTS_MAPPING_PLAN.md
Estimated time: 16.5 hours (2.0x difficulty vs counting)
Added proper dependency management and setup documentation:

Files added:
- DEVELOPMENT.md: Complete setup guide for contributors
  - Conda and pip installation instructions
  - Testing and type checking commands
  - Common issues and solutions

- requirements.txt: pip-based dependency specification
  - All Python packages with version constraints
  - Note about system dependencies (bcftools, samtools, bedtools)

Files updated:
- environment.yml: Enhanced conda environment
  - Updated Python to 3.11.* (from 3.9.*)
  - Added pytest>=7.0 for testing
  - Added pytest-cov for coverage reports
  - Added mypy for type checking
  - Organized with comments

This resolves test environment setup issues and ensures reproducibility.

Testing workflow:
- Option 1 (conda): conda env create -f environment.yml
- Option 2 (pip): pip install -r requirements.txt
- Run tests: python -m pytest tests/ -v
Jaureguy760 and others added 27 commits December 14, 2025 20:02
- Add synthetic INDEL dataset + quickbench commands for indel parity and trim-combo invariants
- Add regression tests to lock indel behavior
- Add region-based dev harness to compare unified vs multi-pass on real subsets
- Make multi-pass remapper emit original-orientation FASTQs for reverse reads

- Fix dev harness tempdir creation when --keep-tmp is unset

- Add synthetic reverse-strand pairR + assertions to quickbench SNV parity test
- Avoid split/flatten allocations in generate_haplotype_seqs_view for non-indel cases

- Keep exact behavior by falling back to the slow path when query spans don't match allele length
- Use SmallVec for per-read overlap/span buffers in unified pipeline
- Add WASP2_BAM_THREADS knob to tune htslib decompression vs Rayon workers
- Switch Python parallel entrypoint to per-call Rayon pool (supports sweeps)
- Add dev harness thread_sweep + Rust unified_profile binary for valgrind profiling
- Export rlib for internal binaries and silence a doctest-only example
- Figure 1: Speed benchmarks with Rust acceleration (7x faster)
- Figure 2: Allele counting accuracy comparison
- Figure 3: Statistical analysis with QTL replication
- Figure 4: Single-cell analysis framework

Includes:
- All figure generation scripts
- Benchmark data (TSV/JSON)
- Publication-quality plots (PDF/PNG/TIFF)
- Validation and release scripts

Data validated against 137 RNA + 137 ATAC iPSCORE samples.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Deterministic seeds for reproducible subsampling
- Avoid qstat segfault in scaling jobs
- Add INDEL variant FAIR comparison script

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Documentation:
- Analysis summary, data validation, experiment reports
- Figure engineering plan, statistical report
- Figure gallery

Figure data:
- Figure 1 panel exports (PDF/PNG/SVG)
- Figure 2 timing results and tools (STAR 2.7.4a archive)
- Figure 3 statistical data (QQ, volcano, QTL replication)

Data:
- CVPC eQTL stats for QTL replication analysis

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Removes paper/benchmark materials (71GB) for clean software release:
- Removed: paper/, benchmarking/, results/, validation/
- Removed: baselines/, benchmarks/, figures/, gm12878_benchmark/
- Removed: scripts/, data/, doc/, .devcontainer/
- Removed: benchmark_*.py, profile_*.py, simulate_*.py, etc.

Retained core software (~4MB):
- src/ (Python modules: counting, mapping, analysis)
- rust/src/ (Rust acceleration code)
- tests/ (Test suite)
- docs/ (Sphinx documentation)
- .github/ (CI/CD workflows)
- Package config (pyproject.toml, environment.yml, etc.)

Full reproducibility preserved in WASP2-exp/ropc-indels branch.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@Jaureguy760
Copy link
Collaborator Author

Closing - will create proper release PR after polishing in a separate development workflow.

@Jaureguy760 Jaureguy760 deleted the release/v1.2.1-software branch January 22, 2026 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants