Skip to content

Conversation

@FranklinChen
Copy link
Member

  1. Lazy loading so that batchalign --help does not take 20 seconds.
  2. Caching of various things to avoid repeated overhead.
  3. Parallelizing over files, with optional --workers but defaulting to limiting to detected number of cores; some reorg to make sure all workers are handled to avoid interleaving
  4. Alignment from O(n^2) to O(n)

- Move imports inside function calls to reduce startup time and memory usage
- Remove unused imports including multiprocessing, glob, traceback, and rich modules
- Consolidate import statements and remove redundant module references
- Maintain all functionality while improving performance through lazy loading
- Reduce initial module load time by importing only when needed in each command function
…ove CLI options

- Introduce --workers option to control number of parallel processes for CLI
- Implement ProcessPoolExecutor for parallel file processing in dispatch module
- Add worker pipeline caching to avoid repeated initialization overhead
- Implement proper stdout/stderr capture and redirection per worker process
- Modify verbosity handling to adjust stanza logging levels appropriately
- Update morphotag pipeline to accept pre-initialized NLP pipeline instance
- Add pre-download of stanza resources to prevent interleaved downloads
- Limit workers for GPU-intensive tasks like transcription to prevent memory issues
- Replace manual regex operations with pre-compiled patterns for better performance
- Remove debug breakpoints and add graceful error handling for edge cases
- Update progress reporting to work correctly with parallel execution model
…ve CLI processing order

- Replace full-matrix dynamic programming with Hirschberg-style divide-and-conquer algorithm for linear space complexity
- Reduce memory usage significantly for large sequence alignments while maintaining same edit-distance results
- Add size-based sorting in CLI dispatcher to process largest files first, preventing late stragglers in parallel processing
- Maintain backward compatibility with existing alignment API and match functions
- Optimize progress tracking and cleanup for alignment operations
…ontext handling

- Add tokenizer_context parameter to morphoanalyze function to maintain pre-parallelization behavior
- Implement NLP pipeline caching using lang/retokenize/mwt combinations to avoid redundant rebuilds
- Extract language conversion logic into _lang_alpha2 static method for better organization
- Add MWT signature hashing to support custom multi-word token configurations
- Introduce _build_nlp method to centralize pipeline construction with proper processor chaining
- Create _get_or_create_nlp method to manage cached pipeline instances efficiently
- Restore tokenizer post-processing functionality that was lost during parallelization changes
- Add validation to prevent code-switching documents with custom MWT lists
- Add progress callback mechanism to worker tasks to track pipeline processing stages
- Implement multiprocessing-safe progress queue to collect real-time progress updates
- Update progress display to show detailed processing stages instead of generic waiting/processing states
- Replace basic completion tracking with granular progress reporting from pipeline operations
- Add proper cleanup and resource management for progress tracking infrastructure
- Maintain backward compatibility with existing error handling and logging features
Copilot AI review requested due to automatic review settings January 11, 2026 02:09
@FranklinChen FranklinChen requested review from Jemoka and removed request for Copilot January 11, 2026 02:10
@Jemoka
Copy link
Member

Jemoka commented Jan 11, 2026

Woah thanks! I will review this and get back to you. DP being in $O(n)$ seems a little suspicious without thinking deeply about this, but will read through and check.
I will get a review in by Wednesday EOD.

Thanks again!

@Jemoka Jemoka changed the base branch from master to feat/speed January 13, 2026 21:05
@Jemoka Jemoka merged commit 1ed2cf8 into TalkBank:feat/speed Jan 13, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants