Speed improvements. #55

FranklinChen · 2026-01-11T02:09:37Z

Lazy loading so that batchalign --help does not take 20 seconds.
Caching of various things to avoid repeated overhead.
Parallelizing over files, with optional --workers but defaulting to limiting to detected number of cores; some reorg to make sure all workers are handled to avoid interleaving
Alignment from O(n^2) to O(n)

- Move imports inside function calls to reduce startup time and memory usage - Remove unused imports including multiprocessing, glob, traceback, and rich modules - Consolidate import statements and remove redundant module references - Maintain all functionality while improving performance through lazy loading - Reduce initial module load time by importing only when needed in each command function

…ove CLI options - Introduce --workers option to control number of parallel processes for CLI - Implement ProcessPoolExecutor for parallel file processing in dispatch module - Add worker pipeline caching to avoid repeated initialization overhead - Implement proper stdout/stderr capture and redirection per worker process - Modify verbosity handling to adjust stanza logging levels appropriately - Update morphotag pipeline to accept pre-initialized NLP pipeline instance - Add pre-download of stanza resources to prevent interleaved downloads - Limit workers for GPU-intensive tasks like transcription to prevent memory issues - Replace manual regex operations with pre-compiled patterns for better performance - Remove debug breakpoints and add graceful error handling for edge cases - Update progress reporting to work correctly with parallel execution model

…ve CLI processing order - Replace full-matrix dynamic programming with Hirschberg-style divide-and-conquer algorithm for linear space complexity - Reduce memory usage significantly for large sequence alignments while maintaining same edit-distance results - Add size-based sorting in CLI dispatcher to process largest files first, preventing late stragglers in parallel processing - Maintain backward compatibility with existing alignment API and match functions - Optimize progress tracking and cleanup for alignment operations

…ontext handling - Add tokenizer_context parameter to morphoanalyze function to maintain pre-parallelization behavior - Implement NLP pipeline caching using lang/retokenize/mwt combinations to avoid redundant rebuilds - Extract language conversion logic into _lang_alpha2 static method for better organization - Add MWT signature hashing to support custom multi-word token configurations - Introduce _build_nlp method to centralize pipeline construction with proper processor chaining - Create _get_or_create_nlp method to manage cached pipeline instances efficiently - Restore tokenizer post-processing functionality that was lost during parallelization changes - Add validation to prevent code-switching documents with custom MWT lists

- Add progress callback mechanism to worker tasks to track pipeline processing stages - Implement multiprocessing-safe progress queue to collect real-time progress updates - Update progress display to show detailed processing stages instead of generic waiting/processing states - Replace basic completion tracking with granular progress reporting from pipeline operations - Add proper cleanup and resource management for progress tracking infrastructure - Maintain backward compatibility with existing error handling and logging features

Jemoka · 2026-01-11T03:03:21Z

Woah thanks! I will review this and get back to you. DP being in $O(n)$ seems a little suspicious without thinking deeply about this, but will read through and check.
I will get a review in by Wednesday EOD.

Thanks again!

FranklinChen added 5 commits January 10, 2026 21:05

Copilot AI review requested due to automatic review settings January 11, 2026 02:09

Copilot started reviewing on behalf of FranklinChen January 11, 2026 02:10 View session

FranklinChen requested review from Jemoka and removed request for Copilot January 11, 2026 02:10

Jemoka changed the base branch from master to feat/speed January 13, 2026 21:05

Jemoka merged commit 1ed2cf8 into TalkBank:feat/speed Jan 13, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed improvements. #55

Speed improvements. #55

FranklinChen commented Jan 11, 2026

Uh oh!

Jemoka commented Jan 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Speed improvements. #55

Speed improvements. #55

Conversation

FranklinChen commented Jan 11, 2026

Uh oh!

Jemoka commented Jan 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants