Skip to content

Conversation

Copy link

Copilot AI commented Jan 5, 2026

  • Explore the codebase to understand the structure
  • Identify performance bottlenecks
  • Optimize the get_script() function using binary search (O(n) → O(log n) with 776 ranges)
  • Pre-compute sorted data structures at code generation time (in genscripts/generate_text.py)
  • Add tests for the optimization
  • Run existing tests to verify correctness (223 tests pass)
  • Run mypy type checking (no issues)
  • Run code review
  • Run security checks (no vulnerabilities found)

Performance Improvement

File: rigour/text/scripts.py

Issue: The get_script() function was iterating linearly through 776 Unicode script ranges to find the script for a codepoint. This is O(n) complexity for each lookup.

Solution:

  1. Replaced linear search with binary search using Python's bisect module
  2. Pre-computed SORTED_RANGES and RANGE_STARTS at code generation time in genscripts/generate_text.py rather than at module import time

Benchmark Results:

  • Before: ~9.8μs per call
  • After: ~0.23μs per call
  • Speedup: ~42x faster

This optimization improves the performance of all functions that depend on script detection, including can_latinize(), can_latinize_cp(), and indirectly affects name processing operations.

Original prompt

Identify and suggest improvements to slow or inefficient code


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Co-authored-by: pudo <41628+pudo@users.noreply.github.com>
Copilot AI changed the title [WIP] Identify and suggest improvements to slow or inefficient code Optimize get_script() using binary search for 42x speedup Jan 5, 2026
Copilot AI requested a review from pudo January 5, 2026 20:30
@pudo
Copy link
Member

pudo commented Jan 5, 2026

@copilot I like this! Do we need to regenerate _SORTED_RANGES on each start or can we do this as the scripts data is computed in genscripts/generate_text.py?

Co-authored-by: pudo <41628+pudo@users.noreply.github.com>
Copy link
Author

Copilot AI commented Jan 5, 2026

@copilot I like this! Do we need to regenerate _SORTED_RANGES on each start or can we do this as the scripts data is computed in genscripts/generate_text.py?

Done in c908d6b. The SORTED_RANGES and RANGE_STARTS are now pre-computed in genscripts/generate_text.py and included in the generated rigour/data/text/scripts.py file, eliminating the runtime sorting overhead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants