Release v1.19.2: Enhanced character encoding and spacing#293
Conversation
Enhanced encode_html_entities() to use extended HTML entity names (e.g., č, ū, ė) instead of numeric references for Lithuanian and Eastern European characters. Changes: - Added extended_entities dictionary with 30+ character mappings - Lithuanian: č → č, ū → ū, ė → ė - Other Eastern European: š → š, ž → ž, etc. - Updated tests to expect named entities Example: "Vaitkevičiūtė" → "Vaitkevičiūtė" Addresses feedback from Guillaume Jacquemet Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Enhanced bioRxiv character encoding with named HTML entities - Supports Lithuanian and Eastern European characters properly
Changed from \enskip (0.5em) to \quad (1em) for more visible spacing after runin subsubsection titles. Addresses feedback that spacing was present but still too tight. Before: \enskip = 0.5em spacing After: \quad = 1em spacing (2x wider) Addresses feedback from Guillaume Jacquemet Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Code Review: Release v1.19.2I've reviewed PR #293 and overall the changes look solid. Here's my detailed feedback: ✅ Strengths1. Enhanced HTML Entity Encoding
2. Test Coverage
3. LaTeX Spacing Improvement
4. Documentation
🔍 Code Quality Observations1. Performance Consideration # At module level (after imports)
_EXTENDED_ENTITIES = {
"č": "ccaron",
"Č": "Ccaron",
# ... rest of mappings
}
# Build the full char-to-entity map once at module load
_CHAR_TO_ENTITY = {}
for char, entity_name in _EXTENDED_ENTITIES.items():
_CHAR_TO_ENTITY[char] = f"&{entity_name};"
for entity_name, codepoint in html.entities.name2codepoint.items():
char = chr(codepoint)
if ord(char) > 127 and char not in _CHAR_TO_ENTITY:
_CHAR_TO_ENTITY[char] = f"&{entity_name};"
def encode_html_entities(text: str) -> str:
# Use pre-built _CHAR_TO_ENTITYThis would avoid rebuilding ~200+ dictionary entries for every author name processed. 2. Memory Efficiency def _encode_char(char):
char_code = ord(char)
if char_code <= 127:
return char
elif char in _CHAR_TO_ENTITY:
return _CHAR_TO_ENTITY[char]
else:
return f"&#{char_code};"
return "".join(_encode_char(char) for char in text)3. Type Hints def encode_html_entities(text: str | None) -> str | None:🔒 Security AnalysisGood security practices observed:
No security concerns identified. 🧪 Testing Recommendations
📝 Minor Suggestions
🎯 VerdictRecommend: APPROVE with optional performance enhancements The PR addresses real user feedback, improves bioRxiv compatibility, and maintains good test coverage. The code is well-documented and follows project conventions. The performance optimizations suggested above are optional - the current implementation is correct and will work fine for typical manuscript author lists (5-20 authors). Only optimize if profiling shows this as a bottleneck. 📋 Pre-merge ChecklistBefore merging, verify:
Great work addressing the user feedback! 🎉 |
There was a problem hiding this comment.
Pull request overview
This PR enhances character encoding for bioRxiv submissions and improves LaTeX section spacing. The changes move from numeric HTML character references to named HTML entities for better compatibility with bioRxiv's TSV import system.
Changes:
- Added 30+ extended HTML entity mappings (Lithuanian, Polish, Turkish, Romanian characters)
- Increased LaTeX subsubsection spacing from
\enskip(0.5em) to\quad(1em) - Updated tests to validate named entity encoding instead of numeric references
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Added PyPDF2 3.0.1 to dev dependencies; lock file revision changed |
| tests/unit/test_prepare_biorxiv.py | Updated test assertions to expect named entities instead of numeric references |
| src/tex/style/rxiv_maker_style.cls | Increased subsubsection spacing from \enskip to \quad |
| src/rxiv_maker/engines/operations/prepare_biorxiv.py | Added extended_entities dictionary with 30+ character mappings to named HTML entities |
| src/rxiv_maker/version.py | Version bumped to 1.19.2 |
| CHANGELOG.md | Added v1.19.2 release notes documenting both changes |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| extended_entities = { | ||
| # Lithuanian and Eastern European | ||
| "č": "ccaron", | ||
| "Č": "Ccaron", # c with caron | ||
| "ė": "edot", | ||
| "Ė": "Edot", # e with dot above | ||
| "ū": "umacr", | ||
| "Ū": "Umacr", # u with macron | ||
| "ā": "amacr", | ||
| "Ā": "Amacr", # a with macron | ||
| "ē": "emacr", | ||
| "Ē": "Emacr", # e with macron | ||
| "ī": "imacr", | ||
| "Ī": "Imacr", # i with macron | ||
| "ō": "omacr", | ||
| "Ō": "Omacr", # o with macron | ||
| # Other common extended entities | ||
| "ă": "abreve", | ||
| "Ă": "Abreve", # a with breve | ||
| "ą": "aogon", | ||
| "Ą": "Aogon", # a with ogonek | ||
| "ć": "cacute", | ||
| "Ć": "Cacute", # c with acute | ||
| "ę": "eogon", | ||
| "Ę": "Eogon", # e with ogonek | ||
| "ğ": "gbreve", | ||
| "Ğ": "Gbreve", # g with breve | ||
| "İ": "Idot", # I with dot above | ||
| "ı": "inodot", # i without dot | ||
| "ł": "lstrok", | ||
| "Ł": "Lstrok", # l with stroke | ||
| "ń": "nacute", | ||
| "Ń": "Nacute", # n with acute | ||
| "œ": "oelig", | ||
| "Œ": "OElig", # oe ligature | ||
| "ř": "rcaron", | ||
| "Ř": "Rcaron", # r with caron | ||
| "ś": "sacute", | ||
| "Ś": "Sacute", # s with acute | ||
| "š": "scaron", | ||
| "Š": "Scaron", # s with caron | ||
| "ş": "scedil", | ||
| "Ş": "Scedil", # s with cedilla | ||
| "ţ": "tcedil", | ||
| "Ţ": "Tcedil", # t with cedilla | ||
| "ů": "uring", | ||
| "Ů": "Uring", # u with ring | ||
| "ź": "zacute", | ||
| "Ź": "Zacute", # z with acute | ||
| "ż": "zdot", | ||
| "Ż": "Zdot", # z with dot above | ||
| "ž": "zcaron", | ||
| "Ž": "Zcaron", # z with caron | ||
| } |
There was a problem hiding this comment.
The extended entities dictionary includes uppercase character mappings (e.g., "Č" → "Ccaron", "Ė" → "Edot", "Ū" → "Umacr") but there are no test cases validating these uppercase entities. Consider adding test coverage for uppercase extended characters to ensure they encode correctly, especially since HTML entity names are case-sensitive.
Release v1.19.2
Fixes
bioRxiv Character Encoding - Named HTML Entities
č, ū →ū, ė →ėLaTeX Section Spacing - Increased Visibility
\enskip(0.5em) to\quad(1em)Testing
Issues Fixed
Addresses character encoding and spacing feedback from Guillaume Jacquemet
🤖 Generated with Claude Code