Skip to content
Discussion options

You must be logged in to vote

Good question … if you scrutinize the algorithm described in the Unicode standard, it should be possible to start in the middle of the string and scan backwards for the last character from which we can determine the state … in most cases you shouldn't have to scan all the way to the beginning of the string.

The most rigorous version of this algorithm would be Unicode version-dependent, so ideally should be implemented within utf8proc itself.

However, a simplified version of the algorithm would be search backwards through the characters for a boundclass (pointed to by the utf8proc_get_property struct) that is equal to UTF8PROC_BOUNDCLASS_OTHER, which is the most common value for characters…

Replies: 1 comment 2 replies

Comment options

You must be logged in to vote
2 replies
@Old-Farmer
Comment options

@stevengj
Comment options

Answer selected by Old-Farmer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants