How to decide a grapheme break efficiently in mid-string? #314
-
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
|
Good question … if you scrutinize the algorithm described in the Unicode standard, it should be possible to start in the middle of the string and scan backwards for the last character from which we can determine the state … in most cases you shouldn't have to scan all the way to the beginning of the string. The most rigorous version of this algorithm would be Unicode version-dependent, so ideally should be implemented within utf8proc itself. However, a simplified version of the algorithm would be search backwards through the characters for a |
Beta Was this translation helpful? Give feedback.

Good question … if you scrutinize the algorithm described in the Unicode standard, it should be possible to start in the middle of the string and scan backwards for the last character from which we can determine the state … in most cases you shouldn't have to scan all the way to the beginning of the string.
The most rigorous version of this algorithm would be Unicode version-dependent, so ideally should be implemented within utf8proc itself.
However, a simplified version of the algorithm would be search backwards through the characters for a
boundclass(pointed to by theutf8proc_get_propertystruct) that is equal toUTF8PROC_BOUNDCLASS_OTHER, which is the most common value for characters…