How to decide a grapheme break efficiently in mid-string? #314

Old-Farmer · 2025-11-16T20:09:33Z

Old-Farmer
Nov 16, 2025

Sorry if I ask naive question. I am quite new to unicode. As the warning part said: "utf8proc_grapheme_break_stateful must be called IN ORDER on ALL potential breaks in a string. However, it is safe to reset the state to zero after a grapheme break". So how to recognize a break in the middle of the str? Only start from the beginning of the str? I have to recognize a break before an offset.

After I read the uax29，I have found that there are only a few cases need the state. I think I can call stateless version and decide whether this is break based on codepoint category? Is it correct?

Answered by stevengj

Nov 16, 2025

Good question … if you scrutinize the algorithm described in the Unicode standard, it should be possible to start in the middle of the string and scan backwards for the last character from which we can determine the state … in most cases you shouldn't have to scan all the way to the beginning of the string.

The most rigorous version of this algorithm would be Unicode version-dependent, so ideally should be implemented within utf8proc itself.

However, a simplified version of the algorithm would be search backwards through the characters for a boundclass (pointed to by the utf8proc_get_property struct) that is equal to UTF8PROC_BOUNDCLASS_OTHER, which is the most common value for characters…

View full answer

stevengj · 2025-11-16T23:00:33Z

stevengj
Nov 16, 2025
Maintainer

Good question … if you scrutinize the algorithm described in the Unicode standard, it should be possible to start in the middle of the string and scan backwards for the last character from which we can determine the state … in most cases you shouldn't have to scan all the way to the beginning of the string.

The most rigorous version of this algorithm would be Unicode version-dependent, so ideally should be implemented within utf8proc itself.

However, a simplified version of the algorithm would be search backwards through the characters for a boundclass (pointed to by the utf8proc_get_property struct) that is equal to UTF8PROC_BOUNDCLASS_OTHER, which is the most common value for characters with no special break rules, and then start at that character with a reset state rather than at the beginning of the string.

2 replies

Old-Farmer Nov 17, 2025
Author

Thanks for your reply. You mean a bound class other codepoint is safe to be passed as the codepoint1 arguement and with state 0?

stevengj Nov 17, 2025
Maintainer

Yes, I think so.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to decide a grapheme break efficiently in mid-string? #314

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How to decide a grapheme break efficiently in mid-string? #314

Uh oh!

Uh oh!

Old-Farmer Nov 16, 2025

Replies: 1 comment · 2 replies

Uh oh!

Uh oh!

stevengj Nov 16, 2025 Maintainer

Uh oh!

Uh oh!

Old-Farmer Nov 17, 2025 Author

Uh oh!

stevengj Nov 17, 2025 Maintainer

Old-Farmer
Nov 16, 2025

Replies: 1 comment 2 replies

stevengj
Nov 16, 2025
Maintainer

Old-Farmer Nov 17, 2025
Author

stevengj Nov 17, 2025
Maintainer