accurate string length for UTF-8 std::strings - Printable Version +- Hot Door CORE Forum (http://hotdoorcore.com/forum) +-- Forum: All forums (http://hotdoorcore.com/forum/forumdisplay.php?fid=1) +--- Forum: Feature requests (http://hotdoorcore.com/forum/forumdisplay.php?fid=7) +--- Thread: accurate string length for UTF-8 std::strings (/showthread.php?tid=273) |
accurate string length for UTF-8 std::strings - Rick Johnson - 02-01-2021 If a UTF-8 string contains a curled apostrophe, it's stored as "\xe2\x80\x99" so a string like "Joe’s" returns a length or size of 7, not 5. The same happens for special characters like en dashes and thin spaces. This is a problem because I end up allocating two extra [garbage] character spaces when setting a text range end offset. Could a function be added to CORE's Strings class to return a string length taking these extended characters into account? Thanks! -- Rick RE: accurate string length for UTF-8 std::strings - Rick Johnson - 02-02-2021 In the meantime, I have a function that seems to give an accurate character count for strings including extended UTF-8 characters. I'm sure it can be improved, but it's enabled me to define text ranges to better handle text runs, and may be useful for others as well. Code: long charCount(std::string s){ RE: accurate string length for UTF-8 std::strings - garrett - 02-03-2021 This is because UTF-8 (the main/preferred string encoding in our libs, but this also applies to UTF-16) is multi-byte encoded. If you want a count of actual glyphs and not just the byte length, then convert the string to UTF-32 first (it is not a multi-byte encoding). See the hdi:core:: string namespace for functions to handle such a conversion. |