accurate string length for UTF-8 std::strings

accurate string length for UTF-8 std::strings - Printable Version

+- Hot Door CORE Forum (http://hotdoorcore.com/forum)
+-- Forum: All forums (http://hotdoorcore.com/forum/forumdisplay.php?fid=1)
+--- Forum: Feature requests (http://hotdoorcore.com/forum/forumdisplay.php?fid=7)
+--- Thread: accurate string length for UTF-8 std::strings (/showthread.php?tid=273)

accurate string length for UTF-8 std::strings - Rick Johnson - 02-01-2021

If a UTF-8 string contains a curled apostrophe, it's stored as "\xe2\x80\x99" so a string like "Joe’s" returns a length or size of 7, not 5. The same happens for special characters like en dashes and thin spaces. This is a problem because I end up allocating two extra [garbage] character spaces when setting a text range end offset.

Could a function be added to CORE's Strings class to return a string length taking these extended characters into account?

Thanks! -- Rick

RE: accurate string length for UTF-8 std::strings - Rick Johnson - 02-02-2021

In the meantime, I have a function that seems to give an accurate character count for strings including extended UTF-8 characters. I'm sure it can be improved, but it's enabled me to define text ranges to better handle text runs, and may be useful for others as well.

Code:
long charCount(std::string s){

    long len = s.length();

    uint32_t e = hdi::core::strings::substrCount(s, "\xe2\x80");

    len -= (e*2);

    e = hdi::core::strings::substrCount(s, "\xe2\x81");

    len -= (e*2);

    return len;

}

RE: accurate string length for UTF-8 std::strings - garrett - 02-03-2021

This is because UTF-8 (the main/preferred string encoding in our libs, but this also applies to UTF-16) is multi-byte encoded. If you want a count of actual glyphs and not just the byte length, then convert the string to UTF-32 first (it is not a multi-byte encoding). See the hdi:core:: string namespace for functions to handle such a conversion.