Pattern matching and UTF-8

Started by llamazing, September 08, 2016, 02:47:52 AM

Previous topic - Next topic
I'm working on a custom dialog manager script for the features I described in this topic. A few questions have arisen in the process regarding UTF-8 characters:

1) In the text of my dialogs.dat file, keywords are denoted by prefacing the keyword with the "@" character. My original concept was to iterate through the keywords in the text using the following: text:gmatch("@(%w+)")which works great for ASCII text and forces keywords to only be alpha-numeric characters. However, my limited testing seems to conclude that it will not work as intended if UTF-8 characters are used.

It looks like a better solution is to instead use the following: text:gmatch("@([^%p%s]+)")which grabs all characters between the @ character and the first punctuation or space character (it only looks at one line at a time, so new-line characters are excluded).

Am I correct that pattern matching doesn't work with UTF-8 characters, and is my second proposed solution viable for when UTF-8 characters are used?

2) I also convert the text input of the player (in their native language) to lowercase with string.lower() so that case does not matter. My testing seems to indicate that string.lower() does not do anything for UTF-8 characters, but I see a note in the lua manual for the string.lower() entry that it depends on the locale.

How does the locale affect string.lower() and is the locale set automatically by the Solarus engine? Or perhaps by the operating system?

3) I've noticed that string.len() doesn't work the way I want it to in regard to UTF-8 (multibyte) characters since it returns the number of bytes in the string, and not the number of actual characters. It's easy enough to write my own custom function to count the number of characters in a string, but I figure I may as well ask...

Are there any included lua functions in the Solarus engine that return the number of characters in a string? It looks like lua 5.3 might have support for that, but as far as I can tell, Solarus appears to be using lua 5.1.

Your questions are more about Lua than about Solarus. Solarus does not set up anything special regarding to UTF-8 and locales, so there is no difference with plain Lua.
Lua 5.1 supports UTF-8 strings, and everything is UTF-8 in Solarus dialog files and string files. However, most functions of the Lua string API are not really adapted to UTF-8 because they are just simple bindings to the C library.

1) I think you are correct. Patterns can work with UTF-8 characters but maybe not the %w notation. Patterns work byte by byte, so for a multibyte character, none of its bytes alone matches the %w requirements. Your second solutions looks good to me.
2) The definition of what is an uppercase character depends on the locale, which is the one from the operating system. So if you don't trust string.lower, you can write your own function.
3) string.len is the number of bytes, yes. If you want to count the actual characters, you can make a loop and detect the non-ASCII ones.

Here is an example of how to detect a two-bytes character in an UTF-8 string: https://github.com/christopho/zelda_roth_se/blob/dev/data/scripts/dialog_box.lua#L399
More information here: http://lua-users.org/wiki/LuaUnicode