Audit usage of Glib::ustring

From a very unscientific method of spamming the GDB backtrace while opening a ~10 MB file, there were a few main things that popped up:

~50%: UTF-8 collate/comparison functions
~35%: Pango font functions (mostly GSUB reading: inbox#440)
~15%: other stuff

Let's assume this dodgy method does indeed reflect how much time is being spent in them, though obviously this could be very wrong.

Why are these UTF-8 functions being called then? Let's take a simple example:

const Glib::ustring foo = "Lorem ipsum dolor sit amet";
if (foo != "bar") {
    std::cout << "Is this fast?";
}

Now you're probably thinking, this is pretty fast right? All we need to do is compare the string lengths, figure that they're obviously not equal and move on.

This would be true if these were mere std::strings, but alas! Welcome to UTF-8. All of a sudden this becomes particularly expensive when we're dealing with Glib::ustrings, because you can't just count the number of bytes to find the number of characters!

Glib::ustring performs a linguistically aware comparison instead of a byte by byte one, but most of the time we don't need this.

So is this a valid concern? Here's what I'd suggest, but I'm open to other ideas:

If you want to use == or compare but don't need Unicode comparison, convert it to std::string first.
If you're not going to be iterating through Unicode strings, consider whether you really need Glib::ustring.

Edited Jun 13, 2019 by Qantas94Heavy

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information

Admin message

Audit usage of Glib::ustring