Syntax Tests: Caret misaligned with multi-byte chars in UTF-8
While testing with the new Hugo syntax I encountered a stray behaviour in tests where there are multibyte encoded characters. Basically, the test caret starts to disalign with the above line when a multi-byte char is encountered.
My guess is that the test code always assumes that each character in the above line maps to a single byte, and keeps counting columns without accounting with multiple bytes in UTF-8 files.
If you look at this test file:
https://gitlab.com/tajmone/highlight-test-suite/blob/master/hugo/syntax_test_strings.hug#L134
You'll notice that I had to limit the actual carets when dealing with the acute accent ´
because I was getting error reports mentioning columns where there wasn't actually a caret for testing.
This acute accent business is an edge case, and I had to struggle a bit to cover matching it in escape sequences (defined as Interpolation) because the character could be a single byte in ISO-8859-1 or two-bytes in UTF-8.
So I had to create an ASCII test file too for this:
https://gitlab.com/tajmone/highlight-test-suite/blob/master/hugo/syntax_test_interpolation-ascii.hug
In the syntax definition, I had to cover both the ASCII version of the accent as well as the UTF-8 version, because although Hugo sources are usually in ISO-8859-1, inside Asciidoctor documentation project they'll be either pasted inside UTF-8 documents, or included externally as UTF-8 converted files (because Asciidoctor doesn't support ISO encoded files).
--[[
NOTE: The RegEx below defines twice the acute accent (´) char because depending
on wether the source is in ASCII/ISO-8859-1 or UTF-8 its encoding will
differ (the former is the expected encoding for Hugo sourceS, but the
latter might be encountered in documentation projects). --]]
Interpolation = [=[ (?x)(\\(?:
\xC2\xB4[a-zA-Z] | # Acute accent (´) in UTF-8 docs will be $c2 $b4.
[`´~\^:][a-zA-Z] | # Note: acute accent in ASCII format also found here.