length function works incorrectly on utf-8 systems
Modern Linux distributions primarily use the UTF-8 encoding. Unfortunately, standard Free Pascal functions such as length
or pos
return incorrect values when given a string
variable containing UTF-8 encoded text.
Here’s a simple example:
program test;
var
s: string;
begin
readln(s);
writeln(length(s));
end.
If you run this code on Linux Mint 22 with UTF-8 encoding (I'm using FPC 3.2.2 from the system repository) and input a word in Russian, the result will be twice the expected value due to the character length in UTF-8.
Any beginner in programming would naturally use the string
type without delving into the differences between WideString
, UnicodeString
, and UTF8String
. Therefore, if we want to lower the entry barrier for beginners, our standard library should support UTF-8 strings out of the box.
I’ve created a simple example demonstrating how this could work: https://github.com/unxed/utf8.pas