length function works incorrectly on utf-8 systems

Modern Linux distributions primarily use the UTF-8 encoding. Unfortunately, standard Free Pascal functions such as length or pos return incorrect values when given a string variable containing UTF-8 encoded text.

Here’s a simple example:

program test;

var
  s: string;

begin
  readln(s);
  writeln(length(s));
end.

If you run this code on Linux Mint 22 with UTF-8 encoding (I'm using FPC 3.2.2 from the system repository) and input a word in Russian, the result will be twice the expected value due to the character length in UTF-8.

Any beginner in programming would naturally use the string type without delving into the differences between WideString, UnicodeString, and UTF8String. Therefore, if we want to lower the entry barrier for beginners, our standard library should support UTF-8 strings out of the box.

I’ve created a simple example demonstrating how this could work: https://github.com/unxed/utf8.pas

Edited Aug 22, 2024 by unxed

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information