UTF-8 is broken on Darwin/AArch64

Summary

on Darwin/Aarch64 (at least), UTF-8 processing is broken. Specifically, TEncoding.UTF8.getbytes() converts any non-ANSI characters to ?. This is in trunk, and doesn't happen on stable.

System Information

  • Operating system: OSX 14.0 (23A344)
  • Processor architecture: M1 (Aarch64)
  • Compiler version: Lazarus 3.1 (rev lazarus_3_0-15-g9bef988478) FPC 3.3.1 aarch64-darwin-cocoa
  • Device: Macbook M2 Pro

Steps to reproduce

Run this code:

procedure TForm1.Button1Click(Sender: TObject);
var
  s1, s2 : String;
  u1 : UnicodeString;
  b1 : TBytes;

  function DumpBytes (var Bytes; Len : integer) : string;
  var
     I : integer;
     v : byte;
  begin
     Result := '';
     for I := 0 to Len-1 do
     begin
       try
         v := TBytes(Bytes)[I];
       except
         v := 254;
       end;
       Result := Result + IntToHex(v) + ' ';
     end;
  end;

begin
  s1 := 'EKG PB R'''' 波持续时间(持续时长、时长、时间长度、时间、时间长短、为时、为期、历时、延续时间、持久时间、持续期) AVR 导联';
  u1 := s1;
  b1 := TEncoding.UTF8.GetBytes(u1);
  s2 := TEncoding.UTF8.GetString(b1);

  Memo1.Lines.Append('s1 : '+DumpBytes(s1, Length(s1)));
  Memo1.Lines.Append('u1 : '+DumpBytes(u1, Length(u1)*2));
  Memo1.Lines.Append('b1 : '+DumpBytes(b1, Length(b1)));
  Memo1.Lines.Append('s2 : '+DumpBytes(s2, Length(s2)));
end;

You'll see the unicode chars with code points >255 have been converted to 3F by the GetBytes() routine. 

Example Project

see utf8-project.zip

What is the current bug behavior?

Unicode chars with code points >255 have been converted to 3F by the GetBytes() routine.

What is the expected (correct) behavior?

Unicode characters should be converted to their correct UTF-8 representations

Relevant logs and/or screenshots

The problem is in cwstring.pp somewhere - iconv_wide2ansi is not initialised (= -1), and so the code calls DefaultUnicode2AnsiMove instead, which does what it actually says, and converts all characters to ANSI. (unlike most 'ansi' routines). I can't debug to see how iconv_wide2ansi is initialised - no break points ever fire for me.

Possible fixes

Edited by Grahame Grieve