UTF-8 is broken on Darwin/AArch64
Summary
on Darwin/Aarch64 (at least), UTF-8 processing is broken. Specifically, TEncoding.UTF8.getbytes() converts any non-ANSI characters to ?. This is in trunk, and doesn't happen on stable.
System Information
- Operating system: OSX 14.0 (23A344)
- Processor architecture: M1 (Aarch64)
- Compiler version: Lazarus 3.1 (rev lazarus_3_0-15-g9bef988478) FPC 3.3.1 aarch64-darwin-cocoa
- Device: Macbook M2 Pro
Steps to reproduce
Run this code:
procedure TForm1.Button1Click(Sender: TObject);
var
s1, s2 : String;
u1 : UnicodeString;
b1 : TBytes;
function DumpBytes (var Bytes; Len : integer) : string;
var
I : integer;
v : byte;
begin
Result := '';
for I := 0 to Len-1 do
begin
try
v := TBytes(Bytes)[I];
except
v := 254;
end;
Result := Result + IntToHex(v) + ' ';
end;
end;
begin
s1 := 'EKG PB R'''' 波持续时间(持续时长、时长、时间长度、时间、时间长短、为时、为期、历时、延续时间、持久时间、持续期) AVR 导联';
u1 := s1;
b1 := TEncoding.UTF8.GetBytes(u1);
s2 := TEncoding.UTF8.GetString(b1);
Memo1.Lines.Append('s1 : '+DumpBytes(s1, Length(s1)));
Memo1.Lines.Append('u1 : '+DumpBytes(u1, Length(u1)*2));
Memo1.Lines.Append('b1 : '+DumpBytes(b1, Length(b1)));
Memo1.Lines.Append('s2 : '+DumpBytes(s2, Length(s2)));
end;
You'll see the unicode chars with code points >255 have been converted to 3F by the GetBytes() routine.
Example Project
see utf8-project.zip
What is the current bug behavior?
Unicode chars with code points >255 have been converted to 3F by the GetBytes() routine.
What is the expected (correct) behavior?
Unicode characters should be converted to their correct UTF-8 representations
Relevant logs and/or screenshots
The problem is in cwstring.pp somewhere - iconv_wide2ansi is not initialised (= -1), and so the code calls DefaultUnicode2AnsiMove instead, which does what it actually says, and converts all characters to ANSI. (unlike most 'ansi' routines). I can't debug to see how iconv_wide2ansi is initialised - no break points ever fire for me.
Possible fixes
Edited by Grahame Grieve