TFixedFormatDataset: Incorrect field values from utf8-encoded data file
Summary
TFixedFormatDataset fails to extract field values from the lines of utf-8 encoded datafiles. The datalines are split at 4-fold field size positions.
System Information
- Operating system: Windows 11, but I think that the issue is independent of OS
- Processor architecture: x86-64, issue expected for other architectures as well
- Compiler version: 3.2.2, trunk (11d542cf)
- Device: Computer
Steps to reproduce
The attached zip file contains two data files, one is encoded with code page 1252, the other one is encoded with utf-8. Each file contains five fields, each 10 characters wide, as well as some dummy records. The test project loads each data file into a TFixedFormatDataset which is set up correctly for the encoding of the file.
In case of the cp1252 file, the file lines are split up correctly at the positions defined by the Schema.
In case of the utf8 file, the lines are split at positions which corespond to 4x field size.
Example Project
See attachment: tfixedformatdataset_utf8.zip
- Project1 is a GUI program for Lazarus
- Project2 is a non-GUI program for FPC alone.
What is the current bug behavior?
The input lines of the utf-8 encoded file are split at positions corresponding to four-fold field size.
What is the expected (correct) behavior?
The input lines should be split at the string positions defined by the field sizes, like for the CP1252 encoded file.
Possible fixes
I don't have a fix ATM, but the issue must be related to the fact that TStringField returns the DataSize to be essentially 4 time its Size:
function TStringField.GetDataSize: Integer;
begin
case FCodePage of
CP_UTF8: Result := 4*Size+1;
else Result := Size+1;
end;
end;