You need to sign in or sign up before continuing.

TFixedFormatDataset: Incorrect field values from utf8-encoded data file

Summary

TFixedFormatDataset fails to extract field values from the lines of utf-8 encoded datafiles. The datalines are split at 4-fold field size positions.

System Information

Operating system: Windows 11, but I think that the issue is independent of OS
Processor architecture: x86-64, issue expected for other architectures as well
Compiler version: 3.2.2, trunk (11d542cf)
Device: Computer

Steps to reproduce

The attached zip file contains two data files, one is encoded with code page 1252, the other one is encoded with utf-8. Each file contains five fields, each 10 characters wide, as well as some dummy records. The test project loads each data file into a TFixedFormatDataset which is set up correctly for the encoding of the file.

In case of the cp1252 file, the file lines are split up correctly at the positions defined by the Schema.

In case of the utf8 file, the lines are split at positions which corespond to 4x field size.

Example Project

See attachment: tfixedformatdataset_utf8.zip

Project1 is a GUI program for Lazarus
Project2 is a non-GUI program for FPC alone.

What is the current bug behavior?

The input lines of the utf-8 encoded file are split at positions corresponding to four-fold field size.

What is the expected (correct) behavior?

The input lines should be split at the string positions defined by the field sizes, like for the CP1252 encoded file.

Possible fixes

I don't have a fix ATM, but the issue must be related to the fact that TStringField returns the DataSize to be essentially 4 time its Size:

function TStringField.GetDataSize: Integer;

begin
  case FCodePage of
    CP_UTF8: Result := 4*Size+1;
    else     Result :=   Size+1;
  end;
end;

Edited Sep 02, 2024 by Werner Pamler

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information