Skip to content

Add utf-8 invalid replacement encoder to trace transformers

What does this MR do?

Uses the UTF-8 replacement encoder transform for ensuring that traces that contain malformed UTF-8 sequences are normalized.

Why was this MR needed?

#27844 (closed)

What's the best way to test this MR?

Use the job with a pwsh Runner (it's possible this only occurs on Windows)

malform:
  script: |
    "Hello, 世界" | Format-Hex

Format-Hex outputs both a hexadecimal representation AND a ASCII representation. This rendered ASCII produces malformed UTF-8 sequences that GitLab's processing of the trace does not like.

  • GitLab will render the log successfully on a build prior to the latest Trace Improvements MR. The previous trace implementation used bytes.Runes on each Write(). Very expensive, but it had the benefit of replacing incorrect UTF-8 with the "Unicode replacement character".

  • Without this MR, GitLab will fail to process the log and display an error.

  • With this MR, the UTF-8 replacement encoder transform uses identical behaviour to the previous implementation, by replacing invalid runes with the "Unicode replacement character".

    Fortunetly, the transformer is way less expensive than bytes.Runes, especially in terms of allocations, but performance is degraded slightly (it doesn't put us anywhere near the poor performance of the previous implementation though).

What are the relevant issue numbers?

Closes #27844 (closed)

Merge request reports