Geo blob replication fails with HPE_USER llhttp callback error on Ubuntu 24.04 with kernel 6.17

Summary

All Geo blob replication fails on GET-provisioned Ubuntu 24.04 instances with both stock GitLab 18.10.1 and MR branch builds. Every blob replicator (PackageFile, Upload, ProjectComponentFile, etc.) fails with the same error:

Error downloading file: error reading from socket: 
Error Parsing data: HPE_USER Span callback error in on_header_field

The secondary's BlobDownloadService uses the http gem (which uses llhttp-ffi) to download files from the primary's internal Geo retrieve API. The HTTP response parser's FFI callbacks are corrupted, causing every download to fail.

Root Cause Analysis

The llhttp-ffi gem's LLHttp::Parser class defines callback wrapper methods via class_eval during gem loading (Rails boot):

CALLBACKS_WITH_DATA.each do |callback|
  class_eval(<<~RB)
    private def #{callback}(buffer, length)
      @delegate.#{callback}(buffer.get_bytes(0, length))
    end
  RB
end

These methods are then converted to FFI function pointers via method(:on_header_field).to_proc in initialize. On this environment, the Method#to_proc conversion produces corrupted FFI function pointers for methods defined during the Rails boot process.

Key evidence:

  • LLHttp::Delegate subclasses defined in a gitlab-rails runner script work correctly
  • The exact same class definition, when loaded during Rails boot (via initializers, Bundler.require, or load), produces broken callbacks
  • Using proc { @delegate.send(callback, ...) } instead of method(:x).to_proc works when called from post-boot code, but not when the proc is created during boot
  • ALL blob replicators are affected (PackageFile, Upload, etc.) — not specific to any one replicator
  • Stock GitLab 18.10.1 has the same issue as MR branch builds on this environment

Environment

Component Version
OS Ubuntu 24.04.4 LTS (Noble Numbat)
Kernel 6.17.0-1009-aws
AMI ami-0ec10929233384c7f
Instance c5.2xlarge (Intel Xeon Platinum 8275CL)
Omnibus libffi 3.2.1 (bundled as libffi.so.6)
System libffi 3.4.6 (libffi.so.8)
ffi gem 1.17.3
llhttp-ffi 0.4.0
http gem 5.1.1
Ruby 3.3.10
GitLab 18.10.1-ee (also reproduced on 18.10.0+rfbranch MR builds)
Provisioning GitLab Environment Toolkit (GET)

Likely Cause

Incompatibility between the omnibus-bundled libffi 3.2.1 and kernel 6.17's memory layout or security features. libffi 3.2.1 is from 2014 and its closure/trampoline mechanism (which allocates executable memory for FFI callbacks) may not function correctly with modern kernel memory protections.

The system libffi 3.4.6 cannot be used as a drop-in replacement because the ABI changed between libffi 6 and libffi 8.

Note: Standard FFI callbacks (e.g., qsort with a Ruby comparator proc) work correctly in ALL contexts. The issue is specific to Method#to_proc conversions being passed as FFI callbacks, and only manifests after the full Rails boot process completes.

Reproduction

On a GET-provisioned Ubuntu 24.04 instance with kernel 6.17:

# This FAILS (from gitlab-rails runner or console):
require "http"
HTTP::Response::Parser.new << "HTTP/1.1 200 OK\r\nserver: nginx\r\n\r\n"
# => IOError: Error Parsing data: HPE_USER Span callback error in on_header_field

# This WORKS (same gitlab-rails runner session):
class FreshHandler < LLHttp::Delegate
  def on_header_field(f); end
  def on_header_value(v); end  
  def on_headers_complete; end
  def on_body(b); end
  def on_message_complete; end
end
LLHttp::Parser.new(FreshHandler.new, type: :response) << "HTTP/1.1 200 OK\r\nserver: nginx\r\n\r\n"
# => OK

Impact

  • All Geo blob replication is broken on affected environments
  • Geo status shows 0 synced, all failed for every blob replicator type
  • Secondary sites fall back to proxying requests to primary (functional but defeats the purpose of Geo replication)
  • Affects anyone using GET with Ubuntu 24.04 AMIs that ship kernel 6.17

Possible Fixes

As mentioned above, this issue happens when the gem llhttp-ffi version 0.4.0 is used in the newest Ubuntu 24.04 kernel. The current GitLab gemfile uses http 5.1.1 which locks llhttp-ffi to 0.4.x.

A newer version of the gem is available, which include llhttp-ffi "~> 0.5.0"

kubeclient and gitlab_quality-test_tooling constrain which version of http can be used. Both allow up to < 6.0, so they permit 5.3.1 (the latest stable 5.x).

The proper fix for this bug is therefore to upgrade http to v. 5.3.1.

Context

Discovered while testing Geo replication for Packages::Debian::ProjectComponentFile (#593813, MR !228959). The ProjectComponentFile replicator code is correct — this environment issue blocks verification of ALL blob replicators.

Edited by Chloe Fons