Geo blob replication fails with HPE_USER llhttp callback error on Ubuntu 24.04 with kernel 6.17
Summary
All Geo blob replication fails on GET-provisioned Ubuntu 24.04 instances with both stock GitLab 18.10.1 and MR branch builds. Every blob replicator (PackageFile, Upload, ProjectComponentFile, etc.) fails with the same error:
Error downloading file: error reading from socket:
Error Parsing data: HPE_USER Span callback error in on_header_fieldThe secondary's BlobDownloadService uses the http gem (which uses llhttp-ffi) to download files from the primary's internal Geo retrieve API. The HTTP response parser's FFI callbacks are corrupted, causing every download to fail.
Root Cause Analysis
The llhttp-ffi gem's LLHttp::Parser class defines callback wrapper methods via class_eval during gem loading (Rails boot):
CALLBACKS_WITH_DATA.each do |callback|
class_eval(<<~RB)
private def #{callback}(buffer, length)
@delegate.#{callback}(buffer.get_bytes(0, length))
end
RB
endThese methods are then converted to FFI function pointers via method(:on_header_field).to_proc in initialize. On this environment, the Method#to_proc conversion produces corrupted FFI function pointers for methods defined during the Rails boot process.
Key evidence:
LLHttp::Delegatesubclasses defined in agitlab-rails runnerscript work correctly- The exact same class definition, when loaded during Rails boot (via initializers,
Bundler.require, orload), produces broken callbacks - Using
proc { @delegate.send(callback, ...) }instead ofmethod(:x).to_procworks when called from post-boot code, but not when the proc is created during boot - ALL blob replicators are affected (PackageFile, Upload, etc.) — not specific to any one replicator
- Stock GitLab 18.10.1 has the same issue as MR branch builds on this environment
Environment
| Component | Version |
|---|---|
| OS | Ubuntu 24.04.4 LTS (Noble Numbat) |
| Kernel | 6.17.0-1009-aws |
| AMI | ami-0ec10929233384c7f |
| Instance | c5.2xlarge (Intel Xeon Platinum 8275CL) |
| Omnibus libffi | 3.2.1 (bundled as libffi.so.6) |
| System libffi | 3.4.6 (libffi.so.8) |
| ffi gem | 1.17.3 |
| llhttp-ffi | 0.4.0 |
| http gem | 5.1.1 |
| Ruby | 3.3.10 |
| GitLab | 18.10.1-ee (also reproduced on 18.10.0+rfbranch MR builds) |
| Provisioning | GitLab Environment Toolkit (GET) |
Likely Cause
Incompatibility between the omnibus-bundled libffi 3.2.1 and kernel 6.17's memory layout or security features. libffi 3.2.1 is from 2014 and its closure/trampoline mechanism (which allocates executable memory for FFI callbacks) may not function correctly with modern kernel memory protections.
The system libffi 3.4.6 cannot be used as a drop-in replacement because the ABI changed between libffi 6 and libffi 8.
Note: Standard FFI callbacks (e.g., qsort with a Ruby comparator proc) work correctly in ALL contexts. The issue is specific to Method#to_proc conversions being passed as FFI callbacks, and only manifests after the full Rails boot process completes.
Reproduction
On a GET-provisioned Ubuntu 24.04 instance with kernel 6.17:
# This FAILS (from gitlab-rails runner or console):
require "http"
HTTP::Response::Parser.new << "HTTP/1.1 200 OK\r\nserver: nginx\r\n\r\n"
# => IOError: Error Parsing data: HPE_USER Span callback error in on_header_field
# This WORKS (same gitlab-rails runner session):
class FreshHandler < LLHttp::Delegate
def on_header_field(f); end
def on_header_value(v); end
def on_headers_complete; end
def on_body(b); end
def on_message_complete; end
end
LLHttp::Parser.new(FreshHandler.new, type: :response) << "HTTP/1.1 200 OK\r\nserver: nginx\r\n\r\n"
# => OKImpact
- All Geo blob replication is broken on affected environments
- Geo status shows 0 synced, all failed for every blob replicator type
- Secondary sites fall back to proxying requests to primary (functional but defeats the purpose of Geo replication)
- Affects anyone using GET with Ubuntu 24.04 AMIs that ship kernel 6.17
Possible Fixes
As mentioned above, this issue happens when the gem llhttp-ffi version 0.4.0 is used in the newest Ubuntu 24.04 kernel.
The current GitLab gemfile uses http 5.1.1 which locks llhttp-ffi to 0.4.x.
A newer version of the gem is available, which include llhttp-ffi "~> 0.5.0"
kubeclient and gitlab_quality-test_tooling constrain which version of http can be used. Both allow up to < 6.0, so they permit 5.3.1 (the latest stable 5.x).
The proper fix for this bug is therefore to upgrade http to v. 5.3.1.
Context
Discovered while testing Geo replication for Packages::Debian::ProjectComponentFile (#593813, MR !228959). The ProjectComponentFile replicator code is correct — this environment issue blocks verification of ALL blob replicators.