Geo blob replication fails with HPE_USER llhttp callback error on Ubuntu 24.04 with kernel 6.17
## Summary All Geo blob replication fails on GET-provisioned Ubuntu 24.04 instances with both stock GitLab 18.10.1 and MR branch builds. Every blob replicator (PackageFile, Upload, ProjectComponentFile, etc.) fails with the same error: ``` Error downloading file: error reading from socket: Error Parsing data: HPE_USER Span callback error in on_header_field ``` The secondary's `BlobDownloadService` uses the `http` gem (which uses `llhttp-ffi`) to download files from the primary's internal Geo retrieve API. The HTTP response parser's FFI callbacks are corrupted, causing every download to fail. ## Root Cause Analysis The `llhttp-ffi` gem's `LLHttp::Parser` class defines callback wrapper methods via `class_eval` during gem loading (Rails boot): ```ruby CALLBACKS_WITH_DATA.each do |callback| class_eval(<<~RB) private def #{callback}(buffer, length) @delegate.#{callback}(buffer.get_bytes(0, length)) end RB end ``` These methods are then converted to FFI function pointers via `method(:on_header_field).to_proc` in `initialize`. On this environment, the `Method#to_proc` conversion produces **corrupted FFI function pointers** for methods defined during the Rails boot process. **Key evidence:** - `LLHttp::Delegate` subclasses defined in a `gitlab-rails runner` script work correctly - The exact same class definition, when loaded during Rails boot (via initializers, `Bundler.require`, or `load`), produces broken callbacks - Using `proc { @delegate.send(callback, ...) }` instead of `method(:x).to_proc` works when called from post-boot code, but not when the proc is created during boot - ALL blob replicators are affected (PackageFile, Upload, etc.) — not specific to any one replicator - Stock GitLab 18.10.1 has the same issue as MR branch builds on this environment ## Environment | Component | Version | |-----------|---------| | OS | Ubuntu 24.04.4 LTS (Noble Numbat) | | Kernel | **6.17.0-1009-aws** | | AMI | ami-0ec10929233384c7f | | Instance | c5.2xlarge (Intel Xeon Platinum 8275CL) | | Omnibus libffi | **3.2.1** (bundled as `libffi.so.6`) | | System libffi | 3.4.6 (`libffi.so.8`) | | ffi gem | 1.17.3 | | llhttp-ffi | 0.4.0 | | http gem | 5.1.1 | | Ruby | 3.3.10 | | GitLab | 18.10.1-ee (also reproduced on 18.10.0+rfbranch MR builds) | | Provisioning | GitLab Environment Toolkit (GET) | ## Likely Cause Incompatibility between the omnibus-bundled **libffi 3.2.1** and **kernel 6.17**'s memory layout or security features. libffi 3.2.1 is from 2014 and its closure/trampoline mechanism (which allocates executable memory for FFI callbacks) may not function correctly with modern kernel memory protections. The system libffi 3.4.6 cannot be used as a drop-in replacement because the ABI changed between libffi 6 and libffi 8. Note: Standard FFI callbacks (e.g., `qsort` with a Ruby comparator proc) work correctly in ALL contexts. The issue is specific to `Method#to_proc` conversions being passed as FFI callbacks, and only manifests after the full Rails boot process completes. ## Reproduction On a GET-provisioned Ubuntu 24.04 instance with kernel 6.17: ```ruby # This FAILS (from gitlab-rails runner or console): require "http" HTTP::Response::Parser.new << "HTTP/1.1 200 OK\r\nserver: nginx\r\n\r\n" # => IOError: Error Parsing data: HPE_USER Span callback error in on_header_field # This WORKS (same gitlab-rails runner session): class FreshHandler < LLHttp::Delegate def on_header_field(f); end def on_header_value(v); end def on_headers_complete; end def on_body(b); end def on_message_complete; end end LLHttp::Parser.new(FreshHandler.new, type: :response) << "HTTP/1.1 200 OK\r\nserver: nginx\r\n\r\n" # => OK ``` ## Impact - **All Geo blob replication is broken** on affected environments - Geo status shows 0 synced, all failed for every blob replicator type - Secondary sites fall back to proxying requests to primary (functional but defeats the purpose of Geo replication) - Affects anyone using GET with Ubuntu 24.04 AMIs that ship kernel 6.17 ## Possible Fixes As mentioned above, this issue happens when the gem `llhttp-ffi` version `0.4.0` is used in the newest Ubuntu 24.04 kernel. The current GitLab gemfile uses http `5.1.1` which [locks](https://gitlab.com/gitlab-org/gitlab/-/blob/master/Gemfile.lock?ref_type=heads#L1051-1055) `llhttp-ffi` to `0.4.x`. A [newer version](https://github.com/httprb/http/blob/5-x-stable/http.gemspec) of the gem is available, which include `llhttp-ffi` "~> 0.5.0" `kubeclient` and `gitlab_quality-test_tooling` constrain which version of `http` can be used. Both allow up to < 6.0, so they permit 5.3.1 (the latest stable 5.x). **The proper fix for this bug is therefore to upgrade `http` to v. 5.3.1.** ## Context Discovered while testing Geo replication for `Packages::Debian::ProjectComponentFile` (https://gitlab.com/gitlab-org/gitlab/-/work_items/593813, MR !228959). The ProjectComponentFile replicator code is correct — this environment issue blocks verification of ALL blob replicators.
issue