Geo BlobDownloader fails with status 34 on 18.10 — rugged 1.9.0 / llhttp-ffi symbol collision
## Summary After upgrading a Geo deployment from 18.9.x to 18.10.3, **every** Geo blob replication attempt on secondary sites fails with: > Replication failure: Non-success HTTP response status code 34 affecting **every** replicable type (`job_artifact`, `dependency_proxy_manifest`, `dependency_proxy_blob`, `upload`, `lfs_object`, `package_file`, ...). The number 34 is not a valid HTTP status; it matches the llhttp error code `HPE_INVALID_STATUS`. After a multi-layer bisection we isolated the root cause: **`rugged.so` globally exports `libgit2`'s statically-embedded `llhttp_*` symbols (119 of them), which collide with `llhttp-ffi`'s `libllhttp-ext.so` at runtime.** Details and evidence below. Multiple independent Geo deployments are affected identically. ## Minimal reproducer (no GitLab, no Rails, no Sidekiq, no TLS) ```ruby # reproducer.rb require 'rugged' # <-- loading rugged first is the only trigger require 'http' puts HTTP.get('http://127.0.0.1:8080/').status.code # => 34 (against any well-formed HTTP response bytes; expected 200) ``` With a trivial Python raw-socket server (`server.py`) replying with a spec-compliant `HTTP/1.1 200 OK\r\n...\r\n\r\n<body>` response, the above script prints `34` (bug). Without the `require 'rugged'`, the same script against the same server prints `200` correctly. **Full reproducer (client.rb, server.py, Gemfile, README.md) available as a GitLab snippet:** [gitlab.com/-/snippets/5982447](https://gitlab.com/-/snippets/5982447) ``` git clone https://gitlab.com/snippets/5982447.git cd 5982447 bundle install python3 server.py & bundle exec ruby client.rb ``` Gem versions (as shipped by GitLab 18.10.3 Omnibus): - `rugged` 1.9.0 (vendors libgit2 statically) - `http` 5.3.1 - `llhttp-ffi` 0.5.1 (bundles llhttp C source 8.1.0) - Ruby 3.3.10 ## Evidence: symbol table collision `nm -D --defined-only` on the two shared libraries on a failing Omnibus secondary: ``` $ nm -D --defined-only /opt/gitlab/.../rugged-1.9.0/rugged/rugged.so | grep -c llhttp 119 # all globally exported 'T' $ nm -D --defined-only /opt/gitlab/.../llhttp-ffi-0.5.1/.../libllhttp-ext.so | grep -c llhttp ~40 # same symbol names, different offsets ``` `ldd rugged.so` shows libssl/libcrypto dynamically linked from `/opt/gitlab/embedded/lib`, but libgit2 is nowhere — confirming libgit2 is statically embedded. The additional 1,799 exported `git_*` symbols confirm the static-embed. ## In-stack reproduction (on a Geo 18.10.3 secondary) ```ruby sudo gitlab-rails runner ' rep = Geo::DependencyProxyManifestReplicator.new(model_record_id: <ID>) dl = Gitlab::Geo::Replication::BlobDownloader.new(replicator: rep) puts dl.execute.inspect ' ``` Produces: ``` #<Gitlab::Geo::Replication::BlobDownloader::Result @success=false, @bytes_downloaded=0, @primary_missing_file=false, @reason="Non-success HTTP response status code 34", @extra_details={:status_code=>34, :reason=>nil, :url=>"..."}> ``` The `34` originates in `ee/lib/gitlab/geo/replication/blob_downloader.rb:281`, which stores `response.status.code` verbatim from http.rb. ## Raw wire-level status line (captured via raw `OpenSSL::SSLSocket` on the failing host) ``` Hex: 48 54 54 50 2f 31 2e 31 20 32 30 30 20 4f 4b 0d 0a ASCII: H T T P / 1 . 1 2 0 0 O K \r \n ``` Textbook `HTTP/1.1 200 OK\r\n`. The server is innocent. ## Verified fix **Rebuilding rugged with `-Wl,--exclude-libs,ALL` resolves the bug completely.** Verified locally against the reproducer linked above, using the exact gem pins from Omnibus 18.10.3 (`rugged 1.9.0`, `http 5.3.1`, `llhttp-ffi 0.5.1`) on Ruby 3.4.8 — llhttp-ffi is a C binary whose symbol-interposition behaviour does not depend on the Ruby version: ```bash bundle config build.rugged "--with-ldflags=-Wl,--exclude-libs,ALL" gem uninstall rugged bundle install # rebuilds rugged with the linker flag bundle exec ruby client.rb # now prints status.code: 200 instead of 34 ``` Effect on `rugged.so`: | metric | default build | with `--exclude-libs,ALL` | | --------------------------------- | -------------- | ------------------------- | | size | 4,814,880 B | 4,686,848 B | | globally exported `llhttp_*` (`T`)| 119 | **0** | | globally exported `git_*` (`T`) | 1,799 | **0** | | reproducer `HTTP.get(...).status.code` | **34** | **200** | This simultaneously confirms the root-cause theory and the fix direction: once libgit2's statically-embedded symbols are no longer exposed to the process-global symbol namespace, the dynamic-linker collision with `libllhttp-ext.so` cannot occur, and http.rb parses correctly on every response. ## Possible fix directions A few places could address this, depending on where maintainers prefer to draw the boundary: - **rugged `extconf.rb`**: pass `-Wl,--exclude-libs,ALL` to the linker when building `rugged.so` so libgit2's statically-embedded third-party symbols stay out of the process-global namespace. We verified this resolves the collision locally (table above). - **libgit2 upstream**: build bundled third-party dependencies (llhttp, historically also http-parser, zlib, ...) with `-fvisibility=hidden` by default. Would protect every downstream consumer, not only rugged. - **llhttp-ffi**: open `libllhttp-ext.so` with `RTLD_LOCAL` / `RTLD_DEEPBIND` and resolve symbols only within that handle. Would make llhttp-ffi robust against any gem statically embedding llhttp. For operators stuck on 18.10.3 before a fix ships, interim options are (a) swapping `BlobDownloader#download_file` to `Net::HTTP` via a Rails initializer (Net::HTTP uses the Ruby stdlib parser and is unaffected), or (b) rebuilding the rugged gem with the linker flag above (requires package/Omnibus-level build customization). ## Workaround currently running on our deployments We are running option (a) above: an internal Rails initializer that prepends `Gitlab::Geo::Replication::BlobDownloader#download_file` with a `Net::HTTP`-based implementation preserving the upstream one-hop manual-redirect-follow semantics. Replication resumed immediately on both affected sites. We consider this operations-grade only — it sidesteps rather than fixes the collision and is wiped on every package upgrade. ## Why 18.10 is when this surfaced (with evidence) `Gemfile.lock` diff between the two relevant GitLab tags: | GitLab tag | rugged | bundled libgit2 | | -------------- | ------- | ------------------------------------------------------- | | `v18.9.0-ee` | 1.6.3 | libgit2 1.6.x — **no bundled llhttp** | | `v18.9.5-ee` | 1.6.3 | libgit2 1.6.x — **no bundled llhttp** | | `v18.10.3-ee` | 1.9.0 | libgit2 ~1.9.0 (submodule pin `338e6fb6`) — **bundles llhttp statically** | Sources: - [`Gemfile.lock` @ v18.9.0-ee](https://gitlab.com/gitlab-org/gitlab/-/raw/v18.9.0-ee/Gemfile.lock) - [`Gemfile.lock` @ v18.10.3-ee](https://gitlab.com/gitlab-org/gitlab/-/raw/v18.10.3-ee/Gemfile.lock) libgit2 began bundling llhttp as a vendored builtin in [libgit2 PR #6713](https://github.com/libgit2/libgit2/pull/6713), merged 2024-04-23, landing in libgit2 1.8.0/1.8.1: > *"Include llhttp as a bundled dependency with the aim to use it as our default http parser, removing the now-unmaintained Node.js http-parser."* Every libgit2 ≥ 1.8.1 ships llhttp statically embedded. rugged 1.9.0 vendors libgit2 ~1.9.0 and therefore exports llhttp's symbols from `rugged.so`. rugged 1.6.3 (libgit2 1.6.x) did not. That is the exact version boundary that matches "started failing with 18.10.3".
issue