Duration and QueuedDuration float64 cannot unmarshal from string
<!-- 🚧 Please make sure to add a meaningful issue title above --> ## Bug Report We're getting errors unmarshalling types like [PipelineEventBuild](https://gitlab.com/gitlab-org/api/client-go/-/blob/v1.46.0/event_webhook_types.go#L978 ), when [compression](https://gitlab.com/gitlab-org/gitlab/-/blob/master/lib/gitlab/sidekiq_middleware/size_limiter/validator.rb#L56) is enabled. It seems the compression changes the field to a json string instead of json number during encoding? # Analysis # Pipeline Hook webhook: string-encoded `builds[].duration` / `builds[].queued_duration` fail to parse in `gitlab.com/gitlab-org/api/client-go` ## Summary Pipeline Hook payloads whose `builds[].duration` and `builds[].queued_duration` are JSON strings (e.g. `"17.1"`) are rejected by every shipping version of `gitlab.com/gitlab-org/api/client-go` — from `v0.124.0` through the current `v2.20.1` — because the target fields on `PipelineEventBuild` are typed `float64`. The issue correlates with an upstream GitLab server configuration we cannot yet pin down, but seems to relate to compression. ## Observed symptom Operating a webhook intake that consumes `Pipeline Hook` events via `gitlab.ParseWebhook(...)`, we saw parse warnings. Every failing payload had `X-Gitlab-Event: Pipeline Hook`. The two canonical errors, with their occurrence counts: | Count | Error | |------:|-------| | 1,253 | `json: cannot unmarshal string into Go struct field .builds.duration of type float64` | | 1,038 | `json: cannot unmarshal string into Go struct field .builds.queued_duration of type float64` | We detected this via a log search in our observability platform DataDog. Payloads tended to be large (often `> 100 KB`), suggesting the failures cluster on pipelines with many builds — but we cannot confirm that's the triggering condition versus the triggering condition being unrelated (configuration, feature flag, plugin) and merely correlated with pipeline size. **The warnings stopped when we changed an operational setting on the upstream GitLab instance.** We expect reverting that setting would re-trigger the failure. This is the core unknown for GitLab to investigate (see [Unanswered question](#unanswered-question-for-gitlab)). ## Affected client-go versions We bisected across major releases by feeding each version a Pipeline Hook fixture with string-encoded `builds[0].duration` and `builds[0].queued_duration`: | Version | Parse result | Error (truncated) | |----------|--------------|-------------------| | v0.124.0 | FAIL | `json: cannot unmarshal string into Go struct field .builds.duration of type float64` | | v1.0.0 | FAIL | `json: cannot unmarshal string into Go struct field PipelineEventBuild.builds.duration of type float64` | | v1.46.0 | FAIL | `json: cannot unmarshal string into Go struct field PipelineEventBuild.builds.duration of type float64` | | v2.0.0 | FAIL | `json: cannot unmarshal string into Go struct field PipelineEventBuild.builds.duration of type float64` | | v2.20.1 | FAIL | `json: cannot unmarshal string into Go struct field PipelineEventBuild.builds.duration of type float64` | The v1.0.0-and-up error text differs from v0.124.0 because v1.0.0 extracted the `builds[]` element into a named `PipelineEventBuild` type — the *field typing* did not change. `PipelineEventBuild.Duration` and `PipelineEventBuild.QueuedDuration` have been `float64` ever since they were introduced in commit **`55c02d97`** (2022-11-29, *"fix: add duration and queued_duration for builds in pipeline webhooks"*, by Johann Gyger). No subsequent commit relaxes the type. ## Root cause There is a mismatch between what the GitLab server is emitting in the Pipeline Hook JSON payload and what client-go expects: - Server: emits a JSON string for `builds[].duration` and `builds[].queued_duration` under some (unconfirmed) condition. - Client: `gitlab.com/gitlab-org/api/client-go` v2.20.1 `event_webhook_types.go:986-987`: ```go type PipelineEventBuild struct { // ... Duration float64 `json:"duration"` // line 986 QueuedDuration float64 `json:"queued_duration"` // line 987 // ... } ``` Go's default `encoding/json` decoder refuses strings into numeric fields, so the entire event is dropped. There is also a **sibling inconsistency worth fixing alongside this bug**. In the same file, `PipelineEventObjectAttributes` (the pipeline-level object) carries a `QueuedDuration` typed **`int64`**: ```go type PipelineEventObjectAttributes struct { // ... Duration int64 `json:"duration"` // line 926 QueuedDuration int64 `json:"queued_duration"` // line 927 // ... } ``` So the same JSON field name (`queued_duration`) is `int64` at the pipeline level and `float64` at the build level within the *same webhook payload*. That divergence is itself a latent bug — one or the other is misrepresenting the wire type — and the answer is almost certainly `float64` on both sides, matching what the Ruby `CommitStatus#duration` / `#queued_duration` return. ## Suggested fix (client-go) Add a tolerant unmarshaler that accepts both JSON numbers and JSON strings for the two confirmed-at-risk `PipelineEventBuild` fields. Concretely, an internal type with a custom `UnmarshalJSON`: ```go // stringOrFloat64 is a float64 that also accepts a JSON string encoding. // GitLab webhook payloads occasionally serialize duration fields as strings // (e.g. "17.1" instead of 17.1); this type normalizes both. type stringOrFloat64 float64 func (s *stringOrFloat64) UnmarshalJSON(data []byte) error { if len(data) >= 2 && data[0] == '"' && data[len(data)-1] == '"' { data = data[1 : len(data)-1] if len(data) == 0 { *s = 0 return nil } } f, err := strconv.ParseFloat(string(data), 64) if err != nil { return err } *s = stringOrFloat64(f) return nil } ``` Retype `PipelineEventBuild.Duration` and `.QueuedDuration` to `stringOrFloat64`. **API implication.** Because `stringOrFloat64` is an unexported type with underlying `float64`: - Callers reading the field into a `float64` via explicit conversion (`float64(build.Duration)`) continue to work. - Callers assigning to a typed `float64` var (`var d float64 = ...`) continue to work. - Callers comparing directly to a `float64` literal (`build.Duration > 10.0`) will need an explicit cast: `float64(build.Duration) > 10.0`. This is the only behavioral break. I can commit fixes if so desired and you all give the go ahead. ## Appendix — susceptible-fields catalog Fields in `gitlab.com/gitlab-org/api/client-go` v2.20.1 that are declared `float64` (or `int64` where the JSON wire could plausibly be stringified under the same mechanism). Line numbers are against `event_webhook_types.go`, `jobs.go`, and `pipelines.go` in the v2.20.1 tag. | Struct | Field | Type | File:Line | JSON tag | Status | |---|---|---|---|---|---| | `PipelineEventBuild` | `Duration` | `float64` | `event_webhook_types.go:986` | `duration` | **Confirmed failing in production** | | `PipelineEventBuild` | `QueuedDuration` | `float64` | `event_webhook_types.go:987` | `queued_duration` | **Confirmed failing in production** | | `PipelineEventObjectAttributes` | `Duration` | `int64` | `event_webhook_types.go:926` | `duration` | Same-shape risk; also a **type inconsistency** with `PipelineEventBuild.Duration` (float64 vs int64) | | `PipelineEventObjectAttributes` | `QueuedDuration` | `int64` | `event_webhook_types.go:927` | `queued_duration` | Same-shape risk; **type inconsistency** with `PipelineEventBuild.QueuedDuration` (float64 vs int64) — likely a latent bug | | `BuildEvent` | `BuildDuration` | `float64` | `event_webhook_types.go:58` | `build_duration` | Same-shape risk | | `JobEvent` | `BuildDuration` | `float64` | `event_webhook_types.go:488` | `build_duration` | Same-shape risk | | `JobEvent` | `BuildQueuedDuration` | `float64` | `event_webhook_types.go:489` | `build_queued_duration` | Same-shape risk | | `Job` (REST) | `Coverage` | `float64` | `jobs.go:142` | `coverage` | Same-shape risk | | `Job` (REST) | `Duration` | `float64` | `jobs.go:148` | `duration` | Same-shape risk | | `Job` (REST) | `QueuedDuration` | `float64` | `jobs.go:149` | `queued_duration` | Same-shape risk | | `Bridge` (REST) | `Coverage` | `float64` | `jobs.go:213` | `coverage` | Same-shape risk | | `Bridge` (REST) | `Duration` | `float64` | `jobs.go:219` | `duration` | Same-shape risk | | `Bridge` (REST) | `QueuedDuration` | `float64` | `jobs.go:220` | `queued_duration` | Same-shape risk | | `Pipeline` (REST) | `Duration` | `int64` | `pipelines.go:112` | `duration` | Same-shape risk | | `Pipeline` (REST) | `QueuedDuration` | `int64` | `pipelines.go:113` | `queued_duration` | Same-shape risk | Only the two `PipelineEventBuild` rows are **confirmed failing in production logs**. Everything else is "same-shape risk" — if GitLab can emit `duration` / `queued_duration` as strings in one place, there's no structural reason it can't do so in another, and the receiving Go types are identical. The `PipelineEventObjectAttributes.Duration` / `.QueuedDuration` vs. `PipelineEventBuild.Duration` / `.QueuedDuration` type divergence (`int64` vs `float64`) deserves its own small MR regardless of how the string-encoding issue is resolved.## Additional Details # additional digging done. ## Root cause suspected: `sidekiq_job_limiter_mode=compress` emits string-encoded `duration` / `queued_duration` in Pipeline Hook webhooks ### The setting **`ApplicationSetting#sidekiq_job_limiter_mode`** — defined in `config/application_setting_columns/sidekiq_job_limiter_mode.yml`: > "`track` or `compress`. Sets the behavior for [Sidekiq job size limits](../administration/settings/sidekiq_job_limits.md). Default: `compress`." Companion setting: **`sidekiq_job_limiter_compression_threshold_bytes`** — defaults to **100,000 bytes**. This matches the observation that every failing webhook payload had `eventSizeBytes > 100 KB`. ### Where the two modes diverge `lib/gitlab/sidekiq_middleware/size_limiter/validator.rb:76-93`: - **`track`** mode: if payload exceeds size limit → report to Sentry, **schedule job unchanged**. Args stay as native Ruby objects through Sidekiq's stdlib `JSON.generate` / `JSON.parse` round-trip — Floats preserve as Floats. - **`compress`** mode: if payload ≥ 100 KB threshold → `Validator#compress_if_necessary` calls `Compressor.compress`: - `lib/gitlab/sidekiq_middleware/size_limiter/compressor.rb:20-28` — `Base64.strict_encode64(Zlib::Deflate.deflate(job_args, 5))`; args replaced with a single-string `[base64_blob]`; `job['compressed'] = true`. - On worker pickup, `Compressor.decompress` (line 30-36) runs `Gitlab::Json.load(Zlib::Inflate.inflate(Base64.strict_decode64(...)))`. - `Gitlab::Json.load` uses **Oj in `:rails` mode** — set globally at `config/initializers_before_autoloader/oj.rb:4`: `Oj.default_options = { mode: :rails }`. ### The mechanism The trigger for string-encoded numerics is the `compress` path's use of **Oj `:rails` mode** to rehydrate args, instead of Sidekiq's stdlib `JSON.parse`. `track` mode never goes through Oj — Sidekiq's native `JSON.parse` loads Floats as Floats. The `compress` path's round-trip through Oj `:rails` can produce `BigDecimal` instances (or values whose `#to_json` / `#as_json` emits quoted strings) for decimal numbers, depending on Oj's `bigdecimal_load` interaction with Rails. When the rehydrated hash is later re-encoded for the outgoing HTTP body by `Gitlab::Json::LimitedEncoder.encode` — which uses `Yajl::Encoder` (`gems/gitlab-utils/lib/gitlab/json.rb:286-299`) — `BigDecimal` values serialize as JSON strings, producing exactly `"duration": "17.1"` instead of `"duration": 17.1`. No explicit `Float → BigDecimal` cast exists in the application code, so the exact mechanism inside Oj/Yajl is the remaining puzzle piece. However the setting-level correlation is airtight: the two modes are the only code paths that differ in how args are serialized, and only the `compress` path routes through Oj. ### Data flow 1. `PipelineHooksWorker` → `Ci::Pipelines::HookService#execute` → `WebHookService#async_execute`. 2. `app/services/web_hook_service.rb:136`: ```ruby WebHookWorker.perform_async(hook.id, data.deep_stringify_keys, hook_name.to_s, params) ``` Here `data` contains `build.duration` / `build.queued_duration`, which are Ruby **Floats** — from `CommitStatus#duration = (end_time || Time.current) - start_time` at `app/models/concerns/ci/has_status.rb:161-165`. `deep_stringify_keys` only transforms keys; values stay Float. 3. Sidekiq size limiter runs. If `compress` + payload ≥ 100 KB → Deflate + Base64, args become `[base64_string]`. 4. Worker pickup → decompress via `Gitlab::Json.load` (Oj `:rails`). 5. `WebHookService#make_request` (line 154-156) encodes the body via `Gitlab::Json::LimitedEncoder.encode` → `Yajl::Encoder.encode` → outgoing JSON. 6. `WebHookWorker#perform` (line 20) calls `Gitlab::WebHooks.prepare_data(data)`, which is a no-op for pipeline events (`lib/gitlab/web_hooks.rb:10-16` — just `with_indifferent_access`, no numeric normalization). ### Verification on a live instance ```ruby ApplicationSetting.current.sidekiq_job_limiter_mode # => "track" or "compress" ApplicationSetting.current.sidekiq_job_limiter_compression_threshold_bytes # => 100000 (default) ``` Toggle to reproduce / stop: ```ruby ApplicationSetting.current.update!(sidekiq_job_limiter_mode: 'compress') # re-trigger ApplicationSetting.current.update!(sidekiq_job_limiter_mode: 'track') # stop ``` End-to-end repro in a Rails console: ```ruby # Pick a pipeline whose serialized webhook payload exceeds ~100 KB data = Gitlab::DataBuilder::Pipeline.build(pipeline) json_in = Sidekiq.dump_json([0, data.deep_stringify_keys]) blob = Base64.strict_encode64(Zlib::Deflate.deflate(json_in, 5)) round_tripped = Gitlab::Json.load(Zlib::Inflate.inflate(Base64.strict_decode64(blob))) body = Gitlab::Json::LimitedEncoder.encode(round_tripped[1]) body.include?('"duration":"') # => true proves the issue; the non-compress path produces "duration":17.1 ``` # Additional Details - GitLab Client Go Version: `v0.124.0` but Claude claims it affects all versions up to v2.20.1 - GitLab Instance Version: `18.0.6` - Go Version: `go 1.25.9` - License Tier: (unknown)
issue