Investigate 377s of unaccounted wall-clock time in EnrichmentJob (span truncation)
Problem
A production Billing::Usage::EnrichmentJob Sentry span shows 382s total wall-clock time, but only ~5s is covered by the 999 captured spans. The remaining ~377s is completely unaccounted for.
Sentry hit its 999-span limit during Phase 3 (per-event processing), which means Phase 4 (ClickHouse INSERT of enriched events and failures) is entirely invisible, along with any remaining per-event processing.
What we know
| Phase | Spans | Duration | Status |
|---|---|---|---|
| Phase 1: ClickHouse fetch (raw events) | 1 | 384 ms | Captured |
| Phase 2: Preload (finders, orders) | 42 | 601 ms | Captured |
| Phase 3: Per-event processing | 956 | 3,942 ms | Partially captured (truncated at 999 spans) |
| Phase 4: Commit (ClickHouse INSERT) | 0 | Unknown | Not captured |
| Unaccounted | — | ~377s | — |
Possible causes
-
ClickHouse INSERT latency:
insert_enriched_events_syncbuilds a single large INSERT with ~100 rows of enriched event data. If ClickHouse is under load or the INSERT is large, this could take significant time. -
CPU-bound event processing: Each event goes through
enrich_eventwhich calls:-
event.serializable_hash— hash serialization -
.merge(strategy.build_enrich_event_data)— hash merging -
.transform_keys(&:camelize)— key transformation on every hash -
hash["metadata"] = hash["metadata"].to_json— JSON serialization -
Billing::Conversion::GitlabUnitConverter.calculate_gitlab_credits— pricing rule computation per LLM operation
With 100+ events × multiple LLM operations each, this CPU work may be significant, especially with GC pressure from all the intermediate hash allocations.
-
-
Unleash feature flag HTTP calls:
BillingEligibility::SubscriptionServicechecksUnleash.enabled?(:deprecate_free_dap_code_suggestions)andUnleash.enabled?(:enable_dedicated_billing_eligibility)per event. If Unleash makes HTTP calls (rather than using a cached client), this could add latency. These calls may not produce Sentry spans. -
Ruby GC pressure: Processing 100+ large event hashes with multiple intermediate transformations creates significant garbage. Long GC pauses could account for elapsed time not visible in spans.
-
The remaining per-event processing that was truncated: Only part of Phase 3 was captured before the span limit was hit. The tail end of per-event processing is also invisible.
Proposed investigation
- Increase Sentry transaction span limit for this job (if configurable) to capture the full trace
-
Add manual Sentry spans around the commit phase (
insert_enriched_events_sync,insert_enrichment_failures) and key CPU-intensive operations -
Profile a staging job with
stackprof(wall-clock mode) to identify where time is actually spent:StackProf.run(mode: :wall, out: '/tmp/enrichment_profile.dump') { service.execute } - Log wall-clock timestamps at phase boundaries (after fetch, after preload, after enrichment loop, after commit) to narrow down which phase consumes the 377s
Relevant files
-
app/services/billing/usage/enrichment_service.rb:166-199—process_eventsorchestrates all phases -
app/services/billing/usage/enrichment/build_enriched_event_service.rb:40-55—enrich_eventhash construction -
app/models/billing/usage/enriched.rb:160-177—insert_enriched_events_syncClickHouse INSERT -
app/models/billing/conversion/gitlab_unit_converter.rb:14-19—convert_llm_operationsper event -
app/jobs/billing/usage/enrichment_job.rb— job entry point