Investigate 377s of unaccounted wall-clock time in EnrichmentJob (span truncation)

Problem

A production Billing::Usage::EnrichmentJob Sentry span shows 382s total wall-clock time, but only ~5s is covered by the 999 captured spans. The remaining ~377s is completely unaccounted for.

Sentry hit its 999-span limit during Phase 3 (per-event processing), which means Phase 4 (ClickHouse INSERT of enriched events and failures) is entirely invisible, along with any remaining per-event processing.

What we know

Phase Spans Duration Status
Phase 1: ClickHouse fetch (raw events) 1 384 ms Captured
Phase 2: Preload (finders, orders) 42 601 ms Captured
Phase 3: Per-event processing 956 3,942 ms Partially captured (truncated at 999 spans)
Phase 4: Commit (ClickHouse INSERT) 0 Unknown Not captured
Unaccounted — ~377s —

Possible causes

  1. ClickHouse INSERT latency: insert_enriched_events_sync builds a single large INSERT with ~100 rows of enriched event data. If ClickHouse is under load or the INSERT is large, this could take significant time.

  2. CPU-bound event processing: Each event goes through enrich_event which calls:

    • event.serializable_hash — hash serialization
    • .merge(strategy.build_enrich_event_data) — hash merging
    • .transform_keys(&:camelize) — key transformation on every hash
    • hash["metadata"] = hash["metadata"].to_json — JSON serialization
    • Billing::Conversion::GitlabUnitConverter.calculate_gitlab_credits — pricing rule computation per LLM operation

    With 100+ events × multiple LLM operations each, this CPU work may be significant, especially with GC pressure from all the intermediate hash allocations.

  3. Unleash feature flag HTTP calls: BillingEligibility::SubscriptionService checks Unleash.enabled?(:deprecate_free_dap_code_suggestions) and Unleash.enabled?(:enable_dedicated_billing_eligibility) per event. If Unleash makes HTTP calls (rather than using a cached client), this could add latency. These calls may not produce Sentry spans.

  4. Ruby GC pressure: Processing 100+ large event hashes with multiple intermediate transformations creates significant garbage. Long GC pauses could account for elapsed time not visible in spans.

  5. The remaining per-event processing that was truncated: Only part of Phase 3 was captured before the span limit was hit. The tail end of per-event processing is also invisible.

Proposed investigation

  1. Increase Sentry transaction span limit for this job (if configurable) to capture the full trace
  2. Add manual Sentry spans around the commit phase (insert_enriched_events_sync, insert_enrichment_failures) and key CPU-intensive operations
  3. Profile a staging job with stackprof (wall-clock mode) to identify where time is actually spent:
    StackProf.run(mode: :wall, out: '/tmp/enrichment_profile.dump') { service.execute }
  4. Log wall-clock timestamps at phase boundaries (after fetch, after preload, after enrichment loop, after commit) to narrow down which phase consumes the 377s

Relevant files

  • app/services/billing/usage/enrichment_service.rb:166-199 — process_events orchestrates all phases
  • app/services/billing/usage/enrichment/build_enriched_event_service.rb:40-55 — enrich_event hash construction
  • app/models/billing/usage/enriched.rb:160-177 — insert_enriched_events_sync ClickHouse INSERT
  • app/models/billing/conversion/gitlab_unit_converter.rb:14-19 — convert_llm_operations per event
  • app/jobs/billing/usage/enrichment_job.rb — job entry point
Assignee Loading
Time tracking Loading