Add granular instrumentation to PostReceive worker for performance analysis

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Part of Improve PostReceive Worker Performance to Meet ... (&19159)

Summary

To effectively optimize the PostReceive worker performance, we need detailed instrumentation to identify which specific components are causing performance bottlenecks. Current logging provides high-level metrics but lacks granular visibility into individual operations.

Problem

The PostReceive worker performs multiple tasks, but we cannot determine which specific operations are slow:

  • Cache expiration operations
  • Repository operations
  • CI pipeline creation (partially addressed)
  • Event processing
  • Merge request updates
  • Lock acquisition and holding

From the performance data, we see three main contributors:

  • redis_duration_s: Up to 9.8 seconds
  • db_duration_s: Variable database query times
  • cpu_s: Up to 11.6 seconds of CPU time

Proposal

We could add extra logging behind a feature flag and enable it for only a small percentage of requests. This should help us gather more information without significantly increasing log size.

from thread

Add detailed instrumentation to measure execution time for each major component:

  1. Cache Operations

    • repository.expire_caches_for_tags timing
    • Branch name cache operations
    • Repository cache operations
  2. Lock Operations

    • Lock acquisition wait time
    • Lock hold duration by operation type
    • Lock contention analysis
  3. Database Operations

    • Query-level timing for major operations
    • Transaction duration breakdown
  4. Event Processing

    • Individual event creation timing
    • Bulk operation performance
  5. Integration Points

    • External validation calls
    • Gitaly operation breakdown
    • Redis operation categorization

Implementation Details

  • Add instrumentation behind a feature flag (detailed_post_receive_instrumentation)
  • Enable for a small percentage of requests - 1%
  • Use structured logging with consistent field naming
  • Include correlation IDs for tracing related operations
  • Try using #log_hash_metadata_on_job_done to collect information on the number of refs processed.

Acceptance Criteria

  • Detailed timing logs for all major PostReceive components
  • Feature flag implementation for controlled rollout
  • Structured logging format for easy analysis
  • Validation on staging environment
Edited by 🤖 GitLab Bot 🤖