Add granular instrumentation to PostReceive worker for performance analysis (#566545) · Issues · GitLab.org / GitLab

Add granular instrumentation to PostReceive worker for performance analysis

<details> <summary> Everyone can contribute. [Help move this issue forward](https://handbook.gitlab.com/handbook/marketing/developer-relations/contributor-success/community-contributors-workflows/#contributor-links) while earning points, leveling up and collecting rewards. </summary> - [Close this issue](https://contributors.gitlab.com/manage-issue?action=close&projectId=278964&issueIid=566545) </details>  Part of gitlab-org&19159+ ### Summary To effectively optimize the `PostReceive` worker performance, we need detailed instrumentation to identify which specific components are causing performance bottlenecks. Current logging provides high-level metrics but lacks granular visibility into individual operations. ### Problem The `PostReceive` worker performs multiple tasks, but we cannot determine which specific operations are slow: - Cache expiration operations - Repository operations - CI pipeline creation (partially addressed) - Event processing - Merge request updates - Lock acquisition and holding From the performance data, we see three main contributors: - `redis_duration_s`: Up to 9.8 seconds - `db_duration_s`: Variable database query times - `cpu_s`: Up to 11.6 seconds of CPU time ### Proposal > We could add extra logging behind a feature flag and enable it for only a small percentage of requests. This should help us gather more information without significantly increasing log size. [from thread](https://gitlab.com/gitlab-org/gitlab/-/issues/553426#note_2695522080) Add detailed instrumentation to measure execution time for each major component: 1. **Cache Operations** - `repository.expire_caches_for_tags` timing - Branch name cache operations - Repository cache operations 2. **Lock Operations** - Lock acquisition wait time - Lock hold duration by operation type - Lock contention analysis 3. **Database Operations** - Query-level timing for major operations - Transaction duration breakdown 4. **Event Processing** - Individual event creation timing - Bulk operation performance 5. **Integration Points** - External validation calls - Gitaly operation breakdown - Redis operation categorization ### Implementation Details - Add instrumentation behind a feature flag (`detailed_post_receive_instrumentation`) - Enable for a small percentage of requests - 1% - Use structured logging with consistent field naming - Include correlation IDs for tracing related operations - Try using [`#log_hash_metadata_on_job_done`](https://gitlab.com/gitlab-org/gitlab/-/blob/7da0ead0f3172c6777ac121158e9814af6893f72/app/workers/concerns/application_worker.rb#L40) to collect information on the number of refs processed. ### Acceptance Criteria - [ ] Detailed timing logs for all major PostReceive components - [ ] Feature flag implementation for controlled rollout - [ ] Structured logging format for easy analysis - [ ] Validation on staging environment

issue