RFC: Process for Cross-Referencing Self-Managed Performance Issues with GitLab.com and GitLab Dedicated

Summary

This RFC proposes a standardized process for Support Engineers to cross-reference performance issues identified in self-managed customer environments against known issues in GitLab.com and GitLab Dedicated. This will enable more efficient issue resolution, reduce duplicate investigations, and help optimize all GitLab platforms simultaneously.

Motivation

Problem Statement

When customers report performance issues on self-managed instances, Support Engineers often investigate these issues in isolation without checking if similar problems have been identified or resolved in GitLab.com or GitLab Dedicated environments. This leads to:

  • Duplicated effort - Multiple investigations of the same underlying issue across different deployment types
  • Missed optimization opportunities - Performance improvements made for .com/.dedicated not being surfaced to self-managed customers
  • Slower resolution times - Reinventing solutions that may already exist
  • Inconsistent platform quality - Performance issues affecting one deployment type not being proactively addressed in others

Real-World Example

Recent performance issues with MergeRequests::Refresh::ApprovalWorker and UpdateMergeRequestsWorker were identified and tracked separately:

  • Issue #584087 - Slow performance caused by checking approvals on closed MRs
  • Issue #548046 - UpdateMergeRequestsWorker not meeting performance targets on GitLab Dedicated

These issues are related and part of the same performance epic, but a self-managed customer experiencing similar symptoms might not immediately discover these existing investigations especially since some may be set as confidential.

Additionally, the issue resolution priority may be low as the impact of the issue may not be fully documented.

Proposal

Proposed Process

I understand that most of this is what we already do anyway but documenting it this way would help newer colleagues and anyone just starting in the performance diagnostics area of Self-Managed.

We would document these as guidelines and not rules.

1. Initial Performance Issue Identification

When a customer reports a performance issue on self-managed:

  • Document the symptoms (slow workers, high memory usage, query patterns, etc.)
  • Identify the specific components involved (workers, services, database queries)
  • Collect relevant metrics and logs

2. Cross-Reference Check

Perform a systematic check:

Search GitLab Issues:

  • Search gitlab-org/gitlab for related worker names, service names, or error patterns
  • Filter by labels: infradev, SLO::Missed, GitLab Dedicated, performance, bug::availability
  • Check for issues in the relevant epic (e.g., "Performance (Code Review)" for MR-related issues)

Check Infrastructure Trackers:

  • Look for incident reports in gitlab-com/gl-infra/gitlab-dedicated/incident-management

Consult with Development Teams:

  • Tag relevant group (e.g., @gitlab-org/create/code-review) if similar patterns are found
  • Reference any related epics or ongoing performance initiatives

3. Documentation and Linking

When a match is found:

  • Link the customer ticket to the relevant GitLab issue(s)
  • Add context about the self-managed environment (version, scale, configuration differences)
  • Add customer label to the GitLab issue if not already present
  • Document any workarounds or mitigations that worked for .com/dedicated

When no match is found:

  • Search Elasticsearch logs for GitLab.com and large GitLab Dedicated environments for similar patterns:
    • Worker names (e.g., MergeRequestResetApprovalsWorker, UpdateMergeRequestsWorker)
    • Service names and error patterns
    • Query patterns or performance signatures (high db_count, redis_calls, duration_s)
    • Look for similar symptoms: timeouts, N+1 queries, memory spikes
  • If similar patterns are found in .com/Dedicated logs:
    • Create a new issue in gitlab-org/gitlab documenting the problem
    • Include evidence from both self-managed customer environment and .com/Dedicated
    • Add relevant labels: customer, infradev, performance, deployment type labels
    • Tag appropriate engineering group and link to related epics if applicable
    • Reference the customer ticket (following confidentiality guidelines)
    • Example: MergeRequests::Refresh::ApprovalWorker exhibits... (gitlab-org/gitlab#585597 - closed)
  • If no patterns are found in .com/Dedicated:
    • Document this in the customer ticket as potentially self-managed specific
    • Consider if it's configuration-related or scale-related
    • Still create a GitLab issue if the problem is significant and reproducible

4. Feedback Loop

  • If the self-managed issue reveals new information, contribute it to the existing GitLab issue
  • If no existing issue is found but the problem is significant, create a new issue and cross-reference it with infrastructure teams
  • Track resolution and ensure fixes are validated across all deployment types

Tools and Resources

Proposed Resources:

  • Curated list of common performance epics and their associated issues
  • Search query templates for common performance patterns
  • Slack channels for quick consultation (#g_create_code-review, #f_gitlab_dedicated, etc.)
  • Dashboard or wiki page tracking active performance investigations

Success Metrics

  • Reduction in time-to-resolution for performance issues
  • Increase in customer issues linked to existing GitLab.com/Dedicated issues
  • Number of self-managed insights that contribute to platform-wide improvements
  • Customer satisfaction scores for performance-related tickets

Alternatives Considered

  1. Status Quo - Continue investigating issues independently

    • Pros: No process change required
    • Cons: Continues current inefficiencies
  2. Automated Matching System - Build tooling to automatically match customer issues to known problems

    • Pros: Fully automated, no manual effort
    • Cons: Complex to build, may have false positives, requires significant engineering investment

Open Questions

  1. Should this process be mandatory for all performance issues or only those meeting certain severity/impact thresholds?
  2. What's the best way to maintain a curated list of active performance investigations?
  3. Should we create a specific label (e.g., self-managed::performance) to make cross-referencing easier?
  4. How do we ensure this process doesn't add significant overhead to ticket handling?

Feedback Requested

  • Is this process valuable for your workflow?
  • What challenges do you foresee in implementation?
  • What tools or resources would make this easier?
Edited by Alvin Gounder