Investigate 10 RPS performance spikes in git clone testing

📜 Summary

In building the first pass of the xk6-exec POC, we uncovered a critical performance concern: system performance degraded 3x at just 10 RPS (30s → 80s clone times) and 12x at 80 RPS (→ 350s), far below the infrastructure's rated 200 RPS capacity. This prevented test runs from achieving target load levels.

We need to investigate the root cause to determine whether this is a test design issue, environment configuration problem, or application bottleneck.

In looking at the 60s_10vuser results we noticed that the Gitaly graphs had spikes, high traffic load then lulls of light load. Once we got to the 60s_80Vuser level, the spikes got very high and with long stretches of no traffic.

🥅 Goal

  • Identify the root cause(s) of the performance spikes
  • Gather sufficient data/evidence to inform a fix (by us or other teams)
  • Determine next steps and ownership for resolution

🏁 Exit Criteria

  • Extended load tests executed with adjusted parameters (longer duration, varied ramp-up)
  • Grafana metrics analyzed (Gitaly queue depth, network bandwidth, resource utilization)
  • Findings documented including:
    • Root cause hypothesis with supporting evidence
    • Whether issue is test design, environment config, or application-level
    • Recommended fix and which team(s) should implement

🚫 Non-Goals

  • Implementing the fix (expected to be a multi-team effort)
  • Achieving target load levels during this investigation
Edited by Andy Hohenner