Investigate 10 RPS performance spikes in git clone testing

📜 Summary

In building the first pass of the xk6-exec POC, we uncovered a critical performance concern: system performance degraded 3x at just 10 RPS (30s → 80s clone times) and 12x at 80 RPS (→ 350s), far below the infrastructure's rated 200 RPS capacity. This prevented test runs from achieving target load levels.

We need to investigate the root cause to determine whether this is a test design issue, environment configuration problem, or application bottleneck.

In looking at the 60s_10vuser results we noticed that the Gitaly graphs had spikes, high traffic load then lulls of light load. Once we got to the 60s_80Vuser level, the spikes got very high and with long stretches of no traffic.

🥅 Goal

Identify the root cause(s) of the performance spikes
Gather sufficient data/evidence to inform a fix (by us or other teams)
Determine next steps and ownership for resolution

🏁 Exit Criteria

Extended load tests executed with adjusted parameters (longer duration, varied ramp-up)
Grafana metrics analyzed (Gitaly queue depth, network bandwidth, resource utilization)
Findings documented including:
- Root cause hypothesis with supporting evidence
- Whether issue is test design, environment config, or application-level
- Recommended fix and which team(s) should implement

🚫 Non-Goals

Implementing the fix (expected to be a multi-team effort)
Achieving target load levels during this investigation

Edited Dec 09, 2025 by Andy Hohenner