Investigate 10 RPS performance spikes in git clone testing
📜 Summary
In building the first pass of the xk6-exec POC, we uncovered a critical performance concern: system performance degraded 3x at just 10 RPS (30s → 80s clone times) and 12x at 80 RPS (→ 350s), far below the infrastructure's rated 200 RPS capacity. This prevented test runs from achieving target load levels.
We need to investigate the root cause to determine whether this is a test design issue, environment configuration problem, or application bottleneck.
In looking at the 60s_10vuser results we noticed that the Gitaly graphs had spikes, high traffic load then lulls of light load. Once we got to the 60s_80Vuser level, the spikes got very high and with long stretches of no traffic.
🥅 Goal
- Identify the root cause(s) of the performance spikes
- Gather sufficient data/evidence to inform a fix (by us or other teams)
- Determine next steps and ownership for resolution
🏁 Exit Criteria
- Extended load tests executed with adjusted parameters (longer duration, varied ramp-up)
- Grafana metrics analyzed (Gitaly queue depth, network bandwidth, resource utilization)
-
Findings documented including:
- Root cause hypothesis with supporting evidence
- Whether issue is test design, environment config, or application-level
- Recommended fix and which team(s) should implement
🚫 Non-Goals
- Implementing the fix (expected to be a multi-team effort)
- Achieving target load levels during this investigation
Edited by Andy Hohenner