Improve performance of List Project Issues API under load to meet target
Summary
The Issue List API is shown to be unperformant in several performance tests.
On several environments it returns a slightly higher TTFB P90 of up to 900ms. On further analysis it appears the endpoint generally has high Rails CPU usage that can sometimes "trip" over on certainly environment makeups that in turn significantly impacts performance. It appears to share this trait alongside the Issue Details API.
In these scenarios with synthetic testing it's more the fact that the endpoint has high CPU usage and is close to "tripping" on the edge. This is much more common on real environments as there's other processes going on so the task here is to optimise the endpoint generally to give it much more headroom.
These were tested against gitlabhq
project with 6617 issues. The example url of the request on Staging can be found at Current Test Details
page by the test name.
Additionally this appears to have gotten slightly worse recently, over the last month we've seen the endpoint trip more suggesting something may have changed recently.
Details
Across our test environments we see times ranging from 250ms to 900ms. On one of the most impact environments, 25k, it posts a time of around 840ms and the metrics show it's it's rails processing the data that's struggling:
* Environment: 25k
* Environment Version: 16.1.0-pre `87fd9f75025`
* Option: 60s_500rps
* Date: 2023-05-22
* Run Time: 1h 39m 11.67s (Start: 04:48:16 UTC, End: 06:27:27 UTC)
* GPT Version: v2.12.2
NAME | RPS | RPS RESULT | TTFB AVG | TTFB P90 | REQ STATUS | RESULT
---------------------------------------------------------|-------|----------------------|-----------|----------------------|----------------|---------
api_v4_projects_issues | 500/s | 488.17/s (>400.00/s) | 531.27ms | 837.79ms (<500ms) | 100.00% (>99%) | FAILED¹²
Note that 80% here is the max for the Puma workers on the box as selected automatically by Omnibus, some cores are left for other processes.
Conversely on one of the environments where it performs better, 10k, it's still taking up a lot of CPU but just not as much:
* Environment: 10k
* Environment Version: 16.1.0-pre `87fd9f75025`
* Option: 60s_200rps
* Date: 2023-05-25
* Run Time: 1h 38m 22.08s (Start: 04:44:50 UTC, End: 06:23:12 UTC)
* GPT Version: v2.12.2
NAME | RPS | RPS RESULT | TTFB AVG | TTFB P90 | REQ STATUS | RESULT
---------------------------------------------------------|-------|----------------------|-----------|-----------------------|----------------|--------
api_v4_projects_issues | 200/s | 196.21/s (>160.00/s) | 237.66ms | 274.94ms (<500ms) | 100.00% (>99%) | Passed¹
The impact thankfully even at worse isn't too much so this is a severity4 based on our targets.