Branches List API performance degrades notably under load
Summary
With our increased performance testing efforts we've started to identify slow areas in GitLab and raising them accordingly.
One such area is the Branches API, specifically when trying to list all branches. We've found that, while the API performs fine when making single requests, that it's performance will degrade notably under load testing.
For reference, the test conditions we have when testing this API is attempting to list the branches of the gitlab-ce project (~2150) under the expected throughput that the environment should be able to handle. In this case we test against our reference 10k user environment that's expected to handle a throughput of 200 requests per second (RPS).
When testing the API at a low 2 RPS we get a response time of 780.85ms:
Environment: 10k
Scenario: 20s_2rps
Version: 12.0.2-ee ef76b54fc1e
NAME | RESULT | DURATION | P95 | RPS_COUNT | RPS_MEAN
------------------------------------|--------|----------|----------|-----------|-----------
api_v4_projects_repository_branches | Passed | 20.0s | 780.85ms | 37 | 1.849993/s
When increasing this to 200 RPS though the response time increases exponentially and varies wildly in it's results. In one test run we witnessed 5817.13ms, around a 640% increase:
NAME | RESULT | DURATION | P95 | RPS_COUNT | RPS_MEAN
-----------------------------------------------------|--------|----------|------------|-----------|-------------
api_v4_projects_repository_branches | Passed | 30.0s | 5817.13ms | 878 | 29.266581/s
Examining the logs and metrics for the environment it appears the endpoint at scale causes significant resource spikes (CPU, Memory). We also noticed high amounts of GC:
A snapshot of the above board can be found here - https://snapshot.raintank.io/dashboard/snapshot/xAYmbpGqDXd4gRM4xe5POogMgjKS9401
Additionally we noticed two other points of information that may be useful:
- When running the test at 200 RPS the API threw numerous 504 error codes (200: 2808, 504: 1868)
- Prometheus monitoring dropped out for the
gitlab-rails
jobs when the API was being load tested, suggesting the webservers became unresponsive during the load test.
Based on what's been gathered above it's presumed that this API isn't performant / efficient as other API's are and it should be investigated.
Steps to reproduce
-
If wanting to test against your own environment, import the gitlabhq project from GitHub or via this tarball into a latest GitLab environment.
-
Once import has completed you should be able to start seeing the issue by loading the API, e.g.
http://10k.testbed.gitlab.net/api/v4/projects/qa-perf-testing%2Fgitlabhq/repository/branches
-
To run a load test against a local environment you'll need to use the performance project as follows:
-
Check out the performance project and read through it's README for reference
-
Edit the test scenario,
artillery/scenarios/quarantined/api_v4_projects_repository_branches.yml
, to run at the desired RPS by increasing the value20
found againstrampUp
andarrivalRate
to the value required. E.G. For a RPS of 40 the file should look like:artillery/scenarios/quarantined/api_v4_projects_repository_branches.yml
exampleconfig: defaults: headers: PRIVATE-TOKEN: "{{ $processEnvironment.ACCESS_TOKEN }}" Accept: "application/json" ensure: maxErrorRate: 1 plugins: expect: {} phases: - duration: 2 arrivalRate: 2 rampTo: 20 name: "Warm up" - duration: 10 arrivalRate: 20 name: "Load" scenarios: - flow: - get: url: /api/v4/projects/{{PROJECT_GROUP}}%2F{{PROJECT_NAME}}/repository/branches expect: - statusCode: 200
-
If you're wanting to test against the 10k environment this has slightly different config. You'll need to increase the values to hit 200 RPS (we have it set at 20 by default as runner dockers can't handle this throughput and need to be run at scale, on a local machine this hasn't been found to be an issue). For example:
artillery/environments/10k.testbed.gitlab.net.yml
exampleconfig: target: http://10k.testbed.gitlab.net variables: PROJECT_GROUP: qa-perf-testing PROJECT_NAME: gitlabhq PROJECT_COMMIT_SHA: 0a99e022 PROJECT_BRANCH: 10-0-stable PROJECT_FILE_PATH: qa%2Fqa%2Erb PROJECT_MR_COMMITS_IID: 10495 PROJECT_MR_NOTES_IID: 6946 PROJECT_SIGNED_COMMIT_SHA: 6526e91f phases: - duration: 5 arrivalRate: 1 rampTo: 20 name: "Warm up" - duration: 15 arrivalRate: 20 rampTo: 200 name: "Ramp Up" - duration: 45 arrivalRate: 200 name: "Full Load"
-
-
-
Notice that the tested environment's performance degrades notably once this endpoint is tested under load
What is the current bug behavior?
Performance of environment degrades notably when the Branches API is tested at load.
What is the expected correct behavior?
Branches API doesn't degrade performance of environment notably when used at approved loads.