Branches List API performance degrades notably under load
### Summary
With our increased performance testing efforts we've started to identify slow areas in GitLab and raising them accordingly.
One such area is the Branches API, specifically when trying to [list all branches](https://docs.gitlab.com/ee/api/branches.html#list-repository-branches). We've found that, while the API performs fine when making single requests, that it's performance will degrade notably under load testing.
For reference, the test conditions we have when testing this API is attempting to list the branches of the gitlab-ce project (~2150) under the expected throughput that the environment should be able to handle. In this case we test against our reference 10k user environment that's expected to handle a throughput of 200 requests per second (RPS).
When testing the API at a low 2 RPS we get a response time of **780.85ms**:
```
Environment: 10k
Scenario: 20s_2rps
Version: 12.0.2-ee ef76b54fc1e
NAME | RESULT | DURATION | P95 | RPS_COUNT | RPS_MEAN
------------------------------------|--------|----------|----------|-----------|-----------
api_v4_projects_repository_branches | Passed | 20.0s | 780.85ms | 37 | 1.849993/s
```
When increasing this to 200 RPS though the response time increases exponentially and varies wildly in it's results. In one test run we witnessed **5817.13ms**, around a 640% increase:
```
NAME | RESULT | DURATION | P95 | RPS_COUNT | RPS_MEAN
-----------------------------------------------------|--------|----------|------------|-----------|-------------
api_v4_projects_repository_branches | Passed | 30.0s | 5817.13ms | 878 | 29.266581/s
```
Examining the logs and metrics for the environment it appears the endpoint at scale causes significant resource spikes (CPU, Memory). We also noticed high amounts of GC:

**A snapshot of the above board can be found here - https://snapshot.raintank.io/dashboard/snapshot/xAYmbpGqDXd4gRM4xe5POogMgjKS9401**
Additionally we noticed two other points of information that may be useful:
* When running the test at 200 RPS the API threw numerous 504 error codes (200: 2808, 504: 1868)
* Prometheus monitoring dropped out for the `gitlab-rails` jobs when the API was being load tested, suggesting the webservers became unresponsive during the load test.
Based on what's been gathered above it's presumed that this API isn't performant / efficient as other API's are and it should be investigated.
### Steps to reproduce
1. If wanting to test against your own environment, import the gitlabhq project from GitHub or [via this tarball](https://drive.google.com/file/d/1GoQZETqQ4Ns8iBS7TD-9Zj19r-JS-DoQ/view?usp=sharing) into a latest GitLab environment.
1. Once import has completed you should be able to start seeing the issue by loading the API, e.g. `http://10k.testbed.gitlab.net/api/v4/projects/qa-perf-testing%2Fgitlabhq/repository/branches`
1. To run a load test against a local environment you'll need to use the [performance](https://gitlab.com/gitlab-org/quality/performance) project as follows:
1. Check out the [performance](https://gitlab.com/gitlab-org/quality/performance) project and read through it's README for reference
1. Edit the test scenario, `artillery/scenarios/quarantined/api_v4_projects_repository_branches.yml`, to run at the desired RPS by increasing the value `20` found against `rampUp` and `arrivalRate` to the value required. E.G. For a RPS of 40 the file should look like:
<p>
<details>
<summary><code>artillery/scenarios/quarantined/api_v4_projects_repository_branches.yml</code> example</summary>
```
config:
defaults:
headers:
PRIVATE-TOKEN: "{{ $processEnvironment.ACCESS_TOKEN }}"
Accept: "application/json"
ensure:
maxErrorRate: 1
plugins:
expect: {}
phases:
- duration: 2
arrivalRate: 2
rampTo: 20
name: "Warm up"
- duration: 10
arrivalRate: 20
name: "Load"
scenarios:
- flow:
- get:
url: /api/v4/projects/{{PROJECT_GROUP}}%2F{{PROJECT_NAME}}/repository/branches
expect:
- statusCode: 200
```
</details>
</p>
* If you're wanting to test against the 10k environment this has slightly different config. You'll need to increase the values to hit 200 RPS (we have it set at 20 by default as runner dockers can't handle this throughput and need to be run at scale, on a local machine this hasn't been found to be an issue). For example:
<p>
<details>
<summary><code>artillery/environments/10k.testbed.gitlab.net.yml</code> example</summary>
```
config:
target: http://10k.testbed.gitlab.net
variables:
PROJECT_GROUP: qa-perf-testing
PROJECT_NAME: gitlabhq
PROJECT_COMMIT_SHA: 0a99e022
PROJECT_BRANCH: 10-0-stable
PROJECT_FILE_PATH: qa%2Fqa%2Erb
PROJECT_MR_COMMITS_IID: 10495
PROJECT_MR_NOTES_IID: 6946
PROJECT_SIGNED_COMMIT_SHA: 6526e91f
phases:
- duration: 5
arrivalRate: 1
rampTo: 20
name: "Warm up"
- duration: 15
arrivalRate: 20
rampTo: 200
name: "Ramp Up"
- duration: 45
arrivalRate: 200
name: "Full Load"
```
</details>
</p>
1. Notice that the tested environment's performance degrades notably once this endpoint is tested under load
### What is the current *bug* behavior?
Performance of environment degrades notably when the Branches API is tested at load.
### What is the expected *correct* behavior?
Branches API doesn't degrade performance of environment notably when used at approved loads.
## Feature Flag Information
The first iteration of caching to improve this endpoint can be enabled using the following feature flag:
```
:merged_branch_names_redis_caching
```
This will enable caching of `#merged_branch_names` on the `Repository` model. Cached values will clear on `Repository` updates and also every 10 minutes, so that if the flag is disabled the cache will be empty within 10 minutes (in case of excessive load).
issue