Improve performance of Search (Advanced) with Blobs scope under load into next tier

While testing Advanced Search (both API and Web) one specific area was found to have it's own specific performance issue and is significantly over our new performance target of 500ms.

█ Results summary

* Environment:                10k
* Environment Version:        12.10.0-pre `f55fca82ae2`
* Option:                     60s_200rps
* Date:                       2020-04-17
* Run Time:                   3m 50.92s (Start: 15:34:49 UTC, End: 15:38:40 UTC)
* GPT Version:                v1.2.6

NAME                   | RPS   | RPS RESULT         | TTFB AVG  | TTFB P90              | REQ STATUS      | RESULT
-----------------------|-------|--------------------|-----------|-----------------------|-----------------|-----------------
api_v4_search_global   | 200/s | 18.08/s (>48.00/s) | 9166.39ms | 10577.00ms (<25000ms) | 100.00% (>9.5%) | FAILED³
api_v4_search_groups   | 200/s | 17.7/s (>48.00/s)  | 9422.17ms | 10278.74ms (<25000ms) | 100.00% (>9.5%) | FAILED³
api_v4_search_projects | 200/s | 22.87/s (>48.00/s) | 7464.47ms | 8057.70ms (<25000ms)  | 100.00% (>9.5%) | FAILED³

Generally the Search API has been identified to perform badly and a general issue has already been raised - #214482 (closed). However as part of that investigation it was noticed that while the Blobs scope for all 3 search levels was performing bad like other scopes that it was actually behaving differently behind the scenes.

Specifically it seems with Blobs that Elasticsearch is the bottleneck and it shows a massive hit to it's CPU whenever blobs are searched.

This is different from the other scopes tested that showed Postgres was the bottleneck.

Some notes on the testing:

The tests were ran against our 10k Reference Architecture, which is the barometer we judge and raise performance issues against.
- The 10k environment's test data is 2 imported copies of our gitlab-foss project (sanitised). Each project has been imported into its own Gitaly node and specifically there's 3608 MRs and 6724 Issues in each. It can be downloaded here - https://gitlab.com/gitlab-org/quality/performance-data/-/blob/master/projects_export/gitlabhq_export.tar.gz.
- Elasticsearch node is a GCP VM of type n1-standard-4 (4 vCPU, 15GB Ram)
The term being searched for with blobs is test
Testing between the 3 levels had to be staggered a little as running them one after enough straight away led to further slowdown and \ or 500 errors as Elasticsearch was still recovering.
The same slowdown was also observed when calling the search from the Web UI as well since ultimately it's calling Elasticsearch in the same manner

Additionally of note Quality was actually testing Blob Search for Projects with Basic Search on before looking to enable Elasticsearch. With Basic Search the endpoint performed well and comfortably within our targets:

* Environment:                10k
* Environment Version:        12.10.0-pre `818364b79d7`
* Option:                     60s_200rps
* Date:                       2020-04-09
* Run Time:                   51m 50.76s (Start: 01:24:32 UTC, End: 02:16:22 UTC)
* GPT Version:                v1.2.6

NAME                                                     | RPS   | RPS RESULT           | TTFB AVG  | TTFB P90            | REQ STATUS     | RESULT
---------------------------------------------------------|-------|----------------------|-----------|---------------------|----------------|-------
api_v4_projects_project_search_blobs                     | 200/s | 193.38/s (>160.00/s) | 81.99ms   | 82.61ms (<500ms)    | 100.00% (>95%) | Passed

As per our performance targets these endpoints are performing bad enough to fall into ~S1 tier. The task then is to improve the endpoint's performance into the next tier (~S2 <9000ms) although for the Projects level it is already beneath that number but the expectation would be that any improvements would benefit all three levels of the endpoint.

Developer notes

Some ideas for seeing where bottlenecks might be:

Local workstation noise may affect results. Try to get a stable baseline from a different performance test that doesn't rely on Elasticsearch to understand if testing locally can be reliable
Request profiling to see where time is spent https://docs.gitlab.com/ee/administration/monitoring/performance/request_profiling.html
Disable the search result redaction and see if performance improves
Remove some other fields in the API response entities that may be fetching more data and see if anything improves
Start pulling out some parts of the search query (eg. permission checks) sent to Elasticsearch to see if a simpler query performs better

Edited Apr 27, 2020 by Dylan Griffith (ex GitLab)