Improve performance of Search API (Advanced) under load into next tier
The Search API with Advanced Search enabled has been found to perform really poorly when tested at scale with it being over our performance target of 500ms.
This is the case for all of it's 3 levels - Global, Groups and Projects. The assumption is that the three levels are ultimately using the same logic behind the scenes so one Issue is being raised at this time but can be broken down as required.
Quality has been working recently to expand performance test coverage to include Advanced Search via Elasticsearch on our reference architecture test environments as well as testing it..
With Advanced Search enabled performance testing found that the API endpoints specifically had a significant performance issue:
Results Summary:
* Environment: 10k
* Environment Version: 12.10.0-pre `71c834efe31`
* Option: 60s_200rps
* Date: 2020-04-16
* Run Time: 3m 22.69s (Start: 16:25:38 UTC, End: 16:29:01 UTC)
* GPT Version: v1.2.6
NAME | RPS | RPS RESULT | TTFB AVG | TTFB P90 | REQ STATUS | RESULT
-----------------------|-------|--------------------|------------|-----------------------|-----------------|----------------
api_v4_search_global | 200/s | 79.65/s (>48.00/s) | 11974.11ms | 17003.58ms (<25000ms) | 94.69% (>9.5%) | Passed
api_v4_search_groups | 200/s | 58.68/s (>48.00/s) | 11417.23ms | 18499.31ms (<25000ms) | 100.00% (>9.5%) | Passed
api_v4_search_projects | 200/s | 68.9/s (>48.00/s) | 6908.80ms | 13211.30ms (<25000ms) | 100.00% (>9.5%) | Passed
Detailed results
█ API - Global Search
data_received...................: 93 MB 1.6 MB/s
data_sent.......................: 1.0 MB 17 kB/s
group_duration..................: avg=14236.45ms min=708.55ms med=16789.56ms max=18559.90ms p(90)=17720.29ms p(95)=17886.37ms
http_req_blocked................: avg=0.72ms min=0.00ms med=0.52ms max=1002.66ms p(90)=0.62ms p(95)=0.70ms
http_req_connecting.............: avg=0.65ms min=0.00ms med=0.43ms max=1002.59ms p(90)=0.53ms p(95)=0.60ms
http_req_duration...............: avg=11974.63ms min=31.39ms med=13237.90ms max=18461.78ms p(90)=17004.99ms p(95)=17271.70ms
http_req_receiving..............: avg=0.45ms min=0.04ms med=0.13ms max=51.66ms p(90)=1.25ms p(95)=1.41ms
http_req_sending................: avg=0.07ms min=0.01ms med=0.07ms max=1.41ms p(90)=0.09ms p(95)=0.10ms
http_req_tls_handshaking........: avg=0.00ms min=0.00ms med=0.00ms max=0.00ms p(90)=0.00ms p(95)=0.00ms
http_req_waiting................: avg=11974.11ms min=31.19ms med=13237.69ms max=18461.20ms p(90)=17003.58ms p(95)=17271.44ms
✓ { endpoint:blobs }............: avg=12707.65ms min=589.69ms med=14551.07ms max=18356.13ms p(90)=17219.15ms p(95)=17494.87ms
✓ { endpoint:commits }..........: avg=12506.79ms min=247.51ms med=14648.24ms max=17763.46ms p(90)=16896.01ms p(95)=17151.34ms
✓ { endpoint:issues }...........: avg=12825.85ms min=558.53ms med=15235.12ms max=18240.46ms p(90)=17229.82ms p(95)=17435.78ms
✓ { endpoint:merge_requests }...: avg=12918.47ms min=676.77ms med=15311.66ms max=18461.20ms p(90)=17391.88ms p(95)=17597.13ms
✓ { endpoint:milestones }.......: avg=12379.72ms min=79.11ms med=14761.05ms max=17656.42ms p(90)=16746.65ms p(95)=16977.56ms
✓ { endpoint:projects }.........: avg=12526.72ms min=119.62ms med=15190.05ms max=17687.90ms p(90)=16748.14ms p(95)=16930.75ms
✓ { endpoint:users }............: avg=8298.67ms min=31.19ms med=10539.07ms max=12639.17ms p(90)=11797.77ms p(95)=11981.41ms
✓ http_reqs.......................: 4779 79.649845/s
✓ { endpoint:blobs }............: 667 11.116645/s
✓ { endpoint:commits }..........: 681 11.349978/s
✓ { endpoint:issues }...........: 677 11.283311/s
✓ { endpoint:merge_requests }...: 666 11.099978/s
✓ { endpoint:milestones }.......: 679 11.316645/s
✓ { endpoint:projects }.........: 675 11.249978/s
✓ { endpoint:users }............: 733 12.216643/s
iteration_duration..............: avg=14214.87ms min=72.31ms med=16789.23ms max=18559.93ms p(90)=17719.90ms p(95)=17886.25ms
iterations......................: 654 10.899979/s
✓ successful_requests.............: 94.69% ✓ 4335 ✗ 243
vus.............................: 1 min=1 max=200
vus_max.........................: 200 min=200 max=200
█ API - Group Search
data_received...................: 76 MB 1.3 MB/s
data_sent.......................: 862 kB 14 kB/s
group_duration..................: avg=17835.65ms min=5751.17ms med=17373.96ms max=30981.22ms p(90)=24858.96ms p(95)=27541.53ms
http_req_blocked................: avg=0.55ms min=0.00ms med=0.56ms max=24.22ms p(90)=0.66ms p(95)=0.76ms
http_req_connecting.............: avg=0.48ms min=0.00ms med=0.47ms max=24.11ms p(90)=0.57ms p(95)=0.66ms
http_req_duration...............: avg=11417.74ms min=573.83ms med=10217.67ms max=29999.51ms p(90)=18499.43ms p(95)=21503.03ms
http_req_receiving..............: avg=0.44ms min=0.05ms med=0.16ms max=9.37ms p(90)=1.23ms p(95)=1.39ms
http_req_sending................: avg=0.07ms min=0.02ms med=0.07ms max=0.48ms p(90)=0.10ms p(95)=0.11ms
http_req_tls_handshaking........: avg=0.00ms min=0.00ms med=0.00ms max=0.00ms p(90)=0.00ms p(95)=0.00ms
http_req_waiting................: avg=11417.23ms min=573.63ms med=10217.49ms max=29998.81ms p(90)=18499.31ms p(95)=21502.00ms
✓ { endpoint:commits }..........: avg=9581.47ms min=2674.84ms med=9012.74ms max=21527.69ms p(90)=15041.66ms p(95)=17867.59ms
✓ { endpoint:issues }...........: avg=16061.85ms min=4167.90ms med=15474.78ms max=29154.93ms p(90)=23185.39ms p(95)=24793.70ms
✓ { endpoint:merge_requests }...: avg=17478.69ms min=5739.88ms med=17262.29ms max=29998.81ms p(90)=24194.73ms p(95)=26859.02ms
✓ { endpoint:milestones }.......: avg=7943.49ms min=797.08ms med=7299.13ms max=19660.94ms p(90)=13509.44ms p(95)=16494.89ms
✓ { endpoint:projects }.........: avg=7831.71ms min=675.98ms med=7216.08ms max=19758.95ms p(90)=13552.16ms p(95)=15999.05ms
✓ { endpoint:users }............: avg=11049.06ms min=573.63ms med=10853.37ms max=23765.19ms p(90)=17645.64ms p(95)=19981.77ms
✓ http_reqs.......................: 3521 58.683258/s
✓ { endpoint:commits }..........: 611 10.18332/s
✓ { endpoint:issues }...........: 543 9.049988/s
✓ { endpoint:merge_requests }...: 532 8.866655/s
✓ { endpoint:milestones }.......: 621 10.349987/s
✓ { endpoint:projects }.........: 625 10.416653/s
✓ { endpoint:users }............: 588 9.799988/s
iteration_duration..............: avg=17803.85ms min=960.34ms med=17367.43ms max=30981.28ms p(90)=24832.59ms p(95)=27538.56ms
iterations......................: 529 8.816655/s
✓ successful_requests.............: 100.00% ✓ 3174 ✗ 0
vus.............................: 1 min=1 max=200
vus_max.........................: 200 min=200 max=200
█ API - Project Search
data_received...................: 91 MB 1.5 MB/s
data_sent.......................: 981 kB 16 kB/s
group_duration..................: avg=13020.79ms min=8461.26ms med=13286.77ms max=17235.54ms p(90)=14543.55ms p(95)=15243.22ms
http_req_blocked................: avg=0.68ms min=0.00ms med=0.52ms max=857.96ms p(90)=0.63ms p(95)=0.69ms
http_req_connecting.............: avg=0.41ms min=0.00ms med=0.44ms max=23.93ms p(90)=0.53ms p(95)=0.59ms
http_req_duration...............: avg=6909.30ms min=198.85ms med=5468.45ms max=17228.79ms p(90)=13212.36ms p(95)=13793.05ms
http_req_receiving..............: avg=0.43ms min=0.05ms med=0.39ms max=12.44ms p(90)=0.98ms p(95)=1.06ms
http_req_sending................: avg=0.07ms min=0.02ms med=0.07ms max=2.03ms p(90)=0.10ms p(95)=0.11ms
http_req_tls_handshaking........: avg=0.00ms min=0.00ms med=0.00ms max=0.00ms p(90)=0.00ms p(95)=0.00ms
http_req_waiting................: avg=6908.80ms min=198.66ms med=5467.99ms max=17228.14ms p(90)=13211.30ms p(95)=13792.47ms
✓ { endpoint:commits }..........: avg=5396.01ms min=1238.89ms med=5646.13ms max=9275.65ms p(90)=6737.16ms p(95)=7373.78ms
✓ { endpoint:issues }...........: avg=10908.93ms min=6507.85ms med=11101.48ms max=15386.18ms p(90)=12482.17ms p(95)=13095.13ms
✓ { endpoint:merge_requests }...: avg=13000.86ms min=8454.13ms med=13264.80ms max=17228.14ms p(90)=14535.82ms p(95)=15225.50ms
✓ { endpoint:milestones }.......: avg=3578.55ms min=314.06ms med=3846.76ms max=7358.59ms p(90)=4900.50ms p(95)=5362.33ms
✓ { endpoint:users }............: avg=3262.30ms min=198.66ms med=3525.94ms max=7031.63ms p(90)=4491.66ms p(95)=4967.65ms
✓ http_reqs.......................: 4134 68.899908/s
✓ { endpoint:commits }..........: 863 14.383314/s
✓ { endpoint:issues }...........: 762 12.699983/s
✓ { endpoint:merge_requests }...: 732 12.199984/s
✓ { endpoint:milestones }.......: 888 14.79998/s
✓ { endpoint:users }............: 888 14.79998/s
iteration_duration..............: avg=13003.51ms min=323.71ms med=13285.27ms max=17235.61ms p(90)=14541.11ms p(95)=15237.89ms
iterations......................: 732 12.199984/s
✓ successful_requests.............: 100.00% ✓ 3660 ✗ 0
vus.............................: 1 min=1 max=200
vus_max.........................: 200 min=200 max=200
Some notes on the testing:
- The tests were ran against our 10k Reference Architecture, which is the barometer we judge and raise performance issues against.
- The 10k environment's test data is 2 imported copies of our
gitlab-fossproject (sanitised). Each project has been imported into its own Gitaly node and specifically there's 3608 MRs and 6724 Issues in each. It can be downloaded here - https://gitlab.com/gitlab-org/quality/performance-data/-/blob/master/projects_export/gitlabhq_export.tar.gz. - Elasticsearch node is a GCP VM of type
n1-standard-4(4 vCPU, 15GB Ram)
- The 10k environment's test data is 2 imported copies of our
- There are 3 API tests being run, each covering a Search level. On each they are searching each of the following scopes:
-
projects-gitlab(except theprojectslevel) -
issues-broken -
commits-fix -
merge_requests-fix -
milestones-2.6 -
users-root
-
The environment metrics for when the test has been run show Postgres is the bottleneck here surprisingly and not Elasticsearch. The database looks to have significant CPU spikes when searching is done via API:
Additionally sometimes the endpoints start to fail with 500 errors due to the load. This is especially the case on our bigger environments, such as our 50k environment, the endpoint also starts to fail with 500 errors for around 25% of requests at scale.
As per our performance targets this endpoint is performing bad enough to fall into ~S1 tier. The task then is to improve the endpoint's performance into the next tier (~S2 <9000ms, additional issues can be opened to reduce into further tiers as an iterative approach).
