Improve performance of Search API (Advanced) under load into next tier

The Search API with Advanced Search enabled has been found to perform really poorly when tested at scale with it being over our performance target of 500ms.

This is the case for all of it's 3 levels - Global, Groups and Projects. The assumption is that the three levels are ultimately using the same logic behind the scenes so one Issue is being raised at this time but can be broken down as required.

Quality has been working recently to expand performance test coverage to include Advanced Search via Elasticsearch on our reference architecture test environments as well as testing it..

With Advanced Search enabled performance testing found that the API endpoints specifically had a significant performance issue:

Results Summary:

* Environment:                10k
* Environment Version:        12.10.0-pre `71c834efe31`
* Option:                     60s_200rps
* Date:                       2020-04-16
* Run Time:                   3m 22.69s (Start: 16:25:38 UTC, End: 16:29:01 UTC)
* GPT Version:                v1.2.6

NAME                   | RPS   | RPS RESULT         | TTFB AVG   | TTFB P90              | REQ STATUS      | RESULT
-----------------------|-------|--------------------|------------|-----------------------|-----------------|----------------
api_v4_search_global   | 200/s | 79.65/s (>48.00/s) | 11974.11ms | 17003.58ms (<25000ms) | 94.69% (>9.5%)  | Passed
api_v4_search_groups   | 200/s | 58.68/s (>48.00/s) | 11417.23ms | 18499.31ms (<25000ms) | 100.00% (>9.5%) | Passed
api_v4_search_projects | 200/s | 68.9/s (>48.00/s)  | 6908.80ms  | 13211.30ms (<25000ms) | 100.00% (>9.5%) | Passed

Detailed results
    █ API - Global Search

    data_received...................: 93 MB  1.6 MB/s
    data_sent.......................: 1.0 MB 17 kB/s
    group_duration..................: avg=14236.45ms min=708.55ms med=16789.56ms max=18559.90ms p(90)=17720.29ms p(95)=17886.37ms
    http_req_blocked................: avg=0.72ms     min=0.00ms   med=0.52ms     max=1002.66ms  p(90)=0.62ms     p(95)=0.70ms
    http_req_connecting.............: avg=0.65ms     min=0.00ms   med=0.43ms     max=1002.59ms  p(90)=0.53ms     p(95)=0.60ms
    http_req_duration...............: avg=11974.63ms min=31.39ms  med=13237.90ms max=18461.78ms p(90)=17004.99ms p(95)=17271.70ms
    http_req_receiving..............: avg=0.45ms     min=0.04ms   med=0.13ms     max=51.66ms    p(90)=1.25ms     p(95)=1.41ms
    http_req_sending................: avg=0.07ms     min=0.01ms   med=0.07ms     max=1.41ms     p(90)=0.09ms     p(95)=0.10ms
    http_req_tls_handshaking........: avg=0.00ms     min=0.00ms   med=0.00ms     max=0.00ms     p(90)=0.00ms     p(95)=0.00ms
    http_req_waiting................: avg=11974.11ms min=31.19ms  med=13237.69ms max=18461.20ms p(90)=17003.58ms p(95)=17271.44ms
    ✓ { endpoint:blobs }............: avg=12707.65ms min=589.69ms med=14551.07ms max=18356.13ms p(90)=17219.15ms p(95)=17494.87ms
    ✓ { endpoint:commits }..........: avg=12506.79ms min=247.51ms med=14648.24ms max=17763.46ms p(90)=16896.01ms p(95)=17151.34ms
    ✓ { endpoint:issues }...........: avg=12825.85ms min=558.53ms med=15235.12ms max=18240.46ms p(90)=17229.82ms p(95)=17435.78ms
    ✓ { endpoint:merge_requests }...: avg=12918.47ms min=676.77ms med=15311.66ms max=18461.20ms p(90)=17391.88ms p(95)=17597.13ms
    ✓ { endpoint:milestones }.......: avg=12379.72ms min=79.11ms  med=14761.05ms max=17656.42ms p(90)=16746.65ms p(95)=16977.56ms
    ✓ { endpoint:projects }.........: avg=12526.72ms min=119.62ms med=15190.05ms max=17687.90ms p(90)=16748.14ms p(95)=16930.75ms
    ✓ { endpoint:users }............: avg=8298.67ms  min=31.19ms  med=10539.07ms max=12639.17ms p(90)=11797.77ms p(95)=11981.41ms
  ✓ http_reqs.......................: 4779   79.649845/s
    ✓ { endpoint:blobs }............: 667    11.116645/s
    ✓ { endpoint:commits }..........: 681    11.349978/s
    ✓ { endpoint:issues }...........: 677    11.283311/s
    ✓ { endpoint:merge_requests }...: 666    11.099978/s
    ✓ { endpoint:milestones }.......: 679    11.316645/s
    ✓ { endpoint:projects }.........: 675    11.249978/s
    ✓ { endpoint:users }............: 733    12.216643/s
    iteration_duration..............: avg=14214.87ms min=72.31ms  med=16789.23ms max=18559.93ms p(90)=17719.90ms p(95)=17886.25ms
    iterations......................: 654    10.899979/s
  ✓ successful_requests.............: 94.69% ✓ 4335  ✗ 243
    vus.............................: 1      min=1   max=200
    vus_max.........................: 200    min=200 max=200
    █ API - Group Search

    data_received...................: 76 MB   1.3 MB/s
    data_sent.......................: 862 kB  14 kB/s
    group_duration..................: avg=17835.65ms min=5751.17ms med=17373.96ms max=30981.22ms p(90)=24858.96ms p(95)=27541.53ms
    http_req_blocked................: avg=0.55ms     min=0.00ms    med=0.56ms     max=24.22ms    p(90)=0.66ms     p(95)=0.76ms
    http_req_connecting.............: avg=0.48ms     min=0.00ms    med=0.47ms     max=24.11ms    p(90)=0.57ms     p(95)=0.66ms
    http_req_duration...............: avg=11417.74ms min=573.83ms  med=10217.67ms max=29999.51ms p(90)=18499.43ms p(95)=21503.03ms
    http_req_receiving..............: avg=0.44ms     min=0.05ms    med=0.16ms     max=9.37ms     p(90)=1.23ms     p(95)=1.39ms
    http_req_sending................: avg=0.07ms     min=0.02ms    med=0.07ms     max=0.48ms     p(90)=0.10ms     p(95)=0.11ms
    http_req_tls_handshaking........: avg=0.00ms     min=0.00ms    med=0.00ms     max=0.00ms     p(90)=0.00ms     p(95)=0.00ms
    http_req_waiting................: avg=11417.23ms min=573.63ms  med=10217.49ms max=29998.81ms p(90)=18499.31ms p(95)=21502.00ms
    ✓ { endpoint:commits }..........: avg=9581.47ms  min=2674.84ms med=9012.74ms  max=21527.69ms p(90)=15041.66ms p(95)=17867.59ms
    ✓ { endpoint:issues }...........: avg=16061.85ms min=4167.90ms med=15474.78ms max=29154.93ms p(90)=23185.39ms p(95)=24793.70ms
    ✓ { endpoint:merge_requests }...: avg=17478.69ms min=5739.88ms med=17262.29ms max=29998.81ms p(90)=24194.73ms p(95)=26859.02ms
    ✓ { endpoint:milestones }.......: avg=7943.49ms  min=797.08ms  med=7299.13ms  max=19660.94ms p(90)=13509.44ms p(95)=16494.89ms
    ✓ { endpoint:projects }.........: avg=7831.71ms  min=675.98ms  med=7216.08ms  max=19758.95ms p(90)=13552.16ms p(95)=15999.05ms
    ✓ { endpoint:users }............: avg=11049.06ms min=573.63ms  med=10853.37ms max=23765.19ms p(90)=17645.64ms p(95)=19981.77ms
  ✓ http_reqs.......................: 3521    58.683258/s
    ✓ { endpoint:commits }..........: 611     10.18332/s
    ✓ { endpoint:issues }...........: 543     9.049988/s
    ✓ { endpoint:merge_requests }...: 532     8.866655/s
    ✓ { endpoint:milestones }.......: 621     10.349987/s
    ✓ { endpoint:projects }.........: 625     10.416653/s
    ✓ { endpoint:users }............: 588     9.799988/s
    iteration_duration..............: avg=17803.85ms min=960.34ms  med=17367.43ms max=30981.28ms p(90)=24832.59ms p(95)=27538.56ms
    iterations......................: 529     8.816655/s
  ✓ successful_requests.............: 100.00% ✓ 3174  ✗ 0
    vus.............................: 1       min=1   max=200
    vus_max.........................: 200     min=200 max=200
    █ API - Project Search

    data_received...................: 91 MB   1.5 MB/s
    data_sent.......................: 981 kB  16 kB/s
    group_duration..................: avg=13020.79ms min=8461.26ms med=13286.77ms max=17235.54ms p(90)=14543.55ms p(95)=15243.22ms
    http_req_blocked................: avg=0.68ms     min=0.00ms    med=0.52ms     max=857.96ms   p(90)=0.63ms     p(95)=0.69ms
    http_req_connecting.............: avg=0.41ms     min=0.00ms    med=0.44ms     max=23.93ms    p(90)=0.53ms     p(95)=0.59ms
    http_req_duration...............: avg=6909.30ms  min=198.85ms  med=5468.45ms  max=17228.79ms p(90)=13212.36ms p(95)=13793.05ms
    http_req_receiving..............: avg=0.43ms     min=0.05ms    med=0.39ms     max=12.44ms    p(90)=0.98ms     p(95)=1.06ms
    http_req_sending................: avg=0.07ms     min=0.02ms    med=0.07ms     max=2.03ms     p(90)=0.10ms     p(95)=0.11ms
    http_req_tls_handshaking........: avg=0.00ms     min=0.00ms    med=0.00ms     max=0.00ms     p(90)=0.00ms     p(95)=0.00ms
    http_req_waiting................: avg=6908.80ms  min=198.66ms  med=5467.99ms  max=17228.14ms p(90)=13211.30ms p(95)=13792.47ms
    ✓ { endpoint:commits }..........: avg=5396.01ms  min=1238.89ms med=5646.13ms  max=9275.65ms  p(90)=6737.16ms  p(95)=7373.78ms
    ✓ { endpoint:issues }...........: avg=10908.93ms min=6507.85ms med=11101.48ms max=15386.18ms p(90)=12482.17ms p(95)=13095.13ms
    ✓ { endpoint:merge_requests }...: avg=13000.86ms min=8454.13ms med=13264.80ms max=17228.14ms p(90)=14535.82ms p(95)=15225.50ms
    ✓ { endpoint:milestones }.......: avg=3578.55ms  min=314.06ms  med=3846.76ms  max=7358.59ms  p(90)=4900.50ms  p(95)=5362.33ms
    ✓ { endpoint:users }............: avg=3262.30ms  min=198.66ms  med=3525.94ms  max=7031.63ms  p(90)=4491.66ms  p(95)=4967.65ms
  ✓ http_reqs.......................: 4134    68.899908/s
    ✓ { endpoint:commits }..........: 863     14.383314/s
    ✓ { endpoint:issues }...........: 762     12.699983/s
    ✓ { endpoint:merge_requests }...: 732     12.199984/s
    ✓ { endpoint:milestones }.......: 888     14.79998/s
    ✓ { endpoint:users }............: 888     14.79998/s
    iteration_duration..............: avg=13003.51ms min=323.71ms  med=13285.27ms max=17235.61ms p(90)=14541.11ms p(95)=15237.89ms
    iterations......................: 732     12.199984/s
  ✓ successful_requests.............: 100.00% ✓ 3660  ✗ 0
    vus.............................: 1       min=1   max=200
    vus_max.........................: 200     min=200 max=200

Some notes on the testing:

  • The tests were ran against our 10k Reference Architecture, which is the barometer we judge and raise performance issues against.
  • There are 3 API tests being run, each covering a Search level. On each they are searching each of the following scopes:
    • projects - gitlab (except the projects level)
    • issues - broken
    • commits - fix
    • merge_requests - fix
    • milestones - 2.6
    • users - root

The environment metrics for when the test has been run show Postgres is the bottleneck here surprisingly and not Elasticsearch. The database looks to have significant CPU spikes when searching is done via API:

Screenshot_2020-04-16_at_17.57.25

Additionally sometimes the endpoints start to fail with 500 errors due to the load. This is especially the case on our bigger environments, such as our 50k environment, the endpoint also starts to fail with 500 errors for around 25% of requests at scale.

As per our performance targets this endpoint is performing bad enough to fall into ~S1 tier. The task then is to improve the endpoint's performance into the next tier (~S2 <9000ms, additional issues can be opened to reduce into further tiers as an iterative approach).

Edited by Grant Young