Investigate performance degradation of index action for ChildrenController and improve its performance on large environments like 50k

The TTFB (Time to First Byte) of the Groups::ChildrenController#index degraded significantly on 50k environment: went up from 250ms to 3800ms:

* Environment:                50k
* Environment Version:        14.8.0-pre `ffc662fac2b`
* Option:                     60s_1000rps
* Date:                       2022-02-21
* Run Time:                   1h 30m 48.38s (Start: 04:39:10 UTC, End: 06:09:59 UTC)
* GPT Version:                v2.10.0

NAME                         | RPS    | RPS RESULT           | TTFB AVG   | TTFB P90              | REQ STATUS     | RESULT  
-----------------------------|--------|----------------------|------------|-----------------------|----------------|---------
web_group                    | 100/s  | 64.17/s (>80.00/s)   | 1542.31ms  | 3730.82ms (<400ms)    | 100.00% (>99%) | FAILED¹²

 █ Web - Groups Page
     data_received...................: 201 MB  3.3 MB/s
     data_sent.......................: 420 kB  6.8 kB/s
     group_duration..................: avg=2997.61ms min=1483.39ms med=2758.06ms max=6849.06ms p(90)=4939.20ms p(95)=5444.09ms
     http_req_blocked................: avg=0.05ms    min=0.00ms    med=0.00ms    max=18.95ms   p(90)=0.01ms    p(95)=0.54ms   
     http_req_connecting.............: avg=0.04ms    min=0.00ms    med=0.00ms    max=15.57ms   p(90)=0.00ms    p(95)=0.46ms   
     http_req_duration...............: avg=1607.92ms min=185.00ms  med=1186.83ms max=6818.37ms p(90)=4024.42ms p(95)=4923.53ms
       { expected_response:true }....: avg=1607.92ms min=185.00ms  med=1186.83ms max=6818.37ms p(90)=4024.42ms p(95)=4923.53ms
     http_req_failed.................: 0.00%   ✓ 0         ✗ 3762 
     http_req_receiving..............: avg=0.83ms    min=0.08ms    med=0.88ms    max=39.37ms   p(90)=1.36ms    p(95)=1.45ms   
     http_req_sending................: avg=0.03ms    min=0.01ms    med=0.02ms    max=0.34ms    p(90)=0.04ms    p(95)=0.06ms   
     http_req_tls_handshaking........: avg=0.00ms    min=0.00ms    med=0.00ms    max=0.00ms    p(90)=0.00ms    p(95)=0.00ms   
     http_req_waiting................: avg=1607.07ms min=184.50ms  med=1185.88ms max=6816.94ms p(90)=4023.15ms p(95)=4922.30ms
     ✗ { endpoint:/ }................: avg=2976.30ms min=1477.02ms med=2744.71ms max=6816.94ms p(90)=4922.38ms p(95)=5421.31ms
     ✓ { endpoint:/children.json }...: avg=237.84ms  min=184.50ms  med=221.21ms  max=894.74ms  p(90)=277.50ms  p(95)=307.73ms 
   ✗ http_reqs.......................: 3762    61.260086/s
     ✗ { endpoint:/ }................: 1881    30.630043/s
     ✗ { endpoint:/children.json }...: 1881    30.630043/s
     iteration_duration..............: avg=2996.05ms min=0.16ms    med=2757.98ms max=6849.08ms p(90)=4938.25ms p(95)=5443.63ms
     iterations......................: 1881    30.630043/s
   ✓ successful_requests.............: 100.00% ✓ 3762      ✗ 0    
     vus.............................: 6       min=6       max=100
     vus_max.........................: 100     min=100     max=100

Metrics: http://50k.testbed.gitlab.net/-/grafana/d/J0ysCtCWz/gpt-test-results?orgId=1&from=now-30d&to=now&var-test_name=web_group_issues&var-test_name=web_group&var-test_name=web_group_merge_requests
GitLab changes between 2022-01-31 (latest successful test run) and 2022-02-14 results b09a2f02...cfe9ec8e

Looking at the server metrics during these tests, all PG nodes goes up to 100% in CPU utilization:

Below is an example of /groups/gpt/-/issues (web_group_issues) request data from Rails - note that db_duration_s":12.70394:

{"method":"GET","path":"/groups/gpt/-/issues","format":"html","controller":"GroupsController","action":"issues","status":200,"time":"2022-02-21
T17:49:39.716Z","params":[{"key":"id","value":"gpt"}],"correlation_id":"01FWEPE1W70BM83X87Z8M06HW4","meta.root_namespace":"gpt","meta.client_id
":"ip/10.142.0.2","meta.caller_id":"GroupsController#issues","meta.remote_ip":"10.142.0.2","meta.feature_category":"team_planning","remote_ip":
"10.142.0.2","ua":"GPT/2.10.0","queue_duration_s":0.022863,"request_urgency":"default","target_duration_s":1,"redis_calls":33,"redis_duration_s
":0.012899,"redis_read_bytes":4601,"redis_write_bytes":2028,"redis_cache_calls":33,"redis_cache_duration_s":0.012899,"redis_cache_read_bytes":4
601,"redis_cache_write_bytes":2028,"db_count":40,"db_write_count":0,"db_cached_count":7,"db_replica_count":39,"db_primary_count":1,"db_replica_
cached_count":7,"db_primary_cached_count":0,"db_replica_wal_count":0,"db_primary_wal_count":1,"db_replica_wal_cached_count":0,"db_primary_wal_c
ached_count":0,"db_replica_duration_s":13.692,"db_primary_duration_s":0.001,"cpu_s":0.253262,"mem_objects":123606,"mem_bytes":9748663,"mem_mall
ocs":26576,"mem_total_bytes":14692903,"pid":3042,"db_duration_s":13.67939,"view_duration_s":0.17305,"duration_s":13.91057}

🚧 Worth mentioning that on 10k environments these tests all pass - http://10k.testbed.gitlab.net/-/grafana/d/J0ysCtCWz/gpt-test-results?orgId=1&from=now-30d&to=now&var-test_name=web_group_issues&var-test_name=web_group&var-test_name=web_group_merge_requests

🚧 web_group, web_group_issues and web_group_merge_requests tests started to fail at the same time on 2022-02-14, see also #334439

Test Details

Testing was done on our 50k Reference Architecture environment with our lab condition GitLab Performance Tool pipeline. The group being tested has 1000 subgroups with 10 projects each and also has a copy of gitlabhq (tarball can be found here). GitLab Performance Tool tests information is listed at Current test details page.

The latest GitLab Performance pipeline results can always be found here. Through this page full Server Metrics can be found via the Metrics Dashboard link on that page.

As per our performance targets these endpoints' TTFB metric is above the target of 2000 ms which is severity2. Task is to investigate why performance has degraded and improve it.

Edited Feb 22, 2022 by Nailia Iskhakova