Investigate performance degradation of index action for ChildrenController and improve its performance on large environments like 50k
The TTFB (Time to First Byte) of the Groups::ChildrenController#index degraded significantly on 50k environment: went up from 250ms to 3800ms:
* Environment: 50k
* Environment Version: 14.8.0-pre `ffc662fac2b`
* Option: 60s_1000rps
* Date: 2022-02-21
* Run Time: 1h 30m 48.38s (Start: 04:39:10 UTC, End: 06:09:59 UTC)
* GPT Version: v2.10.0
NAME | RPS | RPS RESULT | TTFB AVG | TTFB P90 | REQ STATUS | RESULT
-----------------------------|--------|----------------------|------------|-----------------------|----------------|---------
web_group | 100/s | 64.17/s (>80.00/s) | 1542.31ms | 3730.82ms (<400ms) | 100.00% (>99%) | FAILED¹²
█ Web - Groups Page
data_received...................: 201 MB 3.3 MB/s
data_sent.......................: 420 kB 6.8 kB/s
group_duration..................: avg=2997.61ms min=1483.39ms med=2758.06ms max=6849.06ms p(90)=4939.20ms p(95)=5444.09ms
http_req_blocked................: avg=0.05ms min=0.00ms med=0.00ms max=18.95ms p(90)=0.01ms p(95)=0.54ms
http_req_connecting.............: avg=0.04ms min=0.00ms med=0.00ms max=15.57ms p(90)=0.00ms p(95)=0.46ms
http_req_duration...............: avg=1607.92ms min=185.00ms med=1186.83ms max=6818.37ms p(90)=4024.42ms p(95)=4923.53ms
{ expected_response:true }....: avg=1607.92ms min=185.00ms med=1186.83ms max=6818.37ms p(90)=4024.42ms p(95)=4923.53ms
http_req_failed.................: 0.00% ✓ 0 ✗ 3762
http_req_receiving..............: avg=0.83ms min=0.08ms med=0.88ms max=39.37ms p(90)=1.36ms p(95)=1.45ms
http_req_sending................: avg=0.03ms min=0.01ms med=0.02ms max=0.34ms p(90)=0.04ms p(95)=0.06ms
http_req_tls_handshaking........: avg=0.00ms min=0.00ms med=0.00ms max=0.00ms p(90)=0.00ms p(95)=0.00ms
http_req_waiting................: avg=1607.07ms min=184.50ms med=1185.88ms max=6816.94ms p(90)=4023.15ms p(95)=4922.30ms
✗ { endpoint:/ }................: avg=2976.30ms min=1477.02ms med=2744.71ms max=6816.94ms p(90)=4922.38ms p(95)=5421.31ms
✓ { endpoint:/children.json }...: avg=237.84ms min=184.50ms med=221.21ms max=894.74ms p(90)=277.50ms p(95)=307.73ms
✗ http_reqs.......................: 3762 61.260086/s
✗ { endpoint:/ }................: 1881 30.630043/s
✗ { endpoint:/children.json }...: 1881 30.630043/s
iteration_duration..............: avg=2996.05ms min=0.16ms med=2757.98ms max=6849.08ms p(90)=4938.25ms p(95)=5443.63ms
iterations......................: 1881 30.630043/s
✓ successful_requests.............: 100.00% ✓ 3762 ✗ 0
vus.............................: 6 min=6 max=100
vus_max.........................: 100 min=100 max=100
- Metrics: http://50k.testbed.gitlab.net/-/grafana/d/J0ysCtCWz/gpt-test-results?orgId=1&from=now-30d&to=now&var-test_name=web_group_issues&var-test_name=web_group&var-test_name=web_group_merge_requests
- GitLab changes between 2022-01-31 (latest successful test run) and 2022-02-14 results b09a2f02...cfe9ec8e
Looking at the server metrics during these tests, all PG nodes goes up to 100% in CPU utilization:
Below is an example of /groups/gpt/-/issues (web_group_issues) request data from Rails - note that db_duration_s":12.70394:
{"method":"GET","path":"/groups/gpt/-/issues","format":"html","controller":"GroupsController","action":"issues","status":200,"time":"2022-02-21
T17:49:39.716Z","params":[{"key":"id","value":"gpt"}],"correlation_id":"01FWEPE1W70BM83X87Z8M06HW4","meta.root_namespace":"gpt","meta.client_id
":"ip/10.142.0.2","meta.caller_id":"GroupsController#issues","meta.remote_ip":"10.142.0.2","meta.feature_category":"team_planning","remote_ip":
"10.142.0.2","ua":"GPT/2.10.0","queue_duration_s":0.022863,"request_urgency":"default","target_duration_s":1,"redis_calls":33,"redis_duration_s
":0.012899,"redis_read_bytes":4601,"redis_write_bytes":2028,"redis_cache_calls":33,"redis_cache_duration_s":0.012899,"redis_cache_read_bytes":4
601,"redis_cache_write_bytes":2028,"db_count":40,"db_write_count":0,"db_cached_count":7,"db_replica_count":39,"db_primary_count":1,"db_replica_
cached_count":7,"db_primary_cached_count":0,"db_replica_wal_count":0,"db_primary_wal_count":1,"db_replica_wal_cached_count":0,"db_primary_wal_c
ached_count":0,"db_replica_duration_s":13.692,"db_primary_duration_s":0.001,"cpu_s":0.253262,"mem_objects":123606,"mem_bytes":9748663,"mem_mall
ocs":26576,"mem_total_bytes":14692903,"pid":3042,"db_duration_s":13.67939,"view_duration_s":0.17305,"duration_s":13.91057}
🚧 Worth mentioning that on 10k environments these tests all pass - http://10k.testbed.gitlab.net/-/grafana/d/J0ysCtCWz/gpt-test-results?orgId=1&from=now-30d&to=now&var-test_name=web_group_issues&var-test_name=web_group&var-test_name=web_group_merge_requests
🚧 web_group,web_group_issuesandweb_group_merge_requeststests started to fail at the same time on 2022-02-14, see also #334439
Test Details
Testing was done on our 50k Reference Architecture environment with our lab condition GitLab Performance Tool pipeline. The group being tested has 1000 subgroups with 10 projects each and also has a copy of gitlabhq (tarball can be found here). GitLab Performance Tool tests information is listed at Current test details page.
The latest GitLab Performance pipeline results can always be found here. Through this page full Server Metrics can be found via the Metrics Dashboard link on that page.
As per our performance targets these endpoints' TTFB metric is above the target of 2000 ms which is severity2. Task is to investigate why performance has degraded and improve it.

