Skip to content

Investigate performance degradation of GroupsController due to ban_user_feature_flag on large environments like 50k

The TTFB (Time to First Byte) of several GroupsController web endpoints in performance tests have degraded significantly on our largest 50k environment:

* Environment:                50k
* Environment Version:        14.8.0-pre `ffc662fac2b`
* Option:                     60s_1000rps
* Date:                       2022-02-21
* Run Time:                   1h 30m 48.38s (Start: 04:39:10 UTC, End: 06:09:59 UTC)
* GPT Version:                v2.10.0

NAME                                                     | RPS    | RPS RESULT           | TTFB AVG   | TTFB P90              | REQ STATUS     | RESULT  
---------------------------------------------------------|--------|----------------------|------------|-----------------------|----------------|---------
web_group                                                | 100/s  | 64.17/s (>80.00/s)   | 1542.31ms  | 3730.82ms (<400ms)    | 100.00% (>99%) | FAILED¹²
web_group_issues                                         | 100/s  | 6.13/s (>80.00/s)    | 12958.83ms | 22459.75ms (<500ms)   | 100.00% (>99%) | FAILED¹²
web_group_merge_requests                                 | 100/s  | 29.39/s (>80.00/s)   | 3025.91ms  | 5091.28ms (<500ms)    | 100.00% (>99%) | FAILED¹²

Screenshot_2022-02-23_at_16.49.14

Looking at the server metrics during the test, all PG nodes goes up to 100% in CPU utilization:

Screenshot_2022-02-21_at_20.49.28

🚧 Worth mentioning that on 10k environments these tests all pass - http://10k.testbed.gitlab.net/-/grafana/d/J0ysCtCWz/gpt-test-results?orgId=1&from=now-30d&to=now&var-test_name=web_group_issues&var-test_name=web_group&var-test_name=web_group_merge_requests

🚧 web_group, web_group_issues and web_group_merge_requests tests started to fail at the same time on 2022-02-14, see also #334439

After some investigation this looks to be due to the ban_user_feature_flag being enabled by default.

Test Details

Testing was done on our 50k Reference Architecture environment with our lab condition GitLab Performance Tool pipeline. The group being tested has 1000 subgroups with 10 projects each and also has a copy of gitlabhq (tarball can be found here). GitLab Performance Tool tests information is listed at Current test details page.

The latest GitLab Performance pipeline results can always be found here. Through this page full Server Metrics can be found via the Metrics Dashboard link on that page.

As per our performance targets this endpoint's TTFB metric is above the target of 9000 ms which is severity1. Task is to investigate why performance has degraded and improve it.

Edited by Grant Young