Skip to content

GraphQL endpoint for merge request getState requested every few seconds

ZD: https://gitlab.zendesk.com/agent/tickets/462493

The customer upgraded from PostgreSQL 12 to 13 in GitLab 15.11.13 to prepare for GitLab 16.2+. However, as soon as they upgraded the database, CPU load on puma increased to over 80%+, when previously it was under 30%.

In the SOS logs from 20231021151659, fast-stats showed that GraphqlController#execute and Projects::MergeRequests::ContentController#widget dominated the request rate, with the former being the most problematic:

CONTROLLER                                                   COUNT     RPS    P99_ms    P95_ms   MEDIAN_ms    MAX_ms    MIN_ms      SCORE     %FAIL
GraphqlController#execute                                     3047   17.82    1991.1    1179.1       409.9    8144.7      22.9  6066930.5      0.00
Projects::MergeRequests::ContentController#widget             2932   17.15     456.0     302.6        82.5    1438.1      31.2  1337124.2      0.00
Repositories::GitHttpController#info_refs                      215    1.26     437.0     325.2        50.3     878.5      10.5    93957.3      0.00
Projects::MergeRequests::ContentController#cached_widget        38    0.22    1068.1     978.6       408.4    1071.7     153.0    40586.2      0.00
Repositories::GitHttpController#git_upload_pack                 50    0.29     508.4     430.5       159.8     544.7      53.1    25419.5      0.00
Projects::MergeRequestsController#ci_environments_status        19    0.11     563.2     406.5       158.3     602.4      51.1    10700.9      0.00
ApplicationCable::Connection#connect                            27    0.16     243.2     207.4         9.4     255.0       3.5     6566.4      0.00
GraphqlChannel#subscribe                                        22    0.13     197.8     165.6       140.6     206.3      95.7     4352.1      0.00
ApplicationCable::Connection#disconnect                         26    0.15      80.6      28.1         0.1      97.2       0.1     2094.8      0.00
MetricsController#index                                          6    0.04     327.3     312.7       234.5     331.0     220.6     1964.0      0.00
HealthController#readiness                                     110    0.64      15.2       8.0         0.9      18.9       0.7     1671.2      0.00
GraphqlChannel#unsubscribe                                      18    0.11      53.5      27.9         4.2      60.0       0.6      963.6      0.00
UploadsController#show                                           2    0.01      47.6      47.3        44.9      47.6      42.1       95.1      0.00
PwaController#manifest                                           1    0.01      23.9      23.9        23.9      23.9      23.9       23.9      0.00

There were about 86 users total involved, and it seemed based on the User-Agent (ua) that most of the requests were coming from opened Chrome browsers:

% cat production_json.log | grep Graphql | jq .ua | sort | uniq -c | sort -r
2189 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36"
 480 "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/118.0.0.0 Safari/537.36"
 106 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Safari/537.36"
  86 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/16.6 Safari/605.1.15"
  84 "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36"
  76 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36"
  41 "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/118.0"
  19 "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:109.0) Gecko/20100101 Firefox/118.0"
   6 "python-requests/2.28.1"

This makes me suspect that the user's browsers were left open, and for some reason the client was requesting these merge request endpoints quite frequently.

If I take the user with the most requests as an example, we see 4 different merge requests were related:

% cat production_json.log | grep Graphql | grep REDACTED_user | jq . | grep variables | grep iid | sort | uniq -c | sort -r
  59       "variables": "{\"projectPath\"=>\"REDACTED/REDACTED\", \"iid\"=>\"16983\"}",
  28       "variables": "{\"projectPath\"=>\"REDACTED/REDACTED\", \"iid\"=>\"17039\"}",
  24       "variables": "{\"projectPath\"=>\"REDACTED/REDACTED\", \"iid\"=>\"13477\"}",
  19       "variables": "{\"projectPath\"=>\"REDACTED/REDACTED\", \"iid\"=>\"13314\"}",

Look at the timestamps:

 % cat production_json.log| grep Graphql | grep REDACTED_user | grep 16983 | jq .time
"2023-10-21T22:17:37.741Z"
"2023-10-21T22:17:40.918Z"
"2023-10-21T22:17:46.716Z"
"2023-10-21T22:17:46.799Z"
"2023-10-21T22:18:01.905Z"
"2023-10-21T22:18:03.820Z"
"2023-10-21T22:18:09.792Z"
"2023-10-21T22:18:15.752Z"
"2023-10-21T22:18:16.729Z"
"2023-10-21T22:18:21.721Z"
"2023-10-21T22:18:22.760Z"
"2023-10-21T22:18:24.807Z"
"2023-10-21T22:18:31.840Z"
"2023-10-21T22:18:36.729Z"
"2023-10-21T22:18:39.735Z"
"2023-10-21T22:18:41.812Z"
"2023-10-21T22:18:43.076Z"
"2023-10-21T22:18:43.800Z"
"2023-10-21T22:18:44.767Z"
"2023-10-21T22:18:50.721Z"
"2023-10-21T22:18:51.813Z"
"2023-10-21T22:18:58.067Z"
"2023-10-21T22:18:59.234Z"
"2023-10-21T22:19:11.044Z"
"2023-10-21T22:19:11.112Z"
"2023-10-21T22:19:11.827Z"
"2023-10-21T22:19:14.365Z"
"2023-10-21T22:19:15.362Z"
"2023-10-21T22:19:16.419Z"
"2023-10-21T22:19:17.658Z"
"2023-10-21T22:19:19.438Z"
"2023-10-21T22:19:21.650Z"
"2023-10-21T22:19:23.900Z"
"2023-10-21T22:19:25.639Z"
"2023-10-21T22:19:31.366Z"
"2023-10-21T22:19:32.360Z"
"2023-10-21T22:19:35.432Z"
"2023-10-21T22:19:37.566Z"
"2023-10-21T22:19:39.310Z"
"2023-10-21T22:19:43.380Z"
"2023-10-21T22:19:43.895Z"
"2023-10-21T22:19:47.518Z"
"2023-10-21T22:19:49.823Z"
"2023-10-21T22:19:51.234Z"
"2023-10-21T22:19:52.248Z"
"2023-10-21T22:19:52.936Z"
"2023-10-21T22:19:52.978Z"
"2023-10-21T22:19:59.957Z"
"2023-10-21T22:20:01.875Z"
"2023-10-21T22:20:02.765Z"
"2023-10-21T22:20:06.334Z"
"2023-10-21T22:20:06.873Z"
"2023-10-21T22:20:08.080Z"
"2023-10-21T22:20:09.777Z"
"2023-10-21T22:20:11.553Z"
"2023-10-21T22:20:19.083Z"
"2023-10-21T22:20:19.209Z"
"2023-10-21T22:20:20.916Z"
"2023-10-21T22:20:21.102Z"

I thought the frontend code was supposed to request this once, and then poll with increasingly long intervals (https://gitlab.com/gitlab-org/gitlab/-/blob/e3bb1d633d6b59c808234c8c07defecea9998142/app/assets/javascripts/vue_merge_request_widget/mr_widget_options.vue#L188-199, !108860 (merged)). In my testing, I saw that requests were spaced out every few minutes rather than every second.

I thought that the real-time improvements in #372514 (closed) would have reduced the need for polling, but I suspect something went wrong here.

@thomasrandolph Do you have any idea why the frontend might not be increasing its poll interval here?

Edited by 🤖 GitLab Bot 🤖