Improve websockets infrastructure to support 4x active connections
The realtime_labels
feature adds a new WebSocket connection to the MR page (and to issue boards). The FF rollout issue is [Feature flag] Rollout of `realtime_labels` (gitlab-org/gitlab#357370 - closed).
This is expected to add 4x the current active WebSocket connections.
Upon rolling out to 25% and then 50% we saw kube_container_memory_component
and kube_pool_max_nodes_component
saturate
https://dashboards.gitlab.net/d/websockets-main/websockets-overview?from=now-7d&to=now&var-environment=gprd&var-stage=main&orgId=1&var-PROMETHEUS_DS=Global&viewPanel=1217942947
The FF is now at 10% again.
WebSockets are deployed on separate infrastructure from the main production deployment. Each pod has a gitlab-workhorse
and a webservice
container. These seem to saturate memory before CPU, so the pods are possibly not being auto-scaled. The deployment is very small, currently on the order of 3-4 pods.
@igorwwwwwwwwwwwwwwwwwwww investigated this in gitlab-org/gitlab#357370 (comment 910124397):
I took a look at memory attribution by container:
➜ ~ k top -n gitlab pod --containers --selector=type=websockets POD NAME CPU(cores) MEMORY(bytes) gitlab-webservice-websockets-5b8c5b886-6n92v gitlab-workhorse 143m 1037Mi gitlab-webservice-websockets-5b8c5b886-6n92v webservice 875m 4957Mi gitlab-webservice-websockets-5b8c5b886-rm2g9 gitlab-workhorse 169m 1045Mi gitlab-webservice-websockets-5b8c5b886-rm2g9 webservice 856m 4684Mi gitlab-webservice-websockets-5b8c5b886-vnlwh gitlab-workhorse 217m 1043Mi gitlab-webservice-websockets-5b8c5b886-vnlwh webservice 761m 4772Mi
We can see workhorse using about 1 GiB. Ruby is using up to 5 GiB.
Cross referencing with requests and limits:
➜ ~ k get pod -n gitlab gitlab-webservice-websockets-5b8c5b886-6n92v -o json | jq -c '.spec.containers[]|[.name, .resources]' ["webservice",{"limits":{"memory":"6G"},"requests":{"cpu":"4","memory":"5G"}}] ["gitlab-workhorse",{"limits":{"memory":"4G"},"requests":{"cpu":"600m","memory":"1G"}}]
Both are high, but ruby is the one close to hitting the limit. So that's the one we might need to bump the limit on.