Investigation followup: The loadbalancer SLI of the websockets service (`main` stage) has an error rate violating SLO
Summary
We have been getting multiple alerts for Websockets load balancing. While this has always been alerting we saw an increase in May and it has a resulted in a silence until we figure out what is causing the issue.
https://log.gprd.gitlab.net/goto/23a11390-e8bb-11ec-8656-f5f2137823ba
We believe these errors are primarily from actioncable, which is used by the browser for realtime updates on the sidebar. Because clients will retry when this service errors, the errors are not noticeable by the clients but it is odd that we suddenly have started to see more.
- Around April 7th, the was a fix made so that Workhorse now sees a TERM signal and shuts down when pods are terminated
- In gitlab-com/gl-infra/k8s-workloads/gitlab-com!1810 (merged) and gitlab-com/gl-infra/k8s-workloads/gitlab-com!1834 (merged) we removed the long blackout window for Puma
- With gitlab-com/gl-infra/k8s-workloads/gitlab-com!1855 (merged) we increased the min replicas to help reduce the spiky scaling patterns we see for Wesockets both in canary and main
It does look like these changes have improved our LoadBalancer SLI
These error spikes are accompanied by RPS spikes as seen on this graph
A couple remaining investigations before we remove the silence and close this out
- What is causing these RPS spikes, is it deployment related and a surge of client reconnections?
- In gitlab-org/gitlab#363096 (comment 980570509) we discussed how websocket connections are closed and now that both workhorse and puma are shutdown at the same time (since April 7th when the workhorse signaling issue was resolved), should we delay workhorse termination so that connections are shutdown properly?