Investigation followup: The loadbalancer SLI of the websockets service (`main` stage) has an error rate violating SLO

Summary

We have been getting multiple alerts for Websockets load balancing. While this has always been alerting we saw an increase in May and it has a resulted in a silence until we figure out what is causing the issue.

https://log.gprd.gitlab.net/goto/23a11390-e8bb-11ec-8656-f5f2137823ba

We believe these errors are primarily from actioncable, which is used by the browser for realtime updates on the sidebar. Because clients will retry when this service errors, the errors are not noticeable by the clients but it is odd that we suddenly have started to see more.

Around April 7th, the was a fix made so that Workhorse now sees a TERM signal and shuts down when pods are terminated
In gitlab-com/gl-infra/k8s-workloads/gitlab-com!1810 (merged) and gitlab-com/gl-infra/k8s-workloads/gitlab-com!1834 (merged) we removed the long blackout window for Puma
With gitlab-com/gl-infra/k8s-workloads/gitlab-com!1855 (merged) we increased the min replicas to help reduce the spiky scaling patterns we see for Wesockets both in canary and main

It does look like these changes have improved our LoadBalancer SLI

These error spikes are accompanied by RPS spikes as seen on this graph

A couple remaining investigations before we remove the silence and close this out

What is causing these RPS spikes, is it deployment related and a surge of client reconnections?
In gitlab-org/gitlab#363096 (comment 980570509) we discussed how websocket connections are closed and now that both workhorse and puma are shutdown at the same time (since April 7th when the workhorse signaling issue was resolved), should we delay workhorse termination so that connections are shutdown properly?

Investigation followup: The loadbalancer SLI of the websockets service (`main` stage) has an error rate violating SLO

Summary

Related Incident(s)