Request for help scaling WebSockets

Scaling Request

The feature/improvement we'd like some assistance with is: WebSockets

The epic and relevant issues are:

Websockets (interactive terminal and Actioncable) on Kubernetes (for reference)
Readiness review (for reference)
Implementing Prometheus metrics to populate service dashboard (for reference)
Enable on GitLab.com (for reference)
Investigating higher memory usage in Workhorse with Action Cable (in progress)
OKR with our goals for this quarter

The reason we're asking for a scaling review on this item is:

We currently service, at peak, about 50,000 open WebSocket connections in production on GitLab.com. These originate from the issue view page.

Traffic is split at the HAProxy layer and WebSocket connections are routed to isolated infrastructure in K8s. This makes it relatively low-risk to the rest of the web fleet.

We seem to comfortably handle this traffic with our existing infrastructure of two replicas per cluster. Each pod has a Workhorse and Webservice container (see this Grafana dash).

There is a known memory issue with Workhorse (linked above) but it doesn't currently impact our ability to service this traffic, it's a possible scaling concern for the future.

We'd now like to expand WebSocket support to other parts of the application, starting with the merge request view page. This receives roughly 2x the traffic of the issue so we'd need 3x the capacity we have now.

Some scalability questions:

Assuming we don't fix the memory issue in workhorse, it appears to scale linearly with number of connections. Currently it sits around 1GiB per container. At what size do we need more containers?
We implemented Prometheus metrics, can we have help to build out a service dashboard and alerts?
We don't currently know how close to capacity we are with the current number of replicas. How should we approach scaling the Kubernetes deployment we have to handle the new traffic? (Note: We can roll out a new feature by percentage using a feature flag).
Any other scaling concerns?

I will aim to resolve questions around cost separately to this.

In particular, we are concerned about:

We're hoping to release this as part of milestone: %14.0 (Proof of concept already built)

/cc @rnienaber