Document bottlenecks of our Action Cable deployment

We currently serve Action Cable traffic from a dedicated deployment, websockets. With the need for real-time behavior across GitLab increasing, these work loads are not sufficiently documented and understood in terms of potential performance and scalability bottlenecks. This means it is unclear:

How to decide if a new feature rolled out that uses Action Cable (esp. GraphQL subscriptions, our biggest use case by far) is successful and healthy.
Which tuneables exist to address any such performance bottlenecks we may discover as we increase adoption.

A key point to understand here is that websockets instances do not behave like, or are configured like, ordinary Puma servers. Puma is only the "door keeper": it parses an HTTP upgrade request coming from Apollo, Action Cable then hijacks the TCP stream from Puma and processes work on its own multi-threaded worker pool. Puma is completely out of the picture once that has happened.

This implies, for example, that the worker thread pool must be tuned differently (we use the ACTION_CABLE_WORKER_POOL_SIZE environment variable for this). On top of that, Action Cable runs a separate thread pool internally to dispatch events (such as stream or periodic callbacks, but also to process actual websocket data). Both of these thread pools can experience congestion, especially since:

websocket connections are stateful and long-lasting, and more data might be transferred per "request" (conversation?) compared to a short lived HTTP request, thus reducing the number of available worker threads to process other events
there is no load balancing; payloads are routed through pubsub and processed where ever someone is listening; this can lead to a websocket instance running up pending tasks if it happens to manage subscribers who all listen for expensive work loads

While CPU utilization looks OK currently when measured across the entirety of the websockets fleet, there are bursts of event backlogging where fleet utilization reaches 90%, with individual Pumas having 175% CPU utilization:

This correlated with a spike in action_cable_pool_pending_tasks, which are events stuck in the internal task queue:

I think we should spend some time to better understand and define how these resources should be managed and configured, and document this somewhere.

Edited Apr 26, 2023 by Matthias Käppler