Investigate increased memory usage with ActionCable

Infra epic: gitlab-com/gl-infra&355 (closed)

When we enabled ActionCable in the most recent rollout, we saw significant increases in memory use, both in Workhorse but also Puma. An overview of a few select metrics can be seen in this dashboard from the respective time frame. The metrics are:

Active connection count by process (Action Cable / Puma). This is a measure of how many clients are connected to a single Puma process at any given point in time.
Pending tasks (Action Cable / Puma). This is a measure of saturation: pending tasks means that there was backlogging in the AC thread pool and client requests had to wait.
Ruby process PSS (Action Cable / Puma). This is a measure of how much memory each Action Cable process was consuming (accounting for shared memory, so it's a better measure than RSS). Not that there are 4 Puma workers per pod.
Workhorse process RSS. This is a measure of how much memory WH was consuming. Note that there is 1 WH process per pod. Shared memory is not that interesting here because it's comparatively small and RSS is a good estimate of physical memory use here.
Workhorse WS file descriptors. This is a measure of socket connections to serve WS traffic.

The readings topped out as follows (again, these are per process not cluster wide):

3551 connections
14 pending tasks
1.89GB of Puma worker PSS (for comparison: web is fairly steady at ~1GB)
~2.9GB of Workhorse RSS (for comparison: web fluctuates anywhere between 50-200MB)
37120 Workhorse WS FDs

A container summary overview shows that overall connection count peaked at 31531 connections for the cluster.

We found that memory use increases steadily with connections count, most likely because unlike our initial assumption, all websocket traffic is proxied through Workhorse as well, not just the protocol upgrade.

We should investigate options to reduce memory use or not proxy this traffic at all, which is not necessary.

Edited Feb 05, 2021 by Matthias Käppler