RFC: Memory impact of websockets support

GitLab is looking to add real-time updates to the issue and MR sidebar. The current technical proposal is to leverage websockets, specifically ActionCable to implement this feature. The Real Time working group is tasked with prototyping such a solution. (There is no technical design doc yet outlining the rationale and discarded options, hope to have this in place soon.) As the memory team we were asked to advise on memory specific concerns.

The latest proposal is to run a dedicated ActionCable server via Puma; there is a POC already implementing this for issues (issue, MR). The decision to run a dedicated server was made out of fault tolerance and availability considerations, since running it in-process would imply interfering with ordinary request processing (not reiterating the details here; there was a discussion around this here), but to summarize:

Running it as a separate service eases deployment and running at scale
Allows us to start a decoupled service, similar to how we run Sidekiq, that is single-purpose and can be optimised/fine-tuned
We can rollout the service without affecting any other services

Assuming we will continue down this path, it means that as of now, we would pay not only the cost for an extra Puma process, but additionally, the cost for pulling in ActionCable on top of that, which appears to have a substantial footprint:

rails-web             : {"timestamp":"2020-04-09T20:13:55.834Z","pid":25687,"message":"PumaWorkerKiller: Consuming 686.09375 mb with master and 1 workers."}
 rails-actioncable     : {"timestamp":"2020-04-09T20:14:04.238Z","pid":25684,"message":"PumaWorkerKiller: Consuming 1000.15625 mb with master and 1 workers."}

(reported via Slack)

i.e. there is an immediate 300MB penalty just for running ActionCable, and that's before the app has even started to do anything meaningful.

The goal is to find an optimal trade-off between availability/performance and memory consumption for:

gitlab.com (web users)
Omnibus (self-managed users)
GDK/compose-kit (developers)

There might be different strategies we can employ to cut down on memory use for each of these.

Some initial ideas/comments:

Degrade the feature gracefully to no-realtime, meaning that by simply not running the extra process, you get the "normal" issue/MR page behavior
Consider running ActionCable in-process for some setups (non-option for .com, but perhaps Omnibus and certainly GDK)
Lazy-load required classes in ActionCable process, so that only those classes reside in memory that will actually be used
Find/define a "feature boundary" for the issues feature, so that we can load only that, but not anything non-realtime related
Set strict requirements around what an ActionCable service can do, otherwise we open the flood gates to new features being developed that leverage real-time, which will make memory consumption worse

Action items

Issue 1: Understand ActionCable memory breakdown. The aim is to look for memory-savings opportunities in ActionCable specifically, such as not loading the server code if it's not required. We should also understand why the memory requirements are so much higher compared to a Puma instance not running requiring ActionCable (also: compare to Unicorn). This looks at the problem from a technical i.e. horizontal perspective.
Issue 2: Understand issues (the feature) memory breakdown. The aim is to find the set of classes that are necessary to run just the real-time issue sidebar feature and try to identify what a boundary could look like (to inform future breakdowns as well.) This looks at the problem from a domain i.e. vertical perspective.
Issue 3: Understand why we didn't detect the memory increase caused by ActionCable in CI. I find it alarming that we have a sophisticated pipeline deployed just for this and it didn't signal this change. This should be addressed and fits well into our pro-active work roadmap.

Outcome

First, there is ActionCable itself (the gem or library). Based on the profiles pulled for gitlab-org/gitlab#214787 (closed) the size it adds is on the order of kilobytes, not megabytes. Probably not worth looking at more; it is safe to pull this in on every node.

Next, there is the server that services ActionCable clients, which can run in different topologies:

Single-node install: Assuming we stick with a separate proc for ActionCable one way or another (i.e. not use its in-proc mode), then:
1. If we were to stop here and not spend any more time on this, then gitlab-org/gitlab#214787 (closed) suggests that running a separate ActionCable server with 1 master + 1 worker would come down to about 1.2GB in terms of RSS. This is the "nominal" memory consumed from the perspective of the app. In PSS terms, however (i.e. "real" terms when accounting for OS optimizations like memory page sharing), it would come down to about 740MB, out of which about 440MB are shared memory (the rest--called USS--is exclusive to each process). In other words, the total physical memory required for n procs can be approximated by this equation: real_mem = n * USS + shared i.e. real_mem = n * 150 + 440. Some caveats here:
  - Our real-time feature code is currently almost non-existent, i.e. it doesn't utilize much of the GitLab code yet. Doing that will increase memory use due to loading more code and gems.
  - Copy-on-write effects tend to deteriorate over time as processes mature, so the PSS metric tends to go up over time since USS goes up while shared memory goes down.
2. We might be able to share even more memory than we do currently by forking both cable and web processes from the same parent after front-loading the application code, but that is still being explored, which suggests at least another 100-200MB in real savings.
3. We could look into compartmentalizing our application more so that we don't have to load as many gems and app code upfront that is never needed anyway; I think that this is likely to be a fairly massive endeavor and should not be done as part of the real-time work. It would likely require its own working group. A smaller step might be to refactor our initializer chain to be more specific to what we're running (gitlab-org/gitlab#215318). I still think that this is a lot of work, and quite risky too, and it is very difficult to estimate upfront what this would buy is in terms of memory saved.
4. For very low resource setups like a Raspberry Pi, we could switch to single-proc mode so we don't even have to pay for the extra worker as we do in a 1+N topology. But this then increases complexity of our configuration matrix which we said early on is something we're not keen on doing. But it's an option.
Multi-node install: This would be relevant for .com; most of what I wrote under single-node applies, except that the POC from gitlab-org/gitlab!30144 (closed) is probably not be something we would do in prod, especially not with our move to k8s where each server would run on its own pod. So here we would always pay for 1 master + N workers.

Edited Apr 29, 2020 by Matthias Käppler