Consider writing our own ActionCableSubscriptions

I ran against a number of issues with graphql-ruby's (gqlr) ActionCableSubscriptions implementation in both #402999 (closed) and #402717 (closed). This class is graphql-ruby's transport adapter for Action Cable (it also supports delivery via Pusher and Ably).

The upstream implementation is generally not designed to be extended easily and also has some quirks or design choices that work against us:

  1. It does not allow for custom event payloads. When triggering a sub, gqlr serializes the object (which will become the gql query root) to a JSON string and broadcasts it via ActionCable::Server. However, it provides no hooks for us to send additional meta-data about this event, which is necessary for uses cases like database load balancing.
  2. We double-encode event JSON. Because the gqlr serializer produces a JSON string, and because Action Cable also encodes payloads as JSON strings, we end up double-encoding the event, which causes unnecessary CPU and GC time. Related to 1 and 3.
  3. It sends events as strings not hashes. The way Action Cable was designed is to be Ruby-friendly, with both senders (broadcasters) and receivers (stream_from handlers) passing and receiving Hashes. However, gqlr sends the initial event as a JSON encoded string instead. This makes it hard to work with even if there was a hook to intercept it (see point 1) because you have to either hack its serializer to inject your own payload before it is serialized, or you have to de-serialize it first, inject data, then re-serialize it before passing it on to Action Cable. My suspicion is it does that since the serializer is shared with other push impls, but we don't use or need those.
  4. It only allows using a custom AC coder for receivers. This is probably just a bug/oversight and could easily be fixed upstream, but while gqlr accepts a custom Action Cable Coder in its initializer, it is only passed down to AC when receiving messages, not when sending via broadcast. This makes it impossible to hook into AC's message formatting system as far as GraphQL subs are concerned.
  5. It provides insufficient observability hooks. Our backend performs a lot of work when a broadcast from a GraphQL sub is received, most importantly executing GQL queries in response to it. However, this happens in a nested stream_from hook that is called by Action Cable and runs execute_update, which itself is not instrumented. If for example, we wanted to run this in a Labkit::Context, we have to extend or patch these implementations.
  6. It holds every subscribed Query in process memory. When a subscriber registers, it sends a query string that gqlr instantiates into a Query instance and maps this instance to a unique subscriber ID. This happens for every subscriber (of which we have users * tabs * views_per_tab many), making this very memory inefficient. I have yet to measure how much memory this costs us exactly (it's largely a function of query string size) but it's redundant and wasteful.

As this module is fairly small in size, I think we should explore building our own module that implements the GraphQL::Subscriptions type on top of Action Cable.

Decision log

  • While looking into these issues, I realized that many of them boil down to the same issues, and that we can get 80% there without a full rewrite. I think the better approach will be to:
    • Extend/broaden the existing ActionCableWithLoadBalancing module by hooking into execute_update and execute_all
    • This should allow us to solve the following problems now:
      • Use database load balancing when running queries
      • Start query instrumentation via InstrumentationHelper
      • Inject any other custom payloads we may need going forward
    • It won't address (for now):
      • Double-encoding of payloads
      • Customizing payloads at the Action Cable instead of GraphQL level
      • Optimizing query storage or any other performance related aspects
  • I think having the 80% solution more quickly will also inform better what is left to do, and how to best go about it
  • My plan is to perform the following steps:
    • MR1: Remove feature toggle for load balancing: !119076 (merged)
    • MR2: Refactor ActionCableWithLoadBalancing to not use a custom serializer anymore (!118945 (merged))
    • MR3: Introduce a light-weight instrumentation storage facade around RequestStore: !119765 (merged)
    • MR4: Instrument queries via execute_update: !118614 (closed)
Edited by Matthias Käppler