Action Cable and GraphQL logs are inconsistent or incomplete

Original problem description

We currently support two different ways of running GraphQL queries:

  1. via HTTP POST requests to /api/graphql. These are handled by Action Pack / Action Controller in GraphqlController#execute and since they are ordinary HTTP requests to Rails, they have all the expected telemetry written to production_json
  2. via websockets and Action Cable, particularly GraphqlChannel

The logs we have for 2 are currently confusing and incomplete:

  • We log actions invoked on GraphqlChannel by a websocket client (i.e. Apollo) in log stream 1), but:
    • They use "Action Pack" language like the controller: GraphqlChannel and action: subscribe, which suggests that these are Rails HTTP requests, which they are not. We should use a new set of log labels to clearly discern between HTTP and websockets traffic.
    • They only log actions invoked on a channel i.e. incoming traffic from the client. However, if a subscription fires, it will run GraphQL queries and push the results back to the clients; these are currently not present in this log stream.
  • We log all GraphQL queries in graphql_json. This is a completely different log stream unrelated to a request/response interaction or even websockets in general; they simply log any GraphQL query we run via a tracer plugin. This log stream does include queries firing in response to a sub, but telemetry is incomplete because it does not integrate with our SLI mappers or InstrumentationHelper

We should move to an approach where traffic logs are split by Action Cable vs Action Pack, with both having full telemetry as expected from any form of request (push or pull). We may still want to maintain graphql_json, but I now wonder if we should reduce that to a GraphQL slow query log if we manage to log all queries anyway in production_json.

Decision log

  • After exploring a somewhat complicated approach that tried to achieve complete parity with how web and push transactions are instrumented and logged in !116294 (closed), we decided to take this back to the drawing board due to scalability concerns
  • A second approach is to not solve all problems at once, but make a smaller change that focuses on the most immediate paint point, which is missing instrumentation from queries that run as part of a push event. This is being implemented in !118614 (closed)
Edited by Matthias Käppler