Action Cable and GraphQL logs are inconsistent or incomplete
Original problem description
We currently support two different ways of running GraphQL queries:
- via HTTP POST requests to
/api/graphql
. These are handled by Action Pack / Action Controller inGraphqlController#execute
and since they are ordinary HTTP requests to Rails, they have all the expected telemetry written toproduction_json
- via websockets and Action Cable, particularly
GraphqlChannel
The logs we have for 2 are currently confusing and incomplete:
- We log actions invoked on
GraphqlChannel
by a websocket client (i.e. Apollo) in log stream 1), but:- They use "Action Pack" language like the
controller: GraphqlChannel
andaction: subscribe
, which suggests that these are Rails HTTP requests, which they are not. We should use a new set of log labels to clearly discern between HTTP and websockets traffic. - They only log actions invoked on a channel i.e. incoming traffic from the client. However, if a subscription fires, it will run GraphQL queries and push the results back to the clients; these are currently not present in this log stream.
- They use "Action Pack" language like the
- We log all GraphQL queries in
graphql_json
. This is a completely different log stream unrelated to a request/response interaction or even websockets in general; they simply log any GraphQL query we run via a tracer plugin. This log stream does include queries firing in response to a sub, but telemetry is incomplete because it does not integrate with our SLI mappers orInstrumentationHelper
We should move to an approach where traffic logs are split by Action Cable vs Action Pack, with both having full telemetry as expected from any form of request (push or pull). We may still want to maintain graphql_json
, but I now wonder if we should reduce that to a GraphQL slow query log if we manage to log all queries anyway in production_json
.
Decision log
- After exploring a somewhat complicated approach that tried to achieve complete parity with how web and push transactions are instrumented and logged in !116294 (closed), we decided to take this back to the drawing board due to scalability concerns
- A second approach is to not solve all problems at once, but make a smaller change that focuses on the most immediate paint point, which is missing instrumentation from queries that run as part of a push event. This is being implemented in !118614 (closed)
Edited by Matthias Käppler