Skip to content

Instrument GQL queries triggered from subscriptions

Matthias Käppler requested to merge 402717-instrument-push-queries into master

What does this MR do and why?

We do not currently have any standard telemetry in logs for GraphQL queries that execute in response to a push event coming from a GraphQL subscription. We only have this for queries that execute within a web controller transaction i.e. an ordinary HTTP request-response conversation. GraphQL subscriptions use Action Cable (a websockets server) to service these payloads, so we do not benefit from all the standard request instrumentation we have for HTTP requests. This means we are currently flying blind when it comes to understanding performance characteristics of GraphQL subscription queries since they only have a single duration_s field, with none of the breakdowns we provide for HTTP requests (db_duration_s, gitaly_s, etc.): https://log.gprd.gitlab.net/goto/0e1cbf20-ea58-11ed-a017-0d32180b1390

This MR is a rewrite of !116294 (closed), which was a more exhaustive attempt at bringing parity to instrumenting websockets and HTTP requests, but it was deemed too expensive in terms of additional log volume it would create. Instead, I narrowed this change to solve for a more specific problem, which is to not instrument websocket traffic in general, but only GraphQL query executions specifically, which is our primary use case for Action Cable at GitLab currently. So think of it as a "90% there" solution.

Specifically, this MR:

  • Builds on !118945 (merged) to enrich GraphQL subscription event payloads with a correlation ID that is passed from the sending node (the publisher of the event) to the receiving node (the consumer of the event, where the GraphQL query will execute). This allows us to correlate query executions with the trigger that caused them.
  • Uses our existing InstrumentationHelper to start an instrumentation "session" on the node that receives the event. This only happens in response to parsing an event received through Action Cable.
  • Merges these instrumentation data into the log payload used by the GraphQL query tracer, which emits log events to graphql_json.log. Again, this only happens if the session was not for a query that ran as part of a Rack request, in which case we already have this kind of instrumentation in the ordinary Rails request logs written by lograge.

Screenshots or screen recordings

If I change an issue's epic link, two things happen:

  1. We send a GraphQL mutation to Rails to change the epic. This is an HTTP post.
  2. We push this change back into the frontend via Action Cable. This runs a GraphQL query.

Step 1 is already fully instrumented via the request log, because it's just an HTTP request. This happens on the producer node:

{
  "method": "POST",
  "path": "/api/graphql",
  "controller": "GraphqlController",
  "action": "execute",
  "status": 200,
  "correlation_id": "01GZJZ4Q22R9QS529F5JE2AN01",
  "db_count": 138,
  "db_duration_s": 0.26659,
  "duration_s": 0.89589
  ...
}

Note that we already have timing break downs such as db_duration_s etc.

This also generates a second log event, which is emitted by the GraphQL tracer:

{
  "severity": "INFO",
  "time": "2023-05-04T08:47:59.429Z",
  "correlation_id": "01GZJZ4Q22R9QS529F5JE2AN01",
  "trace_type": "execute_query",
  "duration_s": 0.8774472879995301,
  "operation_name": "projectIssueEpicMutation",
  ...
}

Note how this doesn't have timing break downs -- that's OK, because it has a correlation_id that matches the HTTP request, which does have this data, so we don't need to duplicate it here.

However, this leaves the problem of instrumentation for the GraphQL query that runs in response of this mutation generating a push event. Previously, this had neither a matching correlation ID nor any instrumentation added. This happens on the receiver node. Now it does:

{
  "severity": "INFO",
  "time": "2023-05-04T08:47:59.296Z",
  "correlation_id": "01GZJZ4Q22R9QS529F5JE2AN01",
  "trace_type": "execute_query",
  "duration_s": 0.03722739800286945,
  "operation_name": "issuableEpicUpdated",
  "db_count": 21,
  "db_replica_duration_s": 0.064,
  ...
}

Note as well that its correlation ID matches that of the mutation; on master, this correlation ID would be random, and never match anything else.

How to set up and validate locally

  • Go to a page that uses GraphQL subs, e.g. the single issue page
  • Change labels, the assignee, or the epic link in the issue sidebar (these all trigger GraphQL subscription events)
  • look at outputs from log/development_json and log/graphql_json to see if the data exists and matches

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #402717

Edited by Matthias Käppler

Merge request reports