Graphql query error SLI

Description

When a GraphQL query of a batched request fails with an unexpected exception (causing a 500), we want to attribute the failure to the right query. We also want to create a sli metric for query failures. This will likely all be handled within the MetricsTracer (introduced in this MR)

Things to investigate:

  • What happens in the MetricsTracer when 1 query fails and the others in a batch would pass?

Context

Identified in this comment

One thing I haven't thought of before (my bad): we currently track errors and apdex separately. Take at the (tiny bit complicated RequestRackMiddleware: We only record the performance (which is what this SLI is) if the query was successful.

Is there a way we can check if the data[:query] was successful here? Will this get triggered when the query raised an exception, and does the kind of exception matter?

I'm happy to iterate on this, what do you think of the following:

  1. Rename this method to record_apdex_if_needed
  2. Skip incrementing the counters if the query wasn't successful (but looking into how those exceptions are handled).
  3. Create an issue for adding an error rate SLI or success rate for GraphQL queries.

For 3, the SLO definition framework only supports error rates, so I think choosing error rates here will be the fastest way forward over there. So we might have to add support for this inside Gitlab::Metrics::Sli. I'll create that issue and crosslink.

And the reply:

This is a really good question!

So, if there's an unknown exception thrown (which will result in a 500), then yield will blow up and we don't ever hit increment_query_sli. But! If one of these Gitlab::Graphql::Errors are thrown, then we will still increment_query_sli... From what I can tell, these Gitlab::Graphql::Errors more or less represent 400 statuses. Based on record_duration_for_status? I'm guessing that we actually want to record apdex for 400's... So I think we're good here 🤔

WDYT?

Edited by Paul Slaughter