Graphql query error SLI
Description
When a GraphQL query of a batched request fails with an unexpected exception (causing a 500), we want to attribute the failure to the right query. We also want to create a sli metric for query failures. This will likely all be handled within the MetricsTracer
(introduced in this MR)
Things to investigate:
- What happens in the
MetricsTracer
when 1 query fails and the others in a batch would pass?
Context
Identified in this comment
One thing I haven't thought of before (my bad): we currently track errors and apdex separately. Take at the (tiny bit complicated
RequestRackMiddleware
: We only record the performance (which is what this SLI is) if the query was successful.Is there a way we can check if the
data[:query]
was successful here? Will this get triggered when the query raised an exception, and does the kind of exception matter?I'm happy to iterate on this, what do you think of the following:
- Rename this method to
record_apdex_if_needed
- Skip incrementing the counters if the query wasn't successful (but looking into how those exceptions are handled).
- Create an issue for adding an error rate SLI or success rate for GraphQL queries.
For 3, the SLO definition framework only supports
error
rates, so I think choosing error rates here will be the fastest way forward over there. So we might have to add support for this insideGitlab::Metrics::Sli
. I'll create that issue and crosslink.
And the reply:
This is a really good question!
So, if there's an unknown exception thrown (which will result in a
500
), thenyield
will blow up and we don't ever hitincrement_query_sli
. But! If one of theseGitlab::Graphql::Errors
are thrown, then we will stillincrement_query_sli
... From what I can tell, theseGitlab::Graphql::Errors
more or less represent 400 statuses. Based onrecord_duration_for_status?
I'm guessing that we actually want to record apdex for 400's... So I think we're good here🤔 WDYT?