GraphQL subscription `#unauthorized!` raises an ExecutionError
Summary
In investigating https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/incident-management/-/issues/436 for a GitLab Dedicated customer we realized that GraphQL subscriptions raise a GraphQL::ExecutionError
which means it gets included in our SLIs which contribute to SLO violations and page on call engineers. Normally we do not consider things like lacking authorization to perform an action or as another example incorrect input to be errors on the server end but rather client side errors which we don't need to be paged to fix (as there's no action we can take to fix the issue).
Impact
- This error causes pages that the on-call engineer in dedicated must investigate and rule out this issue
- Additionally we are currently using a silence to avoid the above waste of time but that could mask real errors
Recommendation
Instead of raising an execution error we should ensure that the error issued is one that is a client side error. I was hoping to find a way to explain this (and may update the issue if I do later on) but we're also seeing errors in the logs that do NOT contribute to the SLI like this:
{
"message": "Field 'featureFlags' doesn't exist on type 'Metadata'",
"locations": [
{
"line": 4,
"column": 7
}
],
"path": [
"query featureFlagsEnabled",
"metadata",
"featureFlags"
],
"extensions": {
"code": "undefinedField",
"typeName": "Metadata",
"fieldName": "featureFlags"
}
},
{
"message": "Variable $names is declared by featureFlagsEnabled but not used",
"locations": [
{
"line": 2,
"column": 3
}
],
"path": [
"query featureFlagsEnabled"
],
"extensions": {
"code": "variableNotUsed",
"variableName": "names"
}
}
So whatever exception is used to emit the above error would not contribute to the SLI. You can also see whatever will emit an error total that creates a prometheus metric in gitlab_sli_graphql_query_error_total
- that will page us (see verification below).
Verification
Unfortunately I'm not sure how well these recommendations map to GitLab.com but here is a PromQL query used to get errors by endpoint:
sum by (endpoint_id) (
rate(gitlab_sli_graphql_query_error_total{endpoint_id!="graphql:unknown",job="gitlab-rails"}[$__interval])
)
> 0
This doesn't directly show the unauthorized but it's occurring on the following operations:
- graphql:approvalRulesApprovalStateUpdatedEE
- graphql:diffGeneratedSubscription
- graphql:getState
- graphql:getStateSubscription
- graphql:getTitleSubscription
- graphql:issuableLabelsUpdatedEE
- graphql:mergeChecksSubscrption
- graphql:mergeRequestApprovalStateUpdatedEE
- graphql:mergeRequestPrepared
- graphql:mergeRequestReviewersUpdated
- graphql:readyToMergeSubscription
If this were to NOT page and not contribute to SLIs we would expect to not see those operations listed in the query above (since they would be 0).