Change reporting of Service Desk Sidekiq Failures
Summary
Error budgets for Certify dropped substantially on 28th September because of a change in the way Sidekiq metrics are collected (see gitlab-com/runbooks!5024 (merged)).
Detail
EmailReceiver errors are attributed to Service Desk, which belongs to Certify. Therefore, an increase in errors in this component spends error budget belonging to Certify.
There was no measurable increase in the number of exceptions created by this component on the 28th. However, it appears that exception types that were not previously being counted now are.
Many exceptions thrown by the EmailReceiver
(and ServiceDeskReceiver
) are legitimate and not within our control to reduce, for example:
-
UserNotAuthorizedError
: the token used (in a reply) no longer belongs to an authorized user -
NoteableNoteFoundError
: the thread the email is in reply to has been deleted -
ProjectNotFound
: the project has been moved/deleted/never existed -
EmailTooLarge
: the email is too large to process
We really only want to count runtime exceptions related to code changes.
Common exceptions seen in production are visible in this chart:
Generally, exceptions that are caused by users are caught by the FailureHandler
, which may serve as a useful list.
Proposal
Make a change application-side to exclude exceptions of specific types from being counted in Error Budgets.
The change can be made directly in the EmailReceiver
.
It could also be made on the measurement side by updating the metrics catalog, where the Gitlab::Email::AutoGeneratedEmailError
is already excluded.