Add Watch for /oauth/token 400 spike
Context
In gitlab-org/gitlab#565912 (comment 2812574236)+, we added two new Elastic Watchers for OAuth 4xx error spikes.
For convenience, we added them via UX so we can easily tweak and disable them if needed.
They worked just fine without false positive for > 1 week, and the threshold is relatively low to the value where it may start producing false positive (refer to the linked issue thread for more details).
At this point, it's worth turning them into infra-as-code to comply with our alerting guidelines. Keeping it into code will provide paper trail of changes, visibility of the watcher for SREs, protection from accidental deletion via UX and all other benefits that come from that approach.
Also, this MR will be a good place to discuss the criteria and adjust the approach, if needed.
The change
- Adds Watch for
/oauth/token
endpoint400
s spike (from UX: https://log.gprd.gitlab.net/app/management/insightsAndAlerting/watcher/watches/watch/authn_oauth_token_400_spike/edit) - Adds Watch for
oauth/token/info
endpoint401
s spike (from UX: https://log.gprd.gitlab.net/app/management/insightsAndAlerting/watcher/watches/watch/authn_oauth_token_info_401_spike/edit)
I am keeping them separate. While these two act rather similar, but we still iterating so it would be easier to update (or maybe even disable) them individually, if needed. As for other error codes and other endpoints, they show a different behaviour patterns, so we will need to find new criteria.
Notes
Eventually, we will need to delete the Watches from UX (if I understand correctly).
But we also can keep them both until the first "real" alert (if such occurs) to make sure the "codified" one works.
I think the trickiest aspect of wathers-as-code is that it is really tricky to "test" it (because you need to redeploy if adjusting the criteria temporarily or come with some other funky way), while UX allows for fast iteration and ease of verification (e.g. I can quickly add my work email as receiver).
Therefore, I don't have 100% confidence that this MR works, so I need to rely on SREs careful review
Because of that, keeping the UX Watcher may be OK?
Screenshots
That's how it looks in Slack (artificially lower threshold to trigger it from UX):
Issue
Related to gitlab-org/gitlab#577798