Skip to content

Add Watch for /oauth/token 400 spike

Context

In gitlab-org/gitlab#565912 (comment 2812574236)+, we added two new Elastic Watchers for OAuth 4xx error spikes.

For convenience, we added them via UX so we can easily tweak and disable them if needed.

They worked just fine without false positive for > 1 week, and the threshold is relatively low to the value where it may start producing false positive (refer to the linked issue thread for more details).

At this point, it's worth turning them into infra-as-code to comply with our alerting guidelines. Keeping it into code will provide paper trail of changes, visibility of the watcher for SREs, protection from accidental deletion via UX and all other benefits that come from that approach.

Also, this MR will be a good place to discuss the criteria and adjust the approach, if needed.

The change

I am keeping them separate. While these two act rather similar, but we still iterating so it would be easier to update (or maybe even disable) them individually, if needed. As for other error codes and other endpoints, they show a different behaviour patterns, so we will need to find new criteria.

🗒️ The criteria for the alert is explained in gitlab-org/gitlab#565912 (comment 2812574236).
🗨️ I have a very little experience in Alerts and Monitoring best practices, therefore happy to discuss and iterate!

Notes

Eventually, we will need to delete the Watches from UX (if I understand correctly).
But we also can keep them both until the first "real" alert (if such occurs) to make sure the "codified" one works.
I think the trickiest aspect of wathers-as-code is that it is really tricky to "test" it (because you need to redeploy if adjusting the criteria temporarily or come with some other funky way), while UX allows for fast iteration and ease of verification (e.g. I can quickly add my work email as receiver).
Therefore, I don't have 100% confidence that this MR works, so I need to rely on SREs careful review 👀

Because of that, keeping the UX Watcher may be OK?

Screenshots

That's how it looks in Slack (artificially lower threshold to trigger it from UX):

Screenshot_2025-10-21_at_11.02.24

Issue

Related to gitlab-org/gitlab#577798

Edited by Aleksei Lipniagov

Merge request reports

Loading