`authorized_projects` latency SLO has unclear resolution
This alert fired a lot this week:
- 2019-12-27: https://gitlab.slack.com/archives/CD6HFD1L0/p1577406914071800
- 2019-12-26: https://gitlab.slack.com/archives/CD6HFD1L0/p1577396579071300 / https://gitlab.slack.com/archives/CD6HFD1L0/p1577391434070200 / https://gitlab.slack.com/archives/CD6HFD1L0/p1577388854068000 / https://gitlab.slack.com/archives/CD6HFD1L0/p1577388854068000 / https://gitlab.slack.com/archives/CD6HFD1L0/p1577382584066600
- 2019-12-24: https://gitlab.slack.com/archives/CD6HFD1L0/p1577205497060200
It wasn't clear to @msmiley whether this queue should have some manual action taken or we should let it run: https://gitlab.slack.com/archives/C101F3796/p1577424418310800 We should let it run because if we don't, users may have inconsistent access levels - access to things they shouldn't, or not having access to things they need.
From the alert link, I'm not sure how useful the current threshold is: https://dashboards.gitlab.net/d/alerts-worker_apdex_violation/alerts-worker-apdex-violation-alert?orgId=1&from=now-7d&to=now&panelId=2&tz=UTC&var-environment=gprd&var-queue=authorized_projects&var-threshold=1
(The peak of queue latency was under 10s.)
From https://log.gitlab.net/goto/de9c854d556558f0c616cadc9b96a04e we can see that we had many more jobs:
Unfortunately it's hard to tell what causes these jobs from that, because the job's argument is a user. The worker can run in a bunch of cases: user gets added to or removed from a group or project; group has a new project added; group has a link added; project is shared with a group; etc.
In our logging and metrics, these all show up the same way. Also note that when an individual user is added or removed, that's one job. When a project is added to a group, we schedule a job for each user in that group.
Also, we have some cases where we are especially latency-sensitive: access removal. We do this with our JobWaiter class, which you can see in the logs (but not in an easily-filterable way):
How can we make this alert actually useful to SREs? I think we could consider:
- Updating the Sidekiq Queue Out of Control runbook - or some other documentation - to have queue-based information, not just task-based information. If the
authorized_projects
queue is large, the current runbook tells you how to either kill jobs, or spin up more processes (also, that latter part may not match our current Sidekiq configuration!). But how do you know which to do? - Splitting the 'waited-on' jobs (those with a second argument) into their own namespaced queue. We can still process them in the same queue, but we are also adding a way of distinguishing the two cases.
- Somehow adding metadata to the Sidekiq jobs about what the cause is. For instance, if we're updating a user's authorised projects because a project was added to the group, that information isn't critical to the worker, but it might be useful in determining why the queue grew so much.