Improve AuthorizedProjectsWorker when a project has a large number of members
AuthorizedProjectsWorker (authorized_projects
) is responsible for refreshing a user's permissions when their access changes. Some example cases where this happens (non-exhaustive):
- User is added to / removed from a project / group.
- A group a user is a member of has a new project added.
- A project a user is a member of is moved.
- A group is shared with a group.
- A project is shared with a group.
Any time those happen, this worker is scheduled for all affected users. The worker itself only runs for a single user, and recomputes their entire permissions set. This worker is very fast, very easy to reason about, and is known to work well after some initial bumps when we introduced it.
Problem
We have too many jobs in this queue. Even though the job is correct, fast, and well-structured, the sheer queue size makes it hard for it to be attributed as latency-sensitive, which it definitely should be (see below).
#125 (closed) demonstrates this, as do gitlab-org/gitlab#20126 (closed), https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8826, and many more.
Complication: waited-on jobs
Because this worker deals with user access, we want to make sure it's actually done what it's supposed to do. That means that if you remove someone from a project, you don't want to find out that they actually had access for another five minutes while we waited for the worker to run.
To try to handle this tension, we have the WaitableWorker concern. This allows a web request to schedule some Sidekiq jobs and then wait for their completion before returning. By default it waits for 10 seconds, and by default we wait when we refresh access.
Waiting on these jobs adds a second argument to the worker, which makes it hard to deduplicate (#42 (comment 266872492)).
Proposals
Track job waiter timeouts
In #96 (closed), I propose adding some logging that lets us filter by waited-on jobs. I now think that's wrong; or at least, less useful - more useful would be to know how often we complete all waited-on jobs before the timeout.
Based on the numbers in #125 (closed), we often have 20% of jobs failing to complete in 10 seconds, which is definitely going to time out (the default timeout is 10s). If we're hitting that timeout, we might want to reconsider waiting on these jobs at all.
Perform more jobs inline
Currently, we say that if we get three or fewer jobs to wait on, we don't actually use Sidekiq at all, we just do the work immediately: https://gitlab.com/gitlab-org/gitlab/-/blob/v12.7.0-ee/app/workers/concerns/waitable_worker.rb#L9-10
We could increase this limit, but we'd need it to be customisable by worker: for AuthorizedProjectsWorker an increase may be fine, but for other works it might not. (I think the GitHub import workers also use this concern.)
Batch jobs together
As a corollary to the above, we could consider chunking jobs together. We could have a BatchAuthorizedProjectsWorker that takes, say, 100 user IDs, and performs them all in a single job. The considerations here are:
- Batch size: too small and it's not effective, too big and the worker might be too slow to be marked as latency-sensitive.
- Error handling: we'd need to track which - if any - user IDs we failed to update and schedule a new job to handle those, but we'd need to be careful to not just create an infinite loop (as it would be a new job, not retrying an existing one).
Specialise jobs for specific cases
When a project is added to a group with 1,000 members, we'll refresh each user's projects individually. This is true now, and under the proposal above. We could also consider not refreshing everything, but instead (say) a single bulk insert to add N users to 1 project. We'd need to do this on a case-by-case basis, depending on the context we're refreshing the authorizations in.
I put this proposal last because I consider it the biggest risk. We had several issues with the correctness of this job when it was first introduced, and I'm not aware of any since.