Batch propagation of instance wide services to groups and projects
The implementation of propagating services from the instance level to groups and projects in !40717 (merged) currently starts a single job to propagate a service to all groups and projects on the instance.
As discussed in the merge request, this causes several problems:
- The queries using anti-joins can take a long time if there aren't many projects that don't have the integration
- Creating and deleting integrations from an unlimited amount of projects/groups can take a long time. Causing the job to timeout and never finish.
To work around this, we've suggested batching the processing of projects and groups.
The following discussions from !40717 (merged) should be addressed:
> @arturoherrero @sabrams The anti-join plan looks fine if we can find relevant 100 records quickly enough. I'd expect though that this has varying timings when most namespaces have a corresponding integration already. I'd suggest to check if this still works out in the worst-case (all namespaces have a corresponding integration) and what the timings are then. Perhaps you can use database labs for this and create records in `services` for all namespaces?
>
> It's all not too bad, I think, given this works on two indexes. I'd still suggest to check if that's not causing too much trouble.
-
@reprazent started a discussion: (+7 comments) This is not really new for this MR, but should we parallellize this for both projects and groups (so scheduling a few jobs that each insert/update a few batches of services).
I know we're not likely to start using this on GitLab.com (or are we?), but for huge instances this could mean an unbounded number of inserts/updates in a single job, which might cause the job to exceed the 5m maximum runtime: https://docs.gitlab.com/ee/development/sidekiq_style_guide.html#job-urgency.
Do you have an idea on how large the instances are this feature is intended for?
-
@abrandl started a discussion: @arturoherrero Just calling this out specifically in case that isn't clear already - this
LIMIT
doesn't solve the anti-join performance problem. Worst case, there are no records at all qualifying for the anti-join - which means we'd still scan all groups to determine this (which takes a long time).The batch approach we discussed would take a fixed number of
Service
records and determine if any of these qualify for the anti-join. We would have to iterate all batches forservices
explicitly and worst case - no records qualify for any of those queries. However, the number of records being checked in each of those queries is constant (we control the number ofService
records, i.e. the batch size) - which means we can expect the same runtime for all batches, irrespective of the situation.This is different from what we currently have here, because currently, we only control the maximum size of the result set - not the underlying
services
records being scanned.That is unless I'm completely misreading things here, of course.
😄