Enqueuer loop mode
🔥 Problem
From our observations on production, the Enqueuer jobs seem to have high chances to hit the exclusive lease key guard and as such end up in doing nothing.
Because it's a global redis key, only one Enqueuer job execution can be in the steps that pickup a repo and start the migration.
It seems that due to the high concurrency we see, several jobs snowball together but only one of them is executed = only one repo is picked up.
That could explain why we don't see the amount of inflight pre imports hitting max capacity.
In #361445 (closed), we tried to scope the redis key to the repo that has been picked up. Due to the high concurrency, this failed miserably.
🚒 Alternative solution
With this issue, we want to tackle things differently: let each execution pickup and start migrations as many as <max_capacity>.
It's basically taking x
executions that we have today and compact them in a single one. In that regard, the Enqueuer job will simply loop on available repositories until reaching either:
- <max_capacity>.
- related feature flag disabled because yes, this change is going to be gated behind a feature flag.
- timeout reached.
- this is an additional safety net to ensure that we don't end up in a ad vitam aeternam loop.
One nice thing with this is that we can keep the global exclusive lease key. We know that it works well with the level of concurrency we're facing.
🤔 Downsides
The main downside is that a ruby loop that start many migrations could be a small burst in requests done to the container registry.
Given that the registry node can reject the migration start request and rails will simply re-execute the request to hit a different node, we could be triggering 2 * <max capacity>
requests in a very short amount of time. At the time of this writing, <max capacity>
is 25
, so we could fire 50
requests. Given the number of registry nodes available, I think this is still a reasonable number.
🔮 Follow ups
We could optimize the loop by not looping on the existing execution path but simply:
- Get as many aborted migrations as possible (max current capacity).
- If there are still any free slots, get as many next migrations as possible.
- Loop on this set of repositories.
That's basically building the target repositories only once instead of fetching them one by one on each iteration.