Optimize the opportunistic migrator with learnings from staging
The opportunistic migrator is a simple migration mechanism which triggers migrations on all read transactions (limited to running a single migration / gitaly server). Whenever a write transaction comes in, we abort any ongoing migrations for that repository.
This works well, but there are some improvements that we can perform to optimize this based on the learnings from staging:
A lot of the migrations, which fail, are due to non-existent repositories
From the captured metrics, we can see that a lot of migration failures can be observed. Specifically the 'migration_error' kind, which means these migrations failed not because of an incoming write transaction, but rather due to other issues.
However, looking into the logs, it becomes more evident that the issue is because we trigger migrations on all read transactions. Some of these incoming requests are against non-existent repositories.
We should stop adding these logs and collecting these values in the metrics. We could:
- Capture
storage.ErrRepositoryNotFoundand ignore it from logs and metrics - Capture
storage.ErrRepositoryNotFoundand label it accordingly
I think it would better to label it and ignore it on Grafana and elastic rather than skipping out the data entirely.
We only trigger migrations on read transactions
As noted, we only trigger migrations on read transactions. This doesn't work well on write heavy repositories where incoming writes would always cancel any ongoing migrations and since we only trigger migrations on reads, we end up waiting for the next read transactions.
Since we use a middleware to trigger the migrations, a simple solution is to also trigger the migration for incoming write transactions, but only when exiting the middleware. This still ensures we don't conflict with any incoming writes, but at the same time, it will work with repositories which skew towards more writes.
We store migration information in-memory
The migration state is currently stored in-memory. This state is responsible for knowing if a repository is already migrated, if there were previous attempts, if there is a timeout on the migration. Since this is in-memory, it is reset every time the Gitaly server is restarted, which is quite often considering that the server is restarted on every deployment. So we should move this information to disk, perhaps the KV in transactions.

