FDW queries for finding unsynced files & projects slow down as number of synced objects increases
The Geo tracking database contains two tables:
- file_registry (tracks rows in ci_artifacts, lfs_objects and uploads tables)
- project_registry (tracks rows in projects table)
When we backfill files or projects, we need to find rows in the right hand side that are not present in the left-hand side. To do this, we currently have to do one of:
- Ship all
x.id
values into the tracking database - Ship all
x_registry.x_id
values into the main database.
We do the former, which means we get good performance when there are few objects synced, but performance decreases as the number of rows increases.
Currently, we can either do the cross-database pluck ourselves (legacy) or use FDW (which makes the queries shorter to view and saves about 25% of time compared to legacy). This is driven from the Geo::RepositoryShardSyncWorker
and Geo::FileDownloadDispatchWorker
backfill jobs, in batches of about 1000.
Proposal
We can get around the need to do a cross-database pluck at all by storing a row in the registry for every possible value of the object.
Projects, uploads, etc, all have a monotonically increasing ID and we don't (generally) re-use these IDs. So we can:
- Take
max(projects.id)
- Generate, in order, 1000 integers between 0 and that number that don't exist in
project_registry.project_id
- Unconditionally fill the
project_registry
with rows for those thousand entries, whether a matchingproject
exists or not.
This lets us keep a record of projects that don't exist, and so don't need to be backfilled, in the project_registry table (and the same for file_registry), which significantly speeds up finding the next batch of 1000. Performance should be fairly constant every time the backfill workers run, as the size of the dataset they're working on doesn't increase - it's always just one integer from the main DB, and up to a thousand integers that don't exist in the tracking DB's (indexed) x_id
columns.
I think we should also split up the file_registry
into its three composites. This allows us to remove the file_type
column and keeps the tables smaller. We can give each its own backfill worker as well.