FDW queries for finding unsynced files & projects slow down as number of synced objects increases
The Geo tracking database contains two tables:
- file_registry (tracks rows in ci_artifacts, lfs_objects and uploads tables)
- project_registry (tracks rows in projects table)
When we backfill files or projects, we need to find rows in the right hand side that are not present in the left-hand side. To do this, we currently have to do one of:
- Ship all
x.idvalues into the tracking database
- Ship all
x_registry.x_idvalues into the main database.
We do the former, which means we get good performance when there are few objects synced, but performance decreases as the number of rows increases.
Currently, we can either do the cross-database pluck ourselves (legacy) or use FDW (which makes the queries shorter to view and saves about 25% of time compared to legacy). This is driven from the
Geo::FileDownloadDispatchWorker backfill jobs, in batches of about 1000.
We can get around the need to do a cross-database pluck at all by storing a row in the registry for every possible value of the object.
Projects, uploads, etc, all have a monotonically increasing ID and we don't (generally) re-use these IDs. So we can:
- Generate, in order, 1000 integers between 0 and that number that don't exist in
- Unconditionally fill the
project_registrywith rows for those thousand entries, whether a matching
projectexists or not.
This lets us keep a record of projects that don't exist, and so don't need to be backfilled, in the project_registry table (and the same for file_registry), which significantly speeds up finding the next batch of 1000. Performance should be fairly constant every time the backfill workers run, as the size of the dataset they're working on doesn't increase - it's always just one integer from the main DB, and up to a thousand integers that don't exist in the tracking DB's (indexed)
I think we should also split up the
file_registry into its three composites. This allows us to remove the
file_type column and keeps the tables smaller. We can give each its own backfill worker as well.