FDW queries for finding unsynced files & projects slow down as number of synced objects increases

The Geo tracking database contains two tables:

file_registry (tracks rows in ci_artifacts, lfs_objects and uploads tables)
project_registry (tracks rows in projects table)

When we backfill files or projects, we need to find rows in the right hand side that are not present in the left-hand side. To do this, we currently have to do one of:

Ship all x.id values into the tracking database
Ship all x_registry.x_id values into the main database.

We do the former, which means we get good performance when there are few objects synced, but performance decreases as the number of rows increases.

Currently, we can either do the cross-database pluck ourselves (legacy) or use FDW (which makes the queries shorter to view and saves about 25% of time compared to legacy). This is driven from the Geo::RepositoryShardSyncWorker and Geo::FileDownloadDispatchWorker backfill jobs, in batches of about 1000.

Proposal

We can get around the need to do a cross-database pluck at all by storing a row in the registry for every possible value of the object.

Projects, uploads, etc, all have a monotonically increasing ID and we don't (generally) re-use these IDs. So we can:

Take max(projects.id)
Generate, in order, 1000 integers between 0 and that number that don't exist in project_registry.project_id
Unconditionally fill the project_registry with rows for those thousand entries, whether a matching project exists or not.

This lets us keep a record of projects that don't exist, and so don't need to be backfilled, in the project_registry table (and the same for file_registry), which significantly speeds up finding the next batch of 1000. Performance should be fairly constant every time the backfill workers run, as the size of the dataset they're working on doesn't increase - it's always just one integer from the main DB, and up to a thousand integers that don't exist in the tracking DB's (indexed) x_id columns.

I think we should also split up the file_registry into its three composites. This allows us to remove the file_type column and keeps the tables smaller. We can give each its own backfill worker as well.

/cc @brodock @to1ne @stanhu

Edited Nov 14, 2019 by Toon Claes