Skip to content

Potentially gigantic cross-database ID pluck in Geo

Summary

https://gitlab.com/gitlab-org/gitlab-ee/blob/master/app/workers/geo/repository_sync_worker.rb#L27

      current_node.projects
                  .where.not(id: Geo::ProjectRegistry.synced.pluck(:project_id))
                  .order(last_repository_updated_at: :desc)
                  .limit(db_retrieve_batch_size)
                  .pluck(:id)

current_node is a GeoNode, which maps to a table in the main database, as does the projects relation, which boils down to Project.all.

Geo::ProjectRegistry is in the separate, read-write, per-geo-secondary database. So we have to pluck the IDs here. But what happens when a couple of million projects have been synced? What sort of query do we generate?

Steps to reproduce

Sync a few million projects, try to sync a few more.

What is the current bug behavior?

Postgres will probably fail with a "query too large" error of some sort

What is the expected correct behavior?

We need to be able to scale to several million rows in the project_registry table. I'm pretty sure this doesn't.

Possible fixes

Could we use where(last_repository_updated_at: nil) instead?