[meta] Ensure Geo replication can handle GitLab.com-scale update load
So far, our testing of Geo at GitLab.com scale has focused on backfill and basic correctness - can we successfully replicate all instance state on a static primary to a static secondary? When we create/update/delete a single repository/file/record, does the secondary work?
We need to get some assurance that Geo's replication architecture will handle updates at scale.
Investigating current replication requirements
(For almost every number mentioned, I think we should be interested in both average and peak rates. We might also want some measure of variance on the average, so we can get a feel for how "spiky" the demand is.
Events communicated by the Geo event logs ("repository" normally means "project"):
- Repository updated ("git push", creating commits in UI/API)
- Repository deleted (UI/API action, namespace removal sends an event per project)
- Repository renamed (UI/API action, namespace removal sends an event per project)
- List of selective sync namespaces changed (ignore)
- Repository created (UI/API action) (why do we need this?)
- Repository migrated to hashed storage (V1 or V2) (ignore)
- LFS object deleted
- CI artifact deleted
We don't send events for these actions:
- LFS object added
- CI artifact added
- Upload added
- Upload removed
However, we may have an interest in these anyway, as backfill causes them to be replicated by the secondary.
Numbers we want to collect for GitLab.com
Event log depends on postgresql replication.
- Rate of data transfer for current postgresql replication
- Replication lag for current pg replication
- Rate of
git push
(+ UI/API) actions - Rate of data transfer for
git push
(+ UI/API) actions - Rate of project creations, renames and deletions
Numbers we may not want (if we depend on object storage, we might be able to ignore them)
- Rate of LFS uploads
- Rate of data transfer for LFS uploads
- Peak rate of LFS object deletions
- This happens in bulk via
RemoveUnreferencedLfsObjectsWorker
- This happens in bulk via
- Rate of artifact uploads
- Rate of data transfer for artifact uploads
- Rate of artifacts removals
- Some (most?) removals are in bulk via
ExpireBuildArtifactsWorker
andExpireBuildInstanceArtifactsWorker
, so we may only want peak rates here.
- Some (most?) removals are in bulk via
- Rate of uploads
- Rate of data transfer for uploads
- Rate of upload removals
Adding these events to the log cursor will increase postgresql replication load, but hopefully this will be marginal compared to the rest of the database. Once we have numbers, we can make an estimate.
Once backfill is complete, we can (naively) assume that git data replication
load for each secondary will exactly match the git push
load on the primary.
It's a reasonable first-order approximation, and is more likely to over-state
the load than under-state it.
Investigating current replication capacity
This is a fairly exploratory issue. We need to start with a primary and a fully-replicated secondary, apply a sustained period of database and filesystem writes to the primary (creating new issues, uploading files and LFS objects, renaming and updating repositories and wikis, etc), and observe the replication process in action.
We need to either shadow GitLab.com traffic (I'm not sure this is possible), or get some numbers from GitLab.com to tell us what rate and mix of updates we should be sending to our testbed primary and generate the load ourselves with, e.g., https://gitlab.com/gitlab-org/gitlab-ee/issues/3117#note_47093268
Some important questions:
- How does postgresql replication lag change? What rate of sustained database updates can we maintain before we start falling behind? Can we add hardware to scale this?
- The Geo log cursor operates by adding and removing events to various tables on the primary. Does this generate substantial additional load on postgresql? How many events/second can we enqueue before we start to affect postgresql replication negatively?
- The Geo log cursor is a daemon on the secondary that processes those enqueued events. How many events/second can it handle without falling behind? What are its resource requirements while doing so? Can it keep up with ordinary and exceptional GitLab.com traffic?
- The secondary is notified of changes to repositories on the primary, and it enqueues an unbounded number of
git fetch
operations in response to log cursor events. Is this sustainable at GitLab.com scale? Should we apply concurrency limits? - Once the updates to the primary have finished and the secondary claims to be synchronized again, is the secondary actually in a consistent state? Have unexpected race conditions removed or broken repositories or files? Did any events get missed? etc.
I reckon this could do with a GCP Migration label /cc @andrewn