[meta] Ensure Geo replication can handle GitLab.com-scale update load

So far, our testing of Geo at GitLab.com scale has focused on backfill and basic correctness - can we successfully replicate all instance state on a static primary to a static secondary? When we create/update/delete a single repository/file/record, does the secondary work?

We need to get some assurance that Geo's replication architecture will handle updates at scale.

Investigating current replication requirements

(For almost every number mentioned, I think we should be interested in both average and peak rates. We might also want some measure of variance on the average, so we can get a feel for how "spiky" the demand is.

Events communicated by the Geo event logs ("repository" normally means "project"):

Repository updated ("git push", creating commits in UI/API)
Repository deleted (UI/API action, namespace removal sends an event per project)
Repository renamed (UI/API action, namespace removal sends an event per project)
List of selective sync namespaces changed (ignore)
Repository created (UI/API action) (why do we need this?)
Repository migrated to hashed storage (V1 or V2) (ignore)
LFS object deleted
CI artifact deleted

We don't send events for these actions:

LFS object added
CI artifact added
Upload added
Upload removed

However, we may have an interest in these anyway, as backfill causes them to be replicated by the secondary.

Numbers we want to collect for GitLab.com

Event log depends on postgresql replication.

Rate of data transfer for current postgresql replication
Replication lag for current pg replication
Rate of git push (+ UI/API) actions
Rate of data transfer for git push (+ UI/API) actions
Rate of project creations, renames and deletions

Numbers we may not want (if we depend on object storage, we might be able to ignore them)

Rate of LFS uploads
Rate of data transfer for LFS uploads
Peak rate of LFS object deletions
- This happens in bulk via RemoveUnreferencedLfsObjectsWorker
Rate of artifact uploads
Rate of data transfer for artifact uploads
Rate of artifacts removals
- Some (most?) removals are in bulk via ExpireBuildArtifactsWorker and ExpireBuildInstanceArtifactsWorker, so we may only want peak rates here.
Rate of uploads
Rate of data transfer for uploads
Rate of upload removals

Adding these events to the log cursor will increase postgresql replication load, but hopefully this will be marginal compared to the rest of the database. Once we have numbers, we can make an estimate.

Once backfill is complete, we can (naively) assume that git data replication load for each secondary will exactly match the git push load on the primary. It's a reasonable first-order approximation, and is more likely to over-state the load than under-state it.

Investigating current replication capacity

This is a fairly exploratory issue. We need to start with a primary and a fully-replicated secondary, apply a sustained period of database and filesystem writes to the primary (creating new issues, uploading files and LFS objects, renaming and updating repositories and wikis, etc), and observe the replication process in action.

We need to either shadow GitLab.com traffic (I'm not sure this is possible), or get some numbers from GitLab.com to tell us what rate and mix of updates we should be sending to our testbed primary and generate the load ourselves with, e.g., https://gitlab.com/gitlab-org/gitlab-ee/issues/3117#note_47093268

Some important questions:

How does postgresql replication lag change? What rate of sustained database updates can we maintain before we start falling behind? Can we add hardware to scale this?
The Geo log cursor operates by adding and removing events to various tables on the primary. Does this generate substantial additional load on postgresql? How many events/second can we enqueue before we start to affect postgresql replication negatively?
The Geo log cursor is a daemon on the secondary that processes those enqueued events. How many events/second can it handle without falling behind? What are its resource requirements while doing so? Can it keep up with ordinary and exceptional GitLab.com traffic?
The secondary is notified of changes to repositories on the primary, and it enqueues an unbounded number of git fetch operations in response to log cursor events. Is this sustainable at GitLab.com scale? Should we apply concurrency limits?
Once the updates to the primary have finished and the secondary claims to be synchronized again, is the secondary actually in a consistent state? Have unexpected race conditions removed or broken repositories or files? Did any events get missed? etc.

I reckon this could do with a GCP Migration label /cc @andrewn

Edited Jan 18, 2018 by Nick Thomas