Geo: Add replication lag documentation
Problem to solve
Geo replication is asynchronous, so it is possible for a particular secondary to lag significantly behind the primary.
We have added warnings and notices in various places to inform Geo users about this lag. E.g. Current replication lag: 9 seconds
.
We should add a document that says what "replication lag" is, what the implications are, and what can be done about it. We will need to consider users and sysadmins separately, as they will have different needs.
After we have this document, we can link to it from these messages.
Further details
Proposal
Some ideas as a starting point:
- Add
Replication lag
document underGitLab Docs > Administrator Docs > Replication (Geo)
- Add a brief "Overview" section
- Add "What is measured?" section, which should mention database replication and events, and how we currently do not include the time it takes to download files after an event is processed.
- Add "Is replication lag too high?" section, which should mention that the floor for replication lag is set by the connection between the primary and secondary. E.g. If you ping your US node from your China node and see 1500ms, that is probably your minimum possible lag.
- Add an initial iteration of "Reducing replication lag" for sysadmins, with the understanding that there is a variety of possible root causes
- Add "Git pull is out-of-sync with git push" section for users. Since secondary pushes are automatically forwarded to the primary, they are always up-to-date. But secondary pulls are impacted by replication lag. A user can pull directly from the primary if they need to. (One issue: I am not sure how much
git
skills we can assume about a developer).
Who can address the issue
Geo team member