This merge request adds documentation on how to investigate and troubleshoot database replication lag in GitLab's Geo feature. It provides instructions on checking the lag status of different parts of the replication process and explains the significance of the `write_lag`, `flush_lag`, and `replay_lag` values. It also suggests possible causes for high lag values, such as network performance issues, disk I/O problems, long running transactions, or resource saturation. The documentation aims to help engineers identify and address replication lag issues effectively.
What does this MR do and why?
From data collected over a long-spanning ticket and correspondence with members of Geo and DB teams, a lot of useful information was shared around investigating potential causes for database replication lag.
This MR updates the Geo docs to include troubleshooting different types of database replication lag, and provides a common reason for each value.
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
I have evaluated the MR acceptance checklist for this MR.