Handle outdated replicas in the DB load balancer
This MR extends the database load balancer so it can handle replicas that are lagging behind too much. See the commit message for more details. In short:
If a replica lags behind more than 60 seconds (by default), and it lags behind more than 8 MB we stop reading from that secondary. We check this roughly every 30 seconds. There is no central state, instead everything is done in memory. A replica is used again once it is in sync again.
TODO
-
Store the "online" state and timestamp in Redis so multiple processes won't perform the same work- Skipping this for now as performing this work in-memory is much easier.
-
Prevent a thundering herd when all replicas are offline by gradually redirecting traffic- Skipping this since it's not something I consider very useful. If a primary can't handle all traffic then gradually redirect traffic won't work as the primary will eventually succumb anyway. This would give the false sense of belief the primary wouldn't go down.
- https://gitlab.com/gitlab-com/infrastructure/issues/2480 might be a much better solution combined with this MR
- This also requires some form of central coordination, which adds a lot of complexity.
-
Randomly adjust the check interval per host per request to reduce the likelihood of all processes checking at once -
Test using a real replica
Does this MR meet the acceptance criteria?
-
Changelog entry added, if necessary -
Documentation created/updated -
Tests added for this feature/bug - Review
-
Has been reviewed by Backend -
Has been reviewed by Database
-
-
Conform by the merge request performance guides -
Conform by the style guides -
Squashed related commits together
What are the relevant issue numbers?
Edited by Yorick Peterse