Skip to content

Handle outdated replicas in the DB load balancer

Yorick Peterse requested to merge load-balancing-handle-outdated-replicas into master

This MR extends the database load balancer so it can handle replicas that are lagging behind too much. See the commit message for more details. In short:

If a replica lags behind more than 60 seconds (by default), and it lags behind more than 8 MB we stop reading from that secondary. We check this roughly every 30 seconds. There is no central state, instead everything is done in memory. A replica is used again once it is in sync again.

TODO

  • Store the "online" state and timestamp in Redis so multiple processes won't perform the same work
    • Skipping this for now as performing this work in-memory is much easier.
  • Prevent a thundering herd when all replicas are offline by gradually redirecting traffic
    • Skipping this since it's not something I consider very useful. If a primary can't handle all traffic then gradually redirect traffic won't work as the primary will eventually succumb anyway. This would give the false sense of belief the primary wouldn't go down.
    • https://gitlab.com/gitlab-com/infrastructure/issues/2480 might be a much better solution combined with this MR
    • This also requires some form of central coordination, which adds a lot of complexity.
  • Randomly adjust the check interval per host per request to reduce the likelihood of all processes checking at once
  • Test using a real replica

Does this MR meet the acceptance criteria?

What are the relevant issue numbers?

https://gitlab.com/gitlab-org/gitlab-ee/issues/2197

Edited by Yorick Peterse

Merge request reports