We have previously discussed moving load balancing away from the application to a load balancing proxy but there were some points made that shows that our own load balancing code is necessary to ensure that reads go to the primary immediately after a write and other special logic.
We have seen in several high severity incidents like production#4874 (closed) and production#4820 (closed) where losing a replica will cause a large spike of errors for typically < 5 minutes. This spike, while short, can be quite disruptive.
For the case where a replica goes offline, should we decrease the discovery interval from the default of 60seconds, so we can detect it sooner?
Should we push more health logic into console? For example the application checks for replication delay but this could be done by the service status instead, we could also incorporate GCP maintenance events to mark the replica as unhealthy https://cloud.google.com/compute/docs/storing-retrieving-metadata#maintenanceevents and remove it from the pool, which has a 60 second lead time.
@Finotto would love to get your input on this. As it stands, this issue has outstanding questions that are better answered a DBRE. Happy to partner with you on the implementation if needed.
@amoter Thank you for sharing this proposal with me. I think we need to consider some facts on our research during the last months about the degradation from the database load on the secondaries:
the root cause that we understood is a long-running transaction on the primary and we keep an old txid(transaction id) open ( I mean keeping a reference with the txid from the primary in the secondary) generating an overflow on pg_lsru that start to degrade the performance on the secondaries. Please to read more about this we have this nice blog post from Stan: https://about.gitlab.com/blog/2021/09/29/why-we-spent-the-last-month-eliminating-postgresql-subtransactions/
we patched one instance with a custom parameter to increase the LSRU for subtransactions in one node patroni-v12-09, and it seems to be working nicely.
We have already done some corrective actions to fix this problem but still seems that some corner cases are still happening.
Please these are some activities that are related to the proposal, please feel free to schedule a coffee chat and we can talk about it.
@Finotto The changes we made were purely to support multiple database. The strategy for selecting what replicas to read from remains unchanged.
I think a starting point is to determine what metrics are/were useful for identifying the past incidents. Then we have to decide based on what conditions we should stop using a replica. One metric could be the database load, with the load balancer shifting queries away from an overloaded replica.
The requirement here is that obtaining these metrics should be cheap and easy, meaning they are preferably stored in the replica's database itself.
Another change we could make is to increase the frequency by which a replica status is checked. Right now the interval defaults to a random time between 60 and 120 seconds, but that might be too high for GitLab.com.
Basically the current "algorithm" is optimised largely for code simplicity, not for obtaining 15 nines of uptime. We have plenty of options there to improve this, provided the necessary metrics are easily obtainable.
Another option is to introduce a form of shared state between the processes that use load balancing. Right now every process decides for itself what replicas to pick. This can result in all processes picking the same replica, resulting in an uneven distribution of load. This is mitigated somewhat by randomising the order of replicas as stored in-memory, but it's not a perfect solution.
The shared state would be in Redis, and would track data such as how many users (in terms of processes) a replica has. When picking a replica at the start of the request, the process would pick the replica with the least amount of users. In addition, we could use Redis to track the online state of a replica. This way if one process detects a replica as being offline, all other processes will immediately disregard that replica.
Implementation wise I think this isn't actually all that difficult to implement. Basically we just have a counter per DB host to track the number of "users", and a separate "is the host online" flag; at least as a starting point. Redis in turn is fast enough that we can permit checking/modifying this state at the start of a request.
This also opens the door for more advanced load balancing strategies. For example, we could have an external monitoring system modify this Redis data based on various metrics. This way we don't need all sorts of (potentially) expensive checks in the Rails code.
I think in the original MR that introduced the load balancer I suggested something like this being a good option for the future. I guess that future is now
@Finotto@yorickpeterse to give a bit more context on my question, I'm the DRI for the CA squad (epics 562/621) and we are reviewing outstanding corrective actions. This issue strikes me as beyond the scope of the CA squad to take on because it seems to be epic-level work and at the very least should be done in partnership/at the direction of the DBRE team.
Do you feel that the core of reliability concerns that this issue captures are being addressed in other issues or epics? If not, I propose the creation of a new epic/issue and closing this one. Thoughts?
@amoter I do agree that this work will be epic and is not a Corrective Action. At the moment I suggest we can keep this issue in the backlog and raise it in further quarters goals/projects. I do agree also to keep the database team involved in the planning perhaps we change the priority due to the sharding goals.
The label severity4 has been applied due to missing severity:: label.
To ensure the necessary information is captured in this issue for a ::"corrective action", please apply the
corrective_action issue template, which includes a quick action to apply the default severity label.