Capacity assessment for our main production DB cluster
(Related to https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/9094)
We are taking again on this capacity planning effort. The main reason for that is the DB Sharding conversations: We are discussing how to shard our main DB cluster in production, which could take some time to lead to agreements, implementation and roll out to production. Meanwhile, we'd like to be sure that our DB will not hit a wall any time soon, impacting gitlab.com.
Some relevant information we'd want to take into account (some of it comes from the above issue):
- We know there is still some vertical growth that we could go for on this cluster - particularly on our leader node (read/write), involved in a lot of the operations that eventually go to our replicas too. More and more performant CPUs, more memory... There is still some headroom.
- Questions we wanted to answer at that time (3 months ago):
- How much more load we could add on the primary database without degradation in the performance.
- We are reaching 60k connections per second. How much could we grow, safely?
- Should we increase disk or memory to optimize performance? Any suggestions in shared buffers setup?
- Estimate how much % of the load is related to the table merge_requests.
- measure the hot set of data that we use at the moment in memory. How often are we writing to disk? (
pg_buffercache
and the LRU count) - evaluate architectures with higher number of vCPUS.
Edited by Jose Finotto