Discussion: What do we need to do to handle scaling of DB Connections

During the rollbacks test on Production (jump to 01:14:49 for the relevant discussion) we noticed that we increase DB connections by ~25% when we bring new pods online prior to scaling old pods down during deployments. During the rollback there were times when we were using 80% of available db connections.

As we move the API and soon the Web over to Kubernetes we'll be scaling more pods. This, along with increased traffic makes it likely that we'll run out of DB connections at some point.

Let's use this issue to decide how to avoid this, and think about whether we need to be adding in any additional checks when scaling.

cc/ @gitlab-org/delivery

Summary of findings as of 2021-05-27

This was a defect in our saturation metric, rather than an actual capacity problem.

The saturation metric is representing the worst outlier node's sustained utilization, rather than the fleet-wide utilization. Follow-up should focus on improving the utilization metric. No need at this time to add capacity to the db connection pool.

For more details, see summary notes here: #1077 (comment 587300443)

Edited May 28, 2021 by Matt Smiley