Implement mechanism to rollout "Phase 4" of decomposition in a gradual fashion
Problem
As part of our rollout of decomposition we want to de-risk phase 4 as much as possible. Phase 4 is a big change to GitLab.com architecture as it will change the way the application handles write connections to the primary database. It is the first step where the application will switch between using 2 different write connections depending on the tables being accessed.
Solution
There are 2 ways we are de-risking this:
- The 2 write connections will go to different PGBouncer instances but those PGBouncer instances still point to the same primary write database. This means that rollback is very safe as you are still writing all data to the same place. This decision is the intent of "Phase 4" to catch connection problems well before we reach our migration point of no return
- We plan to roll it out based on a percentage of requests/jobs in the GitLab rails application so that we can very quickly roll it out and roll it back
(1) is handled by our rollout plan but (2) is still to be implemented and that's what this issue describes.
Technical problem to solve for connections
The gradual rollout will mean our DB connection architecture looks as follows:
graph LR;
PostgresMain[(PostgresMain - Limit K=500 max_connections)]
GitLabRails-->|100-X % of CI queries|PGBouncerMain
GitLabRails-->|X% of CI queries|PGBouncerCi
PGBouncerMain-->|Limit N=270 pool_size|PostgresMain
PGBouncerCi-->|Limit M pool_size|PostgresMain
We will want to gradually increase X from 0-100. But this presents a problem as this will mean the number of connections to the PostgresMain
DB will change with this number. We assume it has some initial limit "K" connections and we assume this limit is deliberately just high enough to handle the current connections from PGBouncerMain
. If we keep the limit at the current value K
then we will eventually run out of connections as we increase X
. If we increase it to 2K
then we won't run out of connections but we will have a problem where our final system has 2K connections which may overload the primary. We could reduce N
and M
limits in PGBouncer to be K/2
(or half their current) value but then we have the problem that the initial deployment (where X=0%) will have too few primary database connections and GitLabRails
will get timeouts waiting for PGBouncer to get a free connection as we assume all current parameters are well tuned.
We need to find a way to manage the rollout so there is no point in time where the primary has too many open connections and there is no point in time where we have too few connections. Overall the load (total writes from GitLabRails
) won't change during the whole rollout process so it seems like there should be a good/safe way to do this but we just need to plan it out. Our original plan only involved live tweaking of X
(a rails configuration parameter) but it seems this may not be sufficient if we also need PGBouncer nodes to drop connections as new connections are opened on another PGBouncer node.
Stats
-
max_connections
for Postgres primary:500
-
pool_size
for PGBouncer Web/API hosts:50
-
pool_size
for PGBouncer Sidekiq hosts:40
-
sv_active + sv_idle + sv_used
during busy times: `` -
sv_active+ sv_idle + sv_used
during quiet times: ``
Current migration plan
-
Do all the below steps on staging or some other environment with load to verify PGBouncer behaves sensibly (ie. it closes connections in a reasonably timely manner) -
During quiet working hours (~08:00 UTC) -
Set pool size limit of new PGBouncer hosts to 1
for Web/API and Sidekiq -
Enable 1% of CI related queries to go to new PGBouncer hosts -
Observe new CI PGBouncer hosts start accepting new writes and opening connections to Postgres. They should not exceed a total of 6 connections across all 6 hosts -
WAIT 24hrs to be sure nothing is wrong -
At non-peak hours -
Set reserve_pool=0
for main PGBouncer hosts -
Set pool size limit of new PGBouncer hosts to 5
for Web/API and Sidekiq -
Enable 15% of CI related queries to go to new PGBouncer hosts -
Observe new CI PGBouncer hosts start accepting new writes and opening connections to Postgres. They should not exceed a total of 30 connections across all 6 hosts -
Decrease total connections on old PGBouncer hosts by 5 (assuming the data suggests this is fine) -
Observe some connections are now closed on main PGBouncer hosts after a few minutes (or observe that the number open was already below the new limit) -
Use metrics from above steps to figure out next increment sizes (how much % to enable and how many connections to move)