WIP: Split sidekiq and production pools PgBouncer RW layer

Production Change - Criticality 1 C1

Change Objective PgBouncer endpoint addition (and migration) for sidekiq as a frontend pool for RW
Change Type Frontend PgBouncers Pool split (Sidekiq and Production)
Services Impacted ~"Service:Postgres" ~"Service:Pgbouncer"
Change Team Members SRE - Ongres/ @hphilipps /
Change Severity C1
Buddy check or tested in staging @emanuel_ongres @gerardo.herzig @kadaffy
Schedule of the change -
Duration of the change

General

Due to PgBouncer limitations regarding its single-core architecture and, the inability to manage pools in an isolated manner (that is, all pools will in the same queue), it is necessary to split both pool families (production and sidekiq) into their own dedicated pool/services.

This issue scopes the change to the external (RW) PgBouncer frontend nodes, which work behind an iLB to balance traffic through the nodes and issuing checks against PgBouncer port. This requires additional resources at GCP (iLB for the sidekiq) -- cc/ @ahmadsherif.

Currently, for the RW pool, each of the 2 active nodes has the following pool setup:

  • 50 + 75 + (production, sidekiq,geo).
  • 4096 clients allowed to connect (from which sidekiq, has more priority in both nodes).

There is no need for failover in order to execute the following change, as RW pool resides in a different layer. Although, some additional operation steps should be reviewed and executed.

By splitting services, we are also increasing each service capacity in terms of allowed clients.

Different min_pool_size for PgBouncers can be set, as services we'll be split:

Recommended values: min_pool_size = 20 (production) and min_pool_size= 10(sidekiq).

Motivation

The motivation under this re-architecture of the frontend PgBouncer is to:

  • Avoid incoming large-operations from sidekiq to impact production operations
  • Distribute CPU processing to other cores of the instance (4-core/8-thread size).
  • Allow deploying different pool sizes for each pool depending on the accumulated operations in sidekiq. That is, controlling the maximum active connections from sidekiq, in order to avoid unnecessary concurrency.
  • It will enhance the observability of each pool behavior.

Plan

  • Define an additional PgBouncer service on each PgBouncer node ( * ):
    • pgbouncer-01-db-gprd.c.gitlab-production.internal (current pgbouncer standby)
      • This node has no /usr/bin/python -m SimpleHTTPServer 8010 running.
    • pgbouncer-02-db-gprd.c.gitlab-production.internal
    • pgbouncer-03-db-gprd.c.gitlab-production.internal

( * )

/var/opt/gitlab/pgbouncer/databases_sidekiq.ini

[databases]
gitlabhq_production_sidekiq = host=master.patroni.service.consul port=5432 pool_size=50 auth_user=pgbouncer dbname=gitlabhq_production

PgBouncer pgbouncer_sidekiq.ini (diff variables from pgbouncer.ini)

listen_port = 6433
unix_socket_dir = /var/opt/gitlab/pgbouncer_sidekiq
max_client_conn = 4096
default_pool_size = 20
min_pool_size = 10

pidfile = /var/opt/gitlab/pgbouncer_sidekiq.pid
%include /var/opt/gitlab/pgbouncer/databases_sidekiq.ini

/etc/systemd/system/pgbouncer_sidekiq.service

[Unit]
Description=pgbouncer_sidekiq

[Service]
Environment=
ExecStart=/usr/local/bin/pgbouncer /var/opt/gitlab/pgbouncer/pgbouncer_sidekiq.ini
ExecReload=/bin/kill -HUP $MAINPID
KillSignal=TERM
User=gitlab-psql
WorkingDirectory=/
Restart=on-failure
LimitNOFILE=4096

[Install]
WantedBy=multi-user.target

/etc/systemd/system/pgbouncer_sidekiq.service.d/override.conf

[Service]
LimitAS=infinity
LimitCORE=0
LimitCPU=infinity
LimitDATA=infinity
LimitFSIZE=infinity
LimitLOCKS=infinity
LimitMEMLOCK=65536
LimitMSGQUEUE=819200
LimitNICE=0
LimitNOFILE=50000
LimitNPROC=infinity
LimitRSS=infinity
LimitRTPRIO=0
LimitSIGPENDING=62793
LimitSTACK=10485760

GCP steps:

  • Configure additional GCP Health Check for sidekiq Pgbouncer services. cc/ @ahmadsherif

  • Add iLB for sidekiq -- I added as the first, as this can be done in any stage

    • SimpleHTTPServer check can be reused for Health checks
    • Forwarding rules should point to 6433 port.
  • Redirect traffic from sidekicks.

    • @Finotto Please add here whoever team is involved in such change.
  • Wait until other pools drain sidekiq traffic.

    • OnGres @gerardo.herzig: please collect here the links of the necessary dashboards or scripts to do so.
  • Change the databases.ini for removing sidekiq pool from the now, production pool (highlinting, the change of the pool size):

[databases]
gitlabhq_production = host=master.patroni.service.consul port=5432 pool_size=75 auth_user=pgbouncer
gitlabhq_geo_production = auth_user=pgbouncer
  • Remove sidekiq pool from pgbouncer service in favor of the new service

( * )

Currently, HTTP check is under 8010 port https://console.cloud.google.com/compute/healthChecks/details/gprd-pgbouncer-http?project=gitlab-production Running on script (***).

(***)

root@pgbouncer-03-db-gprd.c.gitlab-production.internal:/var/opt/gitlab# ps 18754
  PID TTY      STAT   TIME COMMAND
18754 ?        S      2:28 /usr/bin/python -m SimpleHTTPServer 8010

Production iLB / endpoint: https://console.cloud.google.com/net-services/loadbalancing/details/internal/us-east1/gprd-pgbouncer-regional?project=gitlab-production i.gprd-gcp-tcp-lb-internal-pgbouncer.il4.us-east1.lb.gitlab-production.internal 10.217.4.5:6432

Finalize

  • Monitor that traffic flows as usual.

Changes

-- Updated through production#1373 (closed) Additionally, it is important to highlight the present imbalance between the sum between the production and sidekiq pools. From the overall maximum of 250 active connections for all active PgBouncers:

  • 150 are for Production
  • 100 are for Sidekiq

This configuration leads towards potential large operations (usually coming from Sidekiq) under heavy concurrency. It is recommended to switch the pool size (that is, the active connections), ending up with 150 (75) for production and, 100 for sidekiq (50 each).

Edited by 🤖 GitLab Bot 🤖