Downtime caused by running "gitlab-ctl reconfigure" on pgbouncer hosts

At roughly 16:15 UTC we changed server_idle_timeout for pgbouncer from 90 seconds to 65 seconds. This was committed in chef-repo commits 77ccf2ff0312cf5730e1a2baaf1e88436f0666cc and 3e666d75b3d2d61b6064ad6cbb84e01e6465a2fb, followed by running sudo chef-client and sudo gitlab-ctl reconfigure on the following hosts:

10.66.4.102 pgbouncer-02
10.66.1.102 postgres-02
10.66.1.103 postgres-03
10.66.1.104 postgres-04
10.66.1.101 postgres-01
10.66.4.101 pgbouncer-01

The diff of these changes:

commit 3e666d75b3d2d61b6064ad6cbb84e01e6465a2fb (HEAD -> master)
Author: Yorick Peterse <yorickpeterse@gmail.com>
Date:   Mon Mar 19 17:18:23 2018 +0100

    Reduce server_idle_timeout to 65 seconds

diff --git a/roles/gitlab-base-db-pgbouncer.json b/roles/gitlab-base-db-pgbouncer.json
index 5630d8e0..d3b8aff2 100644
--- a/roles/gitlab-base-db-pgbouncer.json
+++ b/roles/gitlab-base-db-pgbouncer.json
@@ -35,13 +35,14 @@
           "reserve_pool_timeout": 3,
           "max_client_conn": 2048,
           "pool_mode": "transaction",
-          "server_idle_timeout": 90,
+          "server_idle_timeout": 65,
           "enable": true,
           "databases": {
             "gitlabhq_production": {
+
             },
             "gitlabhq_production_sidekiq": {
-              "dbname":"gitlabhq_production",
+              "dbname": "gitlabhq_production",
               "pool_size": "150"
             }
           },
@@ -99,4 +100,4 @@
     "role[gitlab-base-db]",
     "recipe[gitlab-mtail::pgbouncer]"
   ]
-}
+}
\ No newline at end of file

commit 77ccf2ff0312cf5730e1a2baaf1e88436f0666cc
Author: Yorick Peterse <yorickpeterse@gmail.com>
Date:   Mon Mar 19 17:17:51 2018 +0100

    Reduce server_idle_timeout to 65 seconds

diff --git a/roles/gitlab-base-db-postgres.json b/roles/gitlab-base-db-postgres.json
index 8467bc6f..e5cbe898 100644
--- a/roles/gitlab-base-db-postgres.json
+++ b/roles/gitlab-base-db-postgres.json
@@ -374,7 +374,7 @@
           "reserve_pool_timeout": 3,
           "max_client_conn": 2048,
           "pool_mode": "transaction",
-          "server_idle_timeout": 90,
+          "server_idle_timeout": 65,
           "databases": {
             "gitlabhq_production": {
               "host": "127.0.0.1",

These change of this setting itself only requires a reload in pgbouncer, which can be done without incurring any downtime (and we have this set up in Omnibus).

Unfortunately the reconfigure appeared to have generated an incorrect configuration file, resulting in pgbouncer not being able to connect to our databases. This was resolved when @jtevnan restarted consul on pgbouncer-01, forcing it to run a reconfigure and subsequently generating the right configuration.

Timeline

On date: 2018-03-19

16:15 UTC: changes applied, pgbouncer started reloading
16:15 - 16:20 UTC: GitLab.com becomes unresponsive
16:34 UTC: @jtevnan restarts consul, resolving the problem

Incident Analysis

How was the incident detected?
- GitLab.com became unresponsive
- Alerts were being sent for no transactions being processed on the databases
Is there anything that could have been done to improve the time to detection?
- No, we detected the issue the moment it occurred
How was the root cause discovered?
- Mostly by looking at our logs per https://gitlab.com/gitlab-com/infrastructure/issues/3876#note_64118085
Was this incident triggered by a change?
- Yes, an otherwise harmless change somehow triggered a separate process that broke our configuration
Was there an existing issue that would have either prevented this incident or reduced the impact?
- Not that I am aware of

Root Cause Analysis

Why did we have downtime: we applied a change to pgbouncer that normally requires a reload
Why did this cause downtime?: details of this can be found in https://gitlab.com/gitlab-com/infrastructure/issues/3876#note_64118085, but it's not entirely clear yet what happened

What went well

The problem was noticed very shortly after it surfaced, and the team jumped into a meeting very quickly. In other words, the response time was very quick.

What can be improved

Using the root cause analysis, explain what things can be improved.

Corrective actions

https://gitlab.com/gitlab-com/infrastructure/issues/3884

Guidelines

Edited Mar 20, 2018 by Yorick Peterse