Downtime caused by running "gitlab-ctl reconfigure" on pgbouncer hosts

At roughly 16:15 UTC we changed server_idle_timeout for pgbouncer from 90 seconds to 65 seconds. This was committed in chef-repo commits 77ccf2ff0312cf5730e1a2baaf1e88436f0666cc and 3e666d75b3d2d61b6064ad6cbb84e01e6465a2fb, followed by running sudo chef-client and sudo gitlab-ctl reconfigure on the following hosts:

  • 10.66.4.102 pgbouncer-02
  • 10.66.1.102 postgres-02
  • 10.66.1.103 postgres-03
  • 10.66.1.104 postgres-04
  • 10.66.1.101 postgres-01
  • 10.66.4.101 pgbouncer-01

The diff of these changes:

commit 3e666d75b3d2d61b6064ad6cbb84e01e6465a2fb (HEAD -> master)
Author: Yorick Peterse <yorickpeterse@gmail.com>
Date:   Mon Mar 19 17:18:23 2018 +0100

    Reduce server_idle_timeout to 65 seconds

diff --git a/roles/gitlab-base-db-pgbouncer.json b/roles/gitlab-base-db-pgbouncer.json
index 5630d8e0..d3b8aff2 100644
--- a/roles/gitlab-base-db-pgbouncer.json
+++ b/roles/gitlab-base-db-pgbouncer.json
@@ -35,13 +35,14 @@
           "reserve_pool_timeout": 3,
           "max_client_conn": 2048,
           "pool_mode": "transaction",
-          "server_idle_timeout": 90,
+          "server_idle_timeout": 65,
           "enable": true,
           "databases": {
             "gitlabhq_production": {
+
             },
             "gitlabhq_production_sidekiq": {
-              "dbname":"gitlabhq_production",
+              "dbname": "gitlabhq_production",
               "pool_size": "150"
             }
           },
@@ -99,4 +100,4 @@
     "role[gitlab-base-db]",
     "recipe[gitlab-mtail::pgbouncer]"
   ]
-}
+}
\ No newline at end of file

commit 77ccf2ff0312cf5730e1a2baaf1e88436f0666cc
Author: Yorick Peterse <yorickpeterse@gmail.com>
Date:   Mon Mar 19 17:17:51 2018 +0100

    Reduce server_idle_timeout to 65 seconds

diff --git a/roles/gitlab-base-db-postgres.json b/roles/gitlab-base-db-postgres.json
index 8467bc6f..e5cbe898 100644
--- a/roles/gitlab-base-db-postgres.json
+++ b/roles/gitlab-base-db-postgres.json
@@ -374,7 +374,7 @@
           "reserve_pool_timeout": 3,
           "max_client_conn": 2048,
           "pool_mode": "transaction",
-          "server_idle_timeout": 90,
+          "server_idle_timeout": 65,
           "databases": {
             "gitlabhq_production": {
               "host": "127.0.0.1",

These change of this setting itself only requires a reload in pgbouncer, which can be done without incurring any downtime (and we have this set up in Omnibus).

Unfortunately the reconfigure appeared to have generated an incorrect configuration file, resulting in pgbouncer not being able to connect to our databases. This was resolved when @jtevnan restarted consul on pgbouncer-01, forcing it to run a reconfigure and subsequently generating the right configuration.

Timeline

On date: 2018-03-19

  • 16:15 UTC: changes applied, pgbouncer started reloading
  • 16:15 - 16:20 UTC: GitLab.com becomes unresponsive
  • 16:34 UTC: @jtevnan restarts consul, resolving the problem

Incident Analysis

  • How was the incident detected?
    • GitLab.com became unresponsive
    • Alerts were being sent for no transactions being processed on the databases
  • Is there anything that could have been done to improve the time to detection?
    • No, we detected the issue the moment it occurred
  • How was the root cause discovered?
    • Mostly by looking at our logs per https://gitlab.com/gitlab-com/infrastructure/issues/3876#note_64118085
  • Was this incident triggered by a change?
    • Yes, an otherwise harmless change somehow triggered a separate process that broke our configuration
  • Was there an existing issue that would have either prevented this incident or reduced the impact?
    • Not that I am aware of

Root Cause Analysis

  1. Why did we have downtime: we applied a change to pgbouncer that normally requires a reload
  2. Why did this cause downtime?: details of this can be found in https://gitlab.com/gitlab-com/infrastructure/issues/3876#note_64118085, but it's not entirely clear yet what happened

What went well

The problem was noticed very shortly after it surfaced, and the team jumped into a meeting very quickly. In other words, the response time was very quick.

What can be improved

  • Using the root cause analysis, explain what things can be improved.

Corrective actions

  • https://gitlab.com/gitlab-com/infrastructure/issues/3884

Guidelines

  • Blameless Postmortems Guideline
  • 5 whys
Edited Mar 20, 2018 by Yorick Peterse
Assignee Loading
Time tracking Loading