2020-10-24: Increased backend errors - GitLab.com Down

added IncidentActive Source::IMAIncidentDeclare incident severity1 labels

This incident issue does not have any service attribution. Please add one or more of the appropriate service label that are prefixed with Service:.

Please also add a group:: scoped label to help trace to a correct engineering group.

Thanks for your help!

You are welcome to help improve this comment.

added auto updated label

added 1 deleted label

Request queue time spikes at 09:14:15

Gdoc originally used for tracking: https://docs.google.com/document/d/1o-6F9_ndUVpBsAAIinc3fPVbm6JLbU5TZILeNBEx5Bw/edit# - I'm working on transferring timeline and comments

changed the description

changed title from 2020-10-24: Increased backend errors to 2020-10-24: Increased backend errors - GitLab.com Down

Checking pgbouncer nodes

pgbouncer-01-db-gprd.c.gitlab-production.internal: database       |   user    | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us |  pool_mode
pgbouncer-01-db-gprd.c.gitlab-production.internal: ---------------------+-----------+-----------+------------+-----------+---------+---------+-----------+----------+---------+------------+-------------
pgbouncer-01-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab    |      2672 |          0 |         1 |       5 |       0 |         0 |        0 |       0 |          0 | transaction
pgbouncer-01-db-gprd.c.gitlab-production.internal: gitlabhq_production | pgbouncer |         0 |          0 |         0 |       1 |       0 |         0 |        0 |       0 |          0 | transaction
pgbouncer-01-db-gprd.c.gitlab-production.internal: pgbouncer           | pgbouncer |         2 |          0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | statement
pgbouncer-01-db-gprd.c.gitlab-production.internal: (3 rows)
pgbouncer-02-db-gprd.c.gitlab-production.internal: database       |    user    | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us |  pool_mode
pgbouncer-02-db-gprd.c.gitlab-production.internal: ---------------------+------------+-----------+------------+-----------+---------+---------+-----------+----------+---------+------------+-------------
pgbouncer-02-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab     |      2605 |          0 |         1 |       4 |       1 |         0 |        0 |       0 |          0 | transaction
pgbouncer-02-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab-app |         0 |          0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | transaction
pgbouncer-02-db-gprd.c.gitlab-production.internal: gitlabhq_production | pgbouncer  |         0 |          0 |         0 |       1 |       1 |         0 |        0 |       0 |          0 | transaction
pgbouncer-02-db-gprd.c.gitlab-production.internal: pgbouncer           | pgbouncer  |         2 |          0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | statement
pgbouncer-02-db-gprd.c.gitlab-production.internal: (4 rows)
pgbouncer-03-db-gprd.c.gitlab-production.internal: database       |    user    | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us |  pool_mode
pgbouncer-03-db-gprd.c.gitlab-production.internal: ---------------------+------------+-----------+------------+-----------+---------+---------+-----------+----------+---------+------------+-------------
pgbouncer-03-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab     |      2714 |          0 |         1 |       5 |       0 |         0 |        0 |       0 |          0 | transaction
pgbouncer-03-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab-app |         0 |          0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | transaction
pgbouncer-03-db-gprd.c.gitlab-production.internal: gitlabhq_production | pgbouncer  |         0 |          0 |         0 |       2 |       0 |         0 |        0 |       0 |          0 | transaction
pgbouncer-03-db-gprd.c.gitlab-production.internal: pgbouncer           | pgbouncer  |         2 |          0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | statement
pgbouncer-03-db-gprd.c.gitlab-production.internal: (4 rows)

Also on the patroni nodes show some waiting clients

patroni-01-db-gprd.c.gitlab-production.internal: database       |      user      | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us |  pool_mode
patroni-01-db-gprd.c.gitlab-production.internal: ---------------------+----------------+-----------+------------+-----------+---------+---------+-----------+----------+---------+------------+-------------
patroni-01-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab         |         0 |          0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-01-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab-monitor |         8 |          0 |         0 |       2 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-01-db-gprd.c.gitlab-production.internal: gitlabhq_production | pgbouncer      |         0 |          0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-01-db-gprd.c.gitlab-production.internal: pgbouncer           | pgbouncer      |         2 |          0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | statement
patroni-01-db-gprd.c.gitlab-production.internal: (4 rows)
patroni-03-db-gprd.c.gitlab-production.internal: database       |      user      | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us |  pool_mode
patroni-03-db-gprd.c.gitlab-production.internal: ---------------------+----------------+-----------+------------+-----------+---------+---------+-----------+----------+---------+------------+-------------
patroni-03-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab         |      1713 |          0 |        38 |       7 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-03-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab-monitor |         8 |          0 |         2 |       1 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-03-db-gprd.c.gitlab-production.internal: gitlabhq_production | pgbouncer      |         0 |          0 |         0 |       1 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-03-db-gprd.c.gitlab-production.internal: pgbouncer           | pgbouncer      |         2 |          0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | statement
patroni-03-db-gprd.c.gitlab-production.internal: (4 rows)
patroni-02-db-gprd.c.gitlab-production.internal: database       |      user      | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us |  pool_mode
patroni-02-db-gprd.c.gitlab-production.internal: ---------------------+----------------+-----------+------------+-----------+---------+---------+-----------+----------+---------+------------+-------------
patroni-02-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab         |      1464 |        150 |        45 |       0 |       0 |         0 |        0 |       3 |     551995 | transaction
patroni-02-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab-monitor |         6 |          0 |         0 |       1 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-02-db-gprd.c.gitlab-production.internal: gitlabhq_production | pgbouncer      |         0 |          0 |         0 |       1 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-02-db-gprd.c.gitlab-production.internal: pgbouncer           | pgbouncer      |         2 |          0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | statement
patroni-02-db-gprd.c.gitlab-production.internal: (4 rows)
patroni-05-db-gprd.c.gitlab-production.internal: database       |      user      | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us |  pool_mode
patroni-05-db-gprd.c.gitlab-production.internal: ---------------------+----------------+-----------+------------+-----------+---------+---------+-----------+----------+---------+------------+-------------
patroni-05-db-gprd.c.gitlab-production.internal: gitlabhq_production | chatops        |         0 |          0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-05-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab         |      1358 |        262 |        45 |       0 |       0 |         0 |        0 |       3 |     809579 | transaction
patroni-05-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab-monitor |         4 |          0 |         0 |       1 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-05-db-gprd.c.gitlab-production.internal: gitlabhq_production | pgbouncer      |         0 |          0 |         0 |       1 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-05-db-gprd.c.gitlab-production.internal: pgbouncer           | pgbouncer      |         2 |          0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | statement
patroni-05-db-gprd.c.gitlab-production.internal: (5 rows)
patroni-04-db-gprd.c.gitlab-production.internal: database       |      user      | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us |  pool_mode
patroni-04-db-gprd.c.gitlab-production.internal: ---------------------+----------------+-----------+------------+-----------+---------+---------+-----------+----------+---------+------------+-------------
patroni-04-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab         |      1350 |        320 |        45 |       0 |       0 |         0 |        0 |       0 |     516316 | transaction
patroni-04-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab-monitor |         5 |          0 |         0 |       1 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-04-db-gprd.c.gitlab-production.internal: gitlabhq_production | pgbouncer      |         0 |          0 |         0 |       1 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-04-db-gprd.c.gitlab-production.internal: pgbouncer           | pgbouncer      |         2 |          0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | statement
patroni-04-db-gprd.c.gitlab-production.internal: (4 rows)
patroni-06-db-gprd.c.gitlab-production.internal: database       |      user      | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us |  pool_mode
patroni-06-db-gprd.c.gitlab-production.internal: ---------------------+----------------+-----------+------------+-----------+---------+---------+-----------+----------+---------+------------+-------------
patroni-06-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab         |      1499 |        107 |        45 |       0 |       0 |         0 |        0 |       1 |     791577 | transaction
patroni-06-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab-monitor |         5 |          0 |         0 |       1 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-06-db-gprd.c.gitlab-production.internal: gitlabhq_production | pgbouncer      |         0 |          0 |         0 |       1 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-06-db-gprd.c.gitlab-production.internal: pgbouncer           | pgbouncer      |         2 |          0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | statement
patroni-06-db-gprd.c.gitlab-production.internal: (4 rows)
patroni-07-db-gprd.c.gitlab-production.internal: database       |      user      | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us |  pool_mode
patroni-07-db-gprd.c.gitlab-production.internal: ---------------------+----------------+-----------+------------+-----------+---------+---------+-----------+----------+---------+------------+-------------
patroni-07-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab         |      1351 |        443 |        45 |       0 |       0 |         0 |        0 |       3 |     505179 | transaction
patroni-07-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab-monitor |         5 |          0 |         0 |       1 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-07-db-gprd.c.gitlab-production.internal: gitlabhq_production | pgbouncer      |         0 |          0 |         0 |       1 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-07-db-gprd.c.gitlab-production.internal: pgbouncer           | pgbouncer      |         2 |          0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | statement
patroni-07-db-gprd.c.gitlab-production.internal: (4 rows)
patroni-08-db-gprd.c.gitlab-production.internal: database       |      user      | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us |  pool_mode
patroni-08-db-gprd.c.gitlab-production.internal: ---------------------+----------------+-----------+------------+-----------+---------+---------+-----------+----------+---------+------------+-------------
patroni-08-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab         |         0 |          0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-08-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab-monitor |         3 |          0 |         0 |       1 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-08-db-gprd.c.gitlab-production.internal: gitlabhq_production | pgbouncer      |         0 |          0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-08-db-gprd.c.gitlab-production.internal: pgbouncer           | pgbouncer      |         2 |          0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | statement
patroni-08-db-gprd.c.gitlab-production.internal: (4 rows)

and now the waiting clients are gone


patroni-01-db-gprd.c.gitlab-production.internal: database       |      user      | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us |  pool_mode
patroni-01-db-gprd.c.gitlab-production.internal: ---------------------+----------------+-----------+------------+-----------+---------+---------+-----------+----------+---------+------------+-------------
patroni-01-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab         |         0 |          0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-01-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab-monitor |         8 |          0 |         0 |       2 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-01-db-gprd.c.gitlab-production.internal: gitlabhq_production | pgbouncer      |         0 |          0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-01-db-gprd.c.gitlab-production.internal: pgbouncer           | pgbouncer      |         2 |          0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | statement
patroni-01-db-gprd.c.gitlab-production.internal: (4 rows)
patroni-03-db-gprd.c.gitlab-production.internal: database       |      user      | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us |  pool_mode
patroni-03-db-gprd.c.gitlab-production.internal: ---------------------+----------------+-----------+------------+-----------+---------+---------+-----------+----------+---------+------------+-------------
patroni-03-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab         |      1765 |          0 |         8 |      16 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-03-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab-monitor |         8 |          0 |         2 |       1 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-03-db-gprd.c.gitlab-production.internal: gitlabhq_production | pgbouncer      |         0 |          0 |         0 |       1 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-03-db-gprd.c.gitlab-production.internal: pgbouncer           | pgbouncer      |         2 |          0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | statement
patroni-03-db-gprd.c.gitlab-production.internal: (4 rows)
patroni-02-db-gprd.c.gitlab-production.internal: database       |      user      | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us |  pool_mode
patroni-02-db-gprd.c.gitlab-production.internal: ---------------------+----------------+-----------+------------+-----------+---------+---------+-----------+----------+---------+------------+-------------
patroni-02-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab         |      1791 |          0 |         6 |       9 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-02-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab-monitor |         6 |          0 |         0 |       1 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-02-db-gprd.c.gitlab-production.internal: gitlabhq_production | pgbouncer      |         0 |          0 |         0 |       1 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-02-db-gprd.c.gitlab-production.internal: pgbouncer           | pgbouncer      |         2 |          0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | statement
patroni-02-db-gprd.c.gitlab-production.internal: (4 rows)
patroni-04-db-gprd.c.gitlab-production.internal: database       |      user      | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us |  pool_mode
patroni-04-db-gprd.c.gitlab-production.internal: ---------------------+----------------+-----------+------------+-----------+---------+---------+-----------+----------+---------+------------+-------------
patroni-04-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab         |      1795 |          0 |         6 |      11 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-04-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab-monitor |         5 |          0 |         0 |       1 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-04-db-gprd.c.gitlab-production.internal: gitlabhq_production | pgbouncer      |         0 |          0 |         0 |       1 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-04-db-gprd.c.gitlab-production.internal: pgbouncer           | pgbouncer      |         2 |          0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | statement
patroni-04-db-gprd.c.gitlab-production.internal: (4 rows)
patroni-05-db-gprd.c.gitlab-production.internal: database       |      user      | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us |  pool_mode
patroni-05-db-gprd.c.gitlab-production.internal: ---------------------+----------------+-----------+------------+-----------+---------+---------+-----------+----------+---------+------------+-------------
patroni-05-db-gprd.c.gitlab-production.internal: gitlabhq_production | chatops        |         0 |          0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-05-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab         |      1770 |          0 |         7 |      10 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-05-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab-monitor |         4 |          0 |         0 |       1 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-05-db-gprd.c.gitlab-production.internal: gitlabhq_production | pgbouncer      |         0 |          0 |         0 |       1 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-05-db-gprd.c.gitlab-production.internal: pgbouncer           | pgbouncer      |         3 |          0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | statement
patroni-05-db-gprd.c.gitlab-production.internal: (5 rows)
patroni-06-db-gprd.c.gitlab-production.internal: database       |      user      | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us |  pool_mode
patroni-06-db-gprd.c.gitlab-production.internal: ---------------------+----------------+-----------+------------+-----------+---------+---------+-----------+----------+---------+------------+-------------
patroni-06-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab         |      1832 |          0 |        11 |       7 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-06-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab-monitor |         5 |          0 |         0 |       1 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-06-db-gprd.c.gitlab-production.internal: gitlabhq_production | pgbouncer      |         0 |          0 |         0 |       1 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-06-db-gprd.c.gitlab-production.internal: pgbouncer           | pgbouncer      |         2 |          0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | statement
patroni-06-db-gprd.c.gitlab-production.internal: (4 rows)
patroni-07-db-gprd.c.gitlab-production.internal: database       |      user      | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us |  pool_mode
patroni-07-db-gprd.c.gitlab-production.internal: ---------------------+----------------+-----------+------------+-----------+---------+---------+-----------+----------+---------+------------+-------------
patroni-07-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab         |      1764 |          0 |        10 |       5 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-07-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab-monitor |         5 |          0 |         0 |       1 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-07-db-gprd.c.gitlab-production.internal: gitlabhq_production | pgbouncer      |         0 |          0 |         0 |       1 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-07-db-gprd.c.gitlab-production.internal: pgbouncer           | pgbouncer      |         2 |          0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | statement
patroni-07-db-gprd.c.gitlab-production.internal: (4 rows)
patroni-08-db-gprd.c.gitlab-production.internal: database       |      user      | cl_active | cl_waiting | sv_active | sv_idle | sv_used | sv_tested | sv_login | maxwait | maxwait_us |  pool_mode
patroni-08-db-gprd.c.gitlab-production.internal: ---------------------+----------------+-----------+------------+-----------+---------+---------+-----------+----------+---------+------------+-------------
patroni-08-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab         |         0 |          0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-08-db-gprd.c.gitlab-production.internal: gitlabhq_production | gitlab-monitor |         3 |          0 |         0 |       1 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-08-db-gprd.c.gitlab-production.internal: gitlabhq_production | pgbouncer      |         0 |          0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | transaction
patroni-08-db-gprd.c.gitlab-production.internal: pgbouncer           | pgbouncer      |         2 |          0 |         0 |       0 |       0 |         0 |        0 |       0 |          0 | statement
patroni-08-db-gprd.c.gitlab-production.internal: (4 rows)

There are 19K errors or more about failed inserts of the like:

2020-10-24 06:27:37.822 GMT,"gitlab","gitlabhq_production",36038,"10.217.4.4:43452",5f93c601.8cc6,30,"INSERT",2020-10-24 06:13:21 GMT,199/845046021,2519345017,ERROR,23505,"duplicate key value violates unique constraint ""index_ci_build_pending_states_on_build_id""","Key (build_id)=(809879847) already exists.",,,,,"INSERT INTO ""ci_build_pending_states"" (""created_at"", ""updated_at"", ""build_id"", ""state"", ""failure_reason"", ""trace_checksum"") VALUES (*******redacted********) RETURNING ""id"" /*application:web,correlation_id:z6fQ9Tvq5Oa*/",,,"puma: cluster worker 12: 29894 [gitlab-puma-worker] - 10.220.2."

Errors are now going away:

A lot of statement timeout (31918) with the same query happening on patroni-02

Query:

2020-10-24 10:33:56.707 GMT,"gitlab","gitlabhq_production",45379,"127.0.0.1:35348",5f9402d4.b143,6,"SELECT",2020-10-24 10:32:52 GMT,53/2400320742,0,ERROR,57014,"canceling statement due to statement timeout",,,,,,"SELECT ""projects"".""id"" AS t0_r0, ""projects"".""name"" AS t0_r1, ""projects"".""path"" AS t0_r2, ""projects"".""description"" AS t0_r3, ""projects"".""created_at"" AS t0_r4, ""projects"".""updated_at"" AS t0_r5, ""projects"".""creator_id"" AS t0_r6, ""projects"".""namespace_id"" AS t0_r7, ""projects"".""last_activity_at"" AS t0_r8, ""projects"".""import_url"" AS t0_r9, ""projects"".""visibility_level"" AS t0_r10, ""projects"".""archived"" AS t0_r11, ""projects"".""merge_requests_template"" AS t0_r12, ""projects"".""star_count"" AS t0_r13, ""projects"".""merge_requests_rebase_enabled"" AS t0_r14, ""projects"".""import_type"" AS t0_r15, ""projects"".""import_source"" AS t0_r16, ""projects"".""avatar"" AS t0_r17, ""projects"".""approvals_before_merge"" AS t0_r18, ""projects"".""reset_approvals_on_push"" AS t0_r19, ""projects"".""merge_requests_ff_only_enabled"" AS t0_r20, ""projects"".""issues_template"" AS t0_r21, ""projects"".""mirror"" AS t0_r22, ""projects"".""mirror_last_update_at"" AS t0_r23, ""projects"".""mirror_last_successful_update_at"" AS t0_r24, ""projects"".""mirror_user_id"" AS t0_r25, ""projects"".""shared_runners_enabled"" AS t0_r26, ""projects"".""runners_token"" AS t0_r27, ""projects"".""build_coverage_regex"" AS t0_r28, ""projects"".""build_allow_git_fetch"" AS t0_r29, ""projects"".""build_timeout"" AS t0_r30, ""projects"".""mirror_trigger_builds"" AS t0_r31, ""projects"".""public_builds"" AS t0_r32, ""projects"".""pending_delete"" AS t0_r33, ""projects"".""last_repository_check_failed"" AS t0_r34, ""projects"".""last_repository_check_at"" AS t0_r35, ""projects"".""container_registry_enabled"" AS t0_r36, ""projects"".""only_allow_merge_if_pipeline_succeeds"" AS t0_r37, ""projects"".""has_external_issue_tracker"" AS t0_r38, ""projects"".""repository_storage"" AS t0_r39, ""projects"".""request_access_enabled"" AS t0_r40, ""projects"".""has_external_wiki"" AS t0_r41, ""projects"".""repository_read_only"" AS t0_r42, ""projects"".""lfs_enabled"" AS t0_r43, ""projects"".""description_html"" AS t0_r44, ""projects"".""only_allow_merge_if_all_discussions_are_resolved"" AS t0_r45, ""projects"".""repository_size_limit"" AS t0_r46, ""projects"".""service_desk_enabled"" AS t0_r47, ""projects"".""printing_merge_request_link_enabled"" AS t0_r48, ""projects"".""auto_cancel_pending_pipelines"" AS t0_r49, ""projects"".""cached_markdown_version"" AS t0_r50, ""projects"".""last_repository_updated_at"" AS t0_r51, ""projects"".""ci_config_path"" AS t0_r52, ""projects"".""disable_overriding_approvers_per_merge_request"" AS t0_r53, ""projects"".""delete_error"" AS t0_r54, ""projects"".""storage_version"" AS t0_r55, ""projects"".""resolve_outdated_diff_discussions"" AS t0_r56, ""projects"".""remote_mirror_available_overridden"" AS t0_r57, ""projects"".""only_mirror_protected_branches"" AS t0_r58, ""projects"".""pull_mirror_available_overridden"" AS t0_r59, ""projects"".""jobs_cache_index"" AS t0_r60, ""projects"".""external_authorization_classification_label"" AS t0_r61, ""projects"".""mirror_overwrites_diverged_branches"" AS t0_r62, ""projects"".""external_webhook_token"" AS t0_r63, ""projects"".""pages_https_only"" AS t0_r64, ""projects"".""packages_enabled"" AS t0_r65, ""projects"".""merge_requests_author_approval"" AS t0_r66, ""projects"".""pool_repository_id"" AS t0_r67, ""projects"".""runners_token_encrypted"" AS t0_r68, ""projects"".""bfg_object_map"" AS t0_r69, ""projects"".""detected_repository_languages"" AS t0_r70, ""projects"".""merge_requests_disable_committers_approval"" AS t0_r71, ""projects"".""require_password_to_approve"" AS t0_r72, ""projects"".""emails_disabled"" AS t0_r73, ""projects"".""max_pages_size"" AS t0_r74, ""projects"".""max_artifacts_size"" AS t0_r75, ""projects"".""pull_mirror_branch_prefix"" AS t0_r76, ""projects"".""remove_source_branch_after_merge"" AS t0_r77, ""projects"".""marked_for_deletion_at"" AS t0_r78, ""projects"".""marked_for_deletion_by_user_id"" AS t0_r79, ""projects"".""suggestion_commit_message"" AS t0_r80, ""projects"".""autoclose_referenced_issues"" AS t0_r81, ""routes"".""id"" AS t1_r0, ""routes"".""source_id"" AS t1_r1, ""routes"".""source_type"" AS t1_r2, ""routes"".""path"" AS t1_r3, ""routes"".""created_at"" AS t1_r4, ""routes"".""updated_at"" AS t1_r5, ""routes"".""name"" AS t1_r6 FROM ""projects"" LEFT OUTER JOIN ""routes"" ON ""routes"".""source_type"" = 'Project' AND ""routes"".""source_id"" = ""projects"".""id"" WHERE ((LOWER(routes.path) = LOWER('e-rahn/e-rahn-idp-app'))) ORDER BY ""projects"".""id"" ASC LIMIT 1 /*application:web,correlation_id:LsvVSXfLrp8*/",,,"puma: cluster worker 14: 3833 [gitlab-puma-worker] - 10.220.2.4"

This query is indeed quite slow:

explain analyze SELECT "projects"."id" AS t0_r0, "projects"."name" AS t0_r1, "projects"."path" AS t0_r2, "projects"."description" AS t0_r3, "projects"."created_at" AS t0_r4, "projects"."updated_at" AS t0_r5, "projects"."creator_id" AS t0_r6, "projects"."namespace_id" AS t0_r7, "projects"."last_activity_at" AS t0_r8, "projects"."import_url" AS t0_r9, "projects"."visibility_level" AS t0_r10, "projects"."archived" AS t0_r11, "projects"."merge_requests_template" AS t0_r12, "projects"."star_count" AS t0_r13, "projects"."merge_requests_rebase_enabled" AS t0_r14, "projects"."import_type" AS t0_r15, "projects"."import_source" AS t0_r16, "projects"."avatar" AS t0_r17, "projects"."approvals_before_merge" AS t0_r18, "projects"."reset_approvals_on_push" AS t0_r19, "projects"."merge_requests_ff_only_enabled" AS t0_r20, "projects"."issues_template" AS t0_r21, "projects"."mirror" AS t0_r22, "projects"."mirror_last_update_at" AS t0_r23, "projects"."mirror_last_successful_update_at" AS t0_r24, "projects"."mirror_user_id" AS t0_r25, "projects"."shared_runners_enabled" AS t0_r26, "projects"."runners_token" AS t0_r27, "projects"."build_coverage_regex" AS t0_r28, "projects"."build_allow_git_fetch" AS t0_r29, "projects"."build_timeout" AS t0_r30, "projects"."mirror_trigger_builds" AS t0_r31, "projects"."public_builds" AS t0_r32, "projects"."pending_delete" AS t0_r33, "projects"."last_repository_check_failed" AS t0_r34, "projects"."last_repository_check_at" AS t0_r35, "projects"."container_registry_enabled" AS t0_r36, "projects"."only_allow_merge_if_pipeline_succeeds" AS t0_r37, "projects"."has_external_issue_tracker" AS t0_r38, "projects"."repository_storage" AS t0_r39, "projects"."request_access_enabled" AS t0_r40, "projects"."has_external_wiki" AS t0_r41, "projects"."repository_read_only" AS t0_r42, "projects"."lfs_enabled" AS t0_r43, "projects"."description_html" AS t0_r44, "projects"."only_allow_merge_if_all_discussions_are_resolved" AS t0_r45, "projects"."repository_size_limit" AS t0_r46, "projects"."service_desk_enabled" AS t0_r47, "projects"."printing_merge_request_link_enabled" AS t0_r48, "projects"."auto_cancel_pending_pipelines" AS t0_r49, "projects"."cached_markdown_version" AS t0_r50, "projects"."last_repository_updated_at" AS t0_r51, "projects"."ci_config_path" AS t0_r52, "projects"."disable_overriding_approvers_per_merge_request" AS t0_r53, "projects"."delete_error" AS t0_r54, "projects"."storage_version" AS t0_r55, "projects"."resolve_outdated_diff_discussions" AS t0_r56, "projects"."remote_mirror_available_overridden" AS t0_r57, "projects"."only_mirror_protected_branches" AS t0_r58, "projects"."pull_mirror_available_overridden" AS t0_r59, "projects"."jobs_cache_index" AS t0_r60, "projects"."external_authorization_classification_label" AS t0_r61, "projects"."mirror_overwrites_diverged_branches" AS t0_r62, "projects"."external_webhook_token" AS t0_r63, "projects"."pages_https_only" AS t0_r64, "projects"."packages_enabled" AS t0_r65, "projects"."merge_requests_author_approval" AS t0_r66, "projects"."pool_repository_id" AS t0_r67, "projects"."runners_token_encrypted" AS t0_r68, "projects"."bfg_object_map" AS t0_r69, "projects"."detected_repository_languages" AS t0_r70, "projects"."merge_requests_disable_committers_approval" AS t0_r71, "projects"."require_password_to_approve" AS t0_r72, "projects"."emails_disabled" AS t0_r73, "projects"."max_pages_size" AS t0_r74, "projects"."max_artifacts_size" AS t0_r75, "projects"."pull_mirror_branch_prefix" AS t0_r76, "projects"."remove_source_branch_after_merge" AS t0_r77, "projects"."marked_for_deletion_at" AS t0_r78, "projects"."marked_for_deletion_by_user_id" AS t0_r79, "projects"."suggestion_commit_message" AS t0_r80, "projects"."autoclose_referenced_issues" AS t0_r81, "routes"."id" AS t1_r0, "routes"."source_id" AS t1_r1, "routes"."source_type" AS t1_r2, "routes"."path" AS t1_r3, "routes"."created_at" AS t1_r4, "routes"."updated_at" AS t1_r5, "routes"."name" AS t1_r6 FROM "projects" LEFT OUTER JOIN "routes" ON "routes"."source_type" = 'Project' AND "routes"."source_id" = "projects"."id" WHERE ((LOWER(routes.path) = LOWER('e-rahn/e-rahn-idp-app'))) ORDER BY "projects"."id" ASC LIMIT 1;
                                                                                   QUERY PLAN                                                                                    
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1.00..17.56 rows=1 width=807) (actual time=34728.127..34728.128 rows=1 loops=1)
   ->  Nested Loop  (cost=1.00..1253785.01 rows=75690 width=807) (actual time=34728.126..34728.126 rows=1 loops=1)
         ->  Index Scan using index_routes_on_source_type_and_source_id on routes  (cost=0.56..1040277.49 rows=79168 width=80) (actual time=34727.973..34727.973 rows=1 loops=1)
               Index Cond: ((source_type)::text = 'Project'::text)
               Filter: (lower((path)::text) = 'e-rahn/e-rahn-idp-app'::text)
               Rows Removed by Filter: 15827982
         ->  Index Scan using projects_pkey on projects  (cost=0.43..2.70 rows=1 width=727) (actual time=0.129..0.129 rows=1 loops=1)
               Index Cond: (id = routes.source_id)
 Planning Time: 0.682 ms
 Execution Time: 34728.200 ms

There was definitely a temporary huge load in all replicas but 8, which was essentially unused:

Dave: some other graphs showing incident and recovery:

Patroni overview:

pg_bouncer overview

Further service level graphs:

API:

Git:

Web:

Registry:

We are continuing to investigate what caused the slow queries in the time window.

Investigating if #2849 (closed) may be related. We are seeing slow queries where we would not have expected them to be slow.

Stats for the routes table were not good - a re-run of the stats appears to have changed the query plan to be much faster.

I set /chatops run feature set database_reindexing false to disable future reindexes.

Running analyze routes to see if query plan improves. Table size is less than 3GB.

So this fixed the problem for the query:

gitlabhq_production=# \timing
Timing is on.
gitlabhq_production=# analyze routes;
ANALYZE
Time: 7133.985 ms (00:07.134)
gitlabhq_production=# explain analyze SELECT "projects"."id" AS t0_r0, "projects"."name" AS t0_r1, "projects"."path" AS t0_r2, "projects"."description" AS t0_r3, "projects"."created_at" AS t0_r4, "projects"."updated_at" AS t0_r5, "projects"."creator_id" AS t0_r6, "projects"."namespace_id" AS t0_r7, "projects"."last_activity_at" AS t0_r8, "projects"."import_url" AS t0_r9, "projects"."visibility_level" AS t0_r10, "projects"."archived" AS t0_r11, "projects"."merge_requests_template" AS t0_r12, "projects"."star_count" AS t0_r13, "projects"."merge_requests_rebase_enabled" AS t0_r14, "projects"."import_type" AS t0_r15, "projects"."import_source" AS t0_r16, "projects"."avatar" AS t0_r17, "projects"."approvals_before_merge" AS t0_r18, "projects"."reset_approvals_on_push" AS t0_r19, "projects"."merge_requests_ff_only_enabled" AS t0_r20, "projects"."issues_template" AS t0_r21, "projects"."mirror" AS t0_r22, "projects"."mirror_last_update_at" AS t0_r23, "projects"."mirror_last_successful_update_at" AS t0_r24, "projects"."mirror_user_id" AS t0_r25, "projects"."shared_runners_enabled" AS t0_r26, "projects"."runners_token" AS t0_r27, "projects"."build_coverage_regex" AS t0_r28, "projects"."build_allow_git_fetch" AS t0_r29, "projects"."build_timeout" AS t0_r30, "projects"."mirror_trigger_builds" AS t0_r31, "projects"."public_builds" AS t0_r32, "projects"."pending_delete" AS t0_r33, "projects"."last_repository_check_failed" AS t0_r34, "projects"."last_repository_check_at" AS t0_r35, "projects"."container_registry_enabled" AS t0_r36, "projects"."only_allow_merge_if_pipeline_succeeds" AS t0_r37, "projects"."has_external_issue_tracker" AS t0_r38, "projects"."repository_storage" AS t0_r39, "projects"."request_access_enabled" AS t0_r40, "projects"."has_external_wiki" AS t0_r41, "projects"."repository_read_only" AS t0_r42, "projects"."lfs_enabled" AS t0_r43, "projects"."description_html" AS t0_r44, "projects"."only_allow_merge_if_all_discussions_are_resolved" AS t0_r45, "projects"."repository_size_limit" AS t0_r46, "projects"."service_desk_enabled" AS t0_r47, "projects"."printing_merge_request_link_enabled" AS t0_r48, "projects"."auto_cancel_pending_pipelines" AS t0_r49, "projects"."cached_markdown_version" AS t0_r50, "projects"."last_repository_updated_at" AS t0_r51, "projects"."ci_config_path" AS t0_r52, "projects"."disable_overriding_approvers_per_merge_request" AS t0_r53, "projects"."delete_error" AS t0_r54, "projects"."storage_version" AS t0_r55, "projects"."resolve_outdated_diff_discussions" AS t0_r56, "projects"."remote_mirror_available_overridden" AS t0_r57, "projects"."only_mirror_protected_branches" AS t0_r58, "projects"."pull_mirror_available_overridden" AS t0_r59, "projects"."jobs_cache_index" AS t0_r60, "projects"."external_authorization_classification_label" AS t0_r61, "projects"."mirror_overwrites_diverged_branches" AS t0_r62, "projects"."external_webhook_token" AS t0_r63, "projects"."pages_https_only" AS t0_r64, "projects"."packages_enabled" AS t0_r65, "projects"."merge_requests_author_approval" AS t0_r66, "projects"."pool_repository_id" AS t0_r67, "projects"."runners_token_encrypted" AS t0_r68, "projects"."bfg_object_map" AS t0_r69, "projects"."detected_repository_languages" AS t0_r70, "projects"."merge_requests_disable_committers_approval" AS t0_r71, "projects"."require_password_to_approve" AS t0_r72, "projects"."emails_disabled" AS t0_r73, "projects"."max_pages_size" AS t0_r74, "projects"."max_artifacts_size" AS t0_r75, "projects"."pull_mirror_branch_prefix" AS t0_r76, "projects"."remove_source_branch_after_merge" AS t0_r77, "projects"."marked_for_deletion_at" AS t0_r78, "projects"."marked_for_deletion_by_user_id" AS t0_r79, "projects"."suggestion_commit_message" AS t0_r80, "projects"."autoclose_referenced_issues" AS t0_r81, "routes"."id" AS t1_r0, "routes"."source_id" AS t1_r1, "routes"."source_type" AS t1_r2, "routes"."path" AS t1_r3, "routes"."created_at" AS t1_r4, "routes"."updated_at" AS t1_r5, "routes"."name" AS t1_r6 FROM "projects" LEFT OUTER JOIN "routes" ON "routes"."source_type" = 'Project' AND "routes"."source_id" = "projects"."id" WHERE ((LOWER(routes.path) = LOWER('e-rahn/e-rahn-idp-app'))) ORDER BY "projects"."id" ASC LIMIT 1;
                                                                      QUERY PLAN                                                                      
------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=7.05..7.05 rows=1 width=787) (actual time=1.571..1.571 rows=1 loops=1)
   ->  Sort  (cost=7.05..7.05 rows=1 width=787) (actual time=1.570..1.570 rows=1 loops=1)
         Sort Key: projects.id
         Sort Method: quicksort  Memory: 25kB
         ->  Nested Loop  (cost=1.00..7.04 rows=1 width=787) (actual time=1.560..1.561 rows=1 loops=1)
               ->  Index Scan using index_on_routes_lower_path on routes  (cost=0.56..3.58 rows=1 width=80) (actual time=1.535..1.536 rows=1 loops=1)
                     Index Cond: (lower((path)::text) = 'e-rahn/e-rahn-idp-app'::text)
                     Filter: ((source_type)::text = 'Project'::text)
               ->  Index Scan using projects_pkey on projects  (cost=0.43..3.45 rows=1 width=707) (actual time=0.020..0.020 rows=1 loops=1)
                     Index Cond: (id = routes.source_id)
 Planning Time: 1.931 ms
 Execution Time: 1.649 ms

Analyze was fast (7 seconds). It made Postgres query planner to switch one index on routes for another one, and add an extra sort stage. However, this plan is many orders of magnitued better, taking 2ms to execute the query (vs 35K ms).

index_on_routes_lower_path looks like it was touched by the reindexing automation of #2849 (closed)

20 | 2020-10-24 09:12:25.665351+00 | 2020-10-24 09:14:00.102465+00 | 1410965504 | 1050902528 | 1 | public.index_on_routes_lower_path

(That is a row from postgres_reindex_actions)

I think "10:14 incident starts" above is wrong. It should be 9:14 and that removes the mystery of why there would be a 1 hour delay between manipulating the index and the incident starting. There was no 1 hour delay.

changed the description

2020-10-24: Increased backend errors - GitLab.com Down

Summary

Timeline

Incident Review

Summary

Metrics

Customer Impact

Incident Response Analysis

Post Incident Analysis

5 Whys

Lessons Learned

Corrective Actions

Guidelines

Designs

Child items ...

Activity

2020-10-24: Increased backend errors - GitLab.com Down

Summary

Timeline

Incident Review

Summary

Metrics

Customer Impact

Incident Response Analysis

Post Incident Analysis

5 Whys

Lessons Learned

Corrective Actions

Guidelines

Relates to

Activity