Post-mortem for 2018-03-12 incident: Commenting on Issues, MRs was failing

Context

Original issue was: https://gitlab.com/gitlab-org/gitlab-ce/issues/44146#note_63178920

All inserts into the Notes table were blocking for 15s then hitting the statement timeout and throwing an error causing all comments to issues and MRs to fail. This was the case for about 90 minutes between 2018-03-12 10:32:56 and 12:01:27.

Timeline

On date: 2013-08-12

10:28 UTC - Deploy started
10:32:51 - CREATE INDEX CONCURRENTLY "index_taggings_on_tag_id" ON "taggings" ("tag_id" ) begins
10:33:11 - First insert into notes times out (it would have started at 10:32:56) no inserts succeed after this until resolution
10:33:41 - CREATE INDEX CONCURRENTLY "index_taggings_on_tag_id" ON "taggings" ("tag_id" ) ends
10:48:24 - @smcgivern on slack: we're seeing a lot of statement timeouts when inserting notes at the moment observed on Sentry
11:04:00 - We are currently investigating an increase of database locks that may cause inserting comments to timeout https://twitter.com/gitlabstatus/status/973152719191068673
11:05:00 - @abrandl observed on Grafana dashboard that locks and errors (statement timeouts, in this case) increased around 10:30 UTC
11:13:00 - @_stark observed log entries on database server of processing waiting for a lock on index_notes_on_note_trigram
11:31:00 - @smcgivern do we need to call? it's been an hour now
11:34:00 - @smcgivern starts Zoom session
11:34:00 - Discussion focuses on vacuum being the cause
11:53:00 - @abrandl tries ALTER SYSTEM SET statement_timeout TO '30s';
11:54:00 - @smcgivern observes that no notes have been created since 10:32
12:02:00 - @abrandl inserts into notes manually with statement timeout disabled
12:09:00 - @smcgivern observed that the problem had resolved
12:14:00 - We have identified and fixed the increase in DB locks - commenting can continue! https://twitter.com/gitlabstatus/status/973170270168059904

Incident Analysis

The original detection was via Sentry. Our monitoring did pick up the problem as early as 10:33:45 and would have alerted at 10:39:00 but it seems to be misconfigured and the alert didn't appear anywhere.
The locking on the trigram index was identified quickly but we were focused on identifying a problematic query or vacuum holding the lock.

Root Cause Analysis

The locking contention was caused by the trigram index locking to serialize a cleanup of an internal data structure. Due to the statement timeout this cleanup never completed and the next insert attempted to take the same lock.

Somehow the CREATE INDEX CONCURRENTLY triggered this behaviour but the mechanism for this is still unclear.

The root cause of the locking contention was a bug in Postgres that's fixed in 9.6.7. 9.6.7 has been out since 2018-02-08 so tracking point releases of Postgres would have avoided the problem.

From https://www.postgresql.org/docs/9.6/static/release-9-6-7.html:

Ensure that vacuum will always clean up the pending-insertions list of a GIN index (Masahiko Sawada)

This is necessary to ensure that dead index entries get removed. The old code got it backwards, allowing vacuum to skip the cleanup if some other process were running cleanup concurrently, thus risking invalid entries being left behind in the index.

What went well

Information available in Grafana dashboards and Postgres logs did surface the statement timeouts and locking contention quickly.

What can be improved

Our alerting failed to actually send an alert anywhere despite catching the problem quickly.
There wasn't a clear decision about who was "incident commander" so communication and decision-making was slowed
We didn't start a post-mortem promptly
We were overconfident in our analysis of how important the bug-fixes in point releases of Postgres were. We should track point releases more closely in case a bug-fix is more relevant to Gitlab than we understand.

Corrective actions

See "Related issues" below

Guidelines

Edited Mar 21, 2018 by Gregory Stark