Post-mortem for 2018-03-12 incident: Commenting on Issues, MRs was failing
Context
Original issue was: https://gitlab.com/gitlab-org/gitlab-ce/issues/44146#note_63178920
All inserts into the Notes table were blocking for 15s then hitting the statement timeout and throwing an error causing all comments to issues and MRs to fail. This was the case for about 90 minutes between 2018-03-12 10:32:56 and 12:01:27.
Timeline
On date: 2013-08-12
- 10:28 UTC - Deploy started
- 10:32:51 -
CREATE INDEX CONCURRENTLY "index_taggings_on_tag_id" ON "taggings" ("tag_id" )
begins - 10:33:11 - First insert into
notes
times out (it would have started at 10:32:56) no inserts succeed after this until resolution - 10:33:41 -
CREATE INDEX CONCURRENTLY "index_taggings_on_tag_id" ON "taggings" ("tag_id" )
ends - 10:48:24 - @smcgivern on slack:
we're seeing a lot of statement timeouts when inserting notes at the moment
observed on Sentry - 11:04:00 -
We are currently investigating an increase of database locks that may cause inserting comments to timeout
https://twitter.com/gitlabstatus/status/973152719191068673 - 11:05:00 - @abrandl observed on Grafana dashboard that locks and errors (statement timeouts, in this case) increased around 10:30 UTC
- 11:13:00 - @_stark observed log entries on database server of processing waiting for a lock on
index_notes_on_note_trigram
- 11:31:00 - @smcgivern
do we need to call? it's been an hour now
- 11:34:00 - @smcgivern starts Zoom session
- 11:34:00 - Discussion focuses on vacuum being the cause
- 11:53:00 - @abrandl tries
ALTER SYSTEM SET statement_timeout TO '30s';
- 11:54:00 - @smcgivern observes that no notes have been created since 10:32
- 12:02:00 - @abrandl inserts into notes manually with statement timeout disabled
- 12:09:00 - @smcgivern observed that the problem had resolved
- 12:14:00 -
We have identified and fixed the increase in DB locks - commenting can continue!
https://twitter.com/gitlabstatus/status/973170270168059904
Incident Analysis
- The original detection was via Sentry. Our monitoring did pick up the problem as early as 10:33:45 and would have alerted at 10:39:00 but it seems to be misconfigured and the alert didn't appear anywhere.
- The locking on the trigram index was identified quickly but we were focused on identifying a problematic query or vacuum holding the lock.
Root Cause Analysis
The locking contention was caused by the trigram index locking to serialize a cleanup of an internal data structure. Due to the statement timeout this cleanup never completed and the next insert attempted to take the same lock.
Somehow the CREATE INDEX CONCURRENTLY triggered this behaviour but the mechanism for this is still unclear.
The root cause of the locking contention was a bug in Postgres that's fixed in 9.6.7. 9.6.7 has been out since 2018-02-08 so tracking point releases of Postgres would have avoided the problem.
From https://www.postgresql.org/docs/9.6/static/release-9-6-7.html:
Ensure that vacuum will always clean up the pending-insertions list of a GIN index (Masahiko Sawada)
This is necessary to ensure that dead index entries get removed. The old code got it backwards, allowing vacuum to skip the cleanup if some other process were running cleanup concurrently, thus risking invalid entries being left behind in the index.
What went well
- Information available in Grafana dashboards and Postgres logs did surface the statement timeouts and locking contention quickly.
What can be improved
- Our alerting failed to actually send an alert anywhere despite catching the problem quickly.
- There wasn't a clear decision about who was "incident commander" so communication and decision-making was slowed
- We didn't start a post-mortem promptly
- We were overconfident in our analysis of how important the bug-fixes in point releases of Postgres were. We should track point releases more closely in case a bug-fix is more relevant to Gitlab than we understand.
Corrective actions
See "Related issues" below