Inserting notes is timing out
See https://sentry.gitlap.com/gitlab/gitlabcom/issues/149882/ (2,000 events in 20 minutes), https://sentry.gitlap.com/gitlab/gitlabcom/issues/149886/ (200 events in 20 minutes), and more at https://sentry.gitlap.com/gitlab/gitlabcom/.
Previously we had this infrequently, but not enough to be really noticeable: https://gitlab.com/gitlab-org/gitlab-ce/issues/40523#note_48617965
10:28 UTC
Migrations from deploy started and finished at 10:30 UTC
11:05 UTC
@abrandl pointed out that locks and errors (statement timeouts, in this case) increased around 10:30 UTC:
https://performance.gitlab.net/dashboard/db/postgres-stats?orgId=1
We started deploying 10.6.0 RC3 around that time, but the changes don't look related at first glance: https://gitlab.com/gitlab-org/gitlab-ee/compare/v10.6.0-rc2-ee...v10.6.0-rc3-ee
11:13 UTC
@_stark:
2018-03-12_11:12:56.62981 postgres-01 postgresql: 2018-03-12 11:12:56 GMT [1052]: [523-1] STATEMENT: INSERT INTO "notes" ("noteable_id", "noteable_type", "project_id", "author_id", "note", "system", "discussion_id", "created_at", "updated_at", "note_html", "cached_markdown_version") VALUES (9677991, 'Issue', 4757047, 480756, 'changed the description', 't', 'e1eb65d04c828ad740125a686d4a2f2bdd83d0ea', '2018-03-12 11:12:41.757584', '2018-03-12 11:12:41.757584', '<p dir="auto">changed the description</p>', 3) RETURNING "id" 2018-03-12_11:12:56.67597 postgres-01 postgresql: 2018-03-12 11:12:56 GMT [15130]: [218-1] LOG: process 15130 still waiting for ExclusiveLock on page 0 of relation 34377 of database 16385 after 5000.163 ms 2018-03-12_11:12:56.67608 postgres-01 postgresql: 2018-03-12 11:12:56 GMT [15130]: [219-1] DETAIL: Process holding the lock: 1052. Wait queue: 18288, 117975, 10776, 3455, 18292, 8209, 4311, 127503, 11941, 19714, 18290, 122834, 5911, 21506, 8195, 30737, 28373, 31925, 3585, 22108, 4421, 16895, 31058, 31937, 10858, 18555, 28181, 31059, 28270, 8210, 32031, 18554, 32040, 20517, 125843, 4573, 15773, 31534, 8037, 31921, 16971, 30980, 126838, 14933, 4292, 31521, 4310, 14318, 30738, 31561, 4601, 30634, 15130, 18293, 119073, 18628, 18733, 119957, 119593, 9816, 11008, 19169, 22114, 3548, 26723, 18294, 31061, 123398, 11309, 28668, 31924, 8250, 31927, 11007, 18651, 31057, 18297, 3885, 31926, 117569, 124545, 8735, 31953.That's waiting for a lock on
index_notes_on_note_trigram
We have currently 94 of those pending insert into
notesqueries
@_stark:
It's a unicorn transaction blocking them
11:30 UTC
We realise that no notes have been created since 10:33 UTC. The insert count continues to increase:
During the call:
- Autovacuum frequently ran on
notestable. Considered disabling it but didn't touch it. During the periods with no vacuum running, inserts were still timing out. - We try bumping the statement timeout to 30s. No effect.
- @_stark experiments with rebuilding the trigram index on staging, to see how long it would take to recreate and drop in production.
Then:
@abrandl disables the statement timeout for his session, to try inserting https://sentry.gitlap.com/gitlab/gitlabcom/issues/149882/events/5253294/ manually. It takes 32s.
After rolling this back, inserts appear to have gone back to their usual speed.
Relation to production deploy
We saw the increase in locks about 30s before one of the create index statements from the deploy finished:
2018-03-12_10:33:11.58951 postgres-01 postgresql: 2018-03-12 10:33:11 GMT [93234]: [1-1] LOG: process 93234 still waiting for ExclusiveLock on page 0 of relation 34377 of database 16385 after 5000.109 ms
...
2018-03-12_10:33:11.98286 postgres-01 postgresql: 2018-03-12 10:33:11 GMT [119770]: [2-1] ERROR: canceling statement due to statement timeout
...
2018-03-12_10:33:31.50056 postgres-01 postgresql: 2018-03-12 10:33:31 GMT [3224]: [3-1] STATEMENT: CREATE INDEX CONCURRENTLY "index_taggings_on_tag_id" ON "taggings" ("tag_id" )
...
2018-03-12_10:33:41.34177 postgres-01 postgresql: 2018-03-12 10:33:41 GMT [3224]: [6-1] LOG: duration: 50056.631 ms statement: CREATE INDEX CONCURRENTLY "index_taggings_on_tag_id" ON "taggings" ("tag_id" )
This means that the CREATE INDEX CONCURRENTLY began at 10:32:51 and the statement which first timed out (which never logged about waiting on a lock for >5s so it seems it was active for over 15s) started at about 10:32:56.
All other statements subsequently were stuck waiting on the lock for page 0 of the GIN trigram index which indicates they wanted to flush the pending inserts list. It seems clear that the first insert that timed out was trying to flush the pending insert list and that's why it took 15s. However this is supposed to be a non-blocking lock and an operation that can run concurrently with inserts so this appears to not be operating as intended.
Summary
Between 10:32:56 and 12:01:27, no notes (comments) were inserted on GitLab.com. We believe the problem was resolved while we were investigating the root cause, because once a single insert succeeded, it flushed the list of those waiting for the lock.
We don't know the root cause yet.

