Inserting notes is timing out

See https://sentry.gitlap.com/gitlab/gitlabcom/issues/149882/ (2,000 events in 20 minutes), https://sentry.gitlap.com/gitlab/gitlabcom/issues/149886/ (200 events in 20 minutes), and more at https://sentry.gitlap.com/gitlab/gitlabcom/.

Previously we had this infrequently, but not enough to be really noticeable: https://gitlab.com/gitlab-org/gitlab-ce/issues/40523#note_48617965

10:28 UTC

Migrations from deploy started and finished at 10:30 UTC

11:05 UTC

@abrandl pointed out that locks and errors (statement timeouts, in this case) increased around 10:30 UTC:

image

https://performance.gitlab.net/dashboard/db/postgres-stats?orgId=1

We started deploying 10.6.0 RC3 around that time, but the changes don't look related at first glance: https://gitlab.com/gitlab-org/gitlab-ee/compare/v10.6.0-rc2-ee...v10.6.0-rc3-ee

11:13 UTC

@_stark:

2018-03-12_11:12:56.62981 postgres-01 postgresql: 2018-03-12 11:12:56 GMT [1052]: [523-1] STATEMENT:  INSERT INTO "notes" ("noteable_id", "noteable_type", "project_id", "author_id", "note", "system", "discussion_id", "created_at", "updated_at", "note_html", "cached_markdown_version") VALUES (9677991, 'Issue', 4757047, 480756, 'changed the description', 't', 'e1eb65d04c828ad740125a686d4a2f2bdd83d0ea', '2018-03-12 11:12:41.757584', '2018-03-12 11:12:41.757584', '<p dir="auto">changed the description</p>', 3) RETURNING "id"
2018-03-12_11:12:56.67597 postgres-01 postgresql: 2018-03-12 11:12:56 GMT [15130]: [218-1] LOG:  process 15130 still waiting for ExclusiveLock on page 0 of relation 34377 of database 16385 after 5000.163 ms
2018-03-12_11:12:56.67608 postgres-01 postgresql: 2018-03-12 11:12:56 GMT [15130]: [219-1] DETAIL:  Process holding the lock: 1052. Wait queue: 18288, 117975, 10776, 3455, 18292, 8209, 4311, 127503, 11941, 19714, 18290, 122834, 5911, 21506, 8195, 30737, 28373, 31925, 3585, 22108, 4421, 16895, 31058, 31937, 10858, 18555, 28181, 31059, 28270, 8210, 32031, 18554, 32040, 20517, 125843, 4573, 15773, 31534, 8037, 31921, 16971, 30980, 126838, 14933, 4292, 31521, 4310, 14318, 30738, 31561, 4601, 30634, 15130, 18293, 119073, 18628, 18733, 119957, 119593, 9816, 11008, 19169, 22114, 3548, 26723, 18294, 31061, 123398, 11309, 28668, 31924, 8250, 31927, 11007, 18651, 31057, 18297, 3885, 31926, 117569, 124545, 8735, 31953.

That's waiting for a lock on index_notes_on_note_trigram

@abrandl:

We have currently 94 of those pending insert into notes queries

@_stark:

It's a unicorn transaction blocking them

11:30 UTC

We realise that no notes have been created since 10:33 UTC. The insert count continues to increase:

image

During the call:

  1. Autovacuum frequently ran on notes table. Considered disabling it but didn't touch it. During the periods with no vacuum running, inserts were still timing out.
  2. We try bumping the statement timeout to 30s. No effect.
  3. @_stark experiments with rebuilding the trigram index on staging, to see how long it would take to recreate and drop in production.

Then:

@abrandl disables the statement timeout for his session, to try inserting https://sentry.gitlap.com/gitlab/gitlabcom/issues/149882/events/5253294/ manually. It takes 32s.

After rolling this back, inserts appear to have gone back to their usual speed.

Relation to production deploy

We saw the increase in locks about 30s before one of the create index statements from the deploy finished:

2018-03-12_10:33:11.58951 postgres-01 postgresql: 2018-03-12 10:33:11 GMT [93234]: [1-1] LOG:  process 93234 still waiting for ExclusiveLock on page 0 of relation 34377 of database 16385 after 5000.109 ms
...
2018-03-12_10:33:11.98286 postgres-01 postgresql: 2018-03-12 10:33:11 GMT [119770]: [2-1] ERROR:  canceling statement due to statement timeout
...
2018-03-12_10:33:31.50056 postgres-01 postgresql: 2018-03-12 10:33:31 GMT [3224]: [3-1] STATEMENT:  CREATE  INDEX CONCURRENTLY "index_taggings_on_tag_id" ON "taggings"  ("tag_id" )
...
2018-03-12_10:33:41.34177 postgres-01 postgresql: 2018-03-12 10:33:41 GMT [3224]: [6-1] LOG:  duration: 50056.631 ms  statement: CREATE  INDEX CONCURRENTLY "index_taggings_on_tag_id" ON "taggings"  ("tag_id" )

This means that the CREATE INDEX CONCURRENTLY began at 10:32:51 and the statement which first timed out (which never logged about waiting on a lock for >5s so it seems it was active for over 15s) started at about 10:32:56.

All other statements subsequently were stuck waiting on the lock for page 0 of the GIN trigram index which indicates they wanted to flush the pending inserts list. It seems clear that the first insert that timed out was trying to flush the pending insert list and that's why it took 15s. However this is supposed to be a non-blocking lock and an operation that can run concurrently with inserts so this appears to not be operating as intended.

Summary

Between 10:32:56 and 12:01:27, no notes (comments) were inserted on GitLab.com. We believe the problem was resolved while we were investigating the root cause, because once a single insert succeeded, it flushed the list of those waiting for the lock.

We don't know the root cause yet.

Edited Mar 12, 2018 by Gregory Stark
Assignee Loading
Time tracking Loading