database deadlocks updating ci_builds and deployments tables
Summary
Reported in a support ticket, a customer's pipeline (internal links) misbehaves because when jobs fail, the updates to the database can cause a deadlock.
They appear to see two different deadlocks; one involving the ci_builds and deployments tables, the other just involving multiple updates to ci_builds, which I've raised as a separate issue.
Assigned this to ~"group::release" because it involves the table deployments where as the other one could affect any pipeline.
Steps to reproduce
Example Project
Customer's project involved long pipelines, and lots of jobs tagged with environment.
When an earlier job fails, a lot of jobs would then get cancelled, and this has the effect of cancelling as lot of deployments as well.
I created a simplified version of this to try to reproduce it, unsuccessfully - the pipeline was set to run frequently on a schedule, but no deadlocks occurred.
What is the current bug behavior?
@pbair in groupdatabase looked into this:
When a build is failed:
[51] pry(main)> b.drop!
(0.3ms) BEGIN
Deployment Update (0.5ms) UPDATE "deployments" SET "updated_at" = '2020-12-15 18:41:15.967273', "status" = 3,
"finished_at" = '2020-12-15 18:41:15.964140' WHERE "deployments"."id" = 34
Ci::Build Update (0.5ms) UPDATE "ci_builds" SET "status" = 'failed', "finished_at" = '2020-12-15 18:41:15.960797',
"updated_at" = '2020-12-15 18:41:15.970301', "processed" = FALSE, "lock_version" = 4 WHERE "ci_builds"."id" = 14
AND "ci_builds"."lock_version" = 3
(0.7ms) COMMIT
while when the build is skipped/canceled, it instead does:
[63] pry(main)> b.skip
(0.2ms) BEGIN
Ci::Build Update (0.7ms) UPDATE "ci_builds" SET "status" = 'skipped', "updated_at" = '2020-12-15 19:12:03.676005',
"processed" = FALSE, "lock_version" = 8 WHERE "ci_builds"."id" = 14 AND "ci_builds"."lock_version" = 7
Deployment Update (0.4ms) UPDATE "deployments" SET "updated_at" = '2020-12-15 19:12:03.680070', "status" = 4,
"finished_at" = '2020-12-15 19:12:03.679611' WHERE "deployments"."id" = 34
(0.7ms) COMMIT
What is the expected correct behavior?
Transactions should be sequenced the same way to avoid deadlocks.
When running sql statement logging, I couldn't see why the same records would be updated, on either table. So, I'm missing an element of what's needed to reproduce this.
Ignoring that, and just looking at the deadlock errors, it seems like a job failing is triggering updates along the following lines:
job A fails -> lock CI record -> lock deployment record -> commit
|--deadlock--|
commit <- lock CI record <- lock deployment record <- job B gets skipped <- Job A fails
Relevant logs and/or screenshots
DETAIL: Process 3169 waits for ShareLock on transaction 155867330; blocked by process 18902.
Process 18902 waits for ShareLock on transaction 155867333; blocked by process 3169.
Process 3169: UPDATE "deployments" SET "status" = 4, "finished_at" = '2020-11-24 11:02:30.254323',
"updated_at" = '2020-11-24 11:02:30.257483' WHERE "deployments"."id" = 583723
Process 18902: UPDATE "ci_builds" SET "status" = 'failed', "finished_at" = '2020-11-24 11:02:30.104164',
"failure_reason" = 13, "processed" = FALSE, "updated_at" = '2020-11-24 11:02:30.335325',
"lock_version" = 1 WHERE "ci_builds"."id" = 7539594 AND "ci_builds"."lock_version" = 0
DETAIL: Process 7375 waits for ShareLock on transaction 155878589; blocked by process 30906.
Process 30906 waits for ShareLock on transaction 155878588; blocked by process 7375.
Process 7375: UPDATE "ci_builds" SET "status" = 'failed', "finished_at" = '2020-11-24 11:12:05.142284',
"failure_reason" = 13, "processed" = FALSE, "updated_at" = '2020-11-24 11:12:05.168003',
"lock_version" = 1 WHERE "ci_builds"."id" = 7539613 AND "ci_builds"."lock_version" = 0
Process 30906: UPDATE "deployments" SET "status" = 4, "finished_at" = '2020-11-24 11:12:05.183737',
"updated_at" = '2020-11-24 11:12:05.189358' WHERE "deployments"."id" = 583739
DETAIL: Process 4972 waits for ShareLock on transaction 167932300; blocked by process 12852.
Process 12852 waits for ShareLock on transaction 167932299; blocked by process 4972.
Process 4972: UPDATE "ci_builds" SET "status" = 'failed', "finished_at" = '2020-12-03 19:03:05.693157',
"failure_reason" = 13, "processed" = FALSE, "updated_at" = '2020-12-03 19:03:05.707250',
"lock_version" = 1 WHERE "ci_builds"."id" = 8823085 AND "ci_builds"."lock_version" = 0
Process 12852: UPDATE "deployments" SET "status" = 4, "finished_at" = '2020-12-03 19:03:05.794159',
"updated_at" = '2020-12-03 19:03:05.803635' WHERE "deployments"."id" = 652133
Output of checks
This bug happens on GitLab.com
Results of GitLab environment info
- 13.5.3
- AWS RDS PostgreSQL 11.8