database deadlocks updating ci_builds and deployments tables

Summary

Reported in a support ticket, a customer's pipeline (internal links) misbehaves because when jobs fail, the updates to the database can cause a deadlock.

They appear to see two different deadlocks; one involving the ci_builds and deployments tables, the other just involving multiple updates to ci_builds, which I've raised as a separate issue.

Assigned this to ~"group::release" because it involves the table deployments where as the other one could affect any pipeline.

Steps to reproduce

Example Project

Customer's project involved long pipelines, and lots of jobs tagged with environment.

When an earlier job fails, a lot of jobs would then get cancelled, and this has the effect of cancelling as lot of deployments as well.

I created a simplified version of this to try to reproduce it, unsuccessfully - the pipeline was set to run frequently on a schedule, but no deadlocks occurred.

gitlab-ci.yml

What is the current bug behavior?

@pbair in groupdatabase looked into this:

When a build is failed:

[51] pry(main)> b.drop!
   (0.3ms)  BEGIN
  Deployment Update (0.5ms)  UPDATE "deployments" SET "updated_at" = '2020-12-15 18:41:15.967273', "status" = 3, 
         "finished_at" = '2020-12-15 18:41:15.964140' WHERE "deployments"."id" = 34
  Ci::Build Update (0.5ms)  UPDATE "ci_builds" SET "status" = 'failed', "finished_at" = '2020-12-15 18:41:15.960797', 
         "updated_at" = '2020-12-15 18:41:15.970301', "processed" = FALSE, "lock_version" = 4 WHERE "ci_builds"."id" = 14 
         AND "ci_builds"."lock_version" = 3
   (0.7ms)  COMMIT

while when the build is skipped/canceled, it instead does:

[63] pry(main)> b.skip
   (0.2ms)  BEGIN
  Ci::Build Update (0.7ms)  UPDATE "ci_builds" SET "status" = 'skipped', "updated_at" = '2020-12-15 19:12:03.676005', 
          "processed" = FALSE, "lock_version" = 8 WHERE "ci_builds"."id" = 14 AND "ci_builds"."lock_version" = 7
  Deployment Update (0.4ms)  UPDATE "deployments" SET "updated_at" = '2020-12-15 19:12:03.680070', "status" = 4, 
          "finished_at" = '2020-12-15 19:12:03.679611' WHERE "deployments"."id" = 34
   (0.7ms)  COMMIT

What is the expected correct behavior?

Transactions should be sequenced the same way to avoid deadlocks.

When running sql statement logging, I couldn't see why the same records would be updated, on either table. So, I'm missing an element of what's needed to reproduce this.

Ignoring that, and just looking at the deadlock errors, it seems like a job failing is triggering updates along the following lines:

job A fails -> lock CI record -> lock deployment record -> commit
                       |--deadlock--|
     commit <- lock CI record <- lock deployment record <- job B gets skipped <- Job A fails

Relevant logs and/or screenshots

DETAIL: Process 3169 waits for ShareLock on transaction 155867330; blocked by process 18902.
Process 18902 waits for ShareLock on transaction 155867333; blocked by process 3169.
Process 3169: UPDATE "deployments" SET "status" = 4, "finished_at" = '2020-11-24 11:02:30.254323', 
    "updated_at" = '2020-11-24 11:02:30.257483' WHERE "deployments"."id" = 583723
Process 18902: UPDATE "ci_builds" SET "status" = 'failed', "finished_at" = '2020-11-24 11:02:30.104164', 
    "failure_reason" = 13, "processed" = FALSE, "updated_at" = '2020-11-24 11:02:30.335325', 
    "lock_version" = 1 WHERE "ci_builds"."id" = 7539594 AND "ci_builds"."lock_version" = 0
DETAIL: Process 7375 waits for ShareLock on transaction 155878589; blocked by process 30906.
Process 30906 waits for ShareLock on transaction 155878588; blocked by process 7375.
Process 7375: UPDATE "ci_builds" SET "status" = 'failed', "finished_at" = '2020-11-24 11:12:05.142284', 
    "failure_reason" = 13, "processed" = FALSE, "updated_at" = '2020-11-24 11:12:05.168003', 
    "lock_version" = 1 WHERE "ci_builds"."id" = 7539613 AND "ci_builds"."lock_version" = 0
Process 30906: UPDATE "deployments" SET "status" = 4, "finished_at" = '2020-11-24 11:12:05.183737', 
    "updated_at" = '2020-11-24 11:12:05.189358' WHERE "deployments"."id" = 583739
DETAIL: Process 4972 waits for ShareLock on transaction 167932300; blocked by process 12852.
Process 12852 waits for ShareLock on transaction 167932299; blocked by process 4972.
Process 4972: UPDATE "ci_builds" SET "status" = 'failed', "finished_at" = '2020-12-03 19:03:05.693157', 
    "failure_reason" = 13, "processed" = FALSE, "updated_at" = '2020-12-03 19:03:05.707250', 
    "lock_version" = 1 WHERE "ci_builds"."id" = 8823085 AND "ci_builds"."lock_version" = 0
Process 12852: UPDATE "deployments" SET "status" = 4, "finished_at" = '2020-12-03 19:03:05.794159', 
    "updated_at" = '2020-12-03 19:03:05.803635' WHERE "deployments"."id" = 652133

Output of checks

This bug happens on GitLab.com

Results of GitLab environment info

  • 13.5.3
  • AWS RDS PostgreSQL 11.8

Results of GitLab application Check

Possible fixes

Edited by Ben Prescott (ex-GitLab)