Running a number of PipelineUpdateWorkers can cause many blocking queries

We've seen on GitLab.com that on occasion we get a lot of SELECT FOR UPDATE queries on the ci_commits table that appear to accumulate and block on one another, leading to 502 timeouts and graphs as these:

Today @ayufan ran a test:

On dev: Ran 1000 queries: PipelineUpdateWorker.perform_async(<some pipeline ID>). No problems.
On GitLab.com: Repeat same experiment. We saw lots of blocking queries.

Often the 5-minute statement timeout hits, freeing these blocked queries. This is reflected in the Sidekiq graphs for PiplelineUpdateWorker:

The kicker is that PipelineUpdateWorker just does a simple state transition update in the DB (e.g. pending -> running):

    def update_status
      with_lock do
        case latest_builds_status
        when 'pending' then enqueue
        when 'running' then run
        when 'success' then succeed
        when 'failed' then drop
        when 'canceled' then cancel
        when 'skipped' then skip
        end
      end
    end

In looking at Sidekiq TTIN traces, it appears that threads are blocked on the with_lock call. If we look at the pg_lock table, it appears there is some tuple lock that's not being released:

gitlabhq_production=# select * from pg_locks where pid = 34805;
   locktype    | database | relation |  page  | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction |  pid  |        mode         | granted | fastpath 
---------------+----------+----------+--------+-------+------------+---------------+---------+-------+----------+--------------------+-------+---------------------+---------+----------
 relation      |    16385 |    34092 |        |       |            |               |         |       |          | 19/784908          | 34805 | AccessShareLock     | t       | t
 relation      |    16385 |    33272 |        |       |            |               |         |       |          | 19/784908          | 34805 | RowShareLock        | t       | t
 virtualxid    |          |          |        |       | 19/784908  |               |         |       |          | 19/784908          | 34805 | ExclusiveLock       | t       | t
 transactionid |          |          |        |       |            |       4822878 |         |       |          | 19/784908          | 34805 | ExclusiveLock       | t       | f
 tuple         |    16385 |    33272 | 105613 |     7 |            |               |         |       |          | 19/784908          | 34805 | AccessExclusiveLock | f       | f
(5 rows)

We're not exactly sure how this could happen. Is there some deadlock occurring? Is there a race condition when multiple hosts attempt to grab the same lock? Is some Sidekiq thread not properly releasing the lock?

Even though we don't fully understand the problem, here are the merge requests that should help:

Implement optimistic locking: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/7040
Run only one pipeline and project when scheduled multiple times: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/7005

We should also considering lowering the default statement timeout of 5 minutes