Skip to content

Create leases in pending, clear completed_timestamp in mark_worker_started

Jeremiah Bonney requested to merge jbonney4/job-assigner-deadlock into master

Before raising this MR, consider whether the following are required, and complete if so:

  • Unit tests
  • Metrics
  • Documentation update(s)

Description

This PR addresses an issue found in the JobAssigner where jobs without a lease but with worker_start/worker_completed timestamps could cause a deadlock. The reason for this is due to assign_n_leases selecting a bunch of jobs with SELECT FOR UPDATE in a session, and inside that session job.create_lease being called. This in turn calls job.update_lease_state which clears the worker_start/worker_completed timestamps and attempts to persist it in the database using a different session. This results in a deadlock as the inner transaction is waiting for a lock held by the outer transaction. The only reason this doesn't happen all the time is that these fields are typically empty when a job doesn't have a lease, but all it takes is a job in a weird state due to connectivity issues to hit this code path.

We have a longer standing issue to rework how we schedule/handle jobs to use transactions properly, which removing the in-memory scheduler was the first step, but this PR addresses one very specific issue which is debilitating if it happens.

I've also added a test to verify this behavior is fixed with these changes.

Validation

The new test test_job_lease_creation_with_preexisting_worker_timestamps covers this behavior. Running this test against the old code results in a test failure due to lock timeouts which doesn't happen with the new version.

Merge request reports