Skip to content

Add a timeout on acquiring the operations lock

Marios Hadjimichael requested to merge marios/timeout-on-operation-lock into master

We have discovered a rare edge case with the SQL scheduler, where a series of SQL related failures* will cause specific jobs to be in a bad state (and always deduplicated to the problematic job):

  • Job is in "queued" stage (2) and "assigned == false"
  • worker_completed_timestamp and result also populated for said job

While updating the stage/assigned failed, the job was completed (and we can even have a successful result), yet the scheduler is confused and keeps reassigning the job thinking it is still queued, with the rest of buildgrid unable to detect/fix this inconsistent state*.

This can result in deadlocked SQL queries, while holding the operations_lock in buildgrid (e.g. creating a new "operation" for the deduplicated job), which blocks all threads/requests trying to modify any aspect of operations.

Furthermore, this can cause the server to stop handling any and all gRPC requests if a client streaming operation updates (e.g. recc) closes the connection (e.g. due to inactivity timeout, since we're in a deadlocked state in SQL), due to the peer_disconnect callback getting stuck.

More specifically, this happens because when a client streaming operation updates leaves, unregister_job_operation_peer() gets called, which itself waits on the operations lock (forever), and this happens on the gRPC 'router' thread as documented in this diagram, thus subsequent gRPC requests never get assigned to a gRPC executor thread to handle them.

* More details on the series of events resulting in this database inconsistency to follow.

This MR proactively makes sure that no lock keeps waiting forever around operations, and eventually fails if it cannot acquire it.

Before raising this MR, consider whether the following are required, and complete if so:

  • Unit tests
Edited by Marios Hadjimichael

Merge request reports