Skip to content

Guard against race conditions with checking worker expiry

Jeremiah Bonney requested to merge jbonney/jobAssigner_race_conditions into master

Description

When assigning work to a waiting worker, there's a check to verify that the worker's ttl hasn't expired before the job assigner tries to assign a leases. However, the current check is vulnerable to TOCTOU race conditions. If lease assignment starts just before a worker would expire, it's possible to get into a state where the JobAssigner thinks the lease was assigned to a worker, but the worker itself gets no lease from wait_for_work(). This results in jobs which are forever waiting to be executed, but not queued onto any other workers due to them being marked assigned.

This PR aims to prevent this by using a lock guarding updates to self._lease to make sure that wait_for_work can't be run in the middle of maybe_assign_lease, which should make sure that either the lease is assigned to the worker and wait_for_work returns it, or that the lease is not assigned due to the deadline being expired.

Validation

I've run all the unit tests which pass, but haven't been able to reliably reproduce this in a test setting yet.

Merge request reports