Skip to content

Praefect's health check updates can deadlock during an upgrade

Praefect's health check updating query inserts records in random order which can cause a deadlock. HealthManager stores the health checking clients in a map. The map is then iterated upon to run the health checks for each Gitaly node. As map iteration order is random in Go, the order of the health check result slice is also random. This random order then determines the order of the inserts to the database. If two Praefects start upserting the records concurrently in different order, a deadlock will occur. This can occur during an upgrade, as both the old and the new process have the same identity and will try to work on the same records.

We should ensure the inserts are done always in order to guarantee one of the queries manages to acquire all locks. Postgres does notice when the deadlock occurs and resolves it by killing one of the queries.

Example flow with psql:

Client 1:

praefect_development=# begin;
BEGIN
praefect_development=# UPDATE node_status SET last_contact_attempt_at = NOW() where id = 1;
UPDATE 1
praefect_development=# UPDATE node_status SET last_contact_attempt_at = NOW() where id = 2;
UPDATE 1
praefect_development=# 

Client 2:

praefect_development=# begin;
BEGIN
praefect_development=# UPDATE node_status SET last_contact_attempt_at = NOW() where id = 2;
UPDATE 1
praefect_development=# UPDATE node_status SET last_contact_attempt_at = NOW() where id = 1;
ERROR:  deadlock detected
DETAIL:  Process 52732 waits for ShareLock on transaction 73191; blocked by process 52696.
Process 52696 waits for ShareLock on transaction 73192; blocked by process 52732.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (2,7) in relation "node_status"
Edited by Sami Hiltunen
To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information