On SIGHUP, parent can live well beyond grace period and stops responding to signals

I don't have a huge amount of information to reproduce, apart from this production incident report on gitlab.com: gitlab-com/gl-infra/production#2452 (closed).

The logs from this gitaly shard should be visible in Kibana.

A short re-summary: We observed a gitaly parent process remain alive for over an hour after gitlab-ctl hup gitaly was issued. It appears as though there is a race condition in the interprocess communication logic that is meant to facilitate gitaly zero-downtime upgrades.

  • gitaly-wrapper appeared to be watching the child
    • gitlab-ctl hup gitaly failed, because the child would not fork until the parent exited
  • both the parent and child appeared to be successfully serving requests
  • The parent did not respond to SIGTERM or SIGINT
  • SIGKILL'ing the parent allowed a subsequent gitlab-ctl hup gitaly to succeed

I know this is a bit sparse, let me know if you need any more info.

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information