gitlab-ctl restart unicorn does not always restart unicorn
ZD: https://gitlab.zendesk.com/agent/tickets/93745, https://gitlab.zendesk.com/agent/tickets/93217
I was on calls twice this week trying to help customers upgrade to GitLab 10.6.x. Running gitlab-ctl restart unicorn
didn't actually kill the unicorn master
process. We had to kill the unicorn process by hand.
@namhokim and I poked at the GitHost instance running into this. Running gitlab-ctl restart unicorn
appears to kill the gitlab-unicorn-wrapper
script, but if that script hasn't properly adopted the running master, then the restart basically just kills the script and not the process itself.
Here are the logs from /var/log/gitlab/unicorn/current
:
2018-03-30_11:17:45.02970 forwarding TERM to unicorn master 8331
2018-03-30_11:17:45.02994 forwarding CONT to unicorn master 8331
2018-03-30_11:17:45.03009 wrapper for unicorn master 8331 exiting
2018-03-30_11:17:45.06008 adopted existing unicorn master 8331
2018-03-30_16:03:00.83713 Received TERM from runit, sending to process group (-PID)
2018-03-30_16:03:00.83858 Terminated
2018-03-30_16:03:00.83859 forwarding TERM to unicorn master 8331
2018-03-30_16:03:00.83892 wrapper for unicorn master 8331 exiting
2018-03-30_16:03:00.86902 adopted existing unicorn master 8331
As you can see, PID 8331 is the right process ID, which we killed manually at 16:03. Running gitlab-ctl restart unicorn
was issuing the kill to PID 30280, which was the gitlab-unicorn-process
, but we never see a "adopted existing unicorn master" from https://gitlab.com/gitlab-org/omnibus-gitlab/blob/b75ef40a28856e04cc2c4f4989de6cd504927e00/files/gitlab-scripts/gitlab-unicorn-wrapper#L55.
My guess is that something in the wrapper script got stuck, and it failed to do anything. Perhaps kill -0
sent to the unicorn process hung. I did see these warnings before unicorn started up after we killed it, which may be irrelevant but makes me wonder if unicorn was blocked on some DB connection:
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command
FATAL: terminating connection due to administrator command