sidekiq-cluster service might get left behind in runit
Summary
The purpose of this issue is to document potential fixes, since the code to handle this in chef was removed in 14.0. It probably doesn't make sense to add this to 15.x documentation.
GitLab servers running sidekiq and sidekiq-cluster have been observed more than once. GitLab team members can refer to one of the tickets.
This issue relates to how to remove the sidekiq-cluster if it shouldn't be running.
Steps to reproduce
-
An installation defined the following in %13.12 or earlier
sidekiq_cluster['enable'] = true sidekiq_cluster['queue_groups'] = ['*']- With no other changes from default, this will run both
sidekiqandsidekiq-cluster.
- With no other changes from default, this will run both
-
It appears that it was possible for
/opt/gitlab/sv/sidekiq-cluster/runto get removed without the service being undefined viaruby_block[disable sidekiq-cluster] action run. Simulate this with:mv /opt/gitlab/sv/sidekiq-cluster/run /opt/gitlab/sv/sidekiq-cluster/run-foo gitlab-ctl restart -
After that,
disablewon't run becauseenabled?isfalseandenabled?is idempotent on therunfile.def enabled? ::File.exist?(::File.join(service_dir_name, 'run')) end
What is the current bug behavior?
# gitlab-ctl status
run: gitaly: (pid 6832) 455s; run: log: (pid 1377) 6278s
run: gitlab-workhorse: (pid 6848) 455s; run: log: (pid 1381) 6278s
run: logrotate: (pid 6854) 455s; run: log: (pid 1376) 6278s
run: nginx: (pid 6867) 454s; run: log: (pid 1384) 6278s
run: postgresql: (pid 6872) 454s; run: log: (pid 1380) 6278s
run: puma: (pid 6881) 453s; run: log: (pid 1370) 6278s
run: redis: (pid 6886) 453s; run: log: (pid 1372) 6278s
run: sidekiq: (pid 6892) 452s; run: log: (pid 1373) 6278s
down: sidekiq-cluster: 0s, normally up, want up; run: log: (pid 6031) 714s
What is the expected correct behavior?
# gitlab-ctl status
run: gitaly: (pid 6832) 455s; run: log: (pid 1377) 6278s
run: gitlab-workhorse: (pid 6848) 455s; run: log: (pid 1381) 6278s
run: logrotate: (pid 6854) 455s; run: log: (pid 1376) 6278s
run: nginx: (pid 6867) 454s; run: log: (pid 1384) 6278s
run: postgresql: (pid 6872) 454s; run: log: (pid 1380) 6278s
run: puma: (pid 6881) 453s; run: log: (pid 1370) 6278s
run: redis: (pid 6886) 453s; run: log: (pid 1372) 6278s
run: sidekiq: (pid 6892) 452s; run: log: (pid 1373) 6278s
Potential fixes
13.12 or earlier
- Remove any
sidekiq_clusterdefinitions fromgitlab.rb - Replace with
sidekiq['queue_groups'] = ['*'],sidekiq['cluster'] = trueetc (Review 13.x docs for how to do this)- Assuming that this issue is of interest because this step has already been done, but
sidekiq-clusteris persisting.
- Assuming that this issue is of interest because this step has already been done, but
- Run
gitlab-ctl reconfigure - Chef should remove the
sidekiq-clusterservice ifsidekiqis enabled.
If it doesn't, then up to 13.12 inclusive, use Chef to fix this as follows. Note: sidekiq_cluster['enable'] = false doesn't fix it.
# run as root
install -d -m 0755 --owner=root --group=root /opt/gitlab/sv/sidekiq-cluster
install -m 0755 --owner=root --group=root /dev/null /opt/gitlab/sv/sidekiq-cluster/run
gitlab-ctl reconfigure
If neither approach fixes it, use the 14.0+ fix.
- If
reconfigureputssidekiq-clusterback, then this ought to be becausegitlab.rbhas it enabled.
14.0 or later
gitlab-ctl stop
systemctl stop gitlab-runsvdir.service
rm -rf /opt/gitlab/sv/sidekiq-cluster
systemctl start gitlab-runsvdir.service
# not really needed, but always best to know the install is in sync
# is needed if you run this on 13.12 or earlier because other fix doesn't work
gitlab-ctl reconfigure
- Tested on 13.12, as adding this service to later versions (for testing purposes) is non trivial!
- Fix doesn't depend on the presence (or absence) of Chef recipes etc., so this should work on later releases as well.
Discussion
- From the comments, the chef/Omnibus code is being defensive. If we shut down GitLab entirely, we shouldn't need to be
-
block { disable_service }in/opt/gitlab/embedded/cookbooks/runit/libraries/provider_runit_service.rbcalls
def disable_service
Chef::Log.debug("Attempting to disable runit service with: #{new_resource.sv_bin} #{sv_args}down #{service_dir_name}")
shell_out("#{new_resource.sv_bin} #{sv_args}down #{service_dir_name}")
FileUtils.rm(service_dir_name)
-
/var/log/gitlab/sidekiq-clustercould also be cleaned up. - Chef also does:
FileUtils.rm(::File.join(sv_dir_name, 'supervise', 'ok'))but I can't see thatsv_dir_nameis a different location fromservice_dir_nameand theokfile shouldn't get recreated with everything shut down