sidekiq-cluster service might get left behind in runit

Summary

The purpose of this issue is to document potential fixes, since the code to handle this in chef was removed in 14.0. It probably doesn't make sense to add this to 15.x documentation.

GitLab servers running sidekiq and sidekiq-cluster have been observed more than once. GitLab team members can refer to one of the tickets.

This issue relates to how to remove the sidekiq-cluster if it shouldn't be running.

Steps to reproduce

  1. An installation defined the following in %13.12 or earlier

    sidekiq_cluster['enable'] = true
    sidekiq_cluster['queue_groups'] = ['*']
    • With no other changes from default, this will run both sidekiq and sidekiq-cluster.
  2. It appears that it was possible for /opt/gitlab/sv/sidekiq-cluster/run to get removed without the service being undefined via ruby_block[disable sidekiq-cluster] action run. Simulate this with:

    mv /opt/gitlab/sv/sidekiq-cluster/run /opt/gitlab/sv/sidekiq-cluster/run-foo
    gitlab-ctl restart
  3. After that, disable won't run because enabled? is false and enabled? is idempotent on the run file.

        def enabled?
          ::File.exist?(::File.join(service_dir_name, 'run'))
        end

What is the current bug behavior?

# gitlab-ctl status
run: gitaly: (pid 6832) 455s; run: log: (pid 1377) 6278s
run: gitlab-workhorse: (pid 6848) 455s; run: log: (pid 1381) 6278s
run: logrotate: (pid 6854) 455s; run: log: (pid 1376) 6278s
run: nginx: (pid 6867) 454s; run: log: (pid 1384) 6278s
run: postgresql: (pid 6872) 454s; run: log: (pid 1380) 6278s
run: puma: (pid 6881) 453s; run: log: (pid 1370) 6278s
run: redis: (pid 6886) 453s; run: log: (pid 1372) 6278s
run: sidekiq: (pid 6892) 452s; run: log: (pid 1373) 6278s
down: sidekiq-cluster: 0s, normally up, want up; run: log: (pid 6031) 714s

What is the expected correct behavior?

# gitlab-ctl status
run: gitaly: (pid 6832) 455s; run: log: (pid 1377) 6278s
run: gitlab-workhorse: (pid 6848) 455s; run: log: (pid 1381) 6278s
run: logrotate: (pid 6854) 455s; run: log: (pid 1376) 6278s
run: nginx: (pid 6867) 454s; run: log: (pid 1384) 6278s
run: postgresql: (pid 6872) 454s; run: log: (pid 1380) 6278s
run: puma: (pid 6881) 453s; run: log: (pid 1370) 6278s
run: redis: (pid 6886) 453s; run: log: (pid 1372) 6278s
run: sidekiq: (pid 6892) 452s; run: log: (pid 1373) 6278s

Potential fixes

13.12 or earlier

  1. Remove any sidekiq_cluster definitions from gitlab.rb
  2. Replace with sidekiq['queue_groups'] = ['*'], sidekiq['cluster'] = true etc (Review 13.x docs for how to do this)
    • Assuming that this issue is of interest because this step has already been done, but sidekiq-cluster is persisting.
  3. Run gitlab-ctl reconfigure
  4. Chef should remove the sidekiq-cluster service if sidekiq is enabled.

If it doesn't, then up to 13.12 inclusive, use Chef to fix this as follows. Note: sidekiq_cluster['enable'] = false doesn't fix it.

   # run as root
install -d -m 0755 --owner=root --group=root /opt/gitlab/sv/sidekiq-cluster
install -m 0755 --owner=root --group=root /dev/null /opt/gitlab/sv/sidekiq-cluster/run 
gitlab-ctl reconfigure

If neither approach fixes it, use the 14.0+ fix.

  • If reconfigure puts sidekiq-cluster back, then this ought to be because gitlab.rb has it enabled.

14.0 or later

gitlab-ctl stop
systemctl stop gitlab-runsvdir.service
rm -rf /opt/gitlab/sv/sidekiq-cluster
systemctl start gitlab-runsvdir.service
   # not really needed, but always best to know the install is in sync
   # is needed if you run this on 13.12 or earlier because other fix doesn't work
gitlab-ctl reconfigure
  • Tested on 13.12, as adding this service to later versions (for testing purposes) is non trivial!
    • Fix doesn't depend on the presence (or absence) of Chef recipes etc., so this should work on later releases as well.

Discussion

  • From the comments, the chef/Omnibus code is being defensive. If we shut down GitLab entirely, we shouldn't need to be
  • block { disable_service } in /opt/gitlab/embedded/cookbooks/runit/libraries/provider_runit_service.rb calls
    def disable_service
      Chef::Log.debug("Attempting to disable runit service with: #{new_resource.sv_bin} #{sv_args}down #{service_dir_name}")
      shell_out("#{new_resource.sv_bin} #{sv_args}down #{service_dir_name}")
      FileUtils.rm(service_dir_name)
  • /var/log/gitlab/sidekiq-cluster could also be cleaned up.
  • Chef also does: FileUtils.rm(::File.join(sv_dir_name, 'supervise', 'ok')) but I can't see that sv_dir_name is a different location from service_dir_name and the ok file shouldn't get recreated with everything shut down

Relevant logs

Relevant logs

Details of package version

Provide the package version installation details

Edited by Ben Prescott (ex-GitLab)