Skip to content

Race condition in runit_service 'geo-logcursor' during gitlab-ctl reconfigure in container environments

Summary

There is a race condition in the GitLab Omnibus reconfigure phase, specifically in the execution of the runit_service 'geo-logcursor' block inside:

files/gitlab-cookbooks/gitlab-ee/recipes/geo-logcursor.rb

During containerized GitLab startup (e.g. Docker), the runit_service call attempts to invoke sv restart geo-logcursor before runsvdir has spawned runsv for the new service.

This causes a transient but real failure:

fail: geo-logcursor: runsv not running

However, the service is correctly picked up by runsvdir a few seconds later, and everything works from that point forward. If the runit_service block in geo-logcursor.rb is commented out, the issue does not occur, because runsvdir has time to detect the symlink and spawn runsv before any sv command is invoked.


Relevant code

execute 'restart geo-logcursor' do
  command '/opt/gitlab/bin/gitlab-ctl restart geo-logcursor'
  action :nothing
  dependent_services.map { |svc| subscribes :run, "runit_service[#{svc}]" }
  notifies :restart, "runit_service[puma]" if omnibus_helper.should_notify?('puma')
end

Steps to reproduce

  1. Deploy GitLab EE in a Docker container.

  2. Enable geo-logcursor in gitlab.rb

  3. Ensure gitlab-ctl reconfigure runs during container startup.

  4. Observe logs during reconfigure.

  5. See the error:

    fail: geo-logcursor: runsv not running
  6. Comment out the runit_service block for geo-logcursor.

  7. Restart container; the error no longer appears, and the service starts normally.


Analysis

  • runit_service creates the service directory and symlink.
  • Then it calls sv restart or similar commands.
  • But runsvdir requires time to detect the new service and spawn runsv.
  • sv fails if it runs before runsv is present and listening on the FIFOs.

Suggested fix

  • Add a wait mechanism inside runit_service to detect when runsv has attached to supervise/ok.
  • For example, poll for lsof /opt/gitlab/service/<svc>/supervise/ok being opened by runsv before executing sv.
  • Alternatively, introduce retries for sv commands when the specific error runsv not running is returned.

Workaround

  • Wrap /opt/gitlab/embedded/bin/sv to delay execution until runsv is detected.
  • Or comment out the runit_service block (viable in custom Omnibus builds only).

Impact

  • Causes transient but misleading startup failures.
  • Breaks container startup idempotency.
  • Introduces fragility into CI/CD automation.

Please consider

Please consider making runit_service more resilient to runsv startup timing. This is especially relevant in Docker-based or ephemeral deployments where gitlab-ctl reconfigure is executed automatically as part of the container boot process.


Edited by Michael Kazakov