Race condition in `runit_service 'geo-logcursor'` during `gitlab-ctl reconfigure` in container environments

Summary

There is a race condition in the GitLab Omnibus reconfigure phase, specifically in the execution of the runit_service 'geo-logcursor' block inside:

files/gitlab-cookbooks/gitlab-ee/recipes/geo-logcursor.rb

During containerized GitLab startup (e.g. Docker), the runit_service call attempts to invoke sv restart geo-logcursor before runsvdir has spawned runsv for the new service.

This causes a transient but real failure:

fail: geo-logcursor: runsv not running

However, the service is correctly picked up by runsvdir a few seconds later, and everything works from that point forward. If the runit_service block in geo-logcursor.rb is commented out, the issue does not occur, because runsvdir has time to detect the symlink and spawn runsv before any sv command is invoked.

Relevant code

execute 'restart geo-logcursor' do
  command '/opt/gitlab/bin/gitlab-ctl restart geo-logcursor'
  action :nothing
  dependent_services.map { |svc| subscribes :run, "runit_service[#{svc}]" }
  notifies :restart, "runit_service[puma]" if omnibus_helper.should_notify?('puma')
end

Steps to reproduce

Deploy GitLab EE in a Docker container.
Enable geo-logcursor in gitlab.rb
Ensure gitlab-ctl reconfigure runs during container startup.
Observe logs during reconfigure.
See the error:
```
fail: geo-logcursor: runsv not running
```
Comment out the runit_service block for geo-logcursor.
Restart container; the error no longer appears, and the service starts normally.

Analysis

runit_service creates the service directory and symlink.
Then it calls sv restart or similar commands.
But runsvdir requires time to detect the new service and spawn runsv.
sv fails if it runs before runsv is present and listening on the FIFOs.

Suggested fix

Add a wait mechanism inside runit_service to detect when runsv has attached to supervise/ok.
For example, poll for lsof /opt/gitlab/service/<svc>/supervise/ok being opened by runsv before executing sv.
Alternatively, introduce retries for sv commands when the specific error runsv not running is returned.

Workaround

Wrap /opt/gitlab/embedded/bin/sv to delay execution until runsv is detected.
Or comment out the runit_service block (viable in custom Omnibus builds only).

Impact

Causes transient but misleading startup failures.
Breaks container startup idempotency.
Introduces fragility into CI/CD automation.

Please consider

Please consider making runit_service more resilient to runsv startup timing. This is especially relevant in Docker-based or ephemeral deployments where gitlab-ctl reconfigure is executed automatically as part of the container boot process.

Edited Aug 14, 2025 by Michael Kazakov

Race condition in runit_service 'geo-logcursor' during gitlab-ctl reconfigure in container environments