Race condition in runit_service 'geo-logcursor' during gitlab-ctl reconfigure in container environments
Summary
There is a race condition in the GitLab Omnibus reconfigure phase, specifically in the execution of the runit_service 'geo-logcursor' block inside:
files/gitlab-cookbooks/gitlab-ee/recipes/geo-logcursor.rb
During containerized GitLab startup (e.g. Docker), the runit_service call attempts to invoke sv restart geo-logcursor before runsvdir has spawned runsv for the new service.
This causes a transient but real failure:
fail: geo-logcursor: runsv not running
However, the service is correctly picked up by runsvdir a few seconds later, and everything works from that point forward.
If the runit_service block in geo-logcursor.rb is commented out, the issue does not occur, because runsvdir has time to detect the symlink and spawn runsv before any sv command is invoked.
Relevant code
execute 'restart geo-logcursor' do
command '/opt/gitlab/bin/gitlab-ctl restart geo-logcursor'
action :nothing
dependent_services.map { |svc| subscribes :run, "runit_service[#{svc}]" }
notifies :restart, "runit_service[puma]" if omnibus_helper.should_notify?('puma')
end
Steps to reproduce
-
Deploy GitLab EE in a Docker container.
-
Enable geo-logcursor in gitlab.rb
-
Ensure
gitlab-ctl reconfigureruns during container startup. -
Observe logs during reconfigure.
-
See the error:
fail: geo-logcursor: runsv not running -
Comment out the
runit_serviceblock forgeo-logcursor. -
Restart container; the error no longer appears, and the service starts normally.
Analysis
-
runit_servicecreates the service directory and symlink. - Then it calls
sv restartor similar commands. - But
runsvdirrequires time to detect the new service and spawnrunsv. -
svfails if it runs beforerunsvis present and listening on the FIFOs.
Suggested fix
- Add a wait mechanism inside
runit_serviceto detect whenrunsvhas attached tosupervise/ok. - For example, poll for
lsof /opt/gitlab/service/<svc>/supervise/okbeing opened byrunsvbefore executingsv. - Alternatively, introduce retries for
svcommands when the specific errorrunsv not runningis returned.
Workaround
- Wrap
/opt/gitlab/embedded/bin/svto delay execution untilrunsvis detected. - Or comment out the
runit_serviceblock (viable in custom Omnibus builds only).
Impact
- Causes transient but misleading startup failures.
- Breaks container startup idempotency.
- Introduces fragility into CI/CD automation.
Please consider
Please consider making runit_service more resilient to runsv startup timing. This is especially relevant in Docker-based or ephemeral deployments where gitlab-ctl reconfigure is executed automatically as part of the container boot process.