GitLab restore can fail owing to gitlab-healthcheck - Omnibus container deployment using docker swarm
Summary
A customer reported issues restoring GitLab, with the process failing part way through. GitLab team members can read more in the ticket. We spent some time trying to work out why the Gitaly element of the restore appeared to be failing.
It was determined that Docker Swarm was was killing the container, owing to the health check mechanism flagging it as unhealthy. Healthcheck had to be disabled through the docker-compose file.
healthcheck:
test: ["NONE"]
To disable any default healthcheck set by the image, you can use disable: true. This is equivalent to specifying test: ["NONE"].
healthcheck: disable: true
Potential fixes
- The health check could detect that a restore is in progress.
- Documented procedure for disabling the health check during restores, eg:
- Via docker compose
- Maybe some way via
reconfigure
to disable deployment of the health check, as it currently appears to be automatic if nginx or workhorse are deployed.
Steps to reproduce
- Follow the restore process, including stopping Sidekiq and Puma.
- Ensure the restore will exceed the thresholds defined by gitlab-healthcheck
What is the current bug behavior?
It's expected that GitLab will not respond during a restore, because Rails is shut down.
The parts of the puzzle seem to be
- health check is enabled if nginx or workhorse are deployed
- healthcheck runs a curl
- the URL is the
/help
page
As the health check is failing, the container will show as not healthy.
What is the expected correct behavior?
A procedure exists for reliably restoring GitLab when deploying using Docker Swarm
Relevant logs
Relevant logs
Details of package version
Ticket raised against 13.11.4, but code links above are from %15.0