GitLab restore can fail owing to gitlab-healthcheck - Omnibus container deployment using docker swarm

Summary

A customer reported issues restoring GitLab, with the process failing part way through. GitLab team members can read more in the ticket. We spent some time trying to work out why the Gitaly element of the restore appeared to be failing.

It was determined that Docker Swarm was was killing the container, owing to the health check mechanism flagging it as unhealthy. Healthcheck had to be disabled through the docker-compose file.

healthcheck:
test: ["NONE"]

Docker compose docs ref

To disable any default healthcheck set by the image, you can use disable: true. This is equivalent to specifying test: ["NONE"].

healthcheck:
 disable: true

Potential fixes

  • The health check could detect that a restore is in progress.
  • Documented procedure for disabling the health check during restores, eg:
    • Via docker compose
    • Maybe some way via reconfigure to disable deployment of the health check, as it currently appears to be automatic if nginx or workhorse are deployed.

Steps to reproduce

  • Follow the restore process, including stopping Sidekiq and Puma.
  • Ensure the restore will exceed the thresholds defined by gitlab-healthcheck

What is the current bug behavior?

It's expected that GitLab will not respond during a restore, because Rails is shut down.

The parts of the puzzle seem to be

As the health check is failing, the container will show as not healthy.

What is the expected correct behavior?

A procedure exists for reliably restoring GitLab when deploying using Docker Swarm

Relevant logs

Relevant logs

Details of package version

Ticket raised against 13.11.4, but code links above are from %15.0

Edited by Ben Prescott (ex-GitLab)