Zero Downtime Upgrades are unsafe for the Gitaly Service in Dedicated
Problem Statement
During a version upgrade to Gitaly on the Dedicated product, specifically when a change to the version of git
is introduced, we suffer downtime for approximately 15 minutes, give or take pending how long the GitLab Environment Toolkit takes to run through its playbooks.
- Gitaly is running
- Gitaly is upgraded - but not restarted - this is currently expected behavior by the design of GET - this introduces an outage scenario
- Gitaly is subject to
gitlab-ctl reconfigure
- this restarts the Gitaly process - this resolves the outage scenario
Investigative work completed in issue: https://gitlab.com/gitlab-com/gl-infra/gitlab-dedicated/team/-/issues/4450
Solutions
-
Investigate changes to GET to improve or lower the potential for downtime - note that this will need to be safe for customers of GET beyond the Dedicated Team. -
Upgrade Instrumentor as appropriate -
Test/Validate