Delay time between gprd-cny, gprd-alpha, and gprd-beta
Overview
When we roll out a configuration change in production it goes through the following stages gprd-cny
-> gprd
&& gprd-us-east1-b
(alpha) -> gprd-us-east1-c
&& gprd-us-east1-b
(beta) as we see below
Each stage starts immediately after the previous stage is successful, so if gprd-cny
starts causing some problems we don't realize in time and it starts rolling out the change in gprd
as we saw in production#6736 (closed)
Proposal
Use delayed jobs to run the gprd-us-east1-b
and gprd
X minutes after gprd-cny
has been successfully rolled out, and add another delay for the beta stage. With delayed jobs we can start the job immediately pressing the play button:
To stop the active timer of a delayed job, select Unschedule (). This job can no longer be scheduled to run automatically. You can, however, execute the job manually. To start a delayed job immediately, select Play (). Soon GitLab Runner starts the job. https://docs.gitlab.com/ee/ci/jobs/job_control.html#run-a-job-after-a-delay
Pros:
- Changes are rolled out slower and give SRE time to react and stop a rollout Cons:
- Slower pipelines that might hurt deployment pipelines (maybe we can do this only for config changes?)
Note: We've done something similar in GitLab-helmfiles
in gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!549 (merged)