2025-06-26: Renew CA certificate for Consul in gstg

Production Change

Change Summary

The Consul CA certificate in gstg is expiring on 2025-08-06 and needs to be renewed:

$ vault kv get -field certificate k8s/env/gstg/ns/consul/tls | openssl x509 -noout -dates
notBefore=Aug  7 01:02:37 2020 GMT
notAfter=Aug  6 01:02:37 2025 GMT

The client certificate used on VMs also expires on 2025-08-03 and needs to be renewed:

pguinoiseau@console-01-sv-gstg.c.gitlab-staging-1.internal:~$ openssl x509 -in /etc/consul/ssl/certs/consul.crt -noout -dates
notBefore=Aug  4 20:53:22 2022 GMT
notAfter=Aug  3 20:53:22 2025 GMT

Note

The server certificate and the client certificate for Kubernetes are however not a concern as they are generated automatically from the CA on each Helm deployment for the former and on pod init for the latter.

Given that the CA is self-signed, we can extend its expiration date by generating a new certificate from the current one and reusing the same private key, which doesn't cause any cluster disruption during rollout.

We still need to be careful during the rollout as a restart of the Consul clients in Kubernetes can cause an outage when the Rails app fails to query the DNS for the current database leader instances, and a restart of the Consul clients on the database instances can trigger a failover if Patroni is not paused.

Upon successful execution of this change we will proceed to also rotate the CA certificate in gprd using the same procedure.

Issue: production-engineering#25974 (closed)

Change Details

Services Impacted - ServiceConsul ServicePatroni ServiceGitLab Rails
Change Technician - @pguinoiseau
Change Reviewer - @jcstephenson @bshah11
Scheduled Date and Time (UTC in format YYYY-MM-DD HH:MM) - 2025-06-26 03:00
Time tracking - 120 minutes
Downtime Component - none, but Patroni will be paused a few minutes

Set Maintenance Mode in GitLab

If your change involves scheduled maintenance, add a step to set and unset maintenance mode per our runbooks. This will make sure SLA calculations adjust for the maintenance period.

Detailed steps for the change

Pre-execution steps

Make sure all tasks in Change Technician checklist are done
For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
- The SRE on-call provided approval with the eoc_approved label on the issue.
For C1, C2, or blocks deployments change issues, Release managers have been informed prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
There are currently no active incidents that are severity1 or severity2
If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

Generate the new CA certificate with 5 years expiry using the current private key, and store it in Vault:

vault kv get -field certificate k8s/env/gstg/ns/consul/tls > tls.crt
vault kv get -field key k8s/env/gstg/ns/consul/tls > tls.key
openssl x509 -in tls.crt -signkey tls.key -days 1825 | vault kv patch k8s/env/gstg/ns/consul/tls certificate=-

Create the secret with the new CA in Kubernetes: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!8157 (merged)

Setup the DBRE toolkit locally:

test -d db-migration || git clone https://gitlab.com/gitlab-com/gl-infra/db-migration.git
cd db-migration
git checkout master
git pull
python3 -m venv ansible
source ansible/bin/activate
python3 -m pip install --upgrade pip
python3 -m pip install ansible
ansible --version

Check Ansible SSH access to the Patroni VMs:

cd dbre-toolkit
ansible -i inventory/gstg-ci.yml -m ping all
ansible -i inventory/gstg-main.yml -m ping all
ansible -i inventory/gstg-registry.yml -m ping all
ansible -i inventory/gstg-sec.yml -m ping all

Change steps - steps to take to execute the change

Estimated Time to Complete (mins) - 60 minutes

Set label changein-progress /label ~change::in-progress

Pause all Patroni clusters:

ssh patroni-ci-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl pause
ssh patroni-main-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl pause
ssh patroni-registry-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl pause
ssh patroni-sec-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl pause

Update the cluster CA in Kubernetes:

Merge gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!8110 (merged)

Helm will generate a new server certificate using the new CA certificate, and the servers will restart automatically but the clients won't. New clients (for new nodes) will use the new CA.

Verify that the cluster and clients are up and healthy:

kubectl config use-context gke_gitlab-staging-1_us-east1_gstg-gitlab-gke
kubectl --namespace consul get pods -l component=server
kubectl --namespace consul get pods -l component=client
kubectl --namespace consul exec consul-gl-consul-server-0 --container consul -- consul operator raft list-peers
kubectl --namespace consul exec consul-gl-consul-server-0 --container consul -- consul members

Restart a single GKE node pools to rotate the Consul clients (and everything else) without causing outages, and verify that the Consul clients are up and healthy:

gcloud container clusters upgrade gstg-gitlab-gke --project gitlab-staging-1 --location us-east1 --node-pool generic-3 --async
kubectl get nodes --selector cloud.google.com/gke-nodepool=generic-3 --output name | cut -d/ -f2 | xargs -n1 -I%  kubectl --namespace consul get pods --selector component=client --field-selector spec.nodeName=%

Restart all GKE node pools to rotate the Consul clients (and everything else) without causing outages:

gcloud container clusters upgrade gitlab-36dv2 --project gitlab-staging-1 --location us-east1 --node-pool generic-2 --async
gcloud container clusters upgrade gitlab-36dv2 --project gitlab-staging-1 --location us-east1 --node-pool generic-3 --async
gcloud container clusters upgrade gitlab-36dv2 --project gitlab-staging-1 --location us-east1 --node-pool generic-spot-1 --async
gcloud container clusters upgrade gitlab-36dv2 --project gitlab-staging-1 --location us-east1 --node-pool redis-pubsub-0 --async
# gcloud container clusters upgrade gstg-gitlab-gke --project gitlab-staging-1 --location us-east1 --node-pool generic-3 --async
gcloud container clusters upgrade gstg-gitlab-gke --project gitlab-staging-1 --location us-east1 --node-pool generic-4 --async
gcloud container clusters upgrade gstg-gitlab-gke --project gitlab-staging-1 --location us-east1 --node-pool generic-mem-2 --async
gcloud container clusters upgrade gstg-gitlab-gke --project gitlab-staging-1 --location us-east1 --node-pool generic-shared-storage-ssd --async
gcloud container clusters upgrade gstg-gitlab-gke --project gitlab-staging-1 --location us-east1 --node-pool generic-spot-1 --async
gcloud container clusters upgrade gstg-gitlab-gke --project gitlab-staging-1 --location us-east1 --node-pool redis-registry-cache-1 --async
gcloud container clusters upgrade gstg-us-east1-b --project gitlab-staging-1 --location us-east1-b --node-pool generic-1 --async
gcloud container clusters upgrade gstg-us-east1-b --project gitlab-staging-1 --location us-east1-b --node-pool generic-2 --async
gcloud container clusters upgrade gstg-us-east1-b --project gitlab-staging-1 --location us-east1-b --node-pool generic-mem-1 --async
gcloud container clusters upgrade gstg-us-east1-b --project gitlab-staging-1 --location us-east1-b --node-pool generic-spot-1 --async
gcloud container clusters upgrade gstg-us-east1-c --project gitlab-staging-1 --location us-east1-c --node-pool generic-1 --async
gcloud container clusters upgrade gstg-us-east1-c --project gitlab-staging-1 --location us-east1-c --node-pool generic-2 --async
gcloud container clusters upgrade gstg-us-east1-c --project gitlab-staging-1 --location us-east1-c --node-pool generic-mem-1 --async
gcloud container clusters upgrade gstg-us-east1-c --project gitlab-staging-1 --location us-east1-c --node-pool generic-spot-1 --async
gcloud container clusters upgrade gstg-us-east1-d --project gitlab-staging-1 --location us-east1-d --node-pool generic-1 --async
gcloud container clusters upgrade gstg-us-east1-d --project gitlab-staging-1 --location us-east1-d --node-pool generic-2 --async
gcloud container clusters upgrade gstg-us-east1-d --project gitlab-staging-1 --location us-east1-d --node-pool generic-mem-1 --async
gcloud container clusters upgrade gstg-us-east1-d --project gitlab-staging-1 --location us-east1-d --node-pool generic-spot-1 --async

This can continue in the background while proceeding to the next steps.

Update the Consul client certificate on VMs:

Disable chef-client on all Consul-enabled VMs:

knife ssh 'chef_environment:gstg AND recipes:gitlab_consul\:\:agent' 'sudo chef-client-disable https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19980'

Generate a new client certificate for VMs valid for 5 years, and store it in Vault:

vault kv get -field certificate k8s/env/gstg/ns/consul/tls > tls.crt
vault kv get -field key k8s/env/gstg/ns/consul/tls > tls.key
consul tls cert create -client -ca tls.crt -key tls.key -days 1825 -dc east-us-2
vault kv patch chef/env/gstg/cookbook/gitlab-consul/client ca_certificate=@tls.crt certificate=@east-us-2-client-consul-0.pem private_key=@east-us-2-client-consul-0-key.pem

SSH into a VM and test a Consul client update and restart:

$ ssh blackbox-01-inf-gstg.c.gitlab-staging-1.internal
pguinoiseau@blackbox-01-inf-gstg.c.gitlab-staging-1.internal:~$ sudo chef-client-enable
pguinoiseau@blackbox-01-inf-gstg.c.gitlab-staging-1.internal:~$ sudo chef-client
pguinoiseau@blackbox-01-inf-gstg.c.gitlab-staging-1.internal:~$ sudo openssl x509 -in /etc/consul/ssl/certs/chain.crt -noout -dates
pguinoiseau@blackbox-01-inf-gstg.c.gitlab-staging-1.internal:~$ sudo openssl x509 -in /etc/consul/ssl/certs/consul.crt -noout -dates
pguinoiseau@blackbox-01-inf-gstg.c.gitlab-staging-1.internal:~$ sudo systemctl restart consul.service

Update and restart the Consul client on the registry Patroni cluster:

Identify the replica nodes:

ssh patroni-registry-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list

Update Consul on each replica node and verify the cluster status after each one:

ssh patroni-registry-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
ssh patroni-registry-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list

Unpause the cluster and trigger a zero-downtime failover:

ssh patroni-registry-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl resume
cd db-migration/dbre-toolkit
ansible-playbook -i inventory/gstg-registry.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gstg-registry_$(date +%Y%m%d).log

Update Consul on ex-leader/now-replica node and verify the cluster status:

ssh patroni-registry-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
ssh patroni-registry-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list

Update and restart the Consul client on the sec Patroni cluster:

Identify the replica nodes:

ssh patroni-sec-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list

Update Consul on each replica node and verify the cluster status after each one:

ssh patroni-sec-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
ssh patroni-sec-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list

Unpause the cluster and trigger a zero-downtime failover:

ssh patroni-sec-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl resume
cd db-migration/dbre-toolkit
ansible-playbook -i inventory/gstg-sec.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gstg-sec_$(date +%Y%m%d).log

Update Consul on ex-leader/now-replica node and verify the cluster status:

ssh patroni-sec-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
ssh patroni-sec-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list

Update and restart the Consul client on the ci Patroni cluster:

Identify the replica nodes:

ssh patroni-ci-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list

Update Consul on each replica node and verify the cluster status after each one:

ssh patroni-ci-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
ssh patroni-ci-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list

Unpause the cluster and trigger a zero-downtime failover:

ssh patroni-ci-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl resume
cd db-migration/dbre-toolkit
ansible-playbook -i inventory/gstg-ci.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gstg-ci_$(date +%Y%m%d).log

Update Consul on ex-leader/now-replica node and verify the cluster status:

ssh patroni-ci-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
ssh patroni-ci-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list

Update and restart the Consul client on the main Patroni cluster:

Identify the replica nodes:

ssh patroni-main-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list

Update Consul on each replica node and verify the cluster status after each one:

ssh patroni-main-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
ssh patroni-main-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list

Unpause the cluster and trigger a zero-downtime failover:

ssh patroni-main-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl resume
cd db-migration/dbre-toolkit
ansible-playbook -i inventory/gstg-main.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gstg-main_$(date +%Y%m%d).log

Update Consul on ex-leader/now-replica node and verify the cluster status:

ssh patroni-main-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
ssh patroni-main-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list

Update and restart the Consul client on all remaining Consul-enabled VMs:

knife ssh 'chef_environment:gstg AND recipes:gitlab_consul\:\:agent AND NOT recipes:gitlab-patroni\:\:consul' 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'

Party!
Set label changecomplete /label ~change::complete

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 60 minutes

Revert the cluster CA in Kubernetes:

Revert gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!8110 (merged)

Verify that the cluster and clients are up and healthy:

kubectl config use-context gke_gitlab-staging-1_us-east1_gstg-gitlab-gke
kubectl --namespace consul get pods -l component=server
kubectl --namespace consul get pods -l component=client
kubectl --namespace consul exec consul-gl-consul-server-0 --container consul -- consul operator raft list-peers
kubectl --namespace consul exec consul-gl-consul-server-0 --container consul -- consul members

Restart all GKE node pools to rotate the Consul clients (and everything else) without causing outages:

gcloud container clusters upgrade gitlab-36dv2 --project gitlab-staging-1 --location us-east1 --node-pool generic-2 --async
gcloud container clusters upgrade gitlab-36dv2 --project gitlab-staging-1 --location us-east1 --node-pool generic-3 --async
gcloud container clusters upgrade gitlab-36dv2 --project gitlab-staging-1 --location us-east1 --node-pool generic-spot-1 --async
gcloud container clusters upgrade gitlab-36dv2 --project gitlab-staging-1 --location us-east1 --node-pool redis-pubsub-0 --async
gcloud container clusters upgrade gstg-gitlab-gke --project gitlab-staging-1 --location us-east1 --node-pool generic-3 --async
gcloud container clusters upgrade gstg-gitlab-gke --project gitlab-staging-1 --location us-east1 --node-pool generic-4 --async
gcloud container clusters upgrade gstg-gitlab-gke --project gitlab-staging-1 --location us-east1 --node-pool generic-mem-2 --async
gcloud container clusters upgrade gstg-gitlab-gke --project gitlab-staging-1 --location us-east1 --node-pool generic-shared-storage-ssd --async
gcloud container clusters upgrade gstg-gitlab-gke --project gitlab-staging-1 --location us-east1 --node-pool generic-spot-1 --async
gcloud container clusters upgrade gstg-gitlab-gke --project gitlab-staging-1 --location us-east1 --node-pool redis-registry-cache-1 --async
gcloud container clusters upgrade gstg-us-east1-b --project gitlab-staging-1 --location us-east1-b --node-pool generic-1 --async
gcloud container clusters upgrade gstg-us-east1-b --project gitlab-staging-1 --location us-east1-b --node-pool generic-2 --async
gcloud container clusters upgrade gstg-us-east1-b --project gitlab-staging-1 --location us-east1-b --node-pool generic-mem-1 --async
gcloud container clusters upgrade gstg-us-east1-b --project gitlab-staging-1 --location us-east1-b --node-pool generic-spot-1 --async
gcloud container clusters upgrade gstg-us-east1-c --project gitlab-staging-1 --location us-east1-c --node-pool generic-1 --async
gcloud container clusters upgrade gstg-us-east1-c --project gitlab-staging-1 --location us-east1-c --node-pool generic-2 --async
gcloud container clusters upgrade gstg-us-east1-c --project gitlab-staging-1 --location us-east1-c --node-pool generic-mem-1 --async
gcloud container clusters upgrade gstg-us-east1-c --project gitlab-staging-1 --location us-east1-c --node-pool generic-spot-1 --async
gcloud container clusters upgrade gstg-us-east1-d --project gitlab-staging-1 --location us-east1-d --node-pool generic-1 --async
gcloud container clusters upgrade gstg-us-east1-d --project gitlab-staging-1 --location us-east1-d --node-pool generic-2 --async
gcloud container clusters upgrade gstg-us-east1-d --project gitlab-staging-1 --location us-east1-d --node-pool generic-mem-1 --async
gcloud container clusters upgrade gstg-us-east1-d --project gitlab-staging-1 --location us-east1-d --node-pool generic-spot-1 --async

This can continue in the background while proceeding to the next steps.

Revert the Consul client certificate on VMs:

Create a new version of the chef/env/gstg/cookbook/gitlab-consul/client secret in Vault from the n-1 version

Update and restart the Consul client on the registry Patroni cluster:

Identify the replica nodes:

ssh patroni-registry-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list

Update Consul on each replica node and verify the cluster status after each one:

ssh patroni-registry-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
ssh patroni-registry-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list

Unpause the cluster and trigger a zero-downtime failover:

ssh patroni-registry-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl resume
cd db-migration/dbre-toolkit
ansible-playbook -i inventory/gstg-registry.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gstg-registry_$(date +%Y%m%d).log

Update Consul on ex-leader/now-replica node and verify the cluster status:

ssh patroni-registry-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
ssh patroni-registry-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list

Update and restart the Consul client on the sec Patroni cluster:

Identify the replica nodes:

ssh patroni-sec-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list

Update Consul on each replica node and verify the cluster status after each one:

ssh patroni-sec-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
ssh patroni-sec-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list

Unpause the cluster and trigger a zero-downtime failover:

ssh patroni-sec-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl resume
cd db-migration/dbre-toolkit
ansible-playbook -i inventory/gstg-sec.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gstg-sec_$(date +%Y%m%d).log

Update Consul on ex-leader/now-replica node and verify the cluster status:

ssh patroni-sec-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
ssh patroni-sec-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list

Update and restart the Consul client on the ci Patroni cluster:

Identify the replica nodes:

ssh patroni-ci-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list

Update Consul on each replica node and verify the cluster status after each one:

ssh patroni-ci-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
ssh patroni-ci-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list

Unpause the cluster and trigger a zero-downtime failover:

ssh patroni-ci-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl resume
cd db-migration/dbre-toolkit
ansible-playbook -i inventory/gstg-ci.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gstg-ci_$(date +%Y%m%d).log

Update Consul on ex-leader/now-replica node and verify the cluster status:

ssh patroni-ci-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
ssh patroni-ci-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list

Update and restart the Consul client on the main Patroni cluster:

Identify the replica nodes:

ssh patroni-main-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list

Update Consul on each replica node and verify the cluster status after each one:

ssh patroni-main-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
ssh patroni-main-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list

Unpause the cluster and trigger a zero-downtime failover:

ssh patroni-main-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl resume
cd db-migration/dbre-toolkit
ansible-playbook -i inventory/gstg-main.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gstg-main_$(date +%Y%m%d).log

Update Consul on ex-leader/now-replica node and verify the cluster status:

ssh patroni-main-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
ssh patroni-main-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list

Update and restart the Consul client on all remaining Consul-enabled VMs:

knife ssh 'chef_environment:gstg AND recipes:gitlab_consul\:\:agent AND NOT recipes:gitlab-patroni\:\:consul' 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'

Set label changeaborted /label ~change::aborted

Monitoring

Key metrics to observe

Metric: Consul Raft status
- Location: https://dashboards.gitlab.net/d/consul-main/consul3a-overview?orgId=1&from=now-6h%2Fm&to=now%2Fm&timezone=utc&var-PROMETHEUS_DS=mimir-gitlab-gstg&var-environment=gstg&var-stage=main
- What changes to this metric should prompt a rollback: Unhealthy status and zero failure tolerance
Metric: Everything
- Location: https://dashboards.gitlab.net/d/general-triage/general3a-platform-triage?orgId=1&from=now-6h%2Fm&to=now%2Fm&timezone=utc&var-PROMETHEUS_DS=mimir-gitlab-gstg&var-environment=gstg&var-stage=main
- What changes to this metric should prompt a rollback: Apdex drop or increased error rate of any service

Change Reviewer checklist

C4 C3 C2 C1:

Check if the following applies:
- The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

Check if the following applies:
- The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
  - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.

Change Technician checklist

The change plan is technically accurate.
This Change Issue is linked to the appropriate Issue and/or Epic
Change has been tested in staging and results noted in a comment on this issue.
A dry-run has been conducted and results noted in a comment on this issue.
The change execution window respects the Production Change Lock periods.
For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue. Mention @gitlab-org/saas-platforms/inframanagers in this issue to request approval and provide visibility to all infrastructure managers.
For C1, C2, or blocks deployments change issues, confirm with Release managers that the change does not overlap or hinder any release process (In #production channel, mention @release-managers and this issue and await their acknowledgment.)

Edited Jun 26, 2025 by Pierre Guinoiseau