2025-07-10: Renew CA certificate for Consul in gprd
Production Change
Change Summary
The Consul CA certificate in gprd
is expiring on 2025-08-13 and needs to be renewed:
$ vault kv get -field certificate k8s/env/gprd/ns/consul/tls | openssl x509 -noout -dates
notBefore=Aug 14 00:08:18 2020 GMT
notAfter=Aug 13 00:08:18 2025 GMT
The client certificate used on VMs also expires on 2025-08-08 and needs to be renewed:
pguinoiseau@console-01-sv-gprd.c.gitlab-production.internal:~$ openssl x509 -in /etc/consul/ssl/certs/consul.crt -noout -dates
notBefore=Aug 9 22:41:47 2022 GMT
notAfter=Aug 8 22:41:47 2025 GMT
Note
The server certificate and the client certificate for Kubernetes are however not a concern as they are generated automatically from the CA on each Helm deployment for the former and on pod init for the latter.
Given that the CA is self-signed, we can extend its expiration date by generating a new certificate from the current one and reusing the same private key, which doesn't cause any cluster disruption during rollout.
We still need to be careful during the rollout as a restart of the Consul clients in Kubernetes can cause an outage when the Rails app fails to query the DNS for the current database leader instances, and a restart of the Consul clients on the database instances can trigger a failover if Patroni is not paused.
This change was successfully executed in gstg
: #19980 (closed)
Issue: production-engineering#25974 (closed)
Change Details
- Services Impacted - ServiceConsul ServicePatroni ServiceGitLab Rails
-
Change Technician -
@pguinoiseau
- Change Reviewer - @jcstephenson @bshah11
- Scheduled Date and Time (UTC in format YYYY-MM-DD HH:MM) - 2025-07-10 01:00
- Time tracking - 6 hours
- Downtime Component - none, but the Patroni clusters will be paused
Set Maintenance Mode in GitLab
If your change involves scheduled maintenance, add a step to set and unset maintenance mode per our runbooks. This will make sure SLA calculations adjust for the maintenance period.
Detailed steps for the change
Pre-execution steps
-
Make sure all tasks in Change Technician checklist are done -
For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production
channel, mention@sre-oncall
and this issue and await their acknowledgement.)-
The SRE on-call provided approval with the eoc_approved label on the issue.
-
-
For C1, C2, or blocks deployments change issues, Release managers have been informed prior to change being rolled out. (In #production
channel, mention@release-managers
and this issue and await their acknowledgment.) -
There are currently no active incidents that are severity1 or severity2 -
A DBRE is available during executing of the change -
Generate the new CA certificate with 5 years expiry using the current private key, and store it in Vault: vault kv get -field certificate k8s/env/gprd/ns/consul/tls > tls.crt vault kv get -field key k8s/env/gprd/ns/consul/tls > tls.key openssl x509 -in tls.crt -signkey tls.key -days 1825 | vault kv patch k8s/env/gprd/ns/consul/tls certificate=-
-
Create the secret with the new CA in Kubernetes: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!8237 (merged) -
Setup the DBRE toolkit locally: test -d db-migration || git clone https://gitlab.com/gitlab-com/gl-infra/db-migration.git cd db-migration git checkout master git pull python3 -m venv ansible source ansible/bin/activate python3 -m pip install --upgrade pip python3 -m pip install ansible ansible --version
-
Check Ansible SSH access to the Patroni VMs: cd dbre-toolkit ansible -i inventory/gprd-ci.yml -m ping all ansible -i inventory/gprd-main.yml -m ping all ansible -i inventory/gprd-registry.yml -m ping all ansible -i inventory/gprd-sec.yml -m ping all
Change steps - steps to take to execute the change
Estimated Time to Complete (mins) - 6 hours
-
Set label changein-progress /label ~change::in-progress
-
Pause all Patroni clusters: ssh patroni-ci-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl pause ssh patroni-main-v16-101-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl pause ssh patroni-registry-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl pause ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl pause
-
Update the cluster CA in Kubernetes: -
Merge gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!8238 (merged) Note
Helm will generate a new server certificate using the new CA certificate, and the servers will restart automatically but the clients won't. New clients (for new nodes) will use the new CA.
-
Rollout the server update to the full statefulset: kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke --namespace consul patch statefulset consul-gl-consul-server --patch '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":2}}}}' kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke --namespace consul get pods -l component=server kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke --namespace consul patch statefulset consul-gl-consul-server --patch '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":0}}}}' kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke --namespace consul get pods -l component=server kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke --namespace consul patch statefulset consul-gl-consul-server --patch '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":4}}}}'
-
Verify that the cluster and clients are up and healthy: kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke --namespace consul get pods -l component=server kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke --namespace consul get pods -l component=client kubectl --context gke_gitlab-production_us-east1_gitlab-3okls --namespace consul get pods -l component=client kubectl --context gke_gitlab-production_us-east1-b_gprd-us-east1-b --namespace consul get pods -l component=client kubectl --context gke_gitlab-production_us-east1-c_gprd-us-east1-c --namespace consul get pods -l component=client kubectl --context gke_gitlab-production_us-east1-d_gprd-us-east1-d --namespace consul get pods -l component=client kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke --namespace consul exec consul-gl-consul-server-0 --container consul -- consul operator raft list-peers kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke --namespace consul exec consul-gl-consul-server-0 --container consul -- consul members
-
Restart a single GKE node pools to rotate the Consul clients (and everything else) without causing outages, and verify that the Consul clients are up and healthy: kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke cordon -l pool=generic-3 kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke drain --grace-period=300 --ignore-daemonsets=true --delete-emptydir-data=true -l pool=generic-3 kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke get nodes --selector cloud.google.com/gke-nodepool=generic-3 --output name | cut -d/ -f2 | xargs -n1 -I% kubectl --namespace consul get pods --selector component=client --field-selector spec.nodeName=%
-
Restart all GKE node pools to rotate the Consul clients (and everything else) without causing outages: # (fish shell syntax) set project gitlab-production; set location us-east1; set cluster gprd-gitlab-gke; set context gke_{$project}_{$location}_{$cluster} for node_pool in (gcloud container node-pools list --project $project --location $location --cluster $cluster --format 'value(name)') kubectl --context $context cordon -l pool=$node_pool kubectl --context $context drain --grace-period=300 --ignore-daemonsets=true --delete-emptydir-data=true -l pool=$node_pool end
# (fish shell syntax) set project gitlab-production; set location us-east1; set cluster gitlab-3okls; set context gke_{$project}_{$location}_{$cluster} for node_pool in (gcloud container node-pools list --project $project --location $location --cluster $cluster --format 'value(name)') kubectl --context $context cordon -l pool=$node_pool kubectl --context $context drain --grace-period=300 --ignore-daemonsets=true --delete-emptydir-data=true -l pool=$node_pool end
# (fish shell syntax) set project gitlab-production; set location us-east1-b; set cluster gprd-{$location}; set context gke_{$project}_{$location}_{$cluster} for node_pool in (gcloud container node-pools list --project $project --location $location --cluster $cluster --format 'value(name)') kubectl --context $context cordon -l pool=$node_pool kubectl --context $context drain --grace-period=300 --ignore-daemonsets=true --delete-emptydir-data=true -l pool=$node_pool end
# (fish shell syntax) set project gitlab-production; set location us-east1-c; set cluster gprd-{$location}; set context gke_{$project}_{$location}_{$cluster} for node_pool in (gcloud container node-pools list --project $project --location $location --cluster $cluster --format 'value(name)') kubectl --context $context cordon -l pool=$node_pool kubectl --context $context drain --grace-period=300 --ignore-daemonsets=true --delete-emptydir-data=true -l pool=$node_pool end
# (fish shell syntax) set project gitlab-production; set location us-east1-d; set cluster gprd-{$location}; set context gke_{$project}_{$location}_{$cluster} for node_pool in (gcloud container node-pools list --project $project --location $location --cluster $cluster --format 'value(name)') kubectl --context $context cordon -l pool=$node_pool kubectl --context $context drain --grace-period=300 --ignore-daemonsets=true --delete-emptydir-data=true -l pool=$node_pool end
Note
This can continue in the background while proceeding to the next steps.
-
-
Update the Consul client certificate on VMs: -
Disable chef-client
on all Consul-enabled VMs:knife ssh 'chef_environment:gprd AND recipes:gitlab_consul\:\:agent' 'sudo chef-client-disable https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20084'
-
Generate a new client certificate for VMs valid for 5 years, and store it in Vault: vault kv get -field certificate k8s/env/gprd/ns/consul/tls > tls.crt vault kv get -field key k8s/env/gprd/ns/consul/tls > tls.key consul tls cert create -client -ca tls.crt -key tls.key -days 1825 -dc east-us-2 vault kv patch chef/env/gprd/cookbook/gitlab-consul/client ca_certificate=@tls.crt certificate=@east-us-2-client-consul-0.pem private_key=@east-us-2-client-consul-0-key.pem
-
SSH into a VM and test a Consul client update and restart: $ ssh blackbox-01-inf-gprd.c.gitlab-production.internal pguinoiseau@blackbox-01-inf-gprd.c.gitlab-production.internal:~$ sudo chef-client-enable pguinoiseau@blackbox-01-inf-gprd.c.gitlab-production.internal:~$ sudo chef-client pguinoiseau@blackbox-01-inf-gprd.c.gitlab-production.internal:~$ sudo openssl x509 -in /etc/consul/ssl/certs/chain.crt -noout -dates pguinoiseau@blackbox-01-inf-gprd.c.gitlab-production.internal:~$ sudo openssl x509 -in /etc/consul/ssl/certs/consul.crt -noout -dates pguinoiseau@blackbox-01-inf-gprd.c.gitlab-production.internal:~$ sudo systemctl restart consul.service
-
Update and restart the Consul client on the registry
Patroni cluster:-
Make sure the cluster is still paused, and identify the replica nodes: ssh patroni-registry-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl pause ssh patroni-registry-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
-
Update Consul on each replica node and verify the cluster status after each one: ⚠️ Skip theknife node run_list
commands for the existing maintenance replica nodeknife node run_list add patroni-registry-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-registry-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-registry-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list # Note: Skip this line if it is the maintenance replica knife node run_list remove patroni-registry-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-registry-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client ssh patroni-registry-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
-
Unpause the cluster and trigger a zero-downtime failover: ssh patroni-registry-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl resume cd db-migration/dbre-toolkit ansible-playbook -i inventory/gprd-registry.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gprd-registry_$(date +%Y%m%d).log
-
Update Consul on ex-leader/now-replica node and verify the cluster status: knife node run_list add patroni-registry-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-registry-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-registry-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list knife node run_list remove patroni-registry-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-registry-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client ssh patroni-registry-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
-
-
Update and restart the Consul client on the sec
Patroni cluster:-
Make sure the cluster is still paused, and identify the replica nodes: ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl pause ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
-
Update Consul on each replica node and verify the cluster status after each one: ⚠️ Skip theknife node run_list
commands for the existing maintenance replica nodeknife node run_list add patroni-sec-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-sec-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list knife node run_list remove patroni-sec-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-sec-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
-
Unpause the cluster and trigger a zero-downtime failover: ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl resume cd db-migration/dbre-toolkit ansible-playbook -i inventory/gprd-sec.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gprd-sec_$(date +%Y%m%d).log
-
Update Consul on ex-leader/now-replica node and verify the cluster status: knife node run_list add patroni-sec-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-sec-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list knife node run_list remove patroni-sec-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-sec-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
-
-
Update and restart the Consul client on the ci
Patroni cluster:-
Make sure the cluster is still paused, and identify the replica nodes: ssh patroni-ci-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl pause ssh patroni-ci-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
-
Update Consul on each replica node and verify the cluster status after each one: ⚠️ Skip theknife node run_list
commands for the existing maintenance replica nodeknife node run_list add patroni-ci-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-ci-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-ci-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list knife node run_list remove patroni-ci-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-ci-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client ssh patroni-ci-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
-
Unpause the cluster and trigger a zero-downtime failover: ssh patroni-ci-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl resume cd db-migration/dbre-toolkit ansible-playbook -i inventory/gprd-ci.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gprd-ci_$(date +%Y%m%d).log
-
Update Consul on ex-leader/now-replica node and verify the cluster status: knife node run_list add patroni-ci-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-ci-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-ci-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list knife node run_list remove patroni-ci-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-ci-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client ssh patroni-ci-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
-
-
Update and restart the Consul client on the main
Patroni cluster:-
Make sure the cluster is still paused, and identify the replica nodes: ssh patroni-main-v16-101-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl pause ssh patroni-main-v16-101-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
-
Update Consul on each replica node and verify the cluster status after each one: ⚠️ Skip theknife node run_list
commands for the existing maintenance replica nodeknife node run_list add patroni-main-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-main-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-main-v16-101-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list knife node run_list remove patroni-main-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-main-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client ssh patroni-main-v16-101-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
-
Unpause the cluster and trigger a zero-downtime failover: ssh patroni-main-v16-101-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl resume cd db-migration/dbre-toolkit ansible-playbook -i inventory/gprd-main.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gprd-main_$(date +%Y%m%d).log
-
Update Consul on ex-leader/now-replica node and verify the cluster status: knife node run_list add patroni-main-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-main-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-main-v16-101-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list knife node run_list remove patroni-main-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-main-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client ssh patroni-main-v16-101-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
-
-
Update and restart the Consul client on all remaining Consul-enabled VMs: knife ssh -C 20 'chef_environment:gprd AND recipes:gitlab_consul\:\:agent AND NOT recipes:gitlab-patroni\:\:consul' 'sudo chef-client-enable && sudo chef-client >/dev/null && sudo systemctl restart consul.service'
-
Verify that the CA has been updated and the client has been restarted on all VMs: knife ssh -C 20 'chef_environment:gprd AND recipes:gitlab_consul\:\:agent' 'openssl x509 -in /etc/consul/ssl/certs/consul.crt -noout -dates | grep notAfter | tr "\n" ";"; systemctl status consul.service | grep Active:' | sort
-
-
Party! -
Set label changecomplete /label ~change::complete
Rollback
Note
A rollback is highly unlikely to be necessary past a successful deployment of the new CA to the Consul servers, Kubernetes clients and first VM.
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - up to 6 hours
-
Revert the cluster CA in Kubernetes: -
Revert gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!8238 (merged) -
Rollout the server update to the full statefulset: kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke --namespace consul patch statefulset consul-gl-consul-server --patch '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":2}}}}' kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke --namespace consul get pods -l component=server kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke --namespace consul patch statefulset consul-gl-consul-server --patch '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":0}}}}' kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke --namespace consul get pods -l component=server kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke --namespace consul patch statefulset consul-gl-consul-server --patch '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":4}}}}'
-
Verify that the cluster and clients are up and healthy: kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke --namespace consul get pods -l component=server kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke --namespace consul get pods -l component=client kubectl --context gke_gitlab-production_us-east1_gitlab-3okls --namespace consul get pods -l component=client kubectl --context gke_gitlab-production_us-east1-b_gprd-us-east1-b --namespace consul get pods -l component=client kubectl --context gke_gitlab-production_us-east1-c_gprd-us-east1-c --namespace consul get pods -l component=client kubectl --context gke_gitlab-production_us-east1-d_gprd-us-east1-d --namespace consul get pods -l component=client kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke --namespace consul exec consul-gl-consul-server-0 --container consul -- consul operator raft list-peers kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke --namespace consul exec consul-gl-consul-server-0 --container consul -- consul members
-
Restart all GKE node pools to rotate the Consul clients (and everything else) without causing outages: # (fish shell syntax) set project gitlab-production; set location us-east1; set cluster gprd-gitlab-gke; set context gke_{$project}_{$location}_{$cluster} for node_pool in (gcloud container node-pools list --project $project --location $location --cluster $cluster --format 'value(name)') kubectl --context $context cordon -l pool=$node_pool kubectl --context $context drain --grace-period=300 --ignore-daemonsets=true --delete-emptydir-data=true -l pool=$node_pool end
# (fish shell syntax) set project gitlab-production; set location us-east1; set cluster gitlab-3okls; set context gke_{$project}_{$location}_{$cluster} for node_pool in (gcloud container node-pools list --project $project --location $location --cluster $cluster --format 'value(name)') kubectl --context $context cordon -l pool=$node_pool kubectl --context $context drain --grace-period=300 --ignore-daemonsets=true --delete-emptydir-data=true -l pool=$node_pool end
# (fish shell syntax) set project gitlab-production; set location us-east1-b; set cluster gprd-{$location}; set context gke_{$project}_{$location}_{$cluster} for node_pool in (gcloud container node-pools list --project $project --location $location --cluster $cluster --format 'value(name)') kubectl --context $context cordon -l pool=$node_pool kubectl --context $context drain --grace-period=300 --ignore-daemonsets=true --delete-emptydir-data=true -l pool=$node_pool end
# (fish shell syntax) set project gitlab-production; set location us-east1-c; set cluster gprd-{$location}; set context gke_{$project}_{$location}_{$cluster} for node_pool in (gcloud container node-pools list --project $project --location $location --cluster $cluster --format 'value(name)') kubectl --context $context cordon -l pool=$node_pool kubectl --context $context drain --grace-period=300 --ignore-daemonsets=true --delete-emptydir-data=true -l pool=$node_pool end
# (fish shell syntax) set project gitlab-production; set location us-east1-d; set cluster gprd-{$location}; set context gke_{$project}_{$location}_{$cluster} for node_pool in (gcloud container node-pools list --project $project --location $location --cluster $cluster --format 'value(name)') kubectl --context $context cordon -l pool=$node_pool kubectl --context $context drain --grace-period=300 --ignore-daemonsets=true --delete-emptydir-data=true -l pool=$node_pool end
This can continue in the background while proceeding to the next steps.
-
-
Revert the Consul client certificate on VMs: -
Create a new version of the chef/env/gprd/cookbook/gitlab-consul/client
secret in Vault from the n-1 version -
Update and restart the Consul client on the registry
Patroni cluster:-
Make sure the cluster is still paused, and identify the replica nodes: ssh patroni-registry-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl pause ssh patroni-registry-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
-
Update Consul on each replica node and verify the cluster status after each one: ⚠️ Skip theknife node run_list
commands for the existing maintenance replica nodeknife node run_list add patroni-registry-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-registry-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-registry-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list knife node run_list remove patroni-registry-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-registry-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client ssh patroni-registry-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
-
Unpause the cluster and trigger a zero-downtime failover: ssh patroni-registry-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl resume cd db-migration/dbre-toolkit ansible-playbook -i inventory/gprd-registry.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gprd-registry_$(date +%Y%m%d).log
-
Update Consul on ex-leader/now-replica node and verify the cluster status: knife node run_list add patroni-registry-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-registry-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-registry-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list knife node run_list remove patroni-registry-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-registry-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client ssh patroni-registry-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
-
-
Update and restart the Consul client on the sec
Patroni cluster:-
Make sure the cluster is still paused, and identify the replica nodes: ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl pause ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
-
Update Consul on each replica node and verify the cluster status after each one: ⚠️ Skip theknife node run_list
commands for the existing maintenance replica nodeknife node run_list add patroni-sec-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-sec-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list knife node run_list remove patroni-sec-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-sec-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
-
Unpause the cluster and trigger a zero-downtime failover: ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl resume cd db-migration/dbre-toolkit ansible-playbook -i inventory/gprd-sec.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gprd-sec_$(date +%Y%m%d).log
-
Update Consul on ex-leader/now-replica node and verify the cluster status: knife node run_list add patroni-sec-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-sec-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list knife node run_list remove patroni-sec-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-sec-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
-
-
Update and restart the Consul client on the ci
Patroni cluster:-
Make sure the cluster is still paused, and identify the replica nodes: ssh patroni-ci-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl pause ssh patroni-ci-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
⚠️ Skip theknife node run_list
commands for the existing maintenance replica node -
Update Consul on each replica node and verify the cluster status after each one: knife node run_list add patroni-ci-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-ci-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-ci-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list knife node run_list remove patroni-ci-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-ci-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client ssh patroni-ci-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
-
Unpause the cluster and trigger a zero-downtime failover: ssh patroni-ci-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl resume cd db-migration/dbre-toolkit ansible-playbook -i inventory/gprd-ci.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gprd-ci_$(date +%Y%m%d).log
-
Update Consul on ex-leader/now-replica node and verify the cluster status: knife node run_list add patroni-ci-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-ci-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-ci-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list knife node run_list remove patroni-ci-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-ci-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client ssh patroni-ci-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
-
-
Update and restart the Consul client on the main
Patroni cluster:-
Make sure the cluster is still paused, and identify the replica nodes: ssh patroni-main-v16-101-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl pause ssh patroni-main-v16-101-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
-
Update Consul on each replica node and verify the cluster status after each one: ⚠️ Skip theknife node run_list
commands for the existing maintenance replica nodeknife node run_list add patroni-main-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-main-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-main-v16-101-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list knife node run_list remove patroni-main-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-main-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client ssh patroni-main-v16-101-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
-
Unpause the cluster and trigger a zero-downtime failover: ssh patroni-main-v16-101-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl resume cd db-migration/dbre-toolkit ansible-playbook -i inventory/gprd-main.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gprd-main_$(date +%Y%m%d).log
-
Update Consul on ex-leader/now-replica node and verify the cluster status: knife node run_list add patroni-main-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-main-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-main-v16-101-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list knife node run_list remove patroni-main-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]' ssh patroni-main-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client ssh patroni-main-v16-101-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
-
-
Update and restart the Consul client on all remaining Consul-enabled VMs: knife ssh -C 20 'chef_environment:gprd AND recipes:gitlab_consul\:\:agent AND NOT recipes:gitlab-patroni\:\:consul' 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
-
Verify that the CA has been updated and the client has been restarted on all VMs: knife ssh -C 20 'chef_environment:gprd AND recipes:gitlab_consul\:\:agent' 'openssl x509 -in /etc/consul/ssl/certs/consul.crt -noout -dates | grep notAfter | tr "\n" ";"; systemctl status consul.service | grep Active:' | sort
-
-
Set label changeaborted /label ~change::aborted
Monitoring
Key metrics to observe
-
Metric: Consul Raft status
- Location: https://dashboards.gitlab.net/d/consul-main/consul3a-overview?orgId=1&from=now-6h%2Fm&to=now%2Fm&timezone=utc&var-PROMETHEUS_DS=mimir-gitlab-gprd&var-environment=gprd&var-stage=main
- What changes to this metric should prompt a rollback: Unhealthy status and zero failure tolerance
-
Metric: Everything
- Location: https://dashboards.gitlab.net/d/general-triage/general3a-platform-triage?from=now-6h%2Fm&orgId=1&timezone=utc&to=now%2Fm&var-PROMETHEUS_DS=mimir-gitlab-gprd&var-environment=gprd&var-stage=main
- What changes to this metric should prompt a rollback: Apdex drop or increased error rate of any service
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.
Change Technician checklist
-
The change plan is technically accurate. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
The change execution window respects the Production Change Lock periods. -
For C1 and C2 change issues, the change event is added to the GitLab Production calendar. -
For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue. Mention @gitlab-org/saas-platforms/inframanagers
in this issue to request approval and provide visibility to all infrastructure managers. -
For C1, C2, or blocks deployments change issues, confirm with Release managers that the change does not overlap or hinder any release process (In #production
channel, mention@release-managers
and this issue and await their acknowledgment.)