2025-06-26: Renew CA certificate for Consul in gstg
Production Change
Change Summary
The Consul CA certificate in gstg
is expiring on 2025-08-06 and needs to be renewed:
$ vault kv get -field certificate k8s/env/gstg/ns/consul/tls | openssl x509 -noout -dates
notBefore=Aug 7 01:02:37 2020 GMT
notAfter=Aug 6 01:02:37 2025 GMT
The client certificate used on VMs also expires on 2025-08-03 and needs to be renewed:
pguinoiseau@console-01-sv-gstg.c.gitlab-staging-1.internal:~$ openssl x509 -in /etc/consul/ssl/certs/consul.crt -noout -dates
notBefore=Aug 4 20:53:22 2022 GMT
notAfter=Aug 3 20:53:22 2025 GMT
Note
The server certificate and the client certificate for Kubernetes are however not a concern as they are generated automatically from the CA on each Helm deployment for the former and on pod init for the latter.
Given that the CA is self-signed, we can extend its expiration date by generating a new certificate from the current one and reusing the same private key, which doesn't cause any cluster disruption during rollout.
We still need to be careful during the rollout as a restart of the Consul clients in Kubernetes can cause an outage when the Rails app fails to query the DNS for the current database leader instances, and a restart of the Consul clients on the database instances can trigger a failover if Patroni is not paused.
Upon successful execution of this change we will proceed to also rotate the CA certificate in gprd
using the same procedure.
Issue: production-engineering#25974 (closed)
Change Details
- Services Impacted - ServiceConsul ServicePatroni ServiceGitLab Rails
-
Change Technician -
@pguinoiseau
- Change Reviewer - @jcstephenson @bshah11
- Scheduled Date and Time (UTC in format YYYY-MM-DD HH:MM) - 2025-06-26 03:00
- Time tracking - 120 minutes
- Downtime Component - none, but Patroni will be paused a few minutes
Set Maintenance Mode in GitLab
If your change involves scheduled maintenance, add a step to set and unset maintenance mode per our runbooks. This will make sure SLA calculations adjust for the maintenance period.
Detailed steps for the change
Pre-execution steps
-
Make sure all tasks in Change Technician checklist are done -
For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production
channel, mention@sre-oncall
and this issue and await their acknowledgement.)-
The SRE on-call provided approval with the eoc_approved label on the issue.
-
-
For C1, C2, or blocks deployments change issues, Release managers have been informed prior to change being rolled out. (In #production
channel, mention@release-managers
and this issue and await their acknowledgment.) -
There are currently no active incidents that are severity1 or severity2 -
If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change. -
Generate the new CA certificate with 5 years expiry using the current private key, and store it in Vault: vault kv get -field certificate k8s/env/gstg/ns/consul/tls > tls.crt vault kv get -field key k8s/env/gstg/ns/consul/tls > tls.key openssl x509 -in tls.crt -signkey tls.key -days 1825 | vault kv patch k8s/env/gstg/ns/consul/tls certificate=-
-
Create the secret with the new CA in Kubernetes: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!8157 (merged) -
Setup the DBRE toolkit locally: test -d db-migration || git clone https://gitlab.com/gitlab-com/gl-infra/db-migration.git cd db-migration git checkout master git pull python3 -m venv ansible source ansible/bin/activate python3 -m pip install --upgrade pip python3 -m pip install ansible ansible --version
-
Check Ansible SSH access to the Patroni VMs: cd dbre-toolkit ansible -i inventory/gstg-ci.yml -m ping all ansible -i inventory/gstg-main.yml -m ping all ansible -i inventory/gstg-registry.yml -m ping all ansible -i inventory/gstg-sec.yml -m ping all
Change steps - steps to take to execute the change
Estimated Time to Complete (mins) - 60 minutes
-
Set label changein-progress /label ~change::in-progress
-
Pause all Patroni clusters: ssh patroni-ci-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl pause ssh patroni-main-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl pause ssh patroni-registry-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl pause ssh patroni-sec-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl pause
-
Update the cluster CA in Kubernetes: -
Merge gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!8110 (merged) Helm will generate a new server certificate using the new CA certificate, and the servers will restart automatically but the clients won't. New clients (for new nodes) will use the new CA.
-
Verify that the cluster and clients are up and healthy: kubectl config use-context gke_gitlab-staging-1_us-east1_gstg-gitlab-gke kubectl --namespace consul get pods -l component=server kubectl --namespace consul get pods -l component=client kubectl --namespace consul exec consul-gl-consul-server-0 --container consul -- consul operator raft list-peers kubectl --namespace consul exec consul-gl-consul-server-0 --container consul -- consul members
-
Restart a single GKE node pools to rotate the Consul clients (and everything else) without causing outages, and verify that the Consul clients are up and healthy: gcloud container clusters upgrade gstg-gitlab-gke --project gitlab-staging-1 --location us-east1 --node-pool generic-3 --async kubectl get nodes --selector cloud.google.com/gke-nodepool=generic-3 --output name | cut -d/ -f2 | xargs -n1 -I% kubectl --namespace consul get pods --selector component=client --field-selector spec.nodeName=%
-
Restart all GKE node pools to rotate the Consul clients (and everything else) without causing outages: gcloud container clusters upgrade gitlab-36dv2 --project gitlab-staging-1 --location us-east1 --node-pool generic-2 --async gcloud container clusters upgrade gitlab-36dv2 --project gitlab-staging-1 --location us-east1 --node-pool generic-3 --async gcloud container clusters upgrade gitlab-36dv2 --project gitlab-staging-1 --location us-east1 --node-pool generic-spot-1 --async gcloud container clusters upgrade gitlab-36dv2 --project gitlab-staging-1 --location us-east1 --node-pool redis-pubsub-0 --async # gcloud container clusters upgrade gstg-gitlab-gke --project gitlab-staging-1 --location us-east1 --node-pool generic-3 --async gcloud container clusters upgrade gstg-gitlab-gke --project gitlab-staging-1 --location us-east1 --node-pool generic-4 --async gcloud container clusters upgrade gstg-gitlab-gke --project gitlab-staging-1 --location us-east1 --node-pool generic-mem-2 --async gcloud container clusters upgrade gstg-gitlab-gke --project gitlab-staging-1 --location us-east1 --node-pool generic-shared-storage-ssd --async gcloud container clusters upgrade gstg-gitlab-gke --project gitlab-staging-1 --location us-east1 --node-pool generic-spot-1 --async gcloud container clusters upgrade gstg-gitlab-gke --project gitlab-staging-1 --location us-east1 --node-pool redis-registry-cache-1 --async gcloud container clusters upgrade gstg-us-east1-b --project gitlab-staging-1 --location us-east1-b --node-pool generic-1 --async gcloud container clusters upgrade gstg-us-east1-b --project gitlab-staging-1 --location us-east1-b --node-pool generic-2 --async gcloud container clusters upgrade gstg-us-east1-b --project gitlab-staging-1 --location us-east1-b --node-pool generic-mem-1 --async gcloud container clusters upgrade gstg-us-east1-b --project gitlab-staging-1 --location us-east1-b --node-pool generic-spot-1 --async gcloud container clusters upgrade gstg-us-east1-c --project gitlab-staging-1 --location us-east1-c --node-pool generic-1 --async gcloud container clusters upgrade gstg-us-east1-c --project gitlab-staging-1 --location us-east1-c --node-pool generic-2 --async gcloud container clusters upgrade gstg-us-east1-c --project gitlab-staging-1 --location us-east1-c --node-pool generic-mem-1 --async gcloud container clusters upgrade gstg-us-east1-c --project gitlab-staging-1 --location us-east1-c --node-pool generic-spot-1 --async gcloud container clusters upgrade gstg-us-east1-d --project gitlab-staging-1 --location us-east1-d --node-pool generic-1 --async gcloud container clusters upgrade gstg-us-east1-d --project gitlab-staging-1 --location us-east1-d --node-pool generic-2 --async gcloud container clusters upgrade gstg-us-east1-d --project gitlab-staging-1 --location us-east1-d --node-pool generic-mem-1 --async gcloud container clusters upgrade gstg-us-east1-d --project gitlab-staging-1 --location us-east1-d --node-pool generic-spot-1 --async
This can continue in the background while proceeding to the next steps.
-
-
Update the Consul client certificate on VMs: -
Disable chef-client
on all Consul-enabled VMs:knife ssh 'chef_environment:gstg AND recipes:gitlab_consul\:\:agent' 'sudo chef-client-disable https://gitlab.com/gitlab-com/gl-infra/production/-/issues/19980'
-
Generate a new client certificate for VMs valid for 5 years, and store it in Vault: vault kv get -field certificate k8s/env/gstg/ns/consul/tls > tls.crt vault kv get -field key k8s/env/gstg/ns/consul/tls > tls.key consul tls cert create -client -ca tls.crt -key tls.key -days 1825 -dc east-us-2 vault kv patch chef/env/gstg/cookbook/gitlab-consul/client ca_certificate=@tls.crt certificate=@east-us-2-client-consul-0.pem private_key=@east-us-2-client-consul-0-key.pem
-
SSH into a VM and test a Consul client update and restart: $ ssh blackbox-01-inf-gstg.c.gitlab-staging-1.internal pguinoiseau@blackbox-01-inf-gstg.c.gitlab-staging-1.internal:~$ sudo chef-client-enable pguinoiseau@blackbox-01-inf-gstg.c.gitlab-staging-1.internal:~$ sudo chef-client pguinoiseau@blackbox-01-inf-gstg.c.gitlab-staging-1.internal:~$ sudo openssl x509 -in /etc/consul/ssl/certs/chain.crt -noout -dates pguinoiseau@blackbox-01-inf-gstg.c.gitlab-staging-1.internal:~$ sudo openssl x509 -in /etc/consul/ssl/certs/consul.crt -noout -dates pguinoiseau@blackbox-01-inf-gstg.c.gitlab-staging-1.internal:~$ sudo systemctl restart consul.service
-
Update and restart the Consul client on the registry
Patroni cluster:-
Identify the replica nodes: ssh patroni-registry-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list
-
Update Consul on each replica node and verify the cluster status after each one: ssh patroni-registry-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-registry-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list
-
Unpause the cluster and trigger a zero-downtime failover: ssh patroni-registry-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl resume cd db-migration/dbre-toolkit ansible-playbook -i inventory/gstg-registry.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gstg-registry_$(date +%Y%m%d).log
-
Update Consul on ex-leader/now-replica node and verify the cluster status: ssh patroni-registry-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-registry-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list
-
-
Update and restart the Consul client on the sec
Patroni cluster:-
Identify the replica nodes: ssh patroni-sec-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list
-
Update Consul on each replica node and verify the cluster status after each one: ssh patroni-sec-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-sec-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list
-
Unpause the cluster and trigger a zero-downtime failover: ssh patroni-sec-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl resume cd db-migration/dbre-toolkit ansible-playbook -i inventory/gstg-sec.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gstg-sec_$(date +%Y%m%d).log
-
Update Consul on ex-leader/now-replica node and verify the cluster status: ssh patroni-sec-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-sec-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list
-
-
Update and restart the Consul client on the ci
Patroni cluster:-
Identify the replica nodes: ssh patroni-ci-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list
-
Update Consul on each replica node and verify the cluster status after each one: ssh patroni-ci-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-ci-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list
-
Unpause the cluster and trigger a zero-downtime failover: ssh patroni-ci-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl resume cd db-migration/dbre-toolkit ansible-playbook -i inventory/gstg-ci.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gstg-ci_$(date +%Y%m%d).log
-
Update Consul on ex-leader/now-replica node and verify the cluster status: ssh patroni-ci-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-ci-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list
-
-
Update and restart the Consul client on the main
Patroni cluster:-
Identify the replica nodes: ssh patroni-main-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list
-
Update Consul on each replica node and verify the cluster status after each one: ssh patroni-main-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-main-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list
-
Unpause the cluster and trigger a zero-downtime failover: ssh patroni-main-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl resume cd db-migration/dbre-toolkit ansible-playbook -i inventory/gstg-main.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gstg-main_$(date +%Y%m%d).log
-
Update Consul on ex-leader/now-replica node and verify the cluster status: ssh patroni-main-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-main-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list
-
-
Update and restart the Consul client on all remaining Consul-enabled VMs: knife ssh 'chef_environment:gstg AND recipes:gitlab_consul\:\:agent AND NOT recipes:gitlab-patroni\:\:consul' 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
-
-
Party! -
Set label changecomplete /label ~change::complete
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete (mins) - 60 minutes
-
Revert the cluster CA in Kubernetes: -
Revert gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!8110 (merged) -
Verify that the cluster and clients are up and healthy: kubectl config use-context gke_gitlab-staging-1_us-east1_gstg-gitlab-gke kubectl --namespace consul get pods -l component=server kubectl --namespace consul get pods -l component=client kubectl --namespace consul exec consul-gl-consul-server-0 --container consul -- consul operator raft list-peers kubectl --namespace consul exec consul-gl-consul-server-0 --container consul -- consul members
-
Restart all GKE node pools to rotate the Consul clients (and everything else) without causing outages: gcloud container clusters upgrade gitlab-36dv2 --project gitlab-staging-1 --location us-east1 --node-pool generic-2 --async gcloud container clusters upgrade gitlab-36dv2 --project gitlab-staging-1 --location us-east1 --node-pool generic-3 --async gcloud container clusters upgrade gitlab-36dv2 --project gitlab-staging-1 --location us-east1 --node-pool generic-spot-1 --async gcloud container clusters upgrade gitlab-36dv2 --project gitlab-staging-1 --location us-east1 --node-pool redis-pubsub-0 --async gcloud container clusters upgrade gstg-gitlab-gke --project gitlab-staging-1 --location us-east1 --node-pool generic-3 --async gcloud container clusters upgrade gstg-gitlab-gke --project gitlab-staging-1 --location us-east1 --node-pool generic-4 --async gcloud container clusters upgrade gstg-gitlab-gke --project gitlab-staging-1 --location us-east1 --node-pool generic-mem-2 --async gcloud container clusters upgrade gstg-gitlab-gke --project gitlab-staging-1 --location us-east1 --node-pool generic-shared-storage-ssd --async gcloud container clusters upgrade gstg-gitlab-gke --project gitlab-staging-1 --location us-east1 --node-pool generic-spot-1 --async gcloud container clusters upgrade gstg-gitlab-gke --project gitlab-staging-1 --location us-east1 --node-pool redis-registry-cache-1 --async gcloud container clusters upgrade gstg-us-east1-b --project gitlab-staging-1 --location us-east1-b --node-pool generic-1 --async gcloud container clusters upgrade gstg-us-east1-b --project gitlab-staging-1 --location us-east1-b --node-pool generic-2 --async gcloud container clusters upgrade gstg-us-east1-b --project gitlab-staging-1 --location us-east1-b --node-pool generic-mem-1 --async gcloud container clusters upgrade gstg-us-east1-b --project gitlab-staging-1 --location us-east1-b --node-pool generic-spot-1 --async gcloud container clusters upgrade gstg-us-east1-c --project gitlab-staging-1 --location us-east1-c --node-pool generic-1 --async gcloud container clusters upgrade gstg-us-east1-c --project gitlab-staging-1 --location us-east1-c --node-pool generic-2 --async gcloud container clusters upgrade gstg-us-east1-c --project gitlab-staging-1 --location us-east1-c --node-pool generic-mem-1 --async gcloud container clusters upgrade gstg-us-east1-c --project gitlab-staging-1 --location us-east1-c --node-pool generic-spot-1 --async gcloud container clusters upgrade gstg-us-east1-d --project gitlab-staging-1 --location us-east1-d --node-pool generic-1 --async gcloud container clusters upgrade gstg-us-east1-d --project gitlab-staging-1 --location us-east1-d --node-pool generic-2 --async gcloud container clusters upgrade gstg-us-east1-d --project gitlab-staging-1 --location us-east1-d --node-pool generic-mem-1 --async gcloud container clusters upgrade gstg-us-east1-d --project gitlab-staging-1 --location us-east1-d --node-pool generic-spot-1 --async
This can continue in the background while proceeding to the next steps.
-
-
Revert the Consul client certificate on VMs: -
Create a new version of the chef/env/gstg/cookbook/gitlab-consul/client
secret in Vault from the n-1 version -
Update and restart the Consul client on the registry
Patroni cluster:-
Identify the replica nodes: ssh patroni-registry-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list
-
Update Consul on each replica node and verify the cluster status after each one: ssh patroni-registry-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-registry-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list
-
Unpause the cluster and trigger a zero-downtime failover: ssh patroni-registry-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl resume cd db-migration/dbre-toolkit ansible-playbook -i inventory/gstg-registry.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gstg-registry_$(date +%Y%m%d).log
-
Update Consul on ex-leader/now-replica node and verify the cluster status: ssh patroni-registry-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-registry-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list
-
-
Update and restart the Consul client on the sec
Patroni cluster:-
Identify the replica nodes: ssh patroni-sec-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list
-
Update Consul on each replica node and verify the cluster status after each one: ssh patroni-sec-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-sec-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list
-
Unpause the cluster and trigger a zero-downtime failover: ssh patroni-sec-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl resume cd db-migration/dbre-toolkit ansible-playbook -i inventory/gstg-sec.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gstg-sec_$(date +%Y%m%d).log
-
Update Consul on ex-leader/now-replica node and verify the cluster status: ssh patroni-sec-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-sec-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list
-
-
Update and restart the Consul client on the ci
Patroni cluster:-
Identify the replica nodes: ssh patroni-ci-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list
-
Update Consul on each replica node and verify the cluster status after each one: ssh patroni-ci-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-ci-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list
-
Unpause the cluster and trigger a zero-downtime failover: ssh patroni-ci-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl resume cd db-migration/dbre-toolkit ansible-playbook -i inventory/gstg-ci.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gstg-ci_$(date +%Y%m%d).log
-
Update Consul on ex-leader/now-replica node and verify the cluster status: ssh patroni-ci-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-ci-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list
-
-
Update and restart the Consul client on the main
Patroni cluster:-
Identify the replica nodes: ssh patroni-main-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list
-
Update Consul on each replica node and verify the cluster status after each one: ssh patroni-main-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-main-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list
-
Unpause the cluster and trigger a zero-downtime failover: ssh patroni-main-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl resume cd db-migration/dbre-toolkit ansible-playbook -i inventory/gstg-main.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gstg-main_$(date +%Y%m%d).log
-
Update Consul on ex-leader/now-replica node and verify the cluster status: ssh patroni-main-v16-...-db-gstg.c.gitlab-staging-1.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service' ssh patroni-main-v16-01-db-gstg.c.gitlab-staging-1.internal sudo gitlab-patronictl list
-
-
-
Update and restart the Consul client on all remaining Consul-enabled VMs: knife ssh 'chef_environment:gstg AND recipes:gitlab_consul\:\:agent AND NOT recipes:gitlab-patroni\:\:consul' 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
-
Set label changeaborted /label ~change::aborted
Monitoring
Key metrics to observe
-
Metric: Consul Raft status
- Location: https://dashboards.gitlab.net/d/consul-main/consul3a-overview?orgId=1&from=now-6h%2Fm&to=now%2Fm&timezone=utc&var-PROMETHEUS_DS=mimir-gitlab-gstg&var-environment=gstg&var-stage=main
- What changes to this metric should prompt a rollback: Unhealthy status and zero failure tolerance
-
Metric: Everything
- Location: https://dashboards.gitlab.net/d/general-triage/general3a-platform-triage?orgId=1&from=now-6h%2Fm&to=now%2Fm&timezone=utc&var-PROMETHEUS_DS=mimir-gitlab-gstg&var-environment=gstg&var-stage=main
- What changes to this metric should prompt a rollback: Apdex drop or increased error rate of any service
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.
Change Technician checklist
-
The change plan is technically accurate. -
This Change Issue is linked to the appropriate Issue and/or Epic -
Change has been tested in staging and results noted in a comment on this issue. -
A dry-run has been conducted and results noted in a comment on this issue. -
The change execution window respects the Production Change Lock periods. -
For C1 and C2 change issues, the change event is added to the GitLab Production calendar. -
For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue. Mention @gitlab-org/saas-platforms/inframanagers
in this issue to request approval and provide visibility to all infrastructure managers. -
For C1, C2, or blocks deployments change issues, confirm with Release managers that the change does not overlap or hinder any release process (In #production
channel, mention@release-managers
and this issue and await their acknowledgment.)