Skip to content

2025-07-10: Renew CA certificate for Consul in gprd

Production Change

Change Summary

The Consul CA certificate in gprd is expiring on 2025-08-13 and needs to be renewed:

$ vault kv get -field certificate k8s/env/gprd/ns/consul/tls | openssl x509 -noout -dates
notBefore=Aug 14 00:08:18 2020 GMT
notAfter=Aug 13 00:08:18 2025 GMT

The client certificate used on VMs also expires on 2025-08-08 and needs to be renewed:

pguinoiseau@console-01-sv-gprd.c.gitlab-production.internal:~$ openssl x509 -in /etc/consul/ssl/certs/consul.crt -noout -dates
notBefore=Aug  9 22:41:47 2022 GMT
notAfter=Aug  8 22:41:47 2025 GMT

Note

The server certificate and the client certificate for Kubernetes are however not a concern as they are generated automatically from the CA on each Helm deployment for the former and on pod init for the latter.

Given that the CA is self-signed, we can extend its expiration date by generating a new certificate from the current one and reusing the same private key, which doesn't cause any cluster disruption during rollout.

We still need to be careful during the rollout as a restart of the Consul clients in Kubernetes can cause an outage when the Rails app fails to query the DNS for the current database leader instances, and a restart of the Consul clients on the database instances can trigger a failover if Patroni is not paused.

This change was successfully executed in gstg: #19980 (closed)

Issue: production-engineering#25974 (closed)

Change Details

  1. Services Impacted - ServiceConsul ServicePatroni ServiceGitLab Rails
  2. Change Technician - @pguinoiseau
  3. Change Reviewer - @jcstephenson @bshah11
  4. Scheduled Date and Time (UTC in format YYYY-MM-DD HH:MM) - 2025-07-10 01:00
  5. Time tracking - 6 hours
  6. Downtime Component - none, but the Patroni clusters will be paused

Set Maintenance Mode in GitLab

If your change involves scheduled maintenance, add a step to set and unset maintenance mode per our runbooks. This will make sure SLA calculations adjust for the maintenance period.

Detailed steps for the change

Pre-execution steps

  • Make sure all tasks in Change Technician checklist are done

  • For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)

    • The SRE on-call provided approval with the eoc_approved label on the issue.
  • For C1, C2, or blocks deployments change issues, Release managers have been informed prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)

  • There are currently no active incidents that are severity1 or severity2

  • A DBRE is available during executing of the change

  • Generate the new CA certificate with 5 years expiry using the current private key, and store it in Vault:

    vault kv get -field certificate k8s/env/gprd/ns/consul/tls > tls.crt
    vault kv get -field key k8s/env/gprd/ns/consul/tls > tls.key
    openssl x509 -in tls.crt -signkey tls.key -days 1825 | vault kv patch k8s/env/gprd/ns/consul/tls certificate=-
  • Create the secret with the new CA in Kubernetes: gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!8237 (merged)

  • Setup the DBRE toolkit locally:

    test -d db-migration || git clone https://gitlab.com/gitlab-com/gl-infra/db-migration.git
    cd db-migration
    git checkout master
    git pull
    python3 -m venv ansible
    source ansible/bin/activate
    python3 -m pip install --upgrade pip
    python3 -m pip install ansible
    ansible --version
  • Check Ansible SSH access to the Patroni VMs:

    cd dbre-toolkit
    ansible -i inventory/gprd-ci.yml -m ping all
    ansible -i inventory/gprd-main.yml -m ping all
    ansible -i inventory/gprd-registry.yml -m ping all
    ansible -i inventory/gprd-sec.yml -m ping all

Change steps - steps to take to execute the change

Estimated Time to Complete (mins) - 6 hours

  • Set label changein-progress /label ~change::in-progress

  • Pause all Patroni clusters:

    ssh patroni-ci-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl pause
    ssh patroni-main-v16-101-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl pause
    ssh patroni-registry-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl pause
    ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl pause
  • Update the cluster CA in Kubernetes:

    • Merge gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!8238 (merged)

      Note

      Helm will generate a new server certificate using the new CA certificate, and the servers will restart automatically but the clients won't. New clients (for new nodes) will use the new CA.

    • Rollout the server update to the full statefulset:

      kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke --namespace consul patch statefulset consul-gl-consul-server --patch '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":2}}}}'
      kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke --namespace consul get pods -l component=server
      kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke --namespace consul patch statefulset consul-gl-consul-server --patch '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":0}}}}'
      kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke --namespace consul get pods -l component=server
      kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke --namespace consul patch statefulset consul-gl-consul-server --patch '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":4}}}}'
    • Verify that the cluster and clients are up and healthy:

      kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke   --namespace consul get pods -l component=server
      kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke   --namespace consul get pods -l component=client
      kubectl --context gke_gitlab-production_us-east1_gitlab-3okls      --namespace consul get pods -l component=client
      kubectl --context gke_gitlab-production_us-east1-b_gprd-us-east1-b --namespace consul get pods -l component=client
      kubectl --context gke_gitlab-production_us-east1-c_gprd-us-east1-c --namespace consul get pods -l component=client
      kubectl --context gke_gitlab-production_us-east1-d_gprd-us-east1-d --namespace consul get pods -l component=client
      kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke   --namespace consul exec consul-gl-consul-server-0 --container consul -- consul operator raft list-peers
      kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke   --namespace consul exec consul-gl-consul-server-0 --container consul -- consul members
    • Restart a single GKE node pools to rotate the Consul clients (and everything else) without causing outages, and verify that the Consul clients are up and healthy:

      kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke cordon -l pool=generic-3
      kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke drain --grace-period=300 --ignore-daemonsets=true --delete-emptydir-data=true -l pool=generic-3
      kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke get nodes --selector cloud.google.com/gke-nodepool=generic-3 --output name | cut -d/ -f2 | xargs -n1 -I%  kubectl --namespace consul get pods --selector component=client --field-selector spec.nodeName=%
    • Restart all GKE node pools to rotate the Consul clients (and everything else) without causing outages:

      # (fish shell syntax)
      set project gitlab-production; set location us-east1; set cluster gprd-gitlab-gke; set context gke_{$project}_{$location}_{$cluster}
      for node_pool in (gcloud container node-pools list --project $project --location $location --cluster $cluster --format 'value(name)')
          kubectl --context $context cordon -l pool=$node_pool
          kubectl --context $context drain --grace-period=300 --ignore-daemonsets=true --delete-emptydir-data=true -l pool=$node_pool
      end
      # (fish shell syntax)
      set project gitlab-production; set location us-east1; set cluster gitlab-3okls; set context gke_{$project}_{$location}_{$cluster}
      for node_pool in (gcloud container node-pools list --project $project --location $location --cluster $cluster --format 'value(name)')
          kubectl --context $context cordon -l pool=$node_pool
          kubectl --context $context drain --grace-period=300 --ignore-daemonsets=true --delete-emptydir-data=true -l pool=$node_pool
      end
      # (fish shell syntax)
      set project gitlab-production; set location us-east1-b; set cluster gprd-{$location}; set context gke_{$project}_{$location}_{$cluster}
      for node_pool in (gcloud container node-pools list --project $project --location $location --cluster $cluster --format 'value(name)')
          kubectl --context $context cordon -l pool=$node_pool
          kubectl --context $context drain --grace-period=300 --ignore-daemonsets=true --delete-emptydir-data=true -l pool=$node_pool
      end
      # (fish shell syntax)
      set project gitlab-production; set location us-east1-c; set cluster gprd-{$location}; set context gke_{$project}_{$location}_{$cluster}
      for node_pool in (gcloud container node-pools list --project $project --location $location --cluster $cluster --format 'value(name)')
          kubectl --context $context cordon -l pool=$node_pool
          kubectl --context $context drain --grace-period=300 --ignore-daemonsets=true --delete-emptydir-data=true -l pool=$node_pool
      end
      # (fish shell syntax)
      set project gitlab-production; set location us-east1-d; set cluster gprd-{$location}; set context gke_{$project}_{$location}_{$cluster}
      for node_pool in (gcloud container node-pools list --project $project --location $location --cluster $cluster --format 'value(name)')
          kubectl --context $context cordon -l pool=$node_pool
          kubectl --context $context drain --grace-period=300 --ignore-daemonsets=true --delete-emptydir-data=true -l pool=$node_pool
      end

      Note

      This can continue in the background while proceeding to the next steps.

  • Update the Consul client certificate on VMs:

    • Disable chef-client on all Consul-enabled VMs:

      knife ssh 'chef_environment:gprd AND recipes:gitlab_consul\:\:agent' 'sudo chef-client-disable https://gitlab.com/gitlab-com/gl-infra/production/-/issues/20084'
    • Generate a new client certificate for VMs valid for 5 years, and store it in Vault:

      vault kv get -field certificate k8s/env/gprd/ns/consul/tls > tls.crt
      vault kv get -field key k8s/env/gprd/ns/consul/tls > tls.key
      consul tls cert create -client -ca tls.crt -key tls.key -days 1825 -dc east-us-2
      vault kv patch chef/env/gprd/cookbook/gitlab-consul/client ca_certificate=@tls.crt certificate=@east-us-2-client-consul-0.pem private_key=@east-us-2-client-consul-0-key.pem
    • SSH into a VM and test a Consul client update and restart:

      $ ssh blackbox-01-inf-gprd.c.gitlab-production.internal
      pguinoiseau@blackbox-01-inf-gprd.c.gitlab-production.internal:~$ sudo chef-client-enable
      pguinoiseau@blackbox-01-inf-gprd.c.gitlab-production.internal:~$ sudo chef-client
      pguinoiseau@blackbox-01-inf-gprd.c.gitlab-production.internal:~$ sudo openssl x509 -in /etc/consul/ssl/certs/chain.crt -noout -dates
      pguinoiseau@blackbox-01-inf-gprd.c.gitlab-production.internal:~$ sudo openssl x509 -in /etc/consul/ssl/certs/consul.crt -noout -dates
      pguinoiseau@blackbox-01-inf-gprd.c.gitlab-production.internal:~$ sudo systemctl restart consul.service
    • Update and restart the Consul client on the registry Patroni cluster:

      • Make sure the cluster is still paused, and identify the replica nodes:

        ssh patroni-registry-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl pause
        ssh patroni-registry-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
      • Update Consul on each replica node and verify the cluster status after each one:

        ⚠️ Skip the knife node run_list commands for the existing maintenance replica node

        knife node run_list add patroni-registry-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-registry-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
        ssh patroni-registry-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
        # Note: Skip this line if it is the maintenance replica
        knife node run_list remove patroni-registry-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-registry-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client
        ssh patroni-registry-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
      • Unpause the cluster and trigger a zero-downtime failover:

        ssh patroni-registry-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl resume
        cd db-migration/dbre-toolkit
        ansible-playbook -i inventory/gprd-registry.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gprd-registry_$(date +%Y%m%d).log
      • Update Consul on ex-leader/now-replica node and verify the cluster status:

        knife node run_list add patroni-registry-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-registry-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
        ssh patroni-registry-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
        knife node run_list remove patroni-registry-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-registry-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client
        ssh patroni-registry-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
    • Update and restart the Consul client on the sec Patroni cluster:

      • Make sure the cluster is still paused, and identify the replica nodes:

        ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl pause
        ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
      • Update Consul on each replica node and verify the cluster status after each one:

        ⚠️ Skip the knife node run_list commands for the existing maintenance replica node

        knife node run_list add patroni-sec-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-sec-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
        ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
        knife node run_list remove patroni-sec-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-sec-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client
        ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
      • Unpause the cluster and trigger a zero-downtime failover:

        ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl resume
        cd db-migration/dbre-toolkit
        ansible-playbook -i inventory/gprd-sec.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gprd-sec_$(date +%Y%m%d).log
      • Update Consul on ex-leader/now-replica node and verify the cluster status:

        knife node run_list add patroni-sec-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-sec-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
        ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
        knife node run_list remove patroni-sec-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-sec-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client
        ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
    • Update and restart the Consul client on the ci Patroni cluster:

      • Make sure the cluster is still paused, and identify the replica nodes:

        ssh patroni-ci-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl pause
        ssh patroni-ci-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
      • Update Consul on each replica node and verify the cluster status after each one:

        ⚠️ Skip the knife node run_list commands for the existing maintenance replica node

        knife node run_list add patroni-ci-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-ci-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
        ssh patroni-ci-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
        knife node run_list remove patroni-ci-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-ci-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client
        ssh patroni-ci-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
      • Unpause the cluster and trigger a zero-downtime failover:

        ssh patroni-ci-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl resume
        cd db-migration/dbre-toolkit
        ansible-playbook -i inventory/gprd-ci.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gprd-ci_$(date +%Y%m%d).log
      • Update Consul on ex-leader/now-replica node and verify the cluster status:

        knife node run_list add patroni-ci-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-ci-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
        ssh patroni-ci-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
        knife node run_list remove patroni-ci-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-ci-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client
        ssh patroni-ci-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
    • Update and restart the Consul client on the main Patroni cluster:

      • Make sure the cluster is still paused, and identify the replica nodes:

        ssh patroni-main-v16-101-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl pause
        ssh patroni-main-v16-101-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
      • Update Consul on each replica node and verify the cluster status after each one:

        ⚠️ Skip the knife node run_list commands for the existing maintenance replica node

        knife node run_list add patroni-main-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-main-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
        ssh patroni-main-v16-101-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
        knife node run_list remove patroni-main-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-main-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client
        ssh patroni-main-v16-101-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
      • Unpause the cluster and trigger a zero-downtime failover:

        ssh patroni-main-v16-101-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl resume
        cd db-migration/dbre-toolkit
        ansible-playbook -i inventory/gprd-main.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gprd-main_$(date +%Y%m%d).log
      • Update Consul on ex-leader/now-replica node and verify the cluster status:

        knife node run_list add patroni-main-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-main-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
        ssh patroni-main-v16-101-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
        knife node run_list remove patroni-main-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-main-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client
        ssh patroni-main-v16-101-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
    • Update and restart the Consul client on all remaining Consul-enabled VMs:

      knife ssh -C 20 'chef_environment:gprd AND recipes:gitlab_consul\:\:agent AND NOT recipes:gitlab-patroni\:\:consul' 'sudo chef-client-enable && sudo chef-client >/dev/null && sudo systemctl restart consul.service'
    • Verify that the CA has been updated and the client has been restarted on all VMs:

      knife ssh -C 20 'chef_environment:gprd AND recipes:gitlab_consul\:\:agent' 'openssl x509 -in /etc/consul/ssl/certs/consul.crt -noout -dates | grep notAfter | tr "\n" ";"; systemctl status consul.service | grep Active:' | sort
  • Party!

  • Set label changecomplete /label ~change::complete

Rollback

Note

A rollback is highly unlikely to be necessary past a successful deployment of the new CA to the Consul servers, Kubernetes clients and first VM.

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - up to 6 hours

  • Revert the cluster CA in Kubernetes:

    • Revert gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles!8238 (merged)

    • Rollout the server update to the full statefulset:

      kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke --namespace consul patch statefulset consul-gl-consul-server --patch '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":2}}}}'
      kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke --namespace consul get pods -l component=server
      kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke --namespace consul patch statefulset consul-gl-consul-server --patch '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":0}}}}'
      kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke --namespace consul get pods -l component=server
      kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke --namespace consul patch statefulset consul-gl-consul-server --patch '{"spec":{"updateStrategy":{"rollingUpdate":{"partition":4}}}}'
    • Verify that the cluster and clients are up and healthy:

      kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke   --namespace consul get pods -l component=server
      kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke   --namespace consul get pods -l component=client
      kubectl --context gke_gitlab-production_us-east1_gitlab-3okls      --namespace consul get pods -l component=client
      kubectl --context gke_gitlab-production_us-east1-b_gprd-us-east1-b --namespace consul get pods -l component=client
      kubectl --context gke_gitlab-production_us-east1-c_gprd-us-east1-c --namespace consul get pods -l component=client
      kubectl --context gke_gitlab-production_us-east1-d_gprd-us-east1-d --namespace consul get pods -l component=client
      kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke   --namespace consul exec consul-gl-consul-server-0 --container consul -- consul operator raft list-peers
      kubectl --context gke_gitlab-production_us-east1_gprd-gitlab-gke   --namespace consul exec consul-gl-consul-server-0 --container consul -- consul members
    • Restart all GKE node pools to rotate the Consul clients (and everything else) without causing outages:

      # (fish shell syntax)
      set project gitlab-production; set location us-east1; set cluster gprd-gitlab-gke; set context gke_{$project}_{$location}_{$cluster}
      for node_pool in (gcloud container node-pools list --project $project --location $location --cluster $cluster --format 'value(name)')
          kubectl --context $context cordon -l pool=$node_pool
          kubectl --context $context drain --grace-period=300 --ignore-daemonsets=true --delete-emptydir-data=true -l pool=$node_pool
      end
      # (fish shell syntax)
      set project gitlab-production; set location us-east1; set cluster gitlab-3okls; set context gke_{$project}_{$location}_{$cluster}
      for node_pool in (gcloud container node-pools list --project $project --location $location --cluster $cluster --format 'value(name)')
          kubectl --context $context cordon -l pool=$node_pool
          kubectl --context $context drain --grace-period=300 --ignore-daemonsets=true --delete-emptydir-data=true -l pool=$node_pool
      end
      # (fish shell syntax)
      set project gitlab-production; set location us-east1-b; set cluster gprd-{$location}; set context gke_{$project}_{$location}_{$cluster}
      for node_pool in (gcloud container node-pools list --project $project --location $location --cluster $cluster --format 'value(name)')
          kubectl --context $context cordon -l pool=$node_pool
          kubectl --context $context drain --grace-period=300 --ignore-daemonsets=true --delete-emptydir-data=true -l pool=$node_pool
      end
      # (fish shell syntax)
      set project gitlab-production; set location us-east1-c; set cluster gprd-{$location}; set context gke_{$project}_{$location}_{$cluster}
      for node_pool in (gcloud container node-pools list --project $project --location $location --cluster $cluster --format 'value(name)')
          kubectl --context $context cordon -l pool=$node_pool
          kubectl --context $context drain --grace-period=300 --ignore-daemonsets=true --delete-emptydir-data=true -l pool=$node_pool
      end
      # (fish shell syntax)
      set project gitlab-production; set location us-east1-d; set cluster gprd-{$location}; set context gke_{$project}_{$location}_{$cluster}
      for node_pool in (gcloud container node-pools list --project $project --location $location --cluster $cluster --format 'value(name)')
          kubectl --context $context cordon -l pool=$node_pool
          kubectl --context $context drain --grace-period=300 --ignore-daemonsets=true --delete-emptydir-data=true -l pool=$node_pool
      end

      This can continue in the background while proceeding to the next steps.

  • Revert the Consul client certificate on VMs:

    • Create a new version of the chef/env/gprd/cookbook/gitlab-consul/client secret in Vault from the n-1 version

    • Update and restart the Consul client on the registry Patroni cluster:

      • Make sure the cluster is still paused, and identify the replica nodes:

        ssh patroni-registry-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl pause
        ssh patroni-registry-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
      • Update Consul on each replica node and verify the cluster status after each one:

        ⚠️ Skip the knife node run_list commands for the existing maintenance replica node

        knife node run_list add patroni-registry-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-registry-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
        ssh patroni-registry-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
        knife node run_list remove patroni-registry-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-registry-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client
        ssh patroni-registry-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
      • Unpause the cluster and trigger a zero-downtime failover:

        ssh patroni-registry-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl resume
        cd db-migration/dbre-toolkit
        ansible-playbook -i inventory/gprd-registry.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gprd-registry_$(date +%Y%m%d).log
      • Update Consul on ex-leader/now-replica node and verify the cluster status:

        knife node run_list add patroni-registry-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-registry-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
        ssh patroni-registry-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
        knife node run_list remove patroni-registry-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-registry-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client
        ssh patroni-registry-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
    • Update and restart the Consul client on the sec Patroni cluster:

      • Make sure the cluster is still paused, and identify the replica nodes:

        ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl pause
        ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
      • Update Consul on each replica node and verify the cluster status after each one:

        ⚠️ Skip the knife node run_list commands for the existing maintenance replica node

        knife node run_list add patroni-sec-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-sec-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
        ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
        knife node run_list remove patroni-sec-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-sec-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client
        ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
      • Unpause the cluster and trigger a zero-downtime failover:

        ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl resume
        cd db-migration/dbre-toolkit
        ansible-playbook -i inventory/gprd-sec.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gprd-sec_$(date +%Y%m%d).log
      • Update Consul on ex-leader/now-replica node and verify the cluster status:

        knife node run_list add patroni-sec-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-sec-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
        ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
        knife node run_list remove patroni-sec-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-sec-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client
        ssh patroni-sec-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
    • Update and restart the Consul client on the ci Patroni cluster:

      • Make sure the cluster is still paused, and identify the replica nodes:

        ssh patroni-ci-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl pause
        ssh patroni-ci-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list

        ⚠️ Skip the knife node run_list commands for the existing maintenance replica node

      • Update Consul on each replica node and verify the cluster status after each one:

        knife node run_list add patroni-ci-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-ci-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
        ssh patroni-ci-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
        knife node run_list remove patroni-ci-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-ci-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client
        ssh patroni-ci-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
      • Unpause the cluster and trigger a zero-downtime failover:

        ssh patroni-ci-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl resume
        cd db-migration/dbre-toolkit
        ansible-playbook -i inventory/gprd-ci.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gprd-ci_$(date +%Y%m%d).log
      • Update Consul on ex-leader/now-replica node and verify the cluster status:

        knife node run_list add patroni-ci-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-ci-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
        ssh patroni-ci-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
        knife node run_list remove patroni-ci-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-ci-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client
        ssh patroni-ci-v16-01-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
    • Update and restart the Consul client on the main Patroni cluster:

      • Make sure the cluster is still paused, and identify the replica nodes:

        ssh patroni-main-v16-101-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl pause
        ssh patroni-main-v16-101-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
      • Update Consul on each replica node and verify the cluster status after each one:

        ⚠️ Skip the knife node run_list commands for the existing maintenance replica node

        knife node run_list add patroni-main-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-main-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
        ssh patroni-main-v16-101-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
        knife node run_list remove patroni-main-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-main-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client
        ssh patroni-main-v16-101-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
      • Unpause the cluster and trigger a zero-downtime failover:

        ssh patroni-main-v16-101-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl resume
        cd db-migration/dbre-toolkit
        ansible-playbook -i inventory/gprd-main.yml switchover_patroni_leader.yml 2>&1 | ts | tee -a ansible_switchover_patroni_leader_gprd-main_$(date +%Y%m%d).log
      • Update Consul on ex-leader/now-replica node and verify the cluster status:

        knife node run_list add patroni-main-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-main-v16-...-db-gprd.c.gitlab-production.internal 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
        ssh patroni-main-v16-101-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
        knife node run_list remove patroni-main-v16-...-db-gprd.c.gitlab-production.internal 'role[gprd-base-db-patroni-maintenance]'
        ssh patroni-main-v16-...-db-gprd.c.gitlab-production.internal sudo chef-client
        ssh patroni-main-v16-101-db-gprd.c.gitlab-production.internal sudo gitlab-patronictl list
    • Update and restart the Consul client on all remaining Consul-enabled VMs:

      knife ssh -C 20 'chef_environment:gprd AND recipes:gitlab_consul\:\:agent AND NOT recipes:gitlab-patroni\:\:consul' 'sudo chef-client-enable && sudo chef-client && sudo systemctl restart consul.service'
    • Verify that the CA has been updated and the client has been restarted on all VMs:

      knife ssh -C 20 'chef_environment:gprd AND recipes:gitlab_consul\:\:agent' 'openssl x509 -in /etc/consul/ssl/certs/consul.crt -noout -dates | grep notAfter | tr "\n" ";"; systemctl status consul.service | grep Active:' | sort
  • Set label changeaborted /label ~change::aborted

Monitoring

Key metrics to observe

Change Reviewer checklist

C4 C3 C2 C1:

  • Check if the following applies:
    • The scheduled day and time of execution of the change is appropriate.
    • The change plan is technically accurate.
    • The change plan includes estimated timing values based on previous testing.
    • The change plan includes a viable rollback plan.
    • The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

  • Check if the following applies:
    • The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
    • The change plan includes success measures for all steps/milestones during the execution.
    • The change adequately minimizes risk within the environment/service.
    • The performance implications of executing the change are well-understood and documented.
    • The specified metrics/monitoring dashboards provide sufficient visibility for the change.
      • If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
    • The change has a primary and secondary SRE with knowledge of the details available during the change window.
    • The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
    • The labels blocks deployments and/or blocks feature-flags are applied as necessary.

Change Technician checklist

  • The change plan is technically accurate.
  • This Change Issue is linked to the appropriate Issue and/or Epic
  • Change has been tested in staging and results noted in a comment on this issue.
  • A dry-run has been conducted and results noted in a comment on this issue.
  • The change execution window respects the Production Change Lock periods.
  • For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
  • For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue. Mention @gitlab-org/saas-platforms/inframanagers in this issue to request approval and provide visibility to all infrastructure managers.
  • For C1, C2, or blocks deployments change issues, confirm with Release managers that the change does not overlap or hinder any release process (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
Edited by Pierre Guinoiseau