2022-08-09 - [gprd] Rotate consul certificates

Production Change

Change Summary

Rotate consul certificates in production.

Execution schedule: 2022-08-09 2200 utc

These certificates will expire on: Aug 14 00:08:37 2022 GMT

Replace them with new ones which do not expire until Aug 14 00:00:00 2025 GMT.

This change management plan is based on this documentation: https://support.hashicorp.com/hc/en-us/articles/360048494533-Rotating-Consul-TLS-Certificates

Fulfills: Consul certificate expires 2022-08-14: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15493

This plan was mostly plagiarized from: #2437 (closed)

Change Details

  1. Services Impacted - ServiceConsul
  2. Change Technician - @nnelson
  3. Change Reviewer - @ggillies @skarbek @msmiley
  4. Time tracking - 120 minutes
  5. Downtime Component - No downtime

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete - 90 minutes

  • Complete the Change Technician checklist.
  • Set label changein-progress:
    /label ~"change::in-progress"
  • Log onto https://alerts.gitlab.net and add a silence for alertname="ChefClientStale" and env="gprd".
  • In your browser, navigate to the triage dashboard and continuously monitor the HAProxy 5xx responses metric during the change management plan execution.
  • Disable chef-client on all production nodes
    bundle exec knife ssh -C5 '(recipes:gitlab_consul\:\:cluster OR recipes:gitlab_consul\:\:agent) AND environment:gprd' 'sudo -n chef-client-disable "Rotating consul certificates: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7558"'
  • Download the current copy of the certificate authority certificates and private keys from GCS
    bin/gkms-vault-cat gitlab-consul gprd-client > /tmp/consul_certs/gkms-vault.gitlab-consul.gprd-client.json
    bin/gkms-vault-cat gitlab-consul gprd-cluster > /tmp/consul_certs/gkms-vault.gitlab-consul.gprd-cluster.json
    cat /tmp/consul_certs/gkms-vault.gitlab-consul.gprd-client.json | jq 'keys'
    cat /tmp/consul_certs/gkms-vault.gitlab-consul.gprd-cluster.json | jq 'keys'
    cat /tmp/consul_certs/gkms-vault.gitlab-consul.gprd-cluster.json | jq -r '.ca_certificate' > /tmp/consul_certs/consul_ca_cert.pem
    cat /tmp/consul_certs/gkms-vault.gitlab-consul.gprd-cluster.json | jq -r '.ca_key' > /tmp/consul_certs/consul_ca_key.pem
  • Upload the ca cert and key files to a production consul system.
    scp /tmp/consul_certs/consul_ca_cert.pem consul-01-inf-gprd.c.gitlab-production.internal:~/
    scp /tmp/consul_certs/consul_ca_key.pem consul-01-inf-gprd.c.gitlab-production.internal:~/
  • Backup the current copy of the gkms encrypted certificates and private keys
    gsutil cp gs://gitlab-gprd-secrets/gitlab-consul/gprd-client.enc gs://gitlab-gprd-secrets/gitlab-consul/gprd-client.enc.bak
    gsutil cp gs://gitlab-gprd-secrets/gitlab-consul/gprd-cluster.enc gs://gitlab-gprd-secrets/gitlab-consul/gprd-cluster.enc.bak
  • Generate new consul keys
    ssh consul-01-inf-gprd.c.gitlab-production.internal
    consul tls cert create -client -dc=east-us-2 -days $(( 3*365 )) -ca ./consul_ca_cert.pem -key ./consul_ca_key.pem
    consul tls cert create -server -dc=east-us-2 -days $(( 3*365 )) -ca ./consul_ca_cert.pem -key ./consul_ca_key.pem
    
    # Inspect the results
    ls -lt
    openssl x509 -noout -in east-us-2-client-consul-0.pem -text
    openssl x509 -noout -in east-us-2-server-consul-0.pem -text
    
    # Compare the generated server certificate to the existing server certificate
    diff -U3 <( openssl x509 -noout -in /etc/consul/ssl/certs/consul.crt -text )  <( openssl x509 -noout -in east-us-2-server-consul-0.pem -text )
    
    # Create a tar file to submit to 1Password
    tar -cvzf gprd-consul-server-and-certs-and-keys.tar.gz east-us-2-*.pem
  • In another shell session, download the tarball
    scp consul-01-inf-gprd.c.gitlab-production.internal:~/gprd-consul-certs.tar.gz ./
  • Upload the tarball to 1Password.
  • Delete the CA material from the home directory of your user on the consul system used to generate the renewed certificates:
    ssh consul-01-inf-gprd.c.gitlab-production.internal
    ls -l ~/consul_ca_*.pem
    rm -i ~/consul_ca_*.pem
  • Use gkms-vault-edit to update the certificates and private keys
    # get the new contents of private_key
    cat east-us-2-server-consul-0-key.pem | awk '{ printf("%s\\n", $0) }';echo
    # get the contents of certificate
    cat east-us-2-server-consul-0.pem | awk '{ printf("%s\\n", $0) }';echo
    
    # change cert and private key
    ./bin/gkms-vault-edit gitlab-consul gprd-cluster
    
    # get the new contents of private_key
    cat east-us-2-client-consul-0-key.pem | awk '{ printf("%s\\n", $0) }';echo
    # get the contents of certificate
    cat east-us-2-client-consul-0.pem | awk '{ printf("%s\\n", $0) }';echo
    
    # change cert and private key
    ./bin/gkms-vault-edit gitlab-consul gprd-client
  • Determine non-leader consul nodes.
    rm -f follower_consul_members.txt
    ssh console-ro-01-sv-gprd.c.gitlab-production.internal -- "consul operator raft list-peers | grep follower | awk '{ print \$1 }'" > follower_consul_members.txt
  • Determine the leader consul node.
    # Nels notes: It is not clear to me why this command was being executed on every node matching this search criteria.
    # Original: bundle exec knife ssh -C5 'recipes:gitlab_consul\:\:cluster AND environment:gprd' 'sudo -n consul operator raft list-peers'
    # Record the leader node
    LEADER=$(ssh console-ro-01-sv-gprd.c.gitlab-production.internal -- "consul operator raft list-peers | grep leader | awk '{ print \$1 }'")
  • Open a terminal on the consul leader node and run the following to continuously monitor that consul is returning valid records. Keep monitoring this throughout the outage.
    ssh "${LEADER}.c.gitlab-production.internal"
    watch 'dig @127.0.0.1 -p 8600 db-replica.service.consul. ANY'
  • One at a time, restart consul on all the consul server nodes, leaving the leader node last. It is expected that restarting the consul.service on the leader will cause an election to be called by the consul server cluster members.
    cat follower_consul_members.txt
    # for follower in $(cat follower_consul_members.txt); do
    echo "follower node: ${follower}"
    ssh ${follower}.c.gitlab-production.internal
    chef-client-enable
    sudo chef-client
    sudo systemctl restart consul
    consul operator raft list-peers
    # confirm new node is ok
    consul members | grep `hostname`
    # confirm node is active
    dig @127.0.0.1 -p 8600 db-replica.service.consul. ANY
    # confirm this returns IP addresses for all replicas
    
    # Watch the consul logs for TLS errors for at least 30 seconds before moving on to the next node.
    $ sudo journalctl -u consul.service -n50 -f
    $ sudo journalctl -u consul.service -n100 | grep -e 'heartbeat' -e 'VerifyIncoming'
  • Test that an apply of new certs on console-ro-01-sv-gprd.c.gitlab-production.internal works ok
    ssh console-ro-01-sv-gprd.c.gitlab-production.internal
    chef-client-enable
    sudo chef-client
    sudo systemctl restart consul
    consul members | grep `hostname`
    # confirm node is active
    dig @127.0.0.1 -p 8600 db-replica.service.consul. ANY
    # confirm this returns IP addresses for all replicas
  • Slowly roll out across all nodes by running chef, which will reload consul and enable chef-client again, or else just wait 30 minutes until the entire fleet has finished converging their configurations.
    bundle exec knife ssh -C5 'recipes:gitlab_consul\:\:agent AND environment:gprd' 'chef-client-enable'
    # This is only required to be ran for about 25 minutes. By then, the rest of the systems will begin running `chef-client` on their automatic cadence:
    bundle exec knife ssh -C2 'recipes:gitlab_consul\:\:agent AND environment:gprd' 'sudo -n chef-client'
  • Wait 35 minutes.
  • Confirm everything is now running chef again
    bundle exec knife status 'recipes:gitlab_consul\:\:agent AND environment:gprd'
  • Restart the consul service on all systems running in client agent mode.
    bundle exec knife ssh -C4 'recipes:gitlab_consul\:\:agent AND environment:gprd' 'sudo systemctl restart consul.service'
  • Confirm that everything is still working:
    ssh consul-01-inf-gprd.c.gitlab-production.internal
    sudo chef-client
    consul operator raft list-peers
    # confirm new node is ok
    consul members | grep `hostname`
    # confirm node is active
    dig @127.0.0.1 -p 8600 db-replica.service.consul. ANY
    # confirm this returns IP addresses for all replicas
  • Remove Alert Silence from https://alerts.gitlab.net
  • Remove backup copy of old secrets
    # Nels note: It is not clear that this is immediately required. Consider deferring until it is clear that a roll-back is not necessary.
    gsutil rm gs://gitlab-gprd-secrets/gitlab-consul/gprd-client.enc.bak
    gsutil rm gs://gitlab-gprd-secrets/gitlab-consul/gprd-cluster.enc.bak
  • Set label changecomplete:
    /label ~"change::complete"

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete - 30 minutes

  • Rollback to the old certificates for consul
    gsutil cp gs://gitlab-gprd-secrets/gitlab-consul/gprd-client.enc.bak gs://gitlab-gprd-secrets/gitlab-consul/gprd-client.enc
    gsutil cp gs://gitlab-gprd-secrets/gitlab-consul/gprd-cluster.enc.bak gs://gitlab-gprd-secrets/gitlab-consul/gprd-cluster.enc
  • Go onto any consul-cluster nodes that are known to have new certs, and run chef-client to apply the old certs; for example: consul-01-inf-gprd.c.gitlab-production.internal:
    ssh consul-01-inf-gprd.c.gitlab-production.internal
    sudo chef-client
    consul operator raft list-peers
    # confirm new node is ok
    consul members | grep `hostname`
    # confirm node is active
    dig @127.0.0.1 -p 8600 db-replica.service.consul. ANY
    # confirm this returns IP addresses for all replicas
  • Slowly roll back across any nodes by running chef, which will reload consul
    bundle exec knife ssh -C2 'recipes:gitlab_consul\:\:agent AND environment:gprd' 'sudo chef-client'
  • Confirm everything is now running chef again
    bundle exec knife status 'recipes:gitlab_consul\:\:agent AND environment:gprd'
  • Remove Alert Silence from https://alerts.gitlab.net
  • Set label changeaborted:
    /label ~"change::aborted"

Monitoring

Key metrics to observe

Change Reviewer checklist

C4 C3 C2 C1:

  • Check if the following applies:
    • The scheduled day and time of execution of the change is appropriate.
    • The change plan is technically accurate.
    • The change plan includes estimated timing values based on previous testing.
    • The change plan includes a viable rollback plan.
    • The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

  • Check if the following applies:
    • The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
    • The change plan includes success measures for all steps/milestones during the execution.
    • The change adequately minimizes risk within the environment/service.
    • The performance implications of executing the change are well-understood and documented.
    • The specified metrics/monitoring dashboards provide sufficient visibility for the change.
      • If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
    • The change has a primary and secondary SRE with knowledge of the details available during the change window.
    • The labels blocks deployments and/or blocks feature-flags are applied as necessary

Change Technician checklist

  • Check if all items below are complete:
    • The change plan is technically accurate.
    • This Change Issue is linked to the appropriate Issue and/or Epic
    • Change has been tested in staging and results noted in a comment on this issue.
    • A dry-run has been conducted and results noted in a comment on this issue.
    • For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
    • For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
    • For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
    • For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
    • Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
    • There are currently no active incidents that are severity1 or severity2
    • If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.
Edited by Nels Nelson