Rotate/replace consul certificates in `gprd`

Production Change - Criticality 1 C1

Change Component Description
Change Objective Rotate/replace consul certificates in gprd
Change Type Operation
Services Impacted Everything
Change Team Members @ggillies
Change Criticality C1
Change Reviewer A colleague who will review the change
Tested in staging https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10668#note_382652506
Dry-run output N/A
Due Date Date and time in UTC timezone for the execution of the change, if possible add the local timezone of the engineer executing the change
Time tracking 2 hour
Downtime Component N/A

Detailed steps for the change

The following plan is based off https://support.hashicorp.com/hc/en-us/articles/360048494533-Rotating-Consul-TLS-Certificates

  • Confirm with EOC that there is no reason for this CR to not proceed (incidents, etc)
  • Ensure labelling of this issue is in progress to block deploys
  • log onto https://alerts.gitlab.net and add a silence for alertname="ChefClientStale" and env="gprd"
  • Open the triage dashboard and continuously monitoring this during the change request
  • Confirm tls validation both ways is disabled
$ bundle exec knife ssh -C5 '(recipes:gitlab_consul\:\:cluster OR recipes:gitlab_consul\:\:agent) AND environment:gprd' 'sudo -n grep verify /etc/consul/consul.json'
  • Disable chef-client on all production nodes
$ bundle exec knife ssh -C5 '(recipes:gitlab_consul\:\:cluster OR recipes:gitlab_consul\:\:agent) AND environment:gprd' 'sudo -n systemctl stop chef-client'
  • Backup the current copy of the gkms encrypted certificates and private keys
$ gsutil cp gs://gitlab-gprd-secrets/gitlab-consul/gprd-client.enc gs://gitlab-gprd-secrets/gitlab-consul/gprd-client.enc.bak
$ gsutil cp gs://gitlab-gprd-secrets/gitlab-consul/gprd-cluster.enc gs://gitlab-gprd-secrets/gitlab-consul/gprd-cluster.enc.bak
  • Generate new consul keys
$ consul tls ca create
$ consul tls cert create -server -dc=east-us-2 -days 730
$ consul tls cert create -client -dc=east-us-2 -days 730
# Create tar file to submit to 1password
$ tar -czf gprd-consul-certs.tar.gz *.pem
  • Use gkms-vault-edit to update the certificates and private keys
# get the new contents of private_key
$ cat east-us-2-server-consul-0-key.pem | awk '{ printf("%s\\n", $0) }';echo
# get the contents of certificate
cat east-us-2-server-consul-0.pem | awk '{ printf("%s\\n", $0) }';echo
# get the contents of ca_certificate
cat consul-agent-ca.pem | awk '{ printf("%s\\n", $0) }';echo
# get the contents of ca_key
cat consul-agent-ca-key.pem | awk '{ printf("%s\\n", $0) }';echo
# change certs and private keys
$ ./bin/gkms-vault-edit gitlab-consul gprd-cluster

# get the new contents of private_key
$ cat east-us-2-client-consul-0-key.pem | awk '{ printf("%s\\n", $0) }';echo
# get the contents of certificate
cat east-us-2-client-consul-0.pem | awk '{ printf("%s\\n", $0) }';echo
# get the contents of ca_certificate
cat consul-agent-ca.pem | awk '{ printf("%s\\n", $0) }';echo
# change certs and private keys
$ ./bin/gkms-vault-edit gitlab-consul gprd-client
  • Determine non-leader consul nodes
$ bundle exec knife ssh -C5 'recipes:gitlab_consul\:\:cluster AND environment:gprd' 'sudo -n consul operator raft list-peers'
# note down leader node
  • Open a terminal on the consul leader node and run the following to continuously monitor that consul is returning valid records. Keep monitoring this throughout the outage.
$ ssh consul-0${LEADER}-inf-gprd.c.gitlab-production.internal
$ watch 'dig @127.0.0.1 -p 8600 db-replica.service.consul. ANY'
  • One at a time, restart consul on all the consul server nodes, leaving the leader node last
$ ssh consul-0${NODE}-inf-gprd.c.gitlab-production.internal
$ sudo su -
$ chef-client
$ systemctl restart consul
$ consul operator raft list-peers
# confirm new node is ok
$ consul members | grep `hostname`
# confirm node is active
$ dig @127.0.0.1 -p 8600 db-replica.service.consul. ANY
# confirm this returns IP addresses for all replicas
  • Test that an apply of new certs on web-01-sv-gprd.c.gitlab-production.internal works ok
$ ssh web-01-sv-gstg.c.gitlab-staging-1.internal
$ sudo su -
$ chef-client
$ systemctl restart consul
$ consul members | grep `hostname`
# confirm node is active
$ dig @127.0.0.1 -p 8600 db-replica.service.consul. ANY
# confirm this returns IP addresses for all replicas
  • Slowly roll out across all nodes by running chef, which will reload consul and enable chef-client again
$ bundle exec knife ssh -C2 'recipes:gitlab_consul\:\:agent AND environment:gprd' 'sudo chef-client'
  • Confirm everything is now running chef again
$ bundle exec knife status 'recipes:gitlab_consul\:\:agent AND environment:gprd'
  • Enable ssl verification via MR
$ cd chef-repo
$ for i in `bundle exec knife status '(recipes:gitlab_consul\:\:cluster OR recipes:gitlab_consul\:\:agent) AND environment:gprd' | awk '{ print $4 }' | sed -e 's/,//g'`;do bundle exec knife node attribute set $i consul.config '{"verify_incoming": true, "verify_outgoing": true}';done
  • Wait for gradual chef rollout and confirm that everything is still working
$ ssh consul-01-inf-gprd.c.gitlab-production.internal
$ sudo su -
$ chef-client
$ consul operator raft list-peers
# confirm new node is ok
$ consul members | grep `hostname`
# confirm node is active
$ dig @127.0.0.1 -p 8600 db-replica.service.consul. ANY
# confirm this returns IP addresses for all replicas
$ gsutil rm gs://gitlab-gprd-secrets/gitlab-consul/gprd-client.enc.bak
$ gsutil rm gs://gitlab-gprd-secrets/gitlab-consul/gprd-cluster.enc.bak

Rollback steps

  • Rollback to the old certificates for consul
$ gsutil cp gs://gitlab-gprd-secrets/gitlab-consul/gprd-client.enc.bak gs://gitlab-gprd-secrets/gitlab-consul/gprd-client.enc
$ gsutil cp gs://gitlab-gprd-secrets/gitlab-consul/gprd-cluster.enc.bak gs://gitlab-gprd-secrets/gitlab-consul/gprd-cluster.enc
  • Go onto any consul-cluster nodes that are known to have new certs, and run chef-client to apply the old certs
$ ssh consul-0${NODE}-inf-gprd.c.gitlab-production.internal
$ sudo su -
$ chef-client
$ consul operator raft list-peers
# confirm new node is ok
$ consul members | grep `hostname`
# confirm node is active
$ dig @127.0.0.1 -p 8600 db-replica.service.consul. ANY
# confirm this returns IP addresses for all replicas
  • Slowly roll back across any nodes by running chef, which will reload consul
$ bundle exec knife ssh -C2 'recipes:gitlab_consul\:\:agent AND environment:gprd' 'sudo chef-client'
  • Confirm everything is now running chef again
$ bundle exec knife status 'recipes:gitlab_consul\:\:agent AND environment:gprd'

Monitoring

Key metrics to observe

Summary of infrastruture changes

  • Does this change introduce new compute instances?
  • Does this change re-size any existing compute instances?
  • Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?

Changes checklist

  • Detailed steps and rollback steps have been filled prior to commencing work
  • SRE on-call has been informed prior to change being rolled out
  • There are currently no active incidents
Edited by Craig Barrett