2022-08-09 - [gprd] Rotate consul certificates

`Production` Change

Change Summary

Rotate consul certificates in production.

Execution schedule: 2022-08-09 2200 utc

These certificates will expire on: Aug 14 00:08:37 2022 GMT

Replace them with new ones which do not expire until Aug 14 00:00:00 2025 GMT.

This change management plan is based on this documentation: https://support.hashicorp.com/hc/en-us/articles/360048494533-Rotating-Consul-TLS-Certificates

Fulfills: Consul certificate expires 2022-08-14: https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15493

This plan was mostly plagiarized from: #2437 (closed)

Change Details

Services Impacted - ServiceConsul
Change Technician - @nnelson
Change Reviewer - @ggillies @skarbek @msmiley
Time tracking - 120 minutes
Downtime Component - No downtime

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete - 90 minutes

Complete the Change Technician checklist.
Set label changein-progress:
```
/label ~"change::in-progress"
```
Log onto https://alerts.gitlab.net and add a silence for alertname="ChefClientStale" and env="gprd".
In your browser, navigate to the triage dashboard and continuously monitor the HAProxy 5xx responses metric during the change management plan execution.

Disable chef-client on all production nodes

bundle exec knife ssh -C5 '(recipes:gitlab_consul\:\:cluster OR recipes:gitlab_consul\:\:agent) AND environment:gprd' 'sudo -n chef-client-disable "Rotating consul certificates: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7558"'

Download the current copy of the certificate authority certificates and private keys from GCS

bin/gkms-vault-cat gitlab-consul gprd-client > /tmp/consul_certs/gkms-vault.gitlab-consul.gprd-client.json
bin/gkms-vault-cat gitlab-consul gprd-cluster > /tmp/consul_certs/gkms-vault.gitlab-consul.gprd-cluster.json
cat /tmp/consul_certs/gkms-vault.gitlab-consul.gprd-client.json | jq 'keys'
cat /tmp/consul_certs/gkms-vault.gitlab-consul.gprd-cluster.json | jq 'keys'
cat /tmp/consul_certs/gkms-vault.gitlab-consul.gprd-cluster.json | jq -r '.ca_certificate' > /tmp/consul_certs/consul_ca_cert.pem
cat /tmp/consul_certs/gkms-vault.gitlab-consul.gprd-cluster.json | jq -r '.ca_key' > /tmp/consul_certs/consul_ca_key.pem

Upload the ca cert and key files to a production consul system.

scp /tmp/consul_certs/consul_ca_cert.pem consul-01-inf-gprd.c.gitlab-production.internal:~/
scp /tmp/consul_certs/consul_ca_key.pem consul-01-inf-gprd.c.gitlab-production.internal:~/

Backup the current copy of the gkms encrypted certificates and private keys

gsutil cp gs://gitlab-gprd-secrets/gitlab-consul/gprd-client.enc gs://gitlab-gprd-secrets/gitlab-consul/gprd-client.enc.bak
gsutil cp gs://gitlab-gprd-secrets/gitlab-consul/gprd-cluster.enc gs://gitlab-gprd-secrets/gitlab-consul/gprd-cluster.enc.bak

Generate new consul keys

ssh consul-01-inf-gprd.c.gitlab-production.internal
consul tls cert create -client -dc=east-us-2 -days $(( 3*365 )) -ca ./consul_ca_cert.pem -key ./consul_ca_key.pem
consul tls cert create -server -dc=east-us-2 -days $(( 3*365 )) -ca ./consul_ca_cert.pem -key ./consul_ca_key.pem

# Inspect the results
ls -lt
openssl x509 -noout -in east-us-2-client-consul-0.pem -text
openssl x509 -noout -in east-us-2-server-consul-0.pem -text

# Compare the generated server certificate to the existing server certificate
diff -U3 <( openssl x509 -noout -in /etc/consul/ssl/certs/consul.crt -text )  <( openssl x509 -noout -in east-us-2-server-consul-0.pem -text )

# Create a tar file to submit to 1Password
tar -cvzf gprd-consul-server-and-certs-and-keys.tar.gz east-us-2-*.pem

In another shell session, download the tarball

scp consul-01-inf-gprd.c.gitlab-production.internal:~/gprd-consul-certs.tar.gz ./

Upload the tarball to 1Password.
Delete the CA material from the home directory of your user on the consul system used to generate the renewed certificates:
```
ssh consul-01-inf-gprd.c.gitlab-production.internal
ls -l ~/consul_ca_*.pem
rm -i ~/consul_ca_*.pem
```

Use gkms-vault-edit to update the certificates and private keys

# get the new contents of private_key
cat east-us-2-server-consul-0-key.pem | awk '{ printf("%s\\n", $0) }';echo
# get the contents of certificate
cat east-us-2-server-consul-0.pem | awk '{ printf("%s\\n", $0) }';echo

# change cert and private key
./bin/gkms-vault-edit gitlab-consul gprd-cluster

# get the new contents of private_key
cat east-us-2-client-consul-0-key.pem | awk '{ printf("%s\\n", $0) }';echo
# get the contents of certificate
cat east-us-2-client-consul-0.pem | awk '{ printf("%s\\n", $0) }';echo

# change cert and private key
./bin/gkms-vault-edit gitlab-consul gprd-client

Determine non-leader consul nodes.

rm -f follower_consul_members.txt
ssh console-ro-01-sv-gprd.c.gitlab-production.internal -- "consul operator raft list-peers | grep follower | awk '{ print \$1 }'" > follower_consul_members.txt

Determine the leader consul node.

# Nels notes: It is not clear to me why this command was being executed on every node matching this search criteria.
# Original: bundle exec knife ssh -C5 'recipes:gitlab_consul\:\:cluster AND environment:gprd' 'sudo -n consul operator raft list-peers'
# Record the leader node
LEADER=$(ssh console-ro-01-sv-gprd.c.gitlab-production.internal -- "consul operator raft list-peers | grep leader | awk '{ print \$1 }'")

Open a terminal on the consul leader node and run the following to continuously monitor that consul is returning valid records. Keep monitoring this throughout the outage.
```
ssh "${LEADER}.c.gitlab-production.internal"
watch 'dig @127.0.0.1 -p 8600 db-replica.service.consul. ANY'
```

One at a time, restart consul on all the consul server nodes, leaving the leader node last. It is expected that restarting the consul.service on the leader will cause an election to be called by the consul server cluster members.

cat follower_consul_members.txt
# for follower in $(cat follower_consul_members.txt); do
echo "follower node: ${follower}"
ssh ${follower}.c.gitlab-production.internal
chef-client-enable
sudo chef-client
sudo systemctl restart consul
consul operator raft list-peers
# confirm new node is ok
consul members | grep `hostname`
# confirm node is active
dig @127.0.0.1 -p 8600 db-replica.service.consul. ANY
# confirm this returns IP addresses for all replicas

# Watch the consul logs for TLS errors for at least 30 seconds before moving on to the next node.
$ sudo journalctl -u consul.service -n50 -f
$ sudo journalctl -u consul.service -n100 | grep -e 'heartbeat' -e 'VerifyIncoming'

Test that an apply of new certs on console-ro-01-sv-gprd.c.gitlab-production.internal works ok

ssh console-ro-01-sv-gprd.c.gitlab-production.internal
chef-client-enable
sudo chef-client
sudo systemctl restart consul
consul members | grep `hostname`
# confirm node is active
dig @127.0.0.1 -p 8600 db-replica.service.consul. ANY
# confirm this returns IP addresses for all replicas

Slowly roll out across all nodes by running chef, which will reload consul and enable chef-client again, or else just wait 30 minutes until the entire fleet has finished converging their configurations.

bundle exec knife ssh -C5 'recipes:gitlab_consul\:\:agent AND environment:gprd' 'chef-client-enable'
# This is only required to be ran for about 25 minutes. By then, the rest of the systems will begin running `chef-client` on their automatic cadence:
bundle exec knife ssh -C2 'recipes:gitlab_consul\:\:agent AND environment:gprd' 'sudo -n chef-client'

Wait 35 minutes.

Confirm everything is now running chef again

bundle exec knife status 'recipes:gitlab_consul\:\:agent AND environment:gprd'

Restart the consul service on all systems running in client agent mode.

bundle exec knife ssh -C4 'recipes:gitlab_consul\:\:agent AND environment:gprd' 'sudo systemctl restart consul.service'

Confirm that everything is still working:

ssh consul-01-inf-gprd.c.gitlab-production.internal
sudo chef-client
consul operator raft list-peers
# confirm new node is ok
consul members | grep `hostname`
# confirm node is active
dig @127.0.0.1 -p 8600 db-replica.service.consul. ANY
# confirm this returns IP addresses for all replicas

Remove Alert Silence from https://alerts.gitlab.net

Remove backup copy of old secrets

# Nels note: It is not clear that this is immediately required. Consider deferring until it is clear that a roll-back is not necessary.
gsutil rm gs://gitlab-gprd-secrets/gitlab-consul/gprd-client.enc.bak
gsutil rm gs://gitlab-gprd-secrets/gitlab-consul/gprd-cluster.enc.bak

Set label changecomplete:
```
/label ~"change::complete"
```

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete - 30 minutes

Rollback to the old certificates for consul

gsutil cp gs://gitlab-gprd-secrets/gitlab-consul/gprd-client.enc.bak gs://gitlab-gprd-secrets/gitlab-consul/gprd-client.enc
gsutil cp gs://gitlab-gprd-secrets/gitlab-consul/gprd-cluster.enc.bak gs://gitlab-gprd-secrets/gitlab-consul/gprd-cluster.enc

Go onto any consul-cluster nodes that are known to have new certs, and run chef-client to apply the old certs; for example: consul-01-inf-gprd.c.gitlab-production.internal:

ssh consul-01-inf-gprd.c.gitlab-production.internal
sudo chef-client
consul operator raft list-peers
# confirm new node is ok
consul members | grep `hostname`
# confirm node is active
dig @127.0.0.1 -p 8600 db-replica.service.consul. ANY
# confirm this returns IP addresses for all replicas

Slowly roll back across any nodes by running chef, which will reload consul

bundle exec knife ssh -C2 'recipes:gitlab_consul\:\:agent AND environment:gprd' 'sudo chef-client'

Confirm everything is now running chef again

bundle exec knife status 'recipes:gitlab_consul\:\:agent AND environment:gprd'

Remove Alert Silence from https://alerts.gitlab.net
Set label changeaborted:
```
/label ~"change::aborted"
```

Monitoring

Key metrics to observe

Metric: haproxy 5xx responses
- Location: https://dashboards.gitlab.net/d/RZmbBr7mk/gitlab-triage
- What changes to this metric should prompt a rollback: Any elevation of errors larger than 100 sustained for longer than 2 minutes.

Change Reviewer checklist

C4 C3 C2 C1:

Check if the following applies:
- The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

Check if the following applies:
- The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
  - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary

Change Technician checklist

Check if all items below are complete:
- The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
- For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
- There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

Edited Aug 10, 2022 by Nels Nelson