2022-08-09 - [gprd] Rotate consul certificates
Production Change
Change Summary
Rotate consul certificates in production.
Execution schedule: 2022-08-09 2200 utc
These certificates will expire on: Aug 14 00:08:37 2022 GMT
Replace them with new ones which do not expire until Aug 14 00:00:00 2025 GMT.
This change management plan is based on this documentation: https://support.hashicorp.com/hc/en-us/articles/360048494533-Rotating-Consul-TLS-Certificates
This plan was mostly plagiarized from: #2437 (closed)
Change Details
- Services Impacted - ServiceConsul
- Change Technician - @nnelson
- Change Reviewer - @ggillies @skarbek @msmiley
-
Time tracking -
120 minutes -
Downtime Component -
No downtime
Detailed steps for the change
Change Steps - steps to take to execute the change
Estimated Time to Complete - 90 minutes
-
Complete the Change Technician checklist. -
Set label changein-progress: /label ~"change::in-progress" -
Log onto https://alerts.gitlab.net and add a silence for alertname="ChefClientStale"andenv="gprd". -
In your browser, navigate to the triage dashboard and continuously monitor the HAProxy 5xx responsesmetric during the change management plan execution. -
Disable chef-client on all production nodes bundle exec knife ssh -C5 '(recipes:gitlab_consul\:\:cluster OR recipes:gitlab_consul\:\:agent) AND environment:gprd' 'sudo -n chef-client-disable "Rotating consul certificates: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7558"' -
Download the current copy of the certificate authority certificates and private keys from GCS bin/gkms-vault-cat gitlab-consul gprd-client > /tmp/consul_certs/gkms-vault.gitlab-consul.gprd-client.json bin/gkms-vault-cat gitlab-consul gprd-cluster > /tmp/consul_certs/gkms-vault.gitlab-consul.gprd-cluster.json cat /tmp/consul_certs/gkms-vault.gitlab-consul.gprd-client.json | jq 'keys' cat /tmp/consul_certs/gkms-vault.gitlab-consul.gprd-cluster.json | jq 'keys' cat /tmp/consul_certs/gkms-vault.gitlab-consul.gprd-cluster.json | jq -r '.ca_certificate' > /tmp/consul_certs/consul_ca_cert.pem cat /tmp/consul_certs/gkms-vault.gitlab-consul.gprd-cluster.json | jq -r '.ca_key' > /tmp/consul_certs/consul_ca_key.pem -
Upload the ca cert and key files to a productionconsul system.scp /tmp/consul_certs/consul_ca_cert.pem consul-01-inf-gprd.c.gitlab-production.internal:~/ scp /tmp/consul_certs/consul_ca_key.pem consul-01-inf-gprd.c.gitlab-production.internal:~/ -
Backup the current copy of the gkms encrypted certificates and private keys gsutil cp gs://gitlab-gprd-secrets/gitlab-consul/gprd-client.enc gs://gitlab-gprd-secrets/gitlab-consul/gprd-client.enc.bak gsutil cp gs://gitlab-gprd-secrets/gitlab-consul/gprd-cluster.enc gs://gitlab-gprd-secrets/gitlab-consul/gprd-cluster.enc.bak -
Generate new consul keys ssh consul-01-inf-gprd.c.gitlab-production.internal consul tls cert create -client -dc=east-us-2 -days $(( 3*365 )) -ca ./consul_ca_cert.pem -key ./consul_ca_key.pem consul tls cert create -server -dc=east-us-2 -days $(( 3*365 )) -ca ./consul_ca_cert.pem -key ./consul_ca_key.pem # Inspect the results ls -lt openssl x509 -noout -in east-us-2-client-consul-0.pem -text openssl x509 -noout -in east-us-2-server-consul-0.pem -text # Compare the generated server certificate to the existing server certificate diff -U3 <( openssl x509 -noout -in /etc/consul/ssl/certs/consul.crt -text ) <( openssl x509 -noout -in east-us-2-server-consul-0.pem -text ) # Create a tar file to submit to 1Password tar -cvzf gprd-consul-server-and-certs-and-keys.tar.gz east-us-2-*.pem -
In another shell session, download the tarball scp consul-01-inf-gprd.c.gitlab-production.internal:~/gprd-consul-certs.tar.gz ./ -
Upload the tarball to 1Password. -
Delete the CA material from the home directory of your user on the consul system used to generate the renewed certificates: ssh consul-01-inf-gprd.c.gitlab-production.internal ls -l ~/consul_ca_*.pem rm -i ~/consul_ca_*.pem -
Use gkms-vault-editto update the certificates and private keys# get the new contents of private_key cat east-us-2-server-consul-0-key.pem | awk '{ printf("%s\\n", $0) }';echo # get the contents of certificate cat east-us-2-server-consul-0.pem | awk '{ printf("%s\\n", $0) }';echo # change cert and private key ./bin/gkms-vault-edit gitlab-consul gprd-cluster # get the new contents of private_key cat east-us-2-client-consul-0-key.pem | awk '{ printf("%s\\n", $0) }';echo # get the contents of certificate cat east-us-2-client-consul-0.pem | awk '{ printf("%s\\n", $0) }';echo # change cert and private key ./bin/gkms-vault-edit gitlab-consul gprd-client -
Determine non-leader consul nodes. rm -f follower_consul_members.txt ssh console-ro-01-sv-gprd.c.gitlab-production.internal -- "consul operator raft list-peers | grep follower | awk '{ print \$1 }'" > follower_consul_members.txt -
Determine the leader consul node. # Nels notes: It is not clear to me why this command was being executed on every node matching this search criteria. # Original: bundle exec knife ssh -C5 'recipes:gitlab_consul\:\:cluster AND environment:gprd' 'sudo -n consul operator raft list-peers' # Record the leader node LEADER=$(ssh console-ro-01-sv-gprd.c.gitlab-production.internal -- "consul operator raft list-peers | grep leader | awk '{ print \$1 }'") -
Open a terminal on the consul leader node and run the following to continuously monitor that consul is returning valid records. Keep monitoring this throughout the outage. ssh "${LEADER}.c.gitlab-production.internal" watch 'dig @127.0.0.1 -p 8600 db-replica.service.consul. ANY' -
One at a time, restart consul on all the consul server nodes, leaving the leader node last. It is expected that restarting the consul.serviceon the leader will cause an election to be called by the consul server cluster members.cat follower_consul_members.txt # for follower in $(cat follower_consul_members.txt); do echo "follower node: ${follower}" ssh ${follower}.c.gitlab-production.internal chef-client-enable sudo chef-client sudo systemctl restart consul consul operator raft list-peers # confirm new node is ok consul members | grep `hostname` # confirm node is active dig @127.0.0.1 -p 8600 db-replica.service.consul. ANY # confirm this returns IP addresses for all replicas # Watch the consul logs for TLS errors for at least 30 seconds before moving on to the next node. $ sudo journalctl -u consul.service -n50 -f $ sudo journalctl -u consul.service -n100 | grep -e 'heartbeat' -e 'VerifyIncoming' -
Test that an apply of new certs on console-ro-01-sv-gprd.c.gitlab-production.internalworks okssh console-ro-01-sv-gprd.c.gitlab-production.internal chef-client-enable sudo chef-client sudo systemctl restart consul consul members | grep `hostname` # confirm node is active dig @127.0.0.1 -p 8600 db-replica.service.consul. ANY # confirm this returns IP addresses for all replicas -
Slowly roll out across all nodes by running chef, which will reload consul and enable chef-client again, or else just wait 30 minutes until the entire fleet has finished converging their configurations. bundle exec knife ssh -C5 'recipes:gitlab_consul\:\:agent AND environment:gprd' 'chef-client-enable' # This is only required to be ran for about 25 minutes. By then, the rest of the systems will begin running `chef-client` on their automatic cadence: bundle exec knife ssh -C2 'recipes:gitlab_consul\:\:agent AND environment:gprd' 'sudo -n chef-client' -
Wait 35 minutes. -
Confirm everything is now running chef again bundle exec knife status 'recipes:gitlab_consul\:\:agent AND environment:gprd' -
Restart the consul service on all systems running in client agent mode. bundle exec knife ssh -C4 'recipes:gitlab_consul\:\:agent AND environment:gprd' 'sudo systemctl restart consul.service' -
Confirm that everything is still working: ssh consul-01-inf-gprd.c.gitlab-production.internal sudo chef-client consul operator raft list-peers # confirm new node is ok consul members | grep `hostname` # confirm node is active dig @127.0.0.1 -p 8600 db-replica.service.consul. ANY # confirm this returns IP addresses for all replicas -
Remove Alert Silence from https://alerts.gitlab.net -
Remove backup copy of old secrets # Nels note: It is not clear that this is immediately required. Consider deferring until it is clear that a roll-back is not necessary. gsutil rm gs://gitlab-gprd-secrets/gitlab-consul/gprd-client.enc.bak gsutil rm gs://gitlab-gprd-secrets/gitlab-consul/gprd-cluster.enc.bak -
Set label changecomplete: /label ~"change::complete"
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
Estimated Time to Complete - 30 minutes
-
Rollback to the old certificates for consul gsutil cp gs://gitlab-gprd-secrets/gitlab-consul/gprd-client.enc.bak gs://gitlab-gprd-secrets/gitlab-consul/gprd-client.enc gsutil cp gs://gitlab-gprd-secrets/gitlab-consul/gprd-cluster.enc.bak gs://gitlab-gprd-secrets/gitlab-consul/gprd-cluster.enc -
Go onto any consul-cluster nodes that are known to have new certs, and run chef-client to apply the old certs; for example: consul-01-inf-gprd.c.gitlab-production.internal:ssh consul-01-inf-gprd.c.gitlab-production.internal sudo chef-client consul operator raft list-peers # confirm new node is ok consul members | grep `hostname` # confirm node is active dig @127.0.0.1 -p 8600 db-replica.service.consul. ANY # confirm this returns IP addresses for all replicas -
Slowly roll back across any nodes by running chef, which will reload consul bundle exec knife ssh -C2 'recipes:gitlab_consul\:\:agent AND environment:gprd' 'sudo chef-client' -
Confirm everything is now running chef again bundle exec knife status 'recipes:gitlab_consul\:\:agent AND environment:gprd' -
Remove Alert Silence from https://alerts.gitlab.net -
Set label changeaborted: /label ~"change::aborted"
Monitoring
Key metrics to observe
- Metric:
haproxy 5xx responses- Location: https://dashboards.gitlab.net/d/RZmbBr7mk/gitlab-triage
- What changes to this metric should prompt a rollback: Any elevation of errors larger than
100sustained for longer than2 minutes.
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncalland this issue and await their acknowledgement.) - For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention
@release-managersand this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.
Edited by Nels Nelson