Rotate/replace consul certificates in `gprd`
Production Change - Criticality 1 C1
| Change Component | Description |
|---|---|
| Change Objective | Rotate/replace consul certificates in gprd
|
| Change Type | Operation |
| Services Impacted | Everything |
| Change Team Members | @ggillies |
| Change Criticality | C1 |
| Change Reviewer | A colleague who will review the change |
| Tested in staging | https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10668#note_382652506 |
| Dry-run output | N/A |
| Due Date | Date and time in UTC timezone for the execution of the change, if possible add the local timezone of the engineer executing the change |
| Time tracking | 2 hour |
| Downtime Component | N/A |
Detailed steps for the change
The following plan is based off https://support.hashicorp.com/hc/en-us/articles/360048494533-Rotating-Consul-TLS-Certificates
-
Confirm with EOC that there is no reason for this CR to not proceed (incidents, etc) -
Ensure labelling of this issue is in progress to block deploys -
log onto https://alerts.gitlab.net and add a silence for alertname="ChefClientStale"andenv="gprd" -
Open the triage dashboard and continuously monitoring this during the change request -
Confirm tls validation both ways is disabled
$ bundle exec knife ssh -C5 '(recipes:gitlab_consul\:\:cluster OR recipes:gitlab_consul\:\:agent) AND environment:gprd' 'sudo -n grep verify /etc/consul/consul.json'
-
Disable chef-client on all production nodes
$ bundle exec knife ssh -C5 '(recipes:gitlab_consul\:\:cluster OR recipes:gitlab_consul\:\:agent) AND environment:gprd' 'sudo -n systemctl stop chef-client'
-
Backup the current copy of the gkms encrypted certificates and private keys
$ gsutil cp gs://gitlab-gprd-secrets/gitlab-consul/gprd-client.enc gs://gitlab-gprd-secrets/gitlab-consul/gprd-client.enc.bak
$ gsutil cp gs://gitlab-gprd-secrets/gitlab-consul/gprd-cluster.enc gs://gitlab-gprd-secrets/gitlab-consul/gprd-cluster.enc.bak
-
Generate new consul keys
$ consul tls ca create
$ consul tls cert create -server -dc=east-us-2 -days 730
$ consul tls cert create -client -dc=east-us-2 -days 730
# Create tar file to submit to 1password
$ tar -czf gprd-consul-certs.tar.gz *.pem
-
Use gkms-vault-editto update the certificates and private keys
# get the new contents of private_key
$ cat east-us-2-server-consul-0-key.pem | awk '{ printf("%s\\n", $0) }';echo
# get the contents of certificate
cat east-us-2-server-consul-0.pem | awk '{ printf("%s\\n", $0) }';echo
# get the contents of ca_certificate
cat consul-agent-ca.pem | awk '{ printf("%s\\n", $0) }';echo
# get the contents of ca_key
cat consul-agent-ca-key.pem | awk '{ printf("%s\\n", $0) }';echo
# change certs and private keys
$ ./bin/gkms-vault-edit gitlab-consul gprd-cluster
# get the new contents of private_key
$ cat east-us-2-client-consul-0-key.pem | awk '{ printf("%s\\n", $0) }';echo
# get the contents of certificate
cat east-us-2-client-consul-0.pem | awk '{ printf("%s\\n", $0) }';echo
# get the contents of ca_certificate
cat consul-agent-ca.pem | awk '{ printf("%s\\n", $0) }';echo
# change certs and private keys
$ ./bin/gkms-vault-edit gitlab-consul gprd-client
-
Determine non-leader consul nodes
$ bundle exec knife ssh -C5 'recipes:gitlab_consul\:\:cluster AND environment:gprd' 'sudo -n consul operator raft list-peers'
# note down leader node
-
Open a terminal on the consul leader node and run the following to continuously monitor that consul is returning valid records. Keep monitoring this throughout the outage.
$ ssh consul-0${LEADER}-inf-gprd.c.gitlab-production.internal
$ watch 'dig @127.0.0.1 -p 8600 db-replica.service.consul. ANY'
-
One at a time, restart consul on all the consul server nodes, leaving the leader node last
$ ssh consul-0${NODE}-inf-gprd.c.gitlab-production.internal
$ sudo su -
$ chef-client
$ systemctl restart consul
$ consul operator raft list-peers
# confirm new node is ok
$ consul members | grep `hostname`
# confirm node is active
$ dig @127.0.0.1 -p 8600 db-replica.service.consul. ANY
# confirm this returns IP addresses for all replicas
-
Test that an apply of new certs on web-01-sv-gprd.c.gitlab-production.internalworks ok
$ ssh web-01-sv-gstg.c.gitlab-staging-1.internal
$ sudo su -
$ chef-client
$ systemctl restart consul
$ consul members | grep `hostname`
# confirm node is active
$ dig @127.0.0.1 -p 8600 db-replica.service.consul. ANY
# confirm this returns IP addresses for all replicas
-
Slowly roll out across all nodes by running chef, which will reload consul and enable chef-client again
$ bundle exec knife ssh -C2 'recipes:gitlab_consul\:\:agent AND environment:gprd' 'sudo chef-client'
-
Confirm everything is now running chef again
$ bundle exec knife status 'recipes:gitlab_consul\:\:agent AND environment:gprd'
-
Enable ssl verification via MR
$ cd chef-repo
$ for i in `bundle exec knife status '(recipes:gitlab_consul\:\:cluster OR recipes:gitlab_consul\:\:agent) AND environment:gprd' | awk '{ print $4 }' | sed -e 's/,//g'`;do bundle exec knife node attribute set $i consul.config '{"verify_incoming": true, "verify_outgoing": true}';done
-
Wait for gradual chef rollout and confirm that everything is still working
$ ssh consul-01-inf-gprd.c.gitlab-production.internal
$ sudo su -
$ chef-client
$ consul operator raft list-peers
# confirm new node is ok
$ consul members | grep `hostname`
# confirm node is active
$ dig @127.0.0.1 -p 8600 db-replica.service.consul. ANY
# confirm this returns IP addresses for all replicas
-
Remove Alert Silence from https://alerts.gitlab.net -
Remove backup copy of old secrets
$ gsutil rm gs://gitlab-gprd-secrets/gitlab-consul/gprd-client.enc.bak
$ gsutil rm gs://gitlab-gprd-secrets/gitlab-consul/gprd-cluster.enc.bak
Rollback steps
-
Rollback to the old certificates for consul
$ gsutil cp gs://gitlab-gprd-secrets/gitlab-consul/gprd-client.enc.bak gs://gitlab-gprd-secrets/gitlab-consul/gprd-client.enc
$ gsutil cp gs://gitlab-gprd-secrets/gitlab-consul/gprd-cluster.enc.bak gs://gitlab-gprd-secrets/gitlab-consul/gprd-cluster.enc
-
Go onto any consul-cluster nodes that are known to have new certs, and run chef-client to apply the old certs
$ ssh consul-0${NODE}-inf-gprd.c.gitlab-production.internal
$ sudo su -
$ chef-client
$ consul operator raft list-peers
# confirm new node is ok
$ consul members | grep `hostname`
# confirm node is active
$ dig @127.0.0.1 -p 8600 db-replica.service.consul. ANY
# confirm this returns IP addresses for all replicas
-
Slowly roll back across any nodes by running chef, which will reload consul
$ bundle exec knife ssh -C2 'recipes:gitlab_consul\:\:agent AND environment:gprd' 'sudo chef-client'
-
Confirm everything is now running chef again
$ bundle exec knife status 'recipes:gitlab_consul\:\:agent AND environment:gprd'
-
Remove Alert Silence from https://alerts.gitlab.net
Monitoring
Key metrics to observe
- Metric: haproxy 5xx
- Location: https://dashboards.gitlab.net/d/RZmbBr7mk/gitlab-triage?orgId=1&refresh=30s
- What changes to this metric should prompt a rollback: An increase in this which is un-characteristic to the time this change request is being done
Summary of infrastruture changes
-
Does this change introduce new compute instances? -
Does this change re-size any existing compute instances? -
Does this change introduce any additional usage of tooling like Elastic Search, CDNs, Cloudflare, etc?
Changes checklist
-
Detailed steps and rollback steps have been filled prior to commencing work -
SRE on-call has been informed prior to change being rolled out -
There are currently no active incidents
Edited by Craig Barrett