Renew Consul TLS certificate
Overview
The TLS certificate for Consul expires in 6 months:
❯ vault kv get -field certificate k8s/env/gprd/ns/consul/tls | openssl x509 -noout -dates
notBefore=Aug 14 00:08:18 2020 GMT
notAfter=Aug 13 00:08:18 2025 GMT
We need to renew it before then. And ideally we should look at managing it with cert-manager instead of manually. And we need proper monitoring for it.
This also needs to be done in db-benchmarking and gstg first.
External secret definition: https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-helmfiles/-/blob/master/releases/consul/values-secrets/values.yaml.gotmpl?ref_type=heads#L6
Secret path: https://vault.gitlab.net/ui/vault/secrets/k8s/kv/env%2Fgprd%2Fns%2Fconsul%2Ftls
More context in: #16947 (closed)
Actions needed
Renewing the certificate for the cluster should be fairly straight forward:
- renew the CA certificate
- store it in
k8s/env/gprd/ns/consul/tls - bump the secret version in
gitlab-helmfilesand apply - the chart should generate a new server certificate from the new CA and the Consul cluster should restart gracefully, client should still be able to connect to it
Renewing the certificate for the client is where it can cause an outage if we are not careful, as Rails clients could be unable to connect to Patroni for a few seconds, and the Patroni cluster could go out-of-sync.
We don't need to do anything for the Consul client in k8s as they generate their own client certificate on init. We might still want to to restart them to be safe in case the pods live longer than the old certificate which would invalidate the client certificates. The safest option would be to trigger a node pool rotation for all node pools in all clusters, so that the nodes are drained before Consul shuts down, and new nodes have a fresh Consul client with the new certificate
For the Consul clients in VMs:
- generate a new cert from the CA
- store it in the Cookbook secrets in Vault
- pause the failover mechanism for the Patroni clusters
- restart Consul on the Patroni VMs via
knife - unpause the failover mechanism for the Patroni clusters
- restart Consul on the remaining VMs via
knife(none of them use it, so there is no risk)
Then we need to document the certificate renewal procedure in the runbook.
We should also look at disabling Consul on all non-Patroni VMs (and console, and deploy) because it's not useful for anything on them (and also we get a lot of errors from the Gitaly nodes unable to register because of name conflicts)
Exit criteria
For each environment in db-benmchmarking, gstg, gprd:
-
The Consul cluster is using a certificate with an expiration date 5+ years in the future -
The Consul clients on VMs are using a certificate with an expiration date 5+ years in the future -
The Consul runbooks contains documentation about how to renew the certificates -
An alert exists for the Consul certificate expiration date