consul: SSL certs expired on Aug 3
Summary
The SSL certificate in place for consul on all hosts in gstg and gprd expired on August 03.
Timeline
- 2019-08-07 2pm - While on https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7066, we make the observation that a patroni host doesn't re-join the consul cluster after a restart automatically.
- 2019-08-07 2:37pm - Escalation on Slack https://gitlab.slack.com/archives/CB3LSMEJV/p1565188642498400?thread_ts=1565186678.494500&cid=CB3LSMEJV, paging EOC and declaring an incident in #1037 (closed)
- 2019-08-07 3:14pm - Paging OnGres - no response, Pagerduty alert escalates to @ansdval after 20 minutes.
- 2019-08-07 3:? - We give up on looking for the CA keys and instead decide to go with a new CA
- 2019-08-07 4:30pm - We're about to generate a new pair of keys
Details
# Example for web roles:
knife ssh 'roles:gprd-base-fe-web' "sudo cat /etc/consul/ssl/certs/consul.crt | openssl x509 -noout -dates"
...
notBefore=Aug 3 06:06:25 2017 GMT
notAfter=Aug 3 06:06:25 2019 GMT
...
@abrandl ran into this when wanting to re-join a database instance in staging but patroni fails to talk to consul:
Aug 7 13:59:41 patroni-06-db-gstg consul[6595]: 2019/08/07 13:59:41 [ERR] consul: "Catalog.NodeServices" RPC failed to server 10.224.4.6:8300: rpc error getting client: failed to get conn: remote error: tls: bad certificate
Aug 7 13:59:41 patroni-06-db-gstg consul[6595]: 2019/08/07 13:59:41 [ERR] agent: failed to sync remote state: rpc error getting client: failed to get conn: remote error: tls: bad certificate
Aug 7 13:59:41 patroni-06-db-gstg consul[6595]: consul: "Catalog.NodeServices" RPC failed to server 10.224.4.6:8300: rpc error getting client: failed to get conn: remote error: tls: bad certificate
This led to not being able to start patroni up on said database instance. I'm not totally clear about the impact across the fleet, but it looks like this could be very bad.
We should update the certificate and start to monitor its validity (in case we still want to expire those internal certs after a while).
Edited by AnthonySandoval