consul: SSL certs expired on Aug 3
Incident issue: production#1037 (closed)
Impact & Metrics
Detection & Response
Timeline UTC
- 2019-08-07 2pm - While on, we make the observation that a patroni host doesn't re-join the consul cluster after a restart automatically.
- 2019-08-07 2:37pm - Escalation on Slack, paging EOC and declaring an incident in production#1037 (closed)
- 2019-08-07 3:14pm - Paging OnGres - no response, Pagerduty alert escalates to @ansdval after 20 minutes.
- 2019-08-07 3:? - We give up on looking for the CA keys and instead decide to go with a new CA
- 2019-08-07 4:30pm - We're about to generate a new pair of keys
Root Cause Analysis
The purpose of this document is to understand the reasons that caused an incident, and to create mechanisms to prevent it from recurring in the future. A root cause can never be a person, the way of writing has to refer to the system and the context rather than the specific actors.
Follow the "5 whys" in a blameless manner as the core of the post mortem.
For this it is necessary to start with the incident, and question why it happened. Keep iterating asking "why?" 5 times. While it's not a hard rule that it has to be 5 times, it helps to keep questions get deeper in finding the actual root cause.
Keep in mind that from one "why?" there may come more than one answer, consider following the different branches.
What went well
What can be improved
Corrective actions
- A newer version of consul would have allowed us to just reload the server to use the new certificate (without restarting it).
- There is no monitoring place for checking the validity of the consul cert.