RCA - Renewal of the certificate failed due to old integrations in VersionApp
Please note: if the incident relates to sensitive data, or is security related consider labeling this issue with security and mark it confidential.
Summary
Related issue gitlab-com/gl-infra/reliability-sav#10 (moved)
- version.gitlab.com certificate expires in 4 days
- If it expires, we will no longer get service ping data which will be an S1 incident.
- Renewal of the certificate failed due to old integrations in VersionApp
- The certificate was renewed before expiration.
Service(s) affected: VersionApp, Service Ping collection in VersionApp
Team attribution: ~"group::product intelligence"
Minutes downtime or degradation: This was fixed before degradation
Impact & Metrics
Start with the following:
Question | Answer |
---|---|
What was the impact | (i.e. service outage, sub-service brown-out, exposure of sensitive data, ...) |
Who was impacted | (i.e. external customers, internal customers, specific teams, ...) |
How did this impact customers | (i.e. preventing them from doing X, incorrect display of Y, ...) |
How many attempts made to access | |
How many customers affected | |
How many customers tried to access |
Include any additional metrics that are of relevance.
Provide any relevant graphs that could help understand the impact of the incident and its dynamics.
Detection & Response
Start with the following:
Question | Answer |
---|---|
When was the incident detected? | 2022-03-01 UTC |
How was the incident detected? |
@devin detected by trying to renew the certificate for another issue
|
Did alarming work as expected? | Yes, 3 days before expiry gitlab-com/gl-infra/production#6464 (closed) (1 day after the problem had been identified manually by @devin ) |
How long did it take from the start of the incident to its detection? | |
How long did it take from detection to remediation? | 2 days |
What steps were taken to remediate? | Connect with ~team::Reliability , gitlab-com/gl-infra/reliability-sav#10 (moved)
|
Were there any issues with the response? | (i.e. bastion host used to access the service was not available, relevant team member wasn't page-able, ...) |
MR Checklist
Consider these questions if a code change introduced the issue.
Question | Answer |
---|---|
Was the MR acceptance checklist marked as reviewed in the MR? | |
Should the checklist be updated to help reduce chances of future recurrences? If so, who is the DRI to do so? |
Timeline
2022-03-01
- 00:01 UTC - We received the message in Slack that the certificate is going to expire in 2022-03-04. There was a try to renew it but it failed due to old integrations in VersionsApp
- 13:50 UTC - Issue created to collaborate with
~team::Reliability
to fix the problem - 17:39 UTC A first plan to fix the issue was added gitlab-com/gl-infra/reliability-sav#10 (comment 859020739)
2022-03-02
- 15:26 UTC A detailed plan to fix the problem was added gitlab-com/gl-infra/reliability-sav#10 (comment 860287366) and VersionApp staging was migrated to Helm 3
- 18:47 UTC staging was updated gitlab-com/gl-infra/reliability-sav#10 (comment 860535022)
- 20:47 UTC production was updated gitlab-com/gl-infra/reliability-sav#10 (comment 860638107)
Root Cause Analysis
The purpose of this document is to understand the reasons that caused an incident, and to create mechanisms to prevent it from recurring in the future. A root cause can never be a person, the way of writing has to refer to the system and the context rather than the specific actors.
Follow the "5 whys" in a blameless manner as the core of the root-cause analysis.
For this, it is necessary to start with the incident and question why it happened. Keep iterating asking "why?" 5 times. While it's not a hard rule that it has to be 5 times, it helps to keep questions get deeper in finding the actual root cause.
Keep in mind that from one "why?" there may come more than one answer, consider following the different branches.
Example of the usage of "5 whys"
Renewal of the certificate failed. (the problem)
- Why? - Old integrations in VersionApp.
- Why? - Issue not addressed https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/14897
- Why? - There are no dedicated engineers for VersionApp maintenance, infrastructure.
What went well
Start with the following:
- Cross-team collaboration was great, teams involved are Product Intelligence, Infrastructure and Reliability.
- Issue was fixed before becoming an incident.
- ...
What can be improved
- Have an alert system to get notified earlier if certificates are expring
- Improve VersionApp maintenance, keep the application up to date
- ...
- ...
Corrective actions
- List issues that have been created as corrective actions from this incident.
- For each issue, include the following:
- - Issue labeled as corrective action.
- Include an estimated date of completion of the corrective action.
- Include the named individual who owns the delivery of the corrective action.
-
https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15373
- First phase must be completed before %15.0.
- DRI for first phase: @hfyngvason. Will perform the initial migration to the GitLab agent in collaboration with the Reliability team and ~"group::product intelligence"
- ...