2020-12-21: Unable to deploy to Kubernetes Infrastructure

Summary

More information will be added as we investigate the issue.

Timeline

All times UTC.

2020-12-21

17:34 - deployment to gprd fails - https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/pipelines/393051
17:45 - discovered the deployment failure was specific to the kubernetes infrastructure
18:52 - jskarbek declares incident in Slack.
22:55 - all remediation is complete, deployments are successful - #3228 (comment 471802763)

Corrective Actions

Keep up-to-date on all tools utilized for Kubernetes infrastructure - delivery#1438 (closed)
Open an issue with the helm-git plugin to address the init command as this will be broken for others who are using helm 2 - https://github.com/aslafy-z/helm-git/pull/140

Incident Review

Summary

Service(s) affected: Deployments
Team attribution: teamDelivery
Time to detection: 10 minuets
Minutes downtime or degradation: 5 hours

Metrics

N/A

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. No one
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. N/A
How many customers were affected?
1. N/A
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. N/A

What were the root causes?

"5 Whys"

When Delivery started the migration work of VM's to Kubernetes Infrastructure, Helm was at version 2. While we did an okay job maintaining updates for that version, switching to version 3 was blocked due to various issues. Kubernetes prevents some fields from being updated, and changes to the methods for which helm manages objects prevented our initial upgrade to version 3 of Helm. This was later tabled with a status blocked in search for a solution. While a solution was never found, other priorities ended up taking precedent in order to meet Quarter OKR's, primarily the drive to migrate as much of our stateless services using an unmodified helm chart. This resulted in us being forced to rely on Helm version 2 beyond EOL in November.

Helm's transition into the CNCF meant that the original location where the stable repo could be found eventually changed ownership, and subsequently, the URL of the stable repo has also changed. With Helm version 2 deprecated in November, it's logical for this transition to occur. On December 21st, the old location of the repo started to return HTTP403, which prevented the helm init command from working properly. While we do not use the stable repository anywhere, we use a plugin helm-git which is utilized to pull the GitLab Helm Chart at a specific revision. This plugin runs the helm init command as part of its setup. This setup now only fails since Helm is unable to get the necessary repository information. This in turn leads to a failure for our deployment mechanism to complete.

Helm Deprecation for version 2

Incident Response Analysis

How was the incident detected?
1. Notification sent to #announcements
How could detection time be improved?
1. No
How was the root cause diagnosed?
1. Observing log output of the deployment jobs that were marked as failed
How could time to diagnosis be improved?
1. Looking at the log output more carefully
How did we reach the point where we knew how to mitigate the impact?
1. Source code dives into open source software that we consume
How could time to mitigation be improved?
1. No
What went well?
1. Contribution has been made to the helm-git plugin
2. We received excellent assistance from the Distribution team

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. No
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. Yes - &370 (closed)
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. No - This was out of our control due to deprecation of an old version of software being utilized

Lessons Learned

We need to stay up-to-date with the various tools that we do not build ourselves.

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Incident Review Stakeholders

Edited Dec 22, 2020 by John Skarbek