2020-12-21: Unable to deploy to Kubernetes Infrastructure
Summary
More information will be added as we investigate the issue.
Timeline
All times UTC.
2020-12-21
- 17:34 - deployment to gprd fails - https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/pipelines/393051
- 17:45 - discovered the deployment failure was specific to the kubernetes infrastructure
- 18:52 - jskarbek declares incident in Slack.
- 22:55 - all remediation is complete, deployments are successful - #3228 (comment 471802763)
Corrective Actions
- Keep up-to-date on all tools utilized for Kubernetes infrastructure - delivery#1438 (closed)
- Open an issue with the helm-git plugin to address the init command as this will be broken for others who are using helm 2 - https://github.com/aslafy-z/helm-git/pull/140
Incident Review
Summary
- Service(s) affected: Deployments
- Team attribution: teamDelivery
- Time to detection: 10 minuets
- Minutes downtime or degradation: 5 hours
Metrics
N/A
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- No one
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- N/A
-
How many customers were affected?
- N/A
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- N/A
What were the root causes?
When Delivery started the migration work of VM's to Kubernetes Infrastructure, Helm was at version 2. While we did an okay job maintaining updates for that version, switching to version 3 was blocked due to various issues. Kubernetes prevents some fields from being updated, and changes to the methods for which helm manages objects prevented our initial upgrade to version 3 of Helm. This was later tabled with a status blocked in search for a solution. While a solution was never found, other priorities ended up taking precedent in order to meet Quarter OKR's, primarily the drive to migrate as much of our stateless services using an unmodified helm chart. This resulted in us being forced to rely on Helm version 2 beyond EOL in November.
Helm's transition into the CNCF meant that the original location where the stable repo could be found eventually changed ownership, and subsequently, the URL of the stable repo has also changed. With Helm version 2 deprecated in November, it's logical for this transition to occur. On December 21st, the old location of the repo started to return HTTP403, which prevented the helm init
command from working properly. While we do not use the stable
repository anywhere, we use a plugin helm-git
which is utilized to pull the GitLab Helm Chart at a specific revision. This plugin runs the helm init
command as part of its setup. This setup now only fails since Helm is unable to get the necessary repository information. This in turn leads to a failure for our deployment mechanism to complete.
Helm Deprecation for version 2
Incident Response Analysis
-
How was the incident detected?
- Notification sent to #announcements
-
How could detection time be improved?
- No
-
How was the root cause diagnosed?
- Observing log output of the deployment jobs that were marked as failed
-
How could time to diagnosis be improved?
- Looking at the log output more carefully
-
How did we reach the point where we knew how to mitigate the impact?
- Source code dives into open source software that we consume
-
How could time to mitigation be improved?
- No
-
What went well?
- Contribution has been made to the
helm-git
plugin - We received excellent assistance from the Distribution team
- Contribution has been made to the
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- No
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- Yes - &370 (closed)
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- No - This was out of our control due to deprecation of an old version of software being utilized
Lessons Learned
- We need to stay up-to-date with the various tools that we do not build ourselves.
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)