2020-06-16: Unable to deploy to production due missing configuration
Summary
A change to utilize a dedicated secret object for a new configuration item ci_jwt_signing_key
landed on the Staging environment. The deployment to staging failed on our Kubernetes infrastructure. This failure was due to the fact that we attempted to create a secret that was not preconfigured, and Kubernetes has the secrets mounted as a Read Only volume. After further discussion, it was decided that due to not having the infrastructure of GitLab.com not properly ready for this, we run the risk of introducing an outage if this were to be deployed to .com. No one in infrastructure knew about this change, and due to the need to ensure that we manage any secret tokens central across the entire infrastructure, we would've been in a situation where each GitLab server would have it's own dedicated key, preventing users from seeing data on different servers as data encrypted by one server would not be able to be decrypted on another.
Unable to deploy to production due missing configuration
Kubernetes configurations are unable to continue as we are missing a configuration item that was recently introduced but not configured in a way that is backward compatible with Kubernetes deployments.
Timeline
All times UTC.
2020-06-16
- 14:08 - Staging deploy to Kubernetes begins: https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/1315283
- 14:36 - Failure of the Kubernetes deployment investigation begins
- 14:40 - Discovery that GitLab is attempting to generate a secret object which is not managed by the Infrastructure team
- 15:25 - skarbek declares incident to prevent deploys from going to production
- 15:30 - revert create: gitlab-org/gitlab!34646 (merged)
- 18:41 - the revert is tagged and sent off for building
- 20:50 - incident is marked as mitigated, we've got a candidate package that we'd like to see make it to production
Incident Review
Summary
- Service(s) affected: staging.gitlab.com
- Team attribution: ~"group::release management"
- Minutes downtime or degradation: 0
Metrics
Customer Impact
- Who was impacted by this incident? None
- What was the customer experience during the incident? No
- How many customers were affected? None
- If a precise customer impact number is unknown, what is the estimated potential impact? None
Incident Response Analysis
- How was the event detected? Staging deploy failure
- How could detection time be improved? It cannot, staging deploy did precisely what it was supposed to do
- How did we reach the point where we knew how to mitigate the impact? Internal Slack discussion targeting the known Merge Request that caused the issue.
- How could time to mitigation be improved? n/a
Post Incident Analysis
- How was the root cause diagnosed? Staging deploy failure
- How could time to diagnosis be improved? It cannot, staging deploy did precisely what it was supposed to do
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident? No
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?
This incident was triggered by a change to the GitLab codebase: gitlab-org/gitlab!34249 (merged)
5 Whys
Lessons Learned
- Infrastructure/Development lacks visibility into changes that require intervention or configuration prior to them being deployed.
Corrective Actions
- Infrastructure will need to be ahead of this Merge Request to ensure that GitLab does not attempt to create unique keys per server, but instead a key shared among each environment is created: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10573
- GitLab maybe should not create these secret objects on it's own: gitlab-org/gitlab#222690
- How can Development interact with Infrastructure for changes like this that need configuration ahead of time? https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/8912
- Secret management documentation update for helm charts: gitlab-org/charts/gitlab#2157