2020-06-16: Unable to deploy to production due missing configuration

Summary

A change to utilize a dedicated secret object for a new configuration item ci_jwt_signing_key landed on the Staging environment. The deployment to staging failed on our Kubernetes infrastructure. This failure was due to the fact that we attempted to create a secret that was not preconfigured, and Kubernetes has the secrets mounted as a Read Only volume. After further discussion, it was decided that due to not having the infrastructure of GitLab.com not properly ready for this, we run the risk of introducing an outage if this were to be deployed to .com. No one in infrastructure knew about this change, and due to the need to ensure that we manage any secret tokens central across the entire infrastructure, we would've been in a situation where each GitLab server would have it's own dedicated key, preventing users from seeing data on different servers as data encrypted by one server would not be able to be decrypted on another.

Unable to deploy to production due missing configuration

Kubernetes configurations are unable to continue as we are missing a configuration item that was recently introduced but not configured in a way that is backward compatible with Kubernetes deployments.

Timeline

All times UTC.

2020-06-16

14:08 - Staging deploy to Kubernetes begins: https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/1315283
14:36 - Failure of the Kubernetes deployment investigation begins
14:40 - Discovery that GitLab is attempting to generate a secret object which is not managed by the Infrastructure team
15:25 - skarbek declares incident to prevent deploys from going to production
15:30 - revert create: gitlab-org/gitlab!34646 (merged)
18:41 - the revert is tagged and sent off for building
20:50 - incident is marked as mitigated, we've got a candidate package that we'd like to see make it to production

Incident Review

Summary

Service(s) affected: staging.gitlab.com
Team attribution: ~"group::release management"
Minutes downtime or degradation: 0

Metrics

Customer Impact

Who was impacted by this incident? None
What was the customer experience during the incident? No
How many customers were affected? None
If a precise customer impact number is unknown, what is the estimated potential impact? None

Incident Response Analysis

How was the event detected? Staging deploy failure
How could detection time be improved? It cannot, staging deploy did precisely what it was supposed to do
How did we reach the point where we knew how to mitigate the impact? Internal Slack discussion targeting the known Merge Request that caused the issue.
How could time to mitigation be improved? n/a

Post Incident Analysis

How was the root cause diagnosed? Staging deploy failure
How could time to diagnosis be improved? It cannot, staging deploy did precisely what it was supposed to do
Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident? No
Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?

This incident was triggered by a change to the GitLab codebase: gitlab-org/gitlab!34249 (merged)

5 Whys

Lessons Learned

Infrastructure/Development lacks visibility into changes that require intervention or configuration prior to them being deployed.

Corrective Actions

Infrastructure will need to be ahead of this Merge Request to ensure that GitLab does not attempt to create unique keys per server, but instead a key shared among each environment is created: https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/10573
GitLab maybe should not create these secret objects on it's own: gitlab-org/gitlab#222690
How can Development interact with Infrastructure for changes like this that need configuration ahead of time? https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/8912
Secret management documentation update for helm charts: gitlab-org/charts/gitlab#2157

Guidelines

Blameless RCA Guideline

Edited Jul 06, 2020 by John Skarbek