2021-01-21: license prod down
Summary
customers.gitlab.com and license.gitlab.com are now available. Both sites experienced issues due to a change related to the license app. API calls from customers to license created issues on the customers site created the problems for the customers site.
Timeline
All times UTC.
2021-01-21
- 21:14 - EOC receives notification from PagerDuty
- 21:14 - @cmcfarland ran this job on license-prd branch: https://ops.gitlab.net/gitlab-com/services-base/-/jobs/2824663
- 21:17 - @cmcfarland stopped the job
-
21:17
- @cmcfarland declares incident in Slack. - 21:49 - EOC receives resolved notification from PagerDuty
Corrective Actions
-
https://ops.gitlab.net/gitlab-com/services-base/-/merge_requests/157 - Remove license prod and staging environments/branches from creating a
stop_review
job. - https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12670 - Review and update Services-Base README
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12671 - Runbooks update for Customers/License/Version to make sure SRE's can find and change those services
Incident Review
Summary
After a successful deploy to the license prod environment in services-base, the stop_review
job was manually run in error.
- Service(s) affected: license.gitlab.com, customers.gitlab.com
- Team attribution: ~"team::Core-Infra"
- Time to detection: 3 minutes
- Minutes downtime or degradation: 35 minutes
Metrics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- Internal and external customers trying to use license.gitlab.com or customers.gitlab.com
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- License updates, new licensing, and customer account management could not be used.
-
How many customers were affected?
- ...
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- According to the project metrics, the average traffic from the same hour from yesterday shows about 46 total requests. If the traffic during this incident is similar, that is 46 requests not returned properly.
What were the root causes?
The services-base project manages the environments for version and license Auto DevOps environments. Each branch is an environment and there are special branches (such as license-prd) that directly relate to production environments. Other branches are considered ephemeral and there are CI jobs to clean up and remove those environments.
The CI job to clean up (ie, delete) is normally blocked from having that clean up job available to run on production and staging environments. Running this CI job is manual, but is generally considered a safe action because the job is not part of the pipeline for these protected branches.
The license-prd and license-stg branches were not prevented from including the clean-up job in their pipelines. The site reliability engineer making changes to the license-prd
branch did not fully understand this underlying CI system and thought (incorrectly) that the clean up job was safe to run.
Incident Response Analysis
-
How was the incident detected?
- The incident was manually raised when the job was seen to be deleting production items.
-
How could detection time be improved?
- Pages were sent to the EOC within a minute of the deletions, so it might not be possible to improve this detection time.
-
How was the root cause diagnosed?
- ...
-
How could time to diagnosis be improved?
- ...
-
How did we reach the point where we knew how to mitigate the impact?
- This was known right away since the actions taken that caused the incident were readily known.
-
How could time to mitigation be improved?
- N/A
-
What went well?
- A site reliability engineer with good understanding of the project was available to help fix the issue as quickly as possible.
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- No
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- ...
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- No
Lessons Learned
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)