2020-11-18: Select2 images not loading CSS properly for license compliance and GitHub project import
Summary
Merge Request gitlab-org/gitlab!44319 (merged) introduced asynchronous only-if-needed-loading of certain select2 assets in order to improve overall frontend performance. Those changes were not correctly applied to three more places (Approvals, License Compliance and GitHub Importer), leaving those not working. With gitlab-org/gitlab!48053 (merged) those three places were fixed.
Timeline
All times UTC.
2020-11-18
-
06:50
- @Kushal merges gitlab-org/gitlab!44319 (merged) -
14:29
- @willmeek opens gitlab-org/gitlab#284670 (closed) after QA tests fail (gitlab-org/quality/testcases#575 (comment 449960364)) -
14:58
- The changes are deployed to canary -
16:48
- @timzallmann creates MR with the fix: gitlab-org/gitlab!48053 (merged) -
17:45
- @tpazitny declares severity1 incident in Slack. -
17:53
- @tpazitny adjusts incident severity scope to severity2 -
18:05
- @nnelson escalates to IMOC: https://gitlab.pagerduty.com/incidents/PUW13WC -
18:09
- @nnelson disables production canary through chatops -
19:37
- @jivanvl merges gitlab-org/gitlab!48053 (merged) -
21:59
- @jivanvl opens https://gitlab.com/gitlab-org/security/gitlab/-/merge_requests/1088 which resolves conflicts between gitlab-org/gitlab!48053 (merged) and the 13-6 auto deploy branch.
2020-11-19
Corrective Actions
Incident Review
Summary
- Service(s) affected: GitLab.com CANARY (UI for Issue import from GitHub, License Compliance Feature, MR Approvals)
- Team attribution: ???
- Time to detection: 2:45 hours
- Minutes downtime or degradation: 3:11 hours
Metrics
GitHub Project imports in the week from 2020-11-16 - 2020-11-22. Between the red lines the incident occurred.
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- Internal customers (as most GitLab employees use canary).
- Any customer using canary on GitLab.com
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Preventing customers to import projects from GitHub
- Preventing customers from adding new licenses to License Compliance
- Preventing customers to change certain settings related to Merge Request approvals
-
How many customers were affected?
- N/A
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- Almost none, the most critical thing seems to be GitHub Project imports and in the chart above, there is no clear indiciation of any impact at all.
What were the root causes?
("5 Whys")
In my opinion (@leipert), looking at the initial MR (gitlab-org/gitlab!44319 (merged)), the root cause seems to be:
The changes have been too broad and complex, "under-reviewed" and the MR shouldn't have been merged as-is.
Multiple factors contribute to this analysis:
- The MR is rather large (25 files), 2000 lines changed, across a lot of parts of the code base, making it difficult to review.
- The MR doesn't contain screen recordings showing behavior of the pages before / after the change.
- It seems like the MR has just been reviewed by one person, despite being so large.
- The MR touches a deprecated / legacy part of our code base and thus is not necessarily well tested.
- It seems like the changes have been merged close to our "release cut-off" for 13.6 where things can be a bit frantic, because people "want to get their changes in".
Incident Response Analysis
-
How was the incident detected?
- Multiple people reported different issues related to the incident (e.g. @tpazitny), however automatic Quality tests seem to have caught it when, based on the issue opened by @willmeek: gitlab-org/gitlab#284670 (closed)
-
How could detection time be improved?
- N/A
-
How was the root cause diagnosed?
- Not sure, but it seems like @markrian and maybe others realized it rather quickly based on our automatic testing.
-
How could time to diagnosis be improved?
- N/A
-
How did we reach the point where we knew how to mitigate the impact?
- Based on the changes @timzallmann introduced the fix seemed to have been pretty clear, just three more instances needed to be checked.
-
How could time to mitigation be improved?
- N/A
-
What went well?
- Automatic testing caught the issue in Canary without a broad impact.
- People were quick to respond and collaborated on a fix.
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- N/A
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- N/A
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- N/A
Lessons Learned
- Always make sure that code is reviewed by at least one reviewer before an MR is merged, especially if the changes are complex
- Don't just rely on automated tests, if refactoring legacy parts of the code base, but also prepare screencasts.
- If test coverage is missing, provide it before a refactor.