2020-12-30: Service desk replies not sent to authors
Summary
MR gitlab-org/gitlab!48823 (merged) has been in production since 2020-12-22 . Copying from @cablett revert MR description:
Its corresponding issue gitlab-org/gitlab#226988 (closed) was blocked by gitlab-org/gitlab#288715 (closed) which should have been merged first. It's resulted in gitlab-org/gitlab#296087 (closed) so I have to revert.
The MR itself is fine, it was just merged prematurely. It can be re-reverted (merged) after gitlab-org/gitlab#288715 (closed) is merged.
Timeline
All times UTC.
2020-12-30
- 16:48 - @greg asks in Slack #questions channel:
Were there any recent changes to or incidents with Service Desk on GitLab.com in the past few days?
linking to https://forum.gitlab.com/t/service-desk-replies-not-sent-to-the-sender/46821 - 17:08 - gitlab-org/gitlab!48823 (merged) gets identified as a possible culprit, with gitlab-org/gitlab#296087 (closed) being referenced as an issue reproducing this bug. The severity of the bug was estimated at severity3
- 19:49 - @cynthia sets the severity to severity1.
- 21:03 - Revert MR is setup gitlab-org/gitlab!50699 (merged)
- 22:23 - Revert MR is merged, however the tests in master were broken due to date dependency.
2020-12-31
- 01:31 - MR fixing the tests is setup gitlab-org/gitlab!50704 (merged)
- 03:01 - MR is merged
- 08:41 - marin declares incident in Slack, based on the ask to deploy this revert to production during PCL
Corrective Actions
-
Propose a feature addition to warn when merging an MR related to an issue with open blocking issues: gitlab-org/gitlab#296968 -
Add to reviewer guidelines for Community contributions a requirement to check issues and/or establish MR dependencies. -gitlab-org/gitlab#298743 (closed) -
Open an issue to implement SLIs for Service Desk - gitlab-org/gitlab#298744 (closed)
Incident Review
Summary
- Service(s) affected: Service Desk
- Team attribution: ~"group:certify"
- Time to detection: ~8 days
- Minutes downtime or degradation: ~8 Days
Metrics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- External customers and internal customers. Specifically their customers, who received no response to their Service Desk communications even when the there was a response on the issue.
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Communications with their users through Service Desk issues were silently dropped.
-
How many customers were affected?
- All users of Service Desk
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- All outbound communication on Service Desk issues ceased for the duration of the incident
What were the root causes?
gitlab-org/gitlab!48823 (merged) was reviewed and merged before gitlab-org/gitlab!48711 (merged).
- A merge request dependency of gitlab-org/gitlab!48823 (merged) on gitlab-org/gitlab!48711 (merged) was not set;
- MR Dependency is not automatically inferred from the blocking relationship of issues gitlab-org/gitlab#226988 (closed) and gitlab-org/gitlab#288715 (closed);
- There was no mention of the dependency in the description of gitlab-org/gitlab!48823 (merged);
- Both MRs were made by an experienced community contributor rather than an internal team-member;
- Each was reviewed in isolation by different reviewer/maintainers; and
- Plan/Service Desk domain experts were not involved in the review/merge process.
Incident Response Analysis
-
How was the incident detected?
- Support tickets from customers. This was significantly delayed (~8 days) because of the nature of the bug. Customers of our customers had to realize there was a software problem and they weren't simply being ignored, then report that to the Service Desk operator, who then reported it to GitLab.
-
How could detection time be improved?
- A count of Service Desk email sent over time should have shown a steep decline. An alert on this would make sense. We don't have any smoke tests that I'm aware of for Service Desk functionality on .com. We may have a QA test that covers this but the
package-and-qa
job wasn't run.
- A count of Service Desk email sent over time should have shown a steep decline. An alert on this would make sense. We don't have any smoke tests that I'm aware of for Service Desk functionality on .com. We may have a QA test that covers this but the
-
How was the root cause diagnosed?
- (@cablett) Support flagged that !48823 had been recently merged. I confirmed that new Service Desk issues don't have the author listed under
external_participants
. I am familiar with the feature so I initially thought it might be a bug in the creation logic. I then noted that the creation logic assumed the author has been added toexternal_participants
. That's when I checked the blocking issue hadn't been closed and its MR hadn't been merged, so I created revert MR rather than considering a fix.
- (@cablett) Support flagged that !48823 had been recently merged. I confirmed that new Service Desk issues don't have the author listed under
-
How could time to diagnosis be improved?
- (@cablett) Diagnosis was fairly straightforward so there are no obvious inefficiencies that I can see. There was initially a bit of confusion regarding the status of Certify ("Where would I find the certify group? I don't see a Slack channel").
-
How did we reach the point where we knew how to mitigate the impact?
- (@cablett) Once it was determined that a revert was the best way forward.
-
How could time to mitigation be improved?
- (@cablett) I feel that the investigation and reversion went as smoothly as it could go, even considering the broken date test. However, I suggested an improvement to the release tools bot https://gitlab.com/gitlab-org/release-tools/-/issues/492 - I think it'd be good if the bot notified in the MR if it was picked into the deployment but excluded from the release due to a failed pipeline.
-
What went well?
- Well described issue titles and descriptions with blocking relationships defined
- (@cablett) I'm glad I kept checking the pipe to ensure it could deploy. When I saw the pipeline was red, I flagged it as a potential blocker to deployment and was able to identify the issue quickly and fix the date.
- Using Slack and pinging in
#s_plan
was a good move. - Community contributor was assured that it was a process failure. (@cablett) I also reached out via DM. He's since raised gitlab-org/gitlab!51023 (merged) but cannot set an MR blocking dependency due to gitlab-org/gitlab#11393 (closed)
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- Probably before the introduction of MR dependencies
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- No
- Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
Lessons Learned
- There is a risk of this happening when a community contributor is working on multiple, dependent MRs and they're reviewed in isolation by different reviewers/maintainers because they may not have set the MR Dependencies.
- Even when reviewed by the same people a direct dependency might be missed because of human error. To minimize this, a domain expert should be involved in the review of larger community contributions.
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)
Incident Review Stakeholders
Edited by John Hope