Incident Review: SLO violations for API error rates on GitLab.com
Incident Review
The DRI for the incident review is the issue assignee.
-
If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included. -
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident. -
Fill out relevant sections below or link to the meeting review notes that cover these topics -
If there is a need to schedule a synchronous review, complete the following steps: -
In this issue, @
mention the EOC, IMOC and other parties who were involved that we would like to schedule a sync review discussion of this issue. -
Schedule a meeting that works the best for those involved, in the agenda put a link to this review issue. The meeting should primarily discuss what is already documented in this issue, and any questions that arise from it. -
Ensure that the meeting is recorded, when complete upload the recording to GitLab unfiltered.
-
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- There was an increased error rate for deployment jobs using the GitLab agent for 4 hours and 10 minutes. There were 8 customers tickets filed for this issue.
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Customers using the GitLab Agent were seeing deployment errors in their pipeline such as
Error from server (InternalError): an error on the server ("unknown") has prevented the request from succeeding
.
- Customers using the GitLab Agent were seeing deployment errors in their pipeline such as
-
How many customers were affected?
- 8 customer tickets were filed and there were approximately 1000 unique projects affected.
What were the root causes?
- The root cause of the incident has been identified as a post-deploy migration in gitlab-org/gitlab!144939 (merged). This was resolved by removing the not null constraint in gitlab-org/gitlab!145790 (merged).
Incident Response Analysis
-
How was the incident detected?
- EOC received the alert: firing - Service api (gprd) | ApiServiceRailsRequestErrorSLOViolationRegional.
-
How could detection time be improved?
-
✅ Alert received at09:30
UTC, incident was declared at09:34 UTC
.
-
-
How was the root cause diagnosed?
- First the team looked into recent deployments, and a CNY deploy was suspected.
- The CNY deploy didn't quite align with the errors seen, the release manager suspected the post-deploy migrations as a related cause.
- Thanks to @timofurrer, he was able to identify this MR as the root cause.
-
How could time to diagnosis be improved?
- The root cause was first identified at
09:58 UTC
, which is within 30m of the alert.
- The root cause was first identified at
- How did we reach the point where we knew how to mitigate the impact?
-
How could time to mitigation be improved?
- The DRI for reviewing and merging the fix MR was not clear.
- There could room for improvements in the Deploy pipeline as it took a while to get the issue fixed.
- Related discussion: #17665 (comment 1790883642)
- We should be faster in MR review. We suggest us to use Infra Dev Escalation.
- Note that we was in an unlucky situation in this incident, because we had other two incidents that blocked the fix of this incident ( #17658 (closed) and #17659 (closed))
- In case the issue only happens on
gprd-cny
, we should drain the environment (/chatops run canary --drain --production
) to mitigate the issue immediately.
Post Incident Analysis
- Did we have other events in the past with the same root cause?
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- No, all the related corrective action were created as part of this incident.
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- Incident was triggered by the Post Deployment migrations: https://ops.gitlab.net/gitlab-org/release/tools/-/pipelines/2909248
What went well?
- The collaboration from different teams was fantastic. Thanks @timofurrer @ahegyi @alipniagov @tigerwnz for all of your fruitful work here
🤝 - Time to detection and root cause identification were relatively quick.
Guidelines
Edited by Dat Tang