Incident Review: Uptick in coordinator related errors
Key Information
| Metric | Value |
|---|---|
| Customers Affected | Support noted approx 150 tickets filed |
| Requests Affected | |
| Incident Severity | severity2 |
| Start Time | 2024-09-10 08:56 UTC |
| End Time | 2024-09-12 14:34 UTC |
| Total Duration | 29h 28m |
| Link to Incident Issue | 2024-09-10: Uptick in coordinator related errors (#18533 - closed) |
Summary
Customers are reporting that on GitLab.com, some jobs are reporting abnormaly long execution time, even going past the timeout defined. The root cause is attributed to 2024-09-10: SidekiqServiceSidekiqQueueingApdexS... (#18538 - closed).
Details
Given the Redis failures noted in 2024-09-10: SidekiqServiceSidekiqQueueingApdexS... (#18538 - closed), customers experienced stuck CI jobs, and as a result exceeded their CI Minutes quota.
Outcomes/Corrective Actions
https://gitlab.com/gitlab-org/gitlab/-/issues/490681+ is the main issue where we are discussing how to improve our mitigation strategy when users are unable to run additional pipelines when they have exceeded CI minutes quota.
The main objective is to ensure we build better tooling for our Support Engineers and SREs to purchase additional minutes to unblock customers. Currently Support mitigates this by resetting the customer's CI minutes quota for the entire month, but that can incur a high cost to the business depending on the day of the month that operation is executed (e.g. 30th of the month vs 3rd of the month).
Learning Opportunities
What went well?
- The Verify team was able to ramp up quickly on a domain they had limited knowledge about (thank you @mfanGitLab @tianwenchen @allison.browne!!). Also appreciated the additional metrics provided around potential CI minutes that would be reset, and the potential costs to the business.
🙏 - Also appreciated the response from Fulfillment:Utilization given limited engineers available in the AMER timezones.
What was difficult?
- It was unclear this incident required the expertise of Verify teams to troubleshoot until EOD AMER. Verify also only has 1 engineer in APAC, with limited expertise in CI Minutes.
- While preparing the script to perform a bulk reset of CI Minutes, it was unclear who would the ultimate DRI to approve this operation, given the cost to GitLab this reset would incur. Initially, it was discussed that we needed to reset all namespaces on gitlab.com on the 10th of September.
- It was difficult to narrow down the affected customers, as the team was not easily able to differentiate between valid pipeline failures vs pipeline failures caused by this incident.
- It was unclear how we could purchase additional minutes for customers, as we ended up performing a reset of their monthly quota of CI Minutes to get customers unblocked instead.
Review Guidelines
This review should be completed by the team which owns the service causing the alert. That team has the most context around what caused the problem and what information will be needed for an effective fix. The EOC or IMOC may create this issue, but unless they are also on the service owning team, they should assign someone from that team as the DRI.
For the person opening the Incident Review
-
Set the title to Incident Review: (Incident issue name) -
Assign a Service::*label (most likely matching the one on the incident issue) -
Set a Severity::*label which matches the incident -
In the Key Informationsection, make sure to include a link to the incident issue -
Find and Assign a DRI from the team which owns the service (check their slack channel or assign the team's manager) The DRI for the incident review is the issue assignee. -
Announce the incident review in the incident channel on Slack.
:mega: @here An incident review issue was created for this incident with <USER> assigned as the DRI.
If you have any review feedback please add it to <ISSUE_LINK>.
For the assigned DRI
-
Fill in the remaining fields in the Key Informationsection, using the incident issue as a reference. Feel free to ask the EOC or other folks involved if anything is difficult to find. -
If there are metrics showing Customers AffectedorRequests Affected, link those metrics in those fields -
Create a few short sentences in the Summary section summarizing what happened (TL;DR) -
Use the description section to write a few paragraphs explaining what happened -
Link any corrective actions and describe any other actions or outcomes from the incident -
Consider the implications for self-managed and Dedicated instances. For example, do any bug fixes need to be backported? -
Add any appropriate labels based on the incident issue and discussions -
Once discussion wraps up in the comments, summarize any takeaways in the details section -
Close the review before the due date