Incident Review: Uptick in coordinator related errors

Key Information

Metric	Value
Customers Affected	Support noted approx 150 tickets filed
Requests Affected
Incident Severity	severity2
Start Time	2024-09-10 08:56 UTC
End Time	2024-09-12 14:34 UTC
Total Duration	29h 28m
Link to Incident Issue	2024-09-10: Uptick in coordinator related errors (#18533 - closed)

Summary

Customers are reporting that on GitLab.com, some jobs are reporting abnormaly long execution time, even going past the timeout defined. The root cause is attributed to 2024-09-10: SidekiqServiceSidekiqQueueingApdexS... (#18538 - closed).

Details

Given the Redis failures noted in 2024-09-10: SidekiqServiceSidekiqQueueingApdexS... (#18538 - closed), customers experienced stuck CI jobs, and as a result exceeded their CI Minutes quota.

Outcomes/Corrective Actions

https://gitlab.com/gitlab-org/gitlab/-/issues/490681+ is the main issue where we are discussing how to improve our mitigation strategy when users are unable to run additional pipelines when they have exceeded CI minutes quota.

The main objective is to ensure we build better tooling for our Support Engineers and SREs to purchase additional minutes to unblock customers. Currently Support mitigates this by resetting the customer's CI minutes quota for the entire month, but that can incur a high cost to the business depending on the day of the month that operation is executed (e.g. 30th of the month vs 3rd of the month).

Learning Opportunities

What went well?

The Verify team was able to ramp up quickly on a domain they had limited knowledge about (thank you @mfanGitLab @tianwenchen @allison.browne!!). Also appreciated the additional metrics provided around potential CI minutes that would be reset, and the potential costs to the business. 🙏
Also appreciated the response from Fulfillment:Utilization given limited engineers available in the AMER timezones.

What was difficult?

It was unclear this incident required the expertise of Verify teams to troubleshoot until EOD AMER. Verify also only has 1 engineer in APAC, with limited expertise in CI Minutes.
While preparing the script to perform a bulk reset of CI Minutes, it was unclear who would the ultimate DRI to approve this operation, given the cost to GitLab this reset would incur. Initially, it was discussed that we needed to reset all namespaces on gitlab.com on the 10th of September.
It was difficult to narrow down the affected customers, as the team was not easily able to differentiate between valid pipeline failures vs pipeline failures caused by this incident.
It was unclear how we could purchase additional minutes for customers, as we ended up performing a reset of their monthly quota of CI Minutes to get customers unblocked instead.

Review Guidelines

This review should be completed by the team which owns the service causing the alert. That team has the most context around what caused the problem and what information will be needed for an effective fix. The EOC or IMOC may create this issue, but unless they are also on the service owning team, they should assign someone from that team as the DRI.

For the person opening the Incident Review

Set the title to Incident Review: (Incident issue name)
Assign a Service::* label (most likely matching the one on the incident issue)
Set a Severity::* label which matches the incident
In the Key Information section, make sure to include a link to the incident issue
Find and Assign a DRI from the team which owns the service (check their slack channel or assign the team's manager) The DRI for the incident review is the issue assignee.
Announce the incident review in the incident channel on Slack.

:mega: @here An incident review issue was created for this incident with <USER> assigned as the DRI.
If you have any review feedback please add it to <ISSUE_LINK>.

For the assigned DRI

Fill in the remaining fields in the Key Information section, using the incident issue as a reference. Feel free to ask the EOC or other folks involved if anything is difficult to find.
If there are metrics showing Customers Affected or Requests Affected, link those metrics in those fields
Create a few short sentences in the Summary section summarizing what happened (TL;DR)
Use the description section to write a few paragraphs explaining what happened
Link any corrective actions and describe any other actions or outcomes from the incident
Consider the implications for self-managed and Dedicated instances. For example, do any bug fixes need to be backported?
Add any appropriate labels based on the incident issue and discussions
Once discussion wraps up in the comments, summarize any takeaways in the details section
Close the review before the due date

Edited May 27, 2025 by Cheryl Li