Incident Review: Uptick in 429 errors, unexpected authentication errors

Incident Review

The DRI for the incident review is the issue assignee.

Announce the incident review in the incident channel on Slack.

:mega: @here An incident review issue was created for this incident with <USER> assigned as the DRI.
If you have any review feedback please add it to <ISSUE_LINK>.

If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included.
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident.
Fill out relevant sections below or link to the meeting review notes that cover these topics
If there is a need to schedule a synchronous review, complete the following steps:
- In this issue, @ mention the EOC, IMOC and other parties who were involved that we would like to schedule a sync review discussion of this issue.
- Schedule a meeting that works the best for those involved, in the agenda put a link to this review issue. The meeting should primarily discuss what is already documented in this issue, and any questions that arise from it.
- Ensure that the meeting is recorded, when complete upload the recording to GitLab unfiltered.

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. Both external and internal customers were impacted.
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. Customers started experiencing an increase in rate-limited requests, which has prevented them from accessing services where rate limits are implemented, such as Web, API, and Git. However, customers who were within the rate limits could still access the endpoints. The primary impact was on the Git service over HTTPS.
How many customers were affected?
1. Customers in around 50 root namespace were impacted by this, mainly customers who are in our allowlist for rate-limiting and anyone who uses GIT service over HTTP. ref: https://log.gprd.gitlab.net/app/r/s/HgIoW
2. Based on the pager duty, we had four different support ticket raised by customers for being rate limited, ref: #18173 (comment 1959467400)
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. Based on the logs (https://log.gprd.gitlab.net/app/r/s/vWNHc), approximately 2,000,000 requests were bypassing rate-limiting in a 5-minute window before the incident. During the incident, this number dropped to 15,000, indicating that approximately 1,985,000 requests were affected every 5 minutes.

What were the root causes?

A configuration change to our HAProxy which changed how out application level rate-limiting is handled for users.

Incident Response Analysis

How was the incident detected?
1. The customer reached to our support team through PagerDuty: https://gitlab.pagerduty.com/incidents/Q0W2XXYTD8YUSL?utm_campaign=channel
2. After a few minutes, we noticed the huge spike in HTTP 429 errors starting at 5:55 UTC, ref: https://log.gprd.gitlab.net/app/r/s/ORsMm
How could detection time be improved?
1. Having an alert on increase in 429 could have been a good indicator.
2. Having the configuration changes for chef displayed on Elastic config log: https://nonprod-log.gitlab.net/app/r/s/gHvDq
How was the root cause diagnosed?
1. Verifying the timeline against the changes made to the Production, we identified the change which caused the configuration change.
How could time to diagnosis be improved?
How did we reach the point where we knew how to mitigate the impact?
1. Once we identified the configuration change is causing this 429 spike, we decided to revert the change and rollback to the previously working version.
How could time to mitigation be improved?
1. If we had the Configuration change present in the Elastic Config log, it would have improved our time to mitigate. We already have a corrective-action issue for it: production-engineering#25536

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. #8101 (closed) is similar, it also caused rate-limiting to get triggered due to a mis-configuration.
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. Having gitlab-com/runbooks#159 (moved) done would greatly improve in the detection time, it would tell us there is a problem before our customers notice.
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. The incident was triggered by this CR: https://gitlab.com/gitlab-com/gl-infra/production/-/issues/18169

What went well?

Once the incident was declared, various folks participated to mitigate the issue quickly on a synchronous call.
We only had one change which exactly matched our timeline.
The engineer who made the change participated in the incident and quickly prepared the revert for it.

Guidelines

Blameless RCA Guideline

Edited Jun 26, 2024 by Tarun Khandelwal