Incident Review: Connection errors on gitaly causing 500s
Incident Review
The DRI for the incident review is the issue assignee.
-
If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included. -
If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident. -
Fill out relevant sections below or link to the meeting review notes that cover these topics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- external customers
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- 500s when trying to reach a repository
-
How many customers were affected?
- See #16187 (comment 1517550012) for detailed impact analysis
- 16320 IP addresses impacted (workhorse logs don't contain the customer) out of a total of 486870, ie 3.35%.
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- Total error rate for the entire time period was 0.14%.
What were the root causes?
- Memory leak caused by (a goroutine leak caused by) gitlab-org/gitaly@023c2a7d (and the mitigation was to roll this back in gitlab-org/gitaly!6236 (merged)).
- The limiter creates certain data structures and spawns goroutines to manage them. When the limiter cleans up corresponding data structures, the goroutines don't stop and still keep the references. Thus, GC cannot clean them up completely.
- The code is a new reliability feature, part of Gitaly adaptive concurrency limit (gitlab-org&10734 - closed). The change was a new feature and has not been enabled for any actors. The incident was caused by the new semaphore implementation. There shouldn’t be a change in how limiting functionalities work. After rolling back, the limiter uses the old semaphore.
Incident Response Analysis
-
How was the incident detected?
- oncall got paged
-
How could detection time be improved?
- develop early warning alerts on memory usage trends
-
How was the root cause diagnosed?
- timeline correlation with production changes
-
How could time to diagnosis be improved?
- ...
-
How did we reach the point where we knew how to mitigate the impact?
- One of Gitaly engineers found a code causing the incident.
-
How could time to mitigation be improved?
- ...
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- Not the same root cause but a memory leak: https://gitlab.com/gitlab-org/gitaly/-/issues/4732+
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- During the Praefect incident: https://gitlab.com/gitlab-org/gitaly/-/issues/4732#note_1249369327 the idea of a soak test was investigated gitlab-org/quality/performance#556 (closed) but no conclusion was reached.
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- ...
What went well?
- ...
Corrective Actions
- The main Gitaly process is not a part of its cg... (gitlab-org/gitaly#5535 - closed)
- Detect test packages that do not call `testhelp... (gitlab-org/gitaly#5522 - closed)
- https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24289+
- https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/24288+
Guidelines
Edited by Gosia Ksionek