2024 September Incidents and resulting PCL / FCL
Our availability is at 99.92% and it's only the second week of September.
| Severity | Name | Root cause | Action |
|---|---|---|---|
| severity1 | 2024-09-03: The sidekiq_queueing SLI of the sidekiq service on shard urgent-cpu-bound has an apdex violating SLO (incidient review) | Symptom: DB connection pool starvation, lots of idle in transaction connections. Some workers with improvements identified (FlushCounterIncrementsWorker https://gitlab.com/gitlab-org/gitlab/-/issues/482785 grouputilization) , but likely not the root cause. |
0. Incident Review in production#18494 (closed) 1. Infradev issues for workers that need to be addressed. 2. Create a circuit breaker for Sidekiq. 3. Create new connection pools for the urgent shards. |
| severity1 | 2024-09-03: GitLab.com is down (incidient review) | Human error when making a database change (groupdatabase) | 1. Incident review in production#18491 (closed). 2. Summary will be linked to this comment. |
| severity2 | 2024-09-06: 500 error when trying to access repos/projects for some users (incidient review) | Feature flag change caused Gitaly CPU saturation | 1. Incident review production#18515 (closed) 2. Summary will be linked to this comment. |
| severity2 | 2024-09-10: Uptick in coordinator related errors (incidient review) |
grouppipeline execution Stuck builds became marked as cancelled which consumed CI minutes. Some users exhausted their minutes very quickly because old stuck jobs were cancelled. |
1. Incident review in production#18550 2. FCL requested - waiting on outcome of review. comment. |
| severity1 | 2024-09-10: Increased errors on GitLab.com (incidient review) | Failing disk. | 1. Support case open with GCP. 2. Incident review in production#18536 (closed) |
| severity1 | 2024-09-10: SidekiqServiceSidekiqQueueingApdexSLOViolationSingleShard (incidient review) | Sidekiq workers going bad and stop processing jobs. | 1. Root cause under investigation. 2. Incident review in production#18539 (closed) |
| severity2 | 2024-09-11: InactiveTokensDeletionCronWorker (incidient review) |
groupauthentication A newly released worker ResourceAccessTokens::InactiveTokensDeletionCronWorker deleted 1456 users that it shouldn't have. |
1. Incident review in production#18549 (closed). 2. FCL requested in comment |
| severity1 security | SIRT Incident | ||
| severity2 | 2024-09-16: Gitlab.com is down |
Production Change Locks
We discussed in a Slack thread as to if a PCL was needed. The consensus was that a stable environment will help provide space for further investigation and it would assist in preserving our availability (currently 99.92%).
Merge Requests for Production Change Locks
Edited by Rachel Nienaber