2024 September Incidents and resulting PCL / FCL

Our availability is at 99.92% and it's only the second week of September.

Severity Name Root cause Action
severity1 2024-09-03: The sidekiq_queueing SLI of the sidekiq service on shard urgent-cpu-bound has an apdex violating SLO (incidient review) Symptom: DB connection pool starvation, lots of idle in transaction connections. Some workers with improvements identified (FlushCounterIncrementsWorker https://gitlab.com/gitlab-org/gitlab/-/issues/482785 grouputilization) , but likely not the root cause. 0. Incident Review in production#18494 (closed)
1. Infradev issues for workers that need to be addressed.
2. Create a circuit breaker for Sidekiq.
3. Create new connection pools for the urgent shards.
severity1 2024-09-03: GitLab.com is down (incidient review) Human error when making a database change (groupdatabase) 1. Incident review in production#18491 (closed).
2. Summary will be linked to this comment.
severity2 2024-09-06: 500 error when trying to access repos/projects for some users (incidient review) Feature flag change caused Gitaly CPU saturation 1. Incident review production#18515 (closed)
2. Summary will be linked to this comment.
severity2 2024-09-10: Uptick in coordinator related errors (incidient review) grouppipeline execution Stuck builds became marked as cancelled which consumed CI minutes. Some users exhausted their minutes very quickly because old stuck jobs were cancelled. 1. Incident review in production#18550
2. FCL requested - waiting on outcome of review. comment.
severity1 2024-09-10: Increased errors on GitLab.com (incidient review) Failing disk. 1. Support case open with GCP.
2. Incident review in production#18536 (closed)
severity1 2024-09-10: SidekiqServiceSidekiqQueueingApdexSLOViolationSingleShard (incidient review) Sidekiq workers going bad and stop processing jobs. 1. Root cause under investigation.
2. Incident review in production#18539 (closed)
severity2 2024-09-11: InactiveTokensDeletionCronWorker (incidient review) groupauthentication A newly released worker ResourceAccessTokens::InactiveTokensDeletionCronWorker deleted 1456 users that it shouldn't have. 1. Incident review in production#18549 (closed).
2. FCL requested in comment
severity1 security SIRT Incident
severity2 2024-09-16: Gitlab.com is down

Production Change Locks

We discussed in a Slack thread as to if a PCL was needed. The consensus was that a stable environment will help provide space for further investigation and it would assist in preserving our availability (currently 99.92%).

Merge Requests for Production Change Locks

Edited by Rachel Nienaber