2024-09-10: SidekiqServiceSidekiqQueueingApdexSLOViolationSingleShard
Customer Impact
Background processing slow. One main impact is CI not working properly and jobs not shown as properly completing due to job traces not being available
Current Status
14:14 - Redis trace chunks is running out of memory, which impacts availability of Job logs. Redis generally getting saturated which affects background processing.
14:18 - We’re going to try scaling up Redis trace chunks as a temporary mitigation strategy
14:52 - Promoted to S1 incident
15:06 - Resized redis-tracechunks fleet has now stabilized. Memory utilization is growing, but we have quite a bit of headroom.
15:41 - We have restarted affected pods.
16:03 - Restarting of pods improved the situation. Sidekiq workers started to process jobs from the queue. The queue size is decreasing. We monitoring the status of the application.
16:31 - We keep monitoring the current status of the application. Background jobs started after the pod's restart should be processed normally. However, older ones are still in queue. It might take ~20-30 minutes until the queue is empty.
16:46 - The background jobs queue is empty. All stuck jobs should be processed. We're verifying that the system is stable.
17:27 - The incident is resolved.
📝 Summary for CMOC notice / Exec summary:
- Customer Impact: Slow Background processing and CI not working properly
- Service Impact: ServiceSidekiq ServiceRedis
- Impact Duration: 13:40 - 17:37 (237 minutes)
- Root cause: Likely, the bad disk from 2024-09-10: Increased errors on GitLab.com (#18535 - closed) was the initial trigger
📚 References and helpful links
Recent Events (available internally only):
- Feature Flag Log - Chatops to toggle Feature Flags Documentation
- Infrastructure Configurations
- GCP Events (e.g. host failure)
Deployment Guidance
- Deployments Log | Gitlab.com Latest Updates
- Reach out to Release Managers for S1/S2 incidents to discuss Rollbacks, Hot Patching or speeding up deployments. | Rollback Runbook | Hot Patch Runbook
Use the following links to create related issues to this incident if additional work needs to be completed after it is resolved:
- Corrective action ❙ Infradev
- Incident Review ❙ Infra investigation followup
- Confidential Support contact ❙ QA investigation
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in our handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.
Security Note: If anything abnormal is found during the course of your investigation, please do not hesitate to contact security.