2020-07-14: 500 errors returned by the GitLab API, mostly when triggering other pipelines
Summary
A customer has reported seeing an increased occurrence of 500s being returned by the GitLab API in their pipelines, mostly when triggering other pipelines.
We've determined that anyone using child pipelines—and possibly others are receiving 500 errors due to database timeouts.
Workarounds
Workaround 1: Project Settings
Temporarily disable the option Settings > CI/CD > Auto-cancel redundant, pending pipelines
for the project to allow all new pipelines to start properly.
Workaround 2: Retry Job
Cancel any of the currently created
jobs in the stuck created
pipeline and retry the job(s)—the processing will properly handle all the pipeline processing.
Timeline
All times UTC.
2020-07-14
- ...
- 21:39 - Workarounds identified #2407 (comment 379424444)
- 22:08 - Verifying staging deployment of patch.
- 23:06 - Production deployment pending on Sidekiq pods in Kubernetes.
- 23:15 - Production deployment complete.
Incident Review
Expand/Collapse
Summary
- Service(s) affected :
- Team attribution :
- Minutes downtime or degradation :
Metrics
Customer Impact
- Who was impacted by this incident? (i.e. external customers, internal customers)
- What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- How many customers were affected?
- If a precise customer impact number is unknown, what is the estimated potential impact?
Incident Response Analysis
- How was the event detected?
- How could detection time be improved?
- How did we reach the point where we knew how to mitigate the impact?
- How could time to mitigation be improved?
Post Incident Analysis
- How was the root cause diagnosed?
- How could time to diagnosis be improved?
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?
Timeline
- YYYY-MM-DD XX:YY UTC: action X taken
- YYYY-MM-DD XX:YY UTC: action Y taken
5 Whys
Lessons Learned
Corrective Actions
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)
Edited by AnthonySandoval