2021-11-22: Increased error rates across gitlab.com WEB and API services
Current Status
This incident has impacted general performance and resulted in timeouts for some actions. We are continuing to investigate the issue at this time. Our investigation has included review and changes of a few feature flags, database troubleshooting, and analysis of recent code changes. A rollback of a deploy is also in progress.
Quick reference:
- #5952 (comment 748517574) - Summary of findings: the problem, the fix, unintuitive quirks, risk of recurrence, etc.
- #5952 (comment 751163463) - More detailed walk-through of the pathology
Timeline
Recent Events (available internally only):
- Deployments
- Feature Flag Changes
- Infrastructure Configurations
- GCP Events (e.g. host failure)
All times UTC.
2021-11-22
-
13:52
- @rehab declares incident in Slack. -
14:11
- @igorwwwwwwwwwwwwwwwwwwww blocks one of the heavy endpoints via cloudflare. -
14:12
- @rehab reverts FFjupyter_clean_diffs
tofalse
. -
14:47
- @igorwwwwwwwwwwwwwwwwwwww enabled the refresh_widget route for the gitlab-com/runbooks project. -
16:09
- FFci_destroy_unlocked_job_artifacts
set tofalse
. -
16:35
- @abrandl started to runVACUUM FREEZE VERBOSE ANALYZE ci_job_artifacts;
. -
17:14
- VACUUM finished onindex_ci_job_artifacts_on_job_id_and_file_type
which has most of the read traffic across indexes on this table. -
17:31
- Operation should be returning to normal for all customers. -
17:45
- VACUUM finished completely. -
20:00
- autodeploy confirmed working again and incident marked as resolved. Work to re-enable disabled FFs is continuing.
2021-11-23
-
16:13
- Canary stage is undrained. -
23:30
- Sidekiq jobExpireBuildArtifactsWorker
is scheduled to not run: gitlab-com/gl-infra/k8s-workloads/gitlab-com!1368 (merged) -
23:36
- increased collection forauto_explain
instrumentation to 10% and >1s.
2021-11-24
-
00:46
- Production deploy completed. -
01:17
- Post-deploy migrations in production complete.
Corrective Actions
Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.
- Under consideration (via @grzesiek):
- adjusting vacuum thresholds for ci_job_artifacts
- partitioning this table
- Investigation: gitlab-org/gitlab#346427 (closed)
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.
Incident Review
Investigation and summary of events (plus some additional corrective actions) can be found in gitlab-org/gitlab#346427 (closed).