2018-12-06 Delays in shared runners on GitLab.com
Summary
A brief summary of what happened. Try to make it as executive-friendly as possible.
Service(s) affected : CI Shared runners on GitLab.com Team attribution : Minutes downtime or degradation :
Outage time: 2018-12-05 14:40UTC to 2018-12-05 19:40 UTC - 5 hours interrupt where pending jobs was abnormally high per https://dashboards.gitlab.net/d/000000159/ci?panelId=2&fullscreen&orgId=1&from=1544008236203&to=1544040636000&var-runner_type=All&var-runner_managers=All&var-cache_server=All&var-gl_monitor_fqdn=postgres-02-db-gprd.c.gitlab-production.internal&var-has_minutes=yes&var-hanging_droplets_cleaner=All&var-droplet_zero_machines_cleaner=All&var-runner_job_failure_reason=All&var-gitlab_env=gprd&var-jobs_running_for_project=0
Timeline
2018-12-06
Notes from Slack:
- 16:25 UTC It looks like we are at 100% of quota on SSD disks for CI runners in GCP
- 16:30 UTC starting to remove stale SSDs for unattached disks
- 17:00 UTC - Creation of incident issue
- 17:15 UTC - looking into enabling some DO runners to give us some capacity
- 17:25 UTC - DO runners enabled and we should start picking up jobs
- 17:47 UTC - Shared DO runners are helping bring the number of pending jobs down to better levels - continuing to monitor.