RCA: 2019-10-24: Elevated CI job queue durations

Summary

Due to a project import bug, some jobs had an options attribute with a wrong data type, which lead to failures assigning jobs so they were put back into the queue in an endless loop. The fair usage job scheduling algorithm was preferring them over most other jobs as they were belonging to a short pipeline and so only a few other jobs got the chance to also run on shared runners. This increased the overall queue time and the number of pending jobs was rising.

Service(s) affected : ~"Service:CI Runners"
Team attribution :
Minutes downtime or degradation : 10:00 UTC - 20:20 UTC = 10h 20m

For calculating duration of event, use the Platform Metrics Dashboard to look at appdex and SLO violations.

Impact & Metrics

Start with the following:

What was the impact of the incident?
- many jobs pending for a long time
Who was impacted by this incident?
- all users running jobs
How did the incident impact customers?
- customers needed to wait a long time to get their jobs scheduled
How many attempts were made to access the impacted service/feature?
How many customers were affected?
How many customers tried to access the impacted service/feature?

Job duration for 50th percentile:

Job duration for 90th percentile:

Provide any relevant graphs that could help understand the impact of the incident and its dynamics.

Detection & Response

Start with the following:

How was the incident detected?
- reports from customer support about users seeing pending jobs
Did alarming work as expected?
- we got no alerts, as the SLO APDEX for ci runner latency was defined to alert on the 50th percentile but in the first hours only the 70th percentile was severely affected
How long did it take from the start of the incident to its detection?
- 42m (support reported customer issues at 10:42)
How long did it take from detection to remediation?
10:42 - 20:20 = 9h 38m
Were there any issues with the response to the incident? (i.e. bastion host used to access the service was not available, relevant team member wasn't page-able, ...)
- It took a long time to identify the root cause and to understand the impact.

Root Cause Analysis

Jobs were getting stuck in pending state.

Why? - They only had a low chance of getting assigned to a shared runners.
Why? - The shared runners were mostly occupied with a few jobs with corrupt options being retried indefinitely.
Why? - The Ci::Build#options attribute was a string instead of a hash.
Why? - The jobs came from imported projects and the importer has a bug in 12.4.0 (12.3.5 works).
Why?

What went well

customer support escalating to infra team
dev and infra working together to debug a tricky issue

What can be improved

Alerting for pending jobs and rising job queues.
Better understanding of job scheduling and the impact of elevated job queue times.

Start with the following:

Using the root cause analysis, explain what can be improved to prevent this from happening again.
Is there anything that could have been done to improve the detection or time to detection?
Is there anything that could have been done to improve the response or time to response?
Is there an existing issue that would have either prevented this incident or reduced the impact?
Did we have any indication or beforehand knowledge that this incident might take place?

Corrective actions

List issues that have been created as corrective actions from this incident.
For each issue, include the following:
- - Issue labeled as corrective action.
- Include an estimated date of completion of the corrective action.
- Incldue the named individual who owns the delivery of the corrective action.

Prevent corrupt job options: gitlab-org/gitlab!19122 (closed), gitlab-org/gitlab!19124 (merged)
Prevent jobs from being rescheduled indefinitely: https://gitlab.com/gitlab-org/gitlab/issues/34897
Improve alerting for elevated job queue times
Make it easier to identify which job is picked from which project by which runner: gitlab-org/gitlab#34889

Guidelines

Edited Dec 09, 2019 by AnthonySandoval