2021-09-06 Shared runners with tag gitlab-org-docker are unavailable.

Current Status

Jobs internal to the GitLab org executing on private runners were stalled while ephemeral VMs could not be successfully created and connected to. This was due to a configuration change that has been reverted. This only affected runners used by the GitLab organization (our www-gitlab-com and testing pipelines).

Summary for CMOC notice / Exec summary:

  1. Customer Impact: GitLab organization pipelines
  2. Customer Impact Duration: 2021-09-06 15:00 UTC - 2021-09-07 01:00 UTC (600 Minutes)
  3. Current state: IncidentMitigated
  4. Root cause: RootCauseConfig-Change

Timeline

View recent production deployment and configuration events / gcp events (internal only)

All times UTC.

2021-09-06

  • 15:00 (approx) - @akohlbecker enables OS Login setting project-wide on gitlab-ci and gitlab-org-ci.
  • 19:22 - @mmora & @cleveland notice that shared runners with the gitlab-org-docker are offline
  • 19:56 - @mmora declares incident in Slack.
  • 19:58 - @akohlbecker disables the OS Login setting project-wide on gitlab-org-ci. Other steps are taken synchronously (restarting the runners) and so this is not identified as the solution.
  • 20:06 - We notice that gitlab-org-docker runners are back online
  • 20:41 - Incident is marked as resolved
  • 21:01 - Incident is re-opened as @rspeicher notices QA jobs on ops are stuck (docker runner tag)

2021-09-07

  • 00:58 - OS Login setting is disabled project-wide on gitlab-ci
  • 01:02 - Jobs are running again

Corrective Actions

Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.

  • ...

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.


Click to expand or collapse the Incident Review section.

Incident Review

  • Ensure that the exec summary is completed at the top of the incident issue, the timeline is updated and relevant graphs are included in the summary
  • If there are any corrective action items mentioned in the notes on the incident, ensure they are listed in the "Corrective Action" section
  • Fill out relevant sections below or link to the meeting review notes that cover these topics

Customer Impact

  1. Who was impacted by this incident? (i.e. external customers, internal customers)
    1. GitLab team members and community contributors
  2. What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
    1. CI jobs progressively stopped being processed, going to a full stop at around 22:00 UTC
  3. How many customers were affected?
    1. Anyone trying to execute a pipeline
  4. If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
    1. ...

What were the root causes?

  • Enabling the OS Login feature project-wide disabled SSH keys stored in instance metadata, and ultimately prevented docker-machine from connecting to the VMs it tried to create.
  • As ephemeral VMs were progressively destroyed once they reached a max number of jobs executed, they couldn't be replaced, leading to CI jobs getting increasingly stalled

Incident Analysis

  1. Why did this change happen?

To automate some parts of the gitlab-runner deployment process, we need a way to programatically access runner managers from Gitlab pipelines. OS Login allows us to connect via SSH using only a GCP service account and thus simplifies credential management.

  1. What did we think was being changed by this change being applied?

We thought we were making the OS Login feature available for use to instances. False sense of security was due to following the wording in this guide

Before you do anything else make sure to enable it for your project:

$ gcloud compute project-info add-metadata
--metadata enable-oslogin=TRUE

  1. What was actually changed by this change being applied?
  • Google agents running on all our machines started changing the SSH configuration, getting into a loop with chef trying to overwrite it
  • SSH Keys stored in instance metadata were not being enabled anymore, preventing docker-machine from connecting to ephemeral runner VMs
  1. Why was this change changed in the way it was?

We thought we were making the feature available for use

  1. Why didn’t this change go through a production change request?

We didn't realise that this would have any impact on existing infrastructure

  1. Was yesterday a production change lock? (I don’t think so, but double check because it was a holiday in a lot of places)***

Not according to this table

Incident Response Analysis

  1. How was the incident detected?
    1. Team member noticed that their jobs were stuck
  2. How could detection time be improved?
    1. Add a dialtone measuring the time it takes for a job to go through the system and alert after a threshold
  3. How was the root cause diagnosed?
    1. Experimenting with a toy runner VM
  4. How could time to diagnosis be improved?
    1. Manage project settings in Terraform to create a record
  5. How did we reach the point where we knew how to mitigate the impact?
    1. Once the root cause was identified, the solution was immediate
  6. How could time to mitigation be improved?
    1. Not applicable
  7. What went well?
    1. Great communication between all the team members involved

Post Incident Analysis

  1. Did we have other events in the past with the same root cause?
    1. Not that I'm aware of
  2. Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
    1. Not that I'm aware of
  3. Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
    1. Yes, it was part of https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/13803

Lessons Learned

  • ...

Guidelines

  • Blameless RCA Guideline

Resources

  1. If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)
Edited Sep 09, 2021 by Adrien Kohlbecker
Assignee Loading
Time tracking Loading