2022-03-15: Brief spike in artifact upload failures due to runner configuration change

Incident DRI

Current Status

Due to a runner configuration change there were increased errors in uploading artifacts from CI jobs.

The errors reported by CI jobs would show as artifact upload errors, for example:

WARNING: Uploading artifacts as "archive" to coordinator... 307 Temporary Redirect  id=2205205797 responseStatus=307 Temporary Redirect status=307 token=az5wMDn3
WARNING: Retrying...                                context=artifacts-uploader error=invalid argument

See #6582 (comment 875214827) for details.

Summary for CMOC notice / Exec summary:

Customer Impact: For ten minutes CI/CD jobs executed on the shared/green runner managers (which are instance runners on GitLab.com) would be failing to upload artifacts if they were supposed to.
Service Impact: ServiceCI Runners
Impact Duration: 12:07 - 12:17 (10 minutes)
Root cause:

Incident have happened when we've been proceeding with a change rollout. This was our second approach after Day 3 step from the previous approach have caused submodule failures.

At Thursday 2022-03-10 night (around 23:50 UTC) EOC was informed about failures on submodule operations in CI/CD jobs. It was properly identified as a result of the change we've rolled out few hours ago. EOC decided to use the blue/green switch strategy to rollback changes. As it was discussed in that incident's investigation issue, this was unneeded (and was a result of probably not enough clear rollback procedure description).

The effect of this rollback was that shared runners were switched from blue to green where ci-gateway configuration was already reverted. A side effect was that this downgraded runner version. And this was the root cause.

Our strategy for Runner version upgrades is that we update version of only one part of the fleet and switch traffic there. The previously active part's roles are not updated as we're leaving them "in case if a revert to previous version will be needed". To support ci-gateway configuration, especially to support redirects on artifact upload attempts, we had to bring an update in GitLab Runner. During our tests on staging.gitlab.com and previous steps of gitlab.com rollout of https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/14874+ we've updated all three runner shards to a new version of Runner.

This means that at last Thursday we had:
- shared/blue runners using a new version of GitLab Runner,
- shared/green runners using an old version of GitLab Runner.
When doing the revert, we've not only reverted the ci-gateway configuration, but also the used version of the Runner. It was not a problem for customers as this version was still working properly.

When we've confirmed that all our fixes are working properly and decided to start implementing Switch 'shared' runners shard to use ci-gateway... (#6582 - closed) today, we didn't notice that shared/green are using a version of Runner that will fail with ci-gateway configuration. It was an oversight on preparation for the rollout. We've been handling several steps since two weeks while the version of Runner on shared were updated at the very beginning of that. We haven't noticed that Thursday's EOC action have changed our environment (no blame for EOC here! well done on handling the revert when submodules were failing!).

Today, when we've updated configuration on shared/green, runner managers started using ci-gateway internal load balancer, but because of a Runner version too old for it, jobs started to fail.

After identifying the issue and reverting the configuration we've switched runners to shared/blue with the expected Runner version. With that we've been able to repeat steps for Switch 'shared' runners shard to use ci-gateway... (#6582 - closed) - this time properly and without causing any failures.

Timeline

Recent Events (available internally only):

All times UTC.

2022-03-15

12:06 - initial configuration change for #6582 (closed) is applied on shared/green runner managers
12:06 - CI/CD jobs started from this moment and executed on the shared/green runner managers will fail on uploading artifacts step (if it's used in the job)
12:07 - the problem is discovered while executing a test pipeline (which is part of the change management issue "Post-changes steps" checklist).
12:17 - changes are reverted on all shared/green runner managers
12:23 - a test job confirms that the incident is resolved.
12:40 - @jarv declares incident in Slack.

Create related issues

Use the following links to create related issues to this incident if additional work needs to be completed after it is resolved:

Takeaways

Corrective Actions

Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.

Click to expand or collapse the Incident Review section.

Incident Review

Ensure that the exec summary is completed at the top of the incident issue, the timeline is updated and relevant graphs are included in the summary
If there are any corrective action items mentioned in the notes on the incident, ensure they are listed in the "Corrective Action" section
Fill out relevant sections below or link to the meeting review notes that cover these topics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. ...
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. ...
How many customers were affected?
1. ...
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. ...

What were the root causes?

Incident Response Analysis

How was the incident detected?
1. ...
How could detection time be improved?
1. ...
How was the root cause diagnosed?
1. ...
How could time to diagnosis be improved?
1. ...
How did we reach the point where we knew how to mitigate the impact?
1. ...
How could time to mitigation be improved?
1. ...
What went well?
1. ...

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. ...
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. ...
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. ...

What went well?

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Edited Mar 15, 2022 by Tomasz Maczukin