2021-09-13: Issues uploading artifacts: 524 or 502 errors
Current Status
GitLab.com is operating normally.
API requests for artifact uploads were seeing elevated rates of 502/524 errors. These errors have been more frequently showing up on pipelines with large numbers of files to upload. We have figured out that there is some saturation on our nginx ingres and are scaling that to better handle the load. A second invesitgation found that a Golang version update on the binary that handled uploads caused a significant increase in read requests that caused a related degraded state. The initial saturation problem was exacerbated by this binary that handled uploading to GCS. A rollback has been performed to go back to the latest stable version.
Summary for CMOC notice / Exec summary:
- Customer Impact: API the customers are reporting 502 and 524 errors when uploading artifacts (pipelines with large number of files are affected more)
- Customer Impact Duration: start time approx since 2021-09-13 10:00 UTC - 2021-09-13 23:10 ( 13h10m )
- Current state: IncidentResolved
- Root cause: RootCauseSoftware-Change
Timeline
View recent production deployment and configuration events / gcp events (internal only)
All times UTC.
Production Timeline
2021-09-13
-
07:34
-14.3.202109130320-b1e1b2b9679.ee20599df78
package to canary begins -
09:00
-14.3.202109130620-4905b977756.f3ee0bedfc4
package to canary beings -
09:50
-14.3.202109130320-b1e1b2b9679.ee20599df78
package to main begins -
11:05
-14.3.202109130620-4905b977756.f3ee0bedfc4
package to main beings -
13:25
- @kategrechishkina declares incident in Slack. -
13:52
-14.3.202109131120-669e0259bae.6b95ec25dd9
package to canary begins- This package never makes it to production
-
16:20
- response in research with support from GCP that we've seen i/o issues on nginx pods/nodes. -
16:30
- Changing configuration for node pools for nginx to be SSD based which should help with scaling number of pods starting. -
17:00
- initial (assumed) recovery from major saturation. -
19:12
- re-opened incident with further reports of issue -
22:45
- starting roll back to latest stable version. -
23:10
- rollback mostly completed and error rates for uploads return to more normal levels
Software root cause timeline
-
2021-08-16
- MR opened proposing upgrading CNG images to all use golang 1.17 -
2021-08-19
- MR waiting on upstream tests to be merged -
2021-08-20
- golang 1.17 added to gitlab build images -
2021-09-02
- ci testing added to workhorse -
2021-09-08
- MR first review started -
2021-09-09
- Omnibus MR for updating and testing golang opened -
2021-09-10
- Sucessful QA test run of the golang 1.17 update in omnibus -
2021-09-11
- Maintainer review completed and MR merged and helm charts ci qa smoke test -
2021-09-13
- Automatically brought into the auto-deploy branch
Corrective Actions
Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.
- Improve monitoring on GCS request rates/errors
- Alert on high 503 error rate
- Additional Corrective Actions tracked in issue: #5533 (closed)
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.
Click to expand or collapse the Incident Review section.
Incident Review
-
Ensure that the exec summary is completed at the top of the incident issue, the timeline is updated and relevant graphs are included in the summary -
If there are any corrective action items mentioned in the notes on the incident, ensure they are listed in the "Corrective Action" section -
Fill out relevant sections below or link to the meeting review notes that cover these topics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- ...
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- ...
-
How many customers were affected?
- ...
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- ...
What were the root causes?
- The golang version used to compile the gitlab workhorse cloud native images was updated from 1.16 to 1.17 gitlab-org/build/CNG!736 (merged)
- The upgrade to Go v1.17 introduced a significant performance regression with ZIP files. Instead of 2 HTTP Range Requests, it appears this call to
f.readDataDescriptor()
in https://go-review.googlesource.com/c/go/+/312310/14/src/archive/zip/reader.go#120 caused additional HTTP Range Requests to be fired for every file in the archive, which significantly slowed down the generation of the artifact metadata.
Incident Response Analysis
-
How was the incident detected?
- ...
-
How could detection time be improved?
- ...
-
How was the root cause diagnosed?
- ...
-
How could time to diagnosis be improved?
- ...
-
How did we reach the point where we knew how to mitigate the impact?
- ...
-
How could time to mitigation be improved?
- ...
-
What went well?
- ...
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- ...
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- ...
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- ...
Lessons Learned
- ...
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)