2021-09-12 Corrective actions for #5521

What happened

The upgrade to Go v1.17 introduced a significant performance regression with ZIP files. Instead of 2 HTTP Range Requests, it appears this call to f.readDataDescriptor() in https://go-review.googlesource.com/c/go/+/312310/14/src/archive/zip/reader.go#120 caused additional HTTP Range Requests to be fired for every file in the archive, which significantly slowed down the generation of the artifact metadata.

Initial list of corrective action items:

[Monitoring] Monitor GCS request rates (https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12348). Google told us our request rate jumped to a peek of 30K/s from under 2000/s around 10:00 UTC. That helped us correlate with the deploy time.
[Process] The upgrade to Go v1.17 should have happened in development environments first. I think we should give at least 2 weeks, maybe even a month or two, to test that the upgrades don't cause issues. I'd like to think that someone testing a 500+ MB artifact with lots of files would have run into this problem. I know @amulvany asked me about a customer ticket complaining about slow GCS uploads speeds. I'd like to think that someone would have tried to replicate this and run into this issue.
[Process] The upgrade (gitlab-org/build/CNG!736 (merged)) was quietly merged in CNG on Friday evening, but this sort of upgrade probably should have been broadcasted. It didn't occur to me that this would land Monday morning when the deploys went out.
[Process] We need to review why this problem wasn't encountered by our own pipelines. I would have expected CI artifacts to fail in dev.gitlab.org for packages, or even on a gitlab-org/gitlab-com pipeline with assets. Maybe we didn't just wait long enough before deploying the update to the entire fleet? UPDATE: Omnibus builds might not have gotten the upgrade. Perhaps we should merge Omnibus first or simultaneously?
[Testing] We need to set up some automation that benchmarks uploading large CI artifacts (e.g. 1 GB) with lots of entries are processed. This would also help us show that uploading large artifacts needs some work (e.g. gitlab-org/gitlab#285597 (closed)). Issue: gitlab-org/gitlab#340961 (closed)
[Instrumentation] We lacked visibility into what gitlab-zip-metadata was doing. We should look to add logging to this; ideally we'd log the process time and the number of HTTP requests made. Considering adding tests that verify number of HTTP Range Requests (as seen in example in https://github.com/golang/go/issues/48374).
[Monitoring] Add SLIs/SLOs for gitlab-zip-metadata from the metrics above.
[Operation] We should consider capping the maximum time gitlab-zip-metadata can run. If it takes longer than 100 seconds to process the archive, Cloudflare times out the request, but this process continues to run. Then the runner retries, causing even more requests to the GCS bucket.
[Development] We should figure out how what to do about making gitlab-zip-metadata work well with Go v1.17: gitlab-org/gitlab#340778 (closed). Filed an upstream Golang issue: https://github.com/golang/go/issues/48374
[Monitoring] Improve monitoring of Kubernetes node pools/nodes/workloads on nodes
[Monitoring] Alert on high HAProxy 503 error rate

Edited Sep 23, 2021 by Scott Hampton