Reduce flakiness of `www-gitlab-com` pipelines related to intermittent failures
Overview
We are trying to increase the stability of the www-gitlab-com
pipelines, and now have an SLO to investigate all pipeline failures on the master
branch, which are reported to the #master-broken-www-gitlab-com
internal slack channel.
There are frequent flaky failures due to various seemingly intermittent network-related issues. Most of these seem related to the repo cloning/fetching step, but not all.
This issue is to explore strategies to reduce the flakiness. Especially, anything that can be learned from approaches to making the gitlab-com/gitlab
pipelines more stable, or if there's no good mitigation, and we just need to deal with manually retrying them.
Here are some recent examples of these types of failures:
Network errors while fetching repo
We are aware that the www-gitlab-com
repo is very large (over 6 gig), which likely contributes to this instability. There are ongoing efforts (monorepo refactor, reducing frequency of fetching) to address this, but in the meantime we would like to increase the of the fetching if possible.
-
fatal: unable to access 'https://gitlab.com/gitlab-com/www-gitlab-com.git/': The requested URL returned error: 524
when fetching repo (failed job) -
fatal: unable to access 'https://gitlab.com/gitlab-com/www-gitlab-com.git/': OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to gitlab.com:443
when fetching repo (failed job) -
fatal: the remote end hung up unexpectedly - fatal: protocol error: bad pack header
when fetching repo (failed job) -
error: RPC failed; curl 56 OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 104 - fatal: the remote end hung up unexpectedly
when fetching repo (failed job)
Network errors in scripts (other than repo fetching)
-
Errno::ENETUNREACH: Failed to open TCP connection to gitlab.com:443 (Network is unreachable - connect(2) for "gitlab.com" port 443)
when running bin/generate_handbook_changelog (failed job)
Ideas
- Implement CI-level retry, as shown here in the
gitlab
repo (done in !45662 (merged)) - Try to leverage the
retry
script wrapper for scripts which fail, as shown here? - Improve the existing
ApiRetry
module to catch more errors. (done in !45672 (merged)) - See if there are other API calls (via the
Gitlab
gem) which need to be wrapped inApiRetry
Updates
Summary of Updates
- The implementation of ideas 1 and 3 seem to have helped. The number of network failures have reduced from multiple per day to one or zero per day.
- There have been no non-network flakes since implementing ideas 1 and 3 (but we currently have the integration specs commented out due to other flakiness unrelated to the network).
- The network failures have happened both during a fresh clone and during a fetch-only reinitialize
Update: Apr. 6, 2020
- Idea 1 (CI retry) has been implemented:
retry:
max: 2 # This is confusing but this means "3 runs at max".
when:
- unknown_failure
- api_failure
- runner_system_failure
- job_execution_timeout
- stuck_or_timeout_failure
-
Idea 3 (Improve
ApiRetry
) has been implemented. - We will wait a week and see what flakiness still remains, then re-evaluate what the next steps should be.
Update Apr. 7, 2020
- Got another repo cloning network failure:
fatal: the remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed
.- So, whatever this is wasn't fixed by the CI-level retries added yesterday.
- Asked for advice in
#g_runner
slack channel
Update Apr. 9, 2020
- One job got several repo cloning network failures
Initialized empty Git repository in /builds/gitlab-com/www-gitlab-com/.git/
Created fresh repository.
fatal: the remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed
Update Apr. 10, 2020
- One job failed on master while fetching a reinitialized repo - note this was not a fresh clone
Fetching changes with git depth set to 10...
Reinitialized existing Git repository in /builds/gitlab-com/www-gitlab-com/.git/
fatal: unable to access 'https://gitlab.com/gitlab-com/www-gitlab-com.git/': The requested URL returned error: 524
Update Apr. 13, 2020
- One pipeline failed three jobs on repo clone, with more errors like
fatal: the remote end hung up unexpectedly
. Here's one: https://gitlab.com/gitlab-com/www-gitlab-com/-/jobs/508591732 - It's interesting that it failed it three jobs in one pipeline and no others points to Gitaly just having a hiccup, and not even necessarily related to the size of the repo, since it's been mostly stable on thousands of other jobs.
- Perhaps there's a bug against the product to be created here, that something that could or should be caught by the CI-level retry setting, but isn't?
Update Apr. 15 2020
- Several failures early in the day, but they were caused by a systemwide incident around artifact uploads
- One repo failure, during fetch/reinitialize
Reinitialized existing Git repository in /builds/gitlab-com/www-gitlab-com/.git/
fatal: unable to access 'https://gitlab.com/gitlab-com/www-gitlab-com.git/': The requested URL returned error: 524
- It's interesting that several recently are on reinitializes rather than full clones. This indicates that it's not necessarily the repo size, because a reinitialize/fetch should be a quick fast-forward which doesn't pull much data down
Status after observing a week+ of activity (Apr. 21, 2020)
After observing a week's worth of activity on www-gitlab-com
build failures, there's two things that stand out:
- It's interesting that several recently are on reinitializes rather than full clones. This indicates that it's not necessarily the repo size, because a reinitialize/fetch should be a quick fast-forward which doesn't pull much data down
- Perhaps there's a bug against the product to be created here, that something that could or should be caught by the CI-level retry setting, but isn't?. UPDATE: Opened an issue to implement ci-level retry for runner clone/fetch failures
It doesn't appear that it is necessary at this time to try Idea 2 (retry script wrapper), or Idea 4 (auditing for other API calls to wrap in ApiRetry
), because we haven't seen any failures these would have helped. If we do, we can look into them.
Update Apr. 21 2020
- Opened an issue to implement ci-level retry for runner clone/fetch failures
- Found out that this is already supported via Job Stages Attempts variables: https://gitlab.com/gitlab-org/gitlab/-/blob/master/doc/ci/yaml/README.md#job-stages-attempts
- Opened a doc issue to make that more discoverable: gitlab-org/gitlab!30113 (merged)
- Added those variables: !47295 (merged)
- Closed this issue
Update Apr 27, 2020
Reopening this issue. After almost a week of no failures after adding Job Stages Attempts variables, There were two new errors today of a type which hadn't been seen (or at least recorded here) before:
Running with gitlab-runner 12.10.0-rc2 (6c8c540f)
on docker-auto-scale-com 1d6b581d
Preparing the "docker+machine" executor
00:01
Using Docker executor with image registry.gitlab.com/gitlab-org/gitlab-build-images:www-gitlab-com-2.6 ...
Authenticating with credentials from job payload (GitLab Registry)
Pulling docker image registry.gitlab.com/gitlab-org/gitlab-build-images:www-gitlab-com-2.6 ...
Using docker image sha256:32e5ef4517beeb0b933b968da99a1c3b7bb688f2438e5d23d1d9366964a934c3 for registry.gitlab.com/gitlab-org/gitlab-build-images:www-gitlab-com-2.6 ...
Preparing environment
00:00
Uploading artifacts for failed job
00:00
ERROR: Job failed (system failure): Error response from daemon: mkdir /var/lib/docker/overlay2/73148cecddc032dba75fed934ed09655367763a2d4d714188a4e57a28916cdd0-init: read-only file system (docker.go:788:0s)
Update Apr. 29th
Several failures related to infrastructure issue with artifact upload: gitlab-com/gl-infra/production#2031 (closed)
Status May 4
Re-closing this issue. All failures since Apr. 21 are attributable to other transient infrastructure failures or code-related errors. Otherwise, it seems very stable for all of the original items mentioned in the issue.