Reduce flakiness of `www-gitlab-com` pipelines related to intermittent failures

Overview

We are trying to increase the stability of the www-gitlab-com pipelines, and now have an SLO to investigate all pipeline failures on the master branch, which are reported to the #master-broken-www-gitlab-com internal slack channel.

There are frequent flaky failures due to various seemingly intermittent network-related issues. Most of these seem related to the repo cloning/fetching step, but not all.

This issue is to explore strategies to reduce the flakiness. Especially, anything that can be learned from approaches to making the gitlab-com/gitlab pipelines more stable, or if there's no good mitigation, and we just need to deal with manually retrying them.

Here are some recent examples of these types of failures:

Network errors while fetching repo

We are aware that the www-gitlab-com repo is very large (over 6 gig), which likely contributes to this instability. There are ongoing efforts (monorepo refactor, reducing frequency of fetching) to address this, but in the meantime we would like to increase the of the fetching if possible.

fatal: unable to access 'https://gitlab.com/gitlab-com/www-gitlab-com.git/': The requested URL returned error: 524 when fetching repo (failed job)
fatal: unable to access 'https://gitlab.com/gitlab-com/www-gitlab-com.git/': OpenSSL SSL_connect: SSL_ERROR_SYSCALL in connection to gitlab.com:443 when fetching repo (failed job)
fatal: the remote end hung up unexpectedly - fatal: protocol error: bad pack header when fetching repo (failed job)
error: RPC failed; curl 56 OpenSSL SSL_read: SSL_ERROR_SYSCALL, errno 104 - fatal: the remote end hung up unexpectedly when fetching repo (failed job)

Network errors in scripts (other than repo fetching)

Errno::ENETUNREACH: Failed to open TCP connection to gitlab.com:443 (Network is unreachable - connect(2) for "gitlab.com" port 443) when running bin/generate_handbook_changelog (failed job)

Ideas

Implement CI-level retry, as shown here in the gitlab repo (done in !45662 (merged))
Try to leverage the retry script wrapper for scripts which fail, as shown here?
Improve the existing ApiRetry module to catch more errors. (done in !45672 (merged))
See if there are other API calls (via the Gitlab gem) which need to be wrapped in ApiRetry

Updates

Summary of Updates

The implementation of ideas 1 and 3 seem to have helped. The number of network failures have reduced from multiple per day to one or zero per day.
There have been no non-network flakes since implementing ideas 1 and 3 (but we currently have the integration specs commented out due to other flakiness unrelated to the network).
The network failures have happened both during a fresh clone and during a fetch-only reinitialize

Update: Apr. 6, 2020

Idea 1 (CI retry) has been implemented:

  retry:
    max: 2 # This is confusing but this means "3 runs at max".
    when:
      - unknown_failure
      - api_failure
      - runner_system_failure
      - job_execution_timeout
      - stuck_or_timeout_failure

Idea 3 (Improve ApiRetry) has been implemented.
We will wait a week and see what flakiness still remains, then re-evaluate what the next steps should be.

Update Apr. 7, 2020

Got another repo cloning network failure: fatal: the remote end hung up unexpectedly fatal: early EOF fatal: index-pack failed.
- So, whatever this is wasn't fixed by the CI-level retries added yesterday.
Asked for advice in #g_runner slack channel

Update Apr. 9, 2020

One job got several repo cloning network failures

 Initialized empty Git repository in /builds/gitlab-com/www-gitlab-com/.git/
 Created fresh repository.
 fatal: the remote end hung up unexpectedly
 fatal: early EOF
 fatal: index-pack failed

Update Apr. 10, 2020

One job failed on master while fetching a reinitialized repo - note this was not a fresh clone

 Fetching changes with git depth set to 10...
 Reinitialized existing Git repository in /builds/gitlab-com/www-gitlab-com/.git/
 fatal: unable to access 'https://gitlab.com/gitlab-com/www-gitlab-com.git/': The requested URL returned error: 524

Update Apr. 13, 2020

One pipeline failed three jobs on repo clone, with more errors like fatal: the remote end hung up unexpectedly. Here's one: https://gitlab.com/gitlab-com/www-gitlab-com/-/jobs/508591732
It's interesting that it failed it three jobs in one pipeline and no others points to Gitaly just having a hiccup, and not even necessarily related to the size of the repo, since it's been mostly stable on thousands of other jobs.
Perhaps there's a bug against the product to be created here, that something that could or should be caught by the CI-level retry setting, but isn't?

Update Apr. 15 2020

Several failures early in the day, but they were caused by a systemwide incident around artifact uploads
One repo failure, during fetch/reinitialize

Reinitialized existing Git repository in /builds/gitlab-com/www-gitlab-com/.git/
fatal: unable to access 'https://gitlab.com/gitlab-com/www-gitlab-com.git/': The requested URL returned error: 524

It's interesting that several recently are on reinitializes rather than full clones. This indicates that it's not necessarily the repo size, because a reinitialize/fetch should be a quick fast-forward which doesn't pull much data down

Status after observing a week+ of activity (Apr. 21, 2020)

After observing a week's worth of activity on www-gitlab-com build failures, there's two things that stand out:

It's interesting that several recently are on reinitializes rather than full clones. This indicates that it's not necessarily the repo size, because a reinitialize/fetch should be a quick fast-forward which doesn't pull much data down
Perhaps there's a bug against the product to be created here, that something that could or should be caught by the CI-level retry setting, but isn't?. UPDATE: Opened an issue to implement ci-level retry for runner clone/fetch failures

It doesn't appear that it is necessary at this time to try Idea 2 (retry script wrapper), or Idea 4 (auditing for other API calls to wrap in ApiRetry), because we haven't seen any failures these would have helped. If we do, we can look into them.

Update Apr. 21 2020

Opened an issue to implement ci-level retry for runner clone/fetch failures
Found out that this is already supported via Job Stages Attempts variables: https://gitlab.com/gitlab-org/gitlab/-/blob/master/doc/ci/yaml/README.md#job-stages-attempts
Opened a doc issue to make that more discoverable: gitlab-org/gitlab!30113 (merged)
Added those variables: !47295 (merged)
Closed this issue

Update Apr 27, 2020

Reopening this issue. After almost a week of no failures after adding Job Stages Attempts variables, There were two new errors today of a type which hadn't been seen (or at least recorded here) before:

 Running with gitlab-runner 12.10.0-rc2 (6c8c540f)
   on docker-auto-scale-com 1d6b581d
Preparing the "docker+machine" executor
00:01
 Using Docker executor with image registry.gitlab.com/gitlab-org/gitlab-build-images:www-gitlab-com-2.6 ...
 Authenticating with credentials from job payload (GitLab Registry)
 Pulling docker image registry.gitlab.com/gitlab-org/gitlab-build-images:www-gitlab-com-2.6 ...
 Using docker image sha256:32e5ef4517beeb0b933b968da99a1c3b7bb688f2438e5d23d1d9366964a934c3 for registry.gitlab.com/gitlab-org/gitlab-build-images:www-gitlab-com-2.6 ...
Preparing environment
00:00
Uploading artifacts for failed job
00:00
 ERROR: Job failed (system failure): Error response from daemon: mkdir /var/lib/docker/overlay2/73148cecddc032dba75fed934ed09655367763a2d4d714188a4e57a28916cdd0-init: read-only file system (docker.go:788:0s)

Update Apr. 29th

Several failures related to infrastructure issue with artifact upload: gitlab-com/gl-infra/production#2031 (closed)

Status May 4

Re-closing this issue. All failures since Apr. 21 are attributable to other transient infrastructure failures or code-related errors. Otherwise, it seems very stable for all of the original items mentioned in the issue.

Edited May 04, 2020 by Chad Woolley