Failure in qa-schedules-subgroup-cleanup job - TCP connection timeout
Problem
Rake cleanup jobs for all Staging QA runs are failing today with failure to connect to staging.gitlab.com error after about 250 API calls to delete a subgroup.
total_sub_groups: 372
total_sub_group_pages: 4
==== Current Page: 1 ====
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF.FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
==== Current Page: 2 ====
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
==== Current Page: 3 ====
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
ERROR: Job failed: exit code 1
This may be due to Cloudflare based on #199279 (comment 278340697) - likely not based on no 524 response https://gitlab.slack.com/archives/CB3LSMEJV/p1580325752470200?thread_ts=1580317297.454700&cid=CB3LSMEJV
Impact
Blocking Release Deployer per #199279 (comment 278365579) - mitigated by gitlab-org/quality/pipeline-common!24 (merged)
Diagnosing steps
- Manual retries of the job failed have led to the test failing at the same spot.
- Local run of the subgroup cleanup succeeds, failures only occur when running on via example job below on https://ops.gitlab.net/gitlab-org/quality/staging.
- Last known successful run was https://ops.gitlab.net/gitlab-org/quality/staging/-/jobs/869287
Example job: https://ops.gitlab.net/gitlab-org/quality/staging/-/jobs/870018
$ bundle exec rake delete_subgroups GITLAB_ADDRESS=$GITLAB_ADDRESS
rake aborted!
RestClient::Exceptions::OpenTimeout: Timed out connecting to server
/usr/local/bundle/gems/rest-client-2.0.2/lib/restclient/request.rb:739:in `rescue in transmit'
/usr/local/bundle/gems/rest-client-2.0.2/lib/restclient/request.rb:642:in `transmit'
/usr/local/bundle/gems/rest-client-2.0.2/lib/restclient/request.rb:145:in `execute'
/usr/local/bundle/gems/rest-client-2.0.2/lib/restclient/request.rb:52:in `execute'
/builds/gitlab-org/quality/staging/gitlab/qa/qa/support/api.rb:42:in `delete'
/builds/gitlab-org/quality/staging/gitlab/qa/qa/tools/delete_subgroups.rb:54:in `block in delete_subgroups'
/builds/gitlab-org/quality/staging/gitlab/qa/qa/tools/delete_subgroups.rb:53:in `each'
/builds/gitlab-org/quality/staging/gitlab/qa/qa/tools/delete_subgroups.rb:53:in `delete_subgroups'
/builds/gitlab-org/quality/staging/gitlab/qa/qa/tools/delete_subgroups.rb:44:in `block in run'
/builds/gitlab-org/quality/staging/gitlab/qa/qa/tools/delete_subgroups.rb:35:in `times'
/builds/gitlab-org/quality/staging/gitlab/qa/qa/tools/delete_subgroups.rb:35:in `run'
/builds/gitlab-org/quality/staging/gitlab/qa/Rakefile:12:in `block in <top (required)>'
/usr/local/bundle/gems/rake-12.3.0/exe/rake:27:in `<top (required)>'
/usr/local/bin/bundle:23:in `load'
/usr/local/bin/bundle:23:in `<main>'
Caused by:
Errno::ETIMEDOUT: Failed to open TCP connection to staging.gitlab.com:443 (Operation timed out - connect(2) for "staging.gitlab.com" port 443)
/usr/local/bundle/gems/rest-client-2.0.2/lib/restclient/request.rb:715:in `transmit'
/usr/local/bundle/gems/rest-client-2.0.2/lib/restclient/request.rb:145:in `execute'
/usr/local/bundle/gems/rest-client-2.0.2/lib/restclient/request.rb:52:in `execute'
/builds/gitlab-org/quality/staging/gitlab/qa/qa/support/api.rb:42:in `delete'
/builds/gitlab-org/quality/staging/gitlab/qa/qa/tools/delete_subgroups.rb:54:in `block in delete_subgroups'
/builds/gitlab-org/quality/staging/gitlab/qa/qa/tools/delete_subgroups.rb:53:in `each'
/builds/gitlab-org/quality/staging/gitlab/qa/qa/tools/delete_subgroups.rb:53:in `delete_subgroups'
/builds/gitlab-org/quality/staging/gitlab/qa/qa/tools/delete_subgroups.rb:44:in `block in run'
/builds/gitlab-org/quality/staging/gitlab/qa/qa/tools/delete_subgroups.rb:35:in `times'
/builds/gitlab-org/quality/staging/gitlab/qa/qa/tools/delete_subgroups.rb:35:in `run'
/builds/gitlab-org/quality/staging/gitlab/qa/Rakefile:12:in `block in <top (required)>'
/usr/local/bundle/gems/rake-12.3.0/exe/rake:27:in `<top (required)>'
/usr/local/bin/bundle:23:in `load'
/usr/local/bin/bundle:23:in `<main>'
Caused by:
Errno::ETIMEDOUT: Operation timed out - connect(2) for "staging.gitlab.com" port 443
/usr/local/bundle/gems/rest-client-2.0.2/lib/restclient/request.rb:715:in `transmit'
/usr/local/bundle/gems/rest-client-2.0.2/lib/restclient/request.rb:145:in `execute'
/usr/local/bundle/gems/rest-client-2.0.2/lib/restclient/request.rb:52:in `execute'
/builds/gitlab-org/quality/staging/gitlab/qa/qa/support/api.rb:42:in `delete'
/builds/gitlab-org/quality/staging/gitlab/qa/qa/tools/delete_subgroups.rb:54:in `block in delete_subgroups'
/builds/gitlab-org/quality/staging/gitlab/qa/qa/tools/delete_subgroups.rb:53:in `each'
/builds/gitlab-org/quality/staging/gitlab/qa/qa/tools/delete_subgroups.rb:53:in `delete_subgroups'
/builds/gitlab-org/quality/staging/gitlab/qa/qa/tools/delete_subgroups.rb:44:in `block in run'
/builds/gitlab-org/quality/staging/gitlab/qa/qa/tools/delete_subgroups.rb:35:in `times'
/builds/gitlab-org/quality/staging/gitlab/qa/qa/tools/delete_subgroups.rb:35:in `run'
/builds/gitlab-org/quality/staging/gitlab/qa/Rakefile:12:in `block in <top (required)>'
/usr/local/bundle/gems/rake-12.3.0/exe/rake:27:in `<top (required)>'
/usr/local/bin/bundle:23:in `load'
/usr/local/bin/bundle:23:in `<main>'
Tasks: TOP => delete_subgroups
(See full trace by running task with --trace)
Running...
total_sub_groups: 372
total_sub_group_pages: 4
==== Current Page: 1 ====
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF.FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
==== Current Page: 2 ====
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
==== Current Page: 3 ====
FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
ERROR: Job failed: exit code 1
Task runs fine locally so a bit confusing. This failure also doesn't directly affect subsequent QA test runs on Stagning, more just to keep staging clean.
Edited by Kyle Wiebers