Investigate potential performance issue with QA Runners
We have seen a few tests failing either with job time out with Chrome crashing indicating performance issues.
Selenium::WebDriver::Error::InvalidArgumentError:
invalid argument
(Session info: chrome=120.0.6099.216)
# #0 0x558196fdff83 <unknown>
# #1 0x558196c98b2b <unknown>
# #2 0x558196c7eff6 <unknown>
# #3 0x558196c7c821 <unknown>
# #4 0x558196c7d01a <unknown>
# #5 0x558196c9bbbe <unknown>
# #6 0x558196d317a5 <unknown>
# #7 0x558196d120b2 <unknown>
# #8 0x558196d31006 <unknown>
# #9 0x558196d11e53 <unknown>
# #10 0x558196cd9dd4 <unknown>
# #11 0x558196cdb1de <unknown>
# #12 0x558196fa4531 <unknown>
# #13 0x558196fa8455 <unknown>
# #14 0x558196f90f55 <unknown>
# #15 0x558196fa90ef <unknown>
# #16 0x558196f7499f <unknown>
# #17 0x558196fcd008 <unknown>
# #18 0x558196fcd1d7 <unknown>
# #19 0x558196fdf124 <unknown>
# #20 0x7facd1db0ea7 start_thread
# ./qa/runtime/browser.rb:263:in `perform'
# ./qa/runtime/browser.rb:38:in `visit'
# ./qa/runtime/browser.rb:42:in `visit'
# ./qa/page/main/login.rb:209:in `redirect_to_login_page'
# ./qa/flow/login.rb:23:in `block in sign_in'
# ./qa/scenario/actable.rb:16:in `perform'
# ./qa/flow/login.rb:22:in `sign_in'
We also observed on the runner logs:
Error creating machine: Error in driver during machine creation: operation error: {IP_SPACE_EXHAUSTED IP space of 'projects/gitlab-qa-runners-2
🔍 Latest findings
We are seeing two potential reasons for this incident. These are just hypotheses that may help mitigate the incident.
Hypothesis 1: IPs are exhausted and cannot create runners
We see in Kibana on runner logs, specifically, several errors
when there is a significant increase in pipelines triggering jobs.
Aggregating by most common error being logged:
Machine creation failed
1358 (35.9%)
Error creating machine: Error in driver during machine creation: operation error: {IP_SPACE_EXHAUSTED IP space of 'projects/gitlab-qa-runners-2/regions/us-east1/subnetworks/ephemeral-runners' is exhausted. [] []}
1259 (33.2%)
See the documentation on IP_SPACE_EXHAUSTED error message received when trying to create Google Compute Engine instances.
Potential Solution
Opened an MR in our infra repo config-mgmt to reconfigure the CIDR blocks and extend the ephemeral-runners
subnet.
We are also reducing the idle count machines as this was noted as a very aggressive number of lookahead machines to be available thus further contributing for the problem.
Hypothesis 2: Less than optimal machine setup
Looking at the runners that were serviced previously, they had different base machine images can make the VM more performant.
Potential Solution
We will be following all other runner managers configurations for distribution and dotcom to use a COS base image for VMs instead and a lightweight alpine image for the docker machines. MR: https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/4603;