Add retry to reconfigure method to avoid flaky failures (!937) · Merge requests · GitLab.org / GitLab QA

John McDonnell requested to merge jmd-add-retry-to-reconfigure into master May 15, 2022

What does this MR do and why?

To avoid these flaky failures, this MR introduces a retry mechanism to reconfigure which restarts the container which will hopefully not encounter the same failure upon retrying.

How to set up and validate locally

To replicate the errors found in our pipelines is difficult, as they appear to be intermittent and flaky.
However the following steps can trigger the error described by deleting config files during startup.

In terminal window 1 - while true; do sleep 1; docker exec $(docker ps -q) bash -c "rm -rf /var/opt/gitlab/"; done
In terminal window 2 - ./exe/gitlab-qa Test::Instance::Image EE

With this change in place,

If the errors continue happening - it will fail 3 times, before exiting with a (Gitlab::QA::Docker::Shellout::StatusError) as happens today.
If the errors only happen once (cancel the script in Terminal Window 1 after it fails once) - you should see the container start as normal after restarting

Working example of change

On the pipelines for this MR see pipelines#539645510 / job ce:update-parallel 2/5

Note how we encountered a error [31mError executing action `enable` on resource 'runit_service[gitlab-exporter]' which previously would have caused the job to fail - but we now restart the containers instead of allowing the raised error fail the job, and see that the job passes successfully.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Edited May 16, 2022 by John McDonnell

Add retry to reconfigure method to avoid flaky failures

What does this MR do and why?

How to set up and validate locally

Working example of change

MR acceptance checklist

Merge request reports