Skip to content

Add retry to reconfigure method to avoid flaky failures

John McDonnell requested to merge jmd-add-retry-to-reconfigure into master

What does this MR do and why?

Works around for #677 (closed)

To avoid these flaky failures, this MR introduces a retry mechanism to reconfigure which restarts the container which will hopefully not encounter the same failure upon retrying.

How to set up and validate locally

To replicate the errors found in our pipelines is difficult, as they appear to be intermittent and flaky.
However the following steps can trigger the error described by deleting config files during startup.

  • In terminal window 1 - while true; do sleep 1; docker exec $(docker ps -q) bash -c "rm -rf /var/opt/gitlab/"; done
  • In terminal window 2 - ./exe/gitlab-qa Test::Instance::Image EE

With this change in place,

  • If the errors continue happening - it will fail 3 times, before exiting with a (Gitlab::QA::Docker::Shellout::StatusError) as happens today.
  • If the errors only happen once (cancel the script in Terminal Window 1 after it fails once) - you should see the container start as normal after restarting

Working example of change

On the pipelines for this MR see pipelines#539645510 / job ce:update-parallel 2/5

Note how we encountered a error [31mError executing action `enable` on resource 'runit_service[gitlab-exporter]' which previously would have caused the job to fail - but we now restart the containers instead of allowing the raised error fail the job, and see that the job passes successfully.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by John McDonnell

Merge request reports