Skip to content

Retry review_* jobs up to 2 times

What does this MR do?

Retry review_* jobs up to 2 times

Deploying a new release can have multiple intermitent instabilities. Sometimes related to networking issues, sometimes related to cluster resources. But most of the time when we face those, retry this job is enough to make it succeed. We've been seeing multiple cases of these errors, which take a significant amount of time from engineers to debug the probelm which ends up being solved by just retrying the job. Additionally, this job seems idempotent. It should be ok to run it multiple times.

Related issues

Closes #5455 (context deadline exceeded of KAS k8s-proxy)

The above is just one example, but there are multiple examples, with different root causes:

Considerations

Perhaps there are improvements that we could do on KAS or in our infrastructure to make these more robust, and retrying might be masking the problem. But currently, I think the amount of time taken from engineers and maintainers to investigate broken pipelines, and fix timeout/connection related issues is unacceptable given that this has been happening for a long time and very frequently. So perhaps it's worth to try this experiment with retry. Either this, or we should keep prioritizing investigating these issues and other ways to improve them as P1, until we stop being affected by it, or the affect is minimal.

Another thought - if something is really broken, we'll also be able to detect it after the second retry.

Author checklist

See Definition of done.

For anything in this list which will not be completed, please provide a reason in the MR discussion.

Required

  • Merge Request Title and Description are up to date, accurate, and descriptive
  • MR targeting the appropriate branch
  • MR has a green pipeline on GitLab.com
  • When ready for review, follow the instructions in the "Reviewer Roulette" section of the Danger Bot MR comment, as per the Distribution experimental MR workflow

For merge requests from forks, consider the following options for Danger to work properly:

Expected (please provide an explanation if not completing)

  • Test plan indicating conditions for success has been posted and passes
  • Documentation created/updated
  • Tests added/updated
  • Integration tests added to GitLab QA
  • Equivalent MR/issue for omnibus-gitlab opened
  • Equivalent MR/issue for Gitlab Operator project opened (see Operator documentation on impact of Charts changes)
  • Validate potential values for new configuration settings. Formats such as integer 10, duration 10s, URI scheme://user:passwd@host:port may require quotation or other special handling when rendered in a template and written to a configuration file.
Edited by João Alexandre Cunha

Merge request reports