Retry component update (Gitaly, KAS) MR pipelines

What does this MR do and why?

Describe in detail what your merge request does and why.

Content

Add pipeline retry functionality for component update merge requests (Gitaly and KAS). This change is behind a feature flag called retry_pipeline_service. I'll create the FF after the MR has been merged and I'm ready to test.

This MR automates what RMs have been doing. It checks the latest pipeline in a Gitaly or KAS MR, and retries any failing jobs, or retries the pipeline.

  • Implements RetryMrPipelineService to automatically retry failed jobs or create new MR pipelines This class:

    • Returns without doing anything if retry_pipeline_service feature flag is not turned on
    • Returns without doing anything if MR is not open
    • Returns without doing anything if latest MR pipeline has no failed jobs
    • Retries pipeline if latest pipeline is older than 8 hours (GitLab Rails MRs cannot be merged with a pipeline older than 8 hours)
    • Retries pipeline if jobs have already been retried twice (so they have run 3 times)
    • Retries failed jobs if none of the above applied
  • Create PipelineFailureService to detect and analyze pipeline failures This class provides RetryMrPipelineService with information like failed jobs, number of retries, etc.

  • Refactor Components::Updater to use the new retry service

    • When a component update MR already exists, the Updater class will now call RetryMrPipelineService to retry the pipeline if required.
    • If RetryMrPipelineService retries the pipeline/jobs, Updater will exit since there is no point in attempting to set MWPS while CI is still running.
    • If RetryMrPipelineService did nothing, Updater will continue execution and attempt to set MWPS on the MR.

On a related note, Siddharth has opened an MR to notify component teams when an MR is in an unmergeable state (for example due to merge conflicts): !4110 (merged).

Demo of this MR's functionality: https://youtu.be/HTZOZOIk-so?t=37

gitlab-com/gl-infra/delivery#20867

Tests

Test run (TEST=true) on a KAS MR
➜  release-tools git:(rp/retry-failed-pipelines)TEST=true op run --env-file=".env.read" -- bundle exec rake 'components:update_kas'
2025-05-05 16:59:56.936245 I [dry-run] ReleaseTools::Services::UpdateComponentService -- Finding the current component version -- {component: "gitlab-agent", branch: "master"}
2025-05-05 16:59:57.532999 D [dry-run] ReleaseTools::GitlabClient -- [HTTParty] [2025-05-05 16:59:57 +0530] 200 "GET https://gitlab.com/api/v4/projects/gitlab-org%2Fgitlab/repository/files/GITLAB_KAS_VERSION/raw" 41 
2025-05-05 16:59:59.285544 D [dry-run] ReleaseTools::GitlabClient -- [HTTParty] [2025-05-05 16:59:59 +0530] 200 "GET https://gitlab.com/api/v4/projects/gitlab-org%2Fsecurity%2Fcluster-integration%2Fgitlab-agent/repository/commits" - 
2025-05-05 17:00:00.584509 D [dry-run] ReleaseTools::GitlabClient -- [HTTParty] [2025-05-05 17:00:00 +0530] 200 "GET https://gitlab.com/api/v4/projects/gitlab-org%2Fsecurity%2Fcluster-integration%2Fgitlab-agent/repository/commits/6276aded2a69e5b41771a8955983fcbca1a323d2" - 
2025-05-05 17:00:02.241944 D [dry-run] ReleaseTools::GitlabClient -- [HTTParty] [2025-05-05 17:00:02 +0530] 200 "GET https://gitlab.com/api/v4/projects/gitlab-org%2Fgitlab/merge_requests" - 
2025-05-05 17:00:02.242269 I [dry-run] ReleaseTools::Tasks::Components::UpdateKas -- Found existing merge request -- {merge_request: "https://gitlab.com/gitlab-org/gitlab/-/merge_requests/190166", mwps: false, merge_status: "ci_must_pass"}
2025-05-05 17:00:03.147258 D [dry-run] ReleaseTools::GitlabClient -- [HTTParty] [2025-05-05 17:00:03 +0530] 200 "GET https://gitlab.com/api/v4/projects/278964/pipelines" - 
2025-05-05 17:00:03.759921 D [dry-run] ReleaseTools::GitlabClient -- [HTTParty] [2025-05-05 17:00:03 +0530] 200 "GET https://gitlab.com/api/v4/projects/278964/pipelines/1800398399/jobs" - 
2025-05-05 17:00:04.618101 D [dry-run] ReleaseTools::GitlabClient -- [HTTParty] [2025-05-05 17:00:04 +0530] 200 "GET https://gitlab.com/api/v4/projects/278964/pipelines/1800398399/bridges" 2 
2025-05-05 17:00:06.687129 D [dry-run] ReleaseTools::GitlabClient -- [HTTParty] [2025-05-05 17:00:06 +0530] 200 "GET https://gitlab.com/api/v4/projects/278964/pipelines/1800398399/jobs" - 
2025-05-05 17:00:06.688885 I [dry-run] ReleaseTools::Services::RetryMrPipelineService -- Retrying failed job -- {failed_job: "https://gitlab.com/gitlab-org/gitlab/-/jobs/9927711094"}
2025-05-05 17:00:06.688906 I [dry-run] ReleaseTools::Tasks::Components::UpdateKas -- The MR pipeline or job(s) were retried
Actual run (without TEST=true) on the same KAS MR
➜  release-tools git:(rp/retry-failed-pipelines) ✗ op run --env-file=".env.write" -- bundle exec rake 'components:update_kas' 
2025-05-05 17:01:54.110084 I ReleaseTools::Services::UpdateComponentService -- Finding the current component version -- {component: "gitlab-agent", branch: "master"}
2025-05-05 17:01:54.762484 D ReleaseTools::GitlabClient -- [HTTParty] [2025-05-05 17:01:54 +0530] 200 "GET https://gitlab.com/api/v4/projects/gitlab-org%2Fgitlab/repository/files/GITLAB_KAS_VERSION/raw" 41 
2025-05-05 17:01:55.611271 D ReleaseTools::GitlabClient -- [HTTParty] [2025-05-05 17:01:55 +0530] 200 "GET https://gitlab.com/api/v4/projects/gitlab-org%2Fsecurity%2Fcluster-integration%2Fgitlab-agent/repository/commits" - 
2025-05-05 17:01:56.195403 D ReleaseTools::GitlabClient -- [HTTParty] [2025-05-05 17:01:56 +0530] 200 "GET https://gitlab.com/api/v4/projects/gitlab-org%2Fsecurity%2Fcluster-integration%2Fgitlab-agent/repository/commits/6276aded2a69e5b41771a8955983fcbca1a323d2" - 
2025-05-05 17:01:56.900448 D ReleaseTools::GitlabClient -- [HTTParty] [2025-05-05 17:01:56 +0530] 200 "GET https://gitlab.com/api/v4/projects/gitlab-org%2Fgitlab/merge_requests" - 
2025-05-05 17:01:56.900891 I ReleaseTools::Tasks::Components::UpdateKas -- Found existing merge request -- {merge_request: "https://gitlab.com/gitlab-org/gitlab/-/merge_requests/190166", mwps: false, merge_status: "ci_must_pass"}
2025-05-05 17:01:57.340283 D ReleaseTools::GitlabClient -- [HTTParty] [2025-05-05 17:01:57 +0530] 200 "GET https://gitlab.com/api/v4/projects/278964/pipelines" - 
2025-05-05 17:01:58.143424 D ReleaseTools::GitlabClient -- [HTTParty] [2025-05-05 17:01:58 +0530] 200 "GET https://gitlab.com/api/v4/projects/278964/pipelines/1800398399/jobs" - 
2025-05-05 17:01:58.633581 D ReleaseTools::GitlabClient -- [HTTParty] [2025-05-05 17:01:58 +0530] 200 "GET https://gitlab.com/api/v4/projects/278964/pipelines/1800398399/bridges" 2 
2025-05-05 17:01:59.631605 D ReleaseTools::GitlabClient -- [HTTParty] [2025-05-05 17:01:59 +0530] 200 "GET https://gitlab.com/api/v4/projects/278964/pipelines/1800398399/jobs" - 
2025-05-05 17:02:01.622762 D ReleaseTools::GitlabClient -- [HTTParty] [2025-05-05 17:02:01 +0530] 201 "POST https://gitlab.com/api/v4/projects/278964/jobs/9927711094/retry" 3115 
2025-05-05 17:02:01.623038 I ReleaseTools::Services::RetryMrPipelineService -- Retried job -- {new_job: "https://gitlab.com/gitlab-org/gitlab/-/jobs/9929016910"}
2025-05-05 17:02:01.623053 I ReleaseTools::Tasks::Components::UpdateKas -- The MR pipeline or job(s) were retried

Author Check-list

  • Has documentation been updated?
Edited by Reuben Pereira

Merge request reports

Loading