Skip to content

Backend: Scheduled pipelines are triggering multiple times

Summary

Scheduled pipelines are occasionaly triggered multiple times unexpectedly.

Steps to reproduce

Example Project

There have been a few of these... (click to expand):

GitLab JiHu Code Sync

URL Owner Frequency chart Schedule Screenshot
https://gitlab.com/gitlab-jh/code-sync/-/pipeline_schedules/146378/edit @gitlab-jh-bot https://app.periscopedata.com/app/gitlab/590833/WIP:-Kyle-Wiebers-Scratchpad?widget=12272748&udv=0 image

Triage Ops

URL Owner Frequency chart Schedule Screenshot
https://gitlab.com/gitlab-org/quality/triage-ops/-/pipeline_schedules/157187/edit @gitlab-bot https://app.periscopedata.com/app/gitlab/590833/WIP:-Kyle-Wiebers-Scratchpad?widget=12481416&udv=0 image

GitLab Bi-Hourly Pipeline

URL Owner Frequency chart Schedule Screenshot
https://gitlab.com/gitlab-org/gitlab/-/pipeline_schedules/23503/edit @gitlab-bot https://app.periscopedata.com/app/gitlab/590833/WIP:-Kyle-Wiebers-Scratchpad?widget=12481421&udv=0 image

What is the current bug behavior?

Pipeline occasionally gets triggered multiple times

What is the expected correct behavior?

Pipeline is triggered once according to the schedule

Relevant logs and/or screenshots

Output of checks

This bug happens on GitLab.com

Results of GitLab environment info

Expand for output related to GitLab environment info

(For installations with omnibus-gitlab package run and paste the output of:
`sudo gitlab-rake gitlab:env:info`)

(For installations from source run and paste the output of:
`sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)

Results of GitLab application Check

Expand for output related to the GitLab application check

(For installations with omnibus-gitlab package run and paste the output of: sudo gitlab-rake gitlab:check SANITIZE=true)

(For installations from source run and paste the output of: sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)

(we will only investigate if the tests are passing)

Possible fixes

From #338609 (comment 942671252):

  1. PipelineScheduleWorker runs at 04:00am
  2. It finds "runnable schedules"
  3. For each schedule, it uses PipelineScheduleService to execute schedule.schedule_next_run! (which updates the next_run_at column in the database)
  4. It then asynchronously queues a RunPipelineScheduleWorker with the schedule information
  5. Importantly, the above may take longer than 5 minutes, so it may still be processing schedules from the original "runnable schedules" list generated 5 minutes earlier
  6. PipelineScheduleWorker runs at 04:05am - the 04:00am job is still running at this time
  7. It finds "runnable schedules" - this may include schedules that co-exist in the still-running 04:00am job due to next_run_at not being updated yet
  8. It duplicates a RunPipelineScheduleWorker for these schedules

I might be off the mark, but it appears that PipelineScheduleWorker might not be marked as idempotent? Looking at the docs on Sidekiq idempotent jobs, idempotent! with a deduplicate :until_executed strategy looks like it could be a good option. The docs state:

It can be used to prevent jobs from running simultaneously multiple times.

which looks to be the key problem here (I'm assuming that cron-based workers can still use this feature).

Shifting schedule.schedule_next_run! into the RunPipelineScheduleWorker is a good idea, but would duplication still not be possible, if it takes longer than 5 minutes for the worker to be picked up and executed (if there is a large backlog of pending jobs)?

  1. Move schedule_next_run! inside RunPipelineScheduleWorker
  2. Make RunPipelineScheduleWorker idempotent with deduplicate :until_executed
  3. Potentially remove the PipelineScheduleService altogether

To validate the fix we should compare the PipelineScheduleWorker duration before and after to ensure it's lower than 5 minutes.

Unknown but may be related to !62826 (merged)

Edited by Fabio Pitino