Backend: Scheduled pipelines are triggering multiple times

Summary

Scheduled pipelines are occasionaly triggered multiple times unexpectedly.

Steps to reproduce

Setup a scheduled pipeline with 0 */3 * * * Custom schedule on Beijing timeline
Wait
Measure and see how many pipelines are triggered per hour with a chart like https://app.periscopedata.com/app/gitlab/590833/WIP:-Kyle-Wiebers-Scratchpad?widget=12272748&udv=0

Example Project

There have been a few of these... (click to expand):

GitLab JiHu Code Sync

URL	Owner	Frequency chart	Schedule Screenshot
https://gitlab.com/gitlab-jh/code-sync/-/pipeline_schedules/146378/edit	@gitlab-jh-bot	https://app.periscopedata.com/app/gitlab/590833/WIP:-Kyle-Wiebers-Scratchpad?widget=12272748&udv=0

Triage Ops

URL	Owner	Frequency chart	Schedule Screenshot
https://gitlab.com/gitlab-org/quality/triage-ops/-/pipeline_schedules/157187/edit	@gitlab-bot	https://app.periscopedata.com/app/gitlab/590833/WIP:-Kyle-Wiebers-Scratchpad?widget=12481416&udv=0

GitLab Bi-Hourly Pipeline

URL	Owner	Frequency chart	Schedule Screenshot
https://gitlab.com/gitlab-org/gitlab/-/pipeline_schedules/23503/edit	@gitlab-bot	https://app.periscopedata.com/app/gitlab/590833/WIP:-Kyle-Wiebers-Scratchpad?widget=12481421&udv=0

What is the current bug behavior?

Pipeline occasionally gets triggered multiple times

What is the expected correct behavior?

Pipeline is triggered once according to the schedule

Relevant logs and/or screenshots

Output of checks

This bug happens on GitLab.com

Results of GitLab environment info

Expand for output related to GitLab environment info


(For installations with omnibus-gitlab package run and paste the output of:
`sudo gitlab-rake gitlab:env:info`)

(For installations from source run and paste the output of:
`sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)

Results of GitLab application Check

Expand for output related to the GitLab application check

(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true)
(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)
(we will only investigate if the tests are passing)

Possible fixes

From #338609 (comment 942671252):

PipelineScheduleWorker runs at 04:00am

It finds "runnable schedules"

For each schedule, it uses PipelineScheduleService to execute schedule.schedule_next_run! (which updates the next_run_at column in the database)

It then asynchronously queues a RunPipelineScheduleWorker with the schedule information

Importantly, the above may take longer than 5 minutes, so it may still be processing schedules from the original "runnable schedules" list generated 5 minutes earlier

PipelineScheduleWorker runs at 04:05am - the 04:00am job is still running at this time

It finds "runnable schedules" - this may include schedules that co-exist in the still-running 04:00am job due to next_run_at not being updated yet

It duplicates a RunPipelineScheduleWorker for these schedules

I might be off the mark, but it appears that PipelineScheduleWorker might not be marked as idempotent? Looking at the docs on Sidekiq idempotent jobs, idempotent! with a deduplicate :until_executed strategy looks like it could be a good option. The docs state:

It can be used to prevent jobs from running simultaneously multiple times.

which looks to be the key problem here (I'm assuming that cron-based workers can still use this feature).

Shifting schedule.schedule_next_run! into the RunPipelineScheduleWorker is a good idea, but would duplication still not be possible, if it takes longer than 5 minutes for the worker to be picked up and executed (if there is a large backlog of pending jobs)?

Move schedule_next_run! inside RunPipelineScheduleWorker
Make RunPipelineScheduleWorker idempotent with deduplicate :until_executed
Potentially remove the PipelineScheduleService altogether

To validate the fix we should compare the PipelineScheduleWorker duration before and after to ensure it's lower than 5 minutes.

~~Unknown but may be related to !62826 (merged)~~

Edited May 31, 2022 by Fabio Pitino