Skip to content

Delayed jobs

Shinya Maeda requested to merge scheduled-manual-jobs into master

What does this MR do?

This MR is to implement the new feature - Delayed job.

Basic concept

  • When when: delayed and start_in: X sec/min/hour are specified to a job in .gitlab-ci.yml, the job will start running in X sec/min/hour, instead of running immediately.
  • Delayed job can be unscheduled via job's unschedule button. "Unschedule" means that the delayed job will never be executed in the future, and users can still play it manually.
  • Delayed job can be canceled via a pipeline cancel button or a job cancel button. "Cancel" means that the delayed job will never be executed in the future, and users can still retry it manually.
  • Delayed job can be triggered immediately. The button exists in the job index page.
  • Timer of Delayed jobs will start ticking right after the previous stage has finished
  • Delayed job blocks the current stage by default. For example, if a job is delayed to run at 1 hour later, the current stage will have been blocked for 1 hour.
  • Manual job blocks a stage with allow_failure: true, whereas delayed job blocks stage regardless of the allow_failure value.
  • One delayed job will be delayed only once. For example, if a delayed job is canceled or unscheduled, users can only retry or play. In this case, the job is fired immediately.
  • UI/UX requirements are described in the issue description

State transition of scheduled

Today we have 8 core statues for ci_builds.status - created pending running success failed canceled skipped manual.

And this MR adds a new status scheduled to ci_builds.status. The state will transit as the following,

  • All job's status starts from created
  • When a job is scheduled to run in the future, the created status transits to scheduled, via ProcessPipelineService. BuildScheduleWorker (sidekiq-worker) is scheduled to work at the right time.
  • When the right time has come for a scheduled job, the scheduled status transits to pending, via RunScheduledBuildService. During this process, BuildScheduleWorker worker checks if the scheduled job is still playable, at first.
  • When a user plays the scheduled job immediately, the scheduled status transits to pending, via PlayBuildService.
  • When a scheduled job is unscheduled during scheduled status, the scheduled status transits to manual. In this case, BuildScheduleWorker(Scheduled sidekiq-jobs) will not proceed to RunScheduledBuildService.
  • When a scheduled job is unscheduled during non-scheduled status, the system raises an exception.
  • When a scheduled job is canceled during scheduled status, the scheduled status transits to cancel. In this case, BuildScheduleWorker(Scheduled sidekiq-jobs) will not proceed to RunScheduledBuildService.
  • scheduled state transition is irreversible. It transits to pending, however, all status (except created) can not transit back to scheduled

Concerns

  • Pipeline/Build status is tightly coupled with BE (e.g. Gitlab::Ci::Status::Build::Scheduled will directly reflect the frontend components). Can we create a dynamic component for a specific state (i.e. scheduled jobs) ? => We'll follow up
  • What's the compound status of a stage? (e.g. Job A: running, Job B: pending, Job C: scheduled => What is shown on pipeline-mini-graph?)
  • What if sidekiq-jobs are lost? https://gitlab.com/gitlab-org/gitlab-ce/issues/36791. Do we just leave it? or do we introduce a clean-up worker? => We cleanup stale schedules in StuckCiJobsWorker

Performance implication

In this MR, we add scheduled_at column to ci_builds table. This column is UPDATEd when the build is scheduled (To set the date), and the column is UPDATEd when the scheduled build finished (To clear the date). Both are queried during the status transition (e.g. UPDATE ci_builds SET status = scheduled WHERE id = 100), therefore there are no additional queried will be executed in the life cycle (e.g. UPDATE ci_builds SET status = scheduled, scheduled_at = '2018-09-24 10:06:19.385977' WHERE id = 100).

However, due to the Sidekiq reliability problem, we can't assure that all scheduled jobs will be executed 100%. A few jobs might be stuck at the scheduled state, in cases corresponding BuildScheduleWorker queue has been lost by SIGKILL.

To rescue those potential orphans, we're going to add a cleanup phase for stale scheduled jobs. This operation is included in StuckCiJobsWorker as it's meant to handle stale pending/running builds. In order to find stale scheduled builds, the worker executes Select * from ci_builds where scheduled_at IS NOT NULL && scheduled_at < '1 day ago'. Given ci_builds table is a very big table (At the moment, it contains over 100 million rows), we add a partial index on (scheduled_at, id) columns where scheduled_at IS NOT NULL. This would make this operation much faster as it uses Index Scan at the first step, and expensive date comparison will perform to only small subsets.

Feature flag

This feature is behind the feature flag ci_enable_scheduled_build. So that if something wrong with this implementation, we can minimize the impact by disabling the feature flag.

When ci_enable_scheduled_build is disabled, delayed jobs will not be created even if gitlab-ci.yml has when: delayed. Instead, it's simply translated to manual job.

Here is how to manipulate feature flag.

Feature.enabled?('ci_enable_scheduled_build') # Check if the feature is enabled
Feature.enable('ci_enable_scheduled_build') # Enable this feature
Feature.disable('ci_enable_scheduled_build') # Disable this feature

This feature will be evaluated in https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5223. After we made sure it's fully functional, we're going to remove the feature flag in https://gitlab.com/gitlab-org/gitlab-ce/issues/52183.

BE TODO

  • Add a new status scheduled to ci_builds.status
  • Respect allow_failure: true/false, however scheduled status should block pipline
  • Retry shouldn't reschedule
  • Play immidiately endpoint
  • Ping DB team to review
  • Write Unit tests
  • Write Integration tests
  • Feature flag

FE TODO

BE+FE TODO

  • Write Feature/Acceptance tests

What are the relevant issue numbers?

Close https://gitlab.com/gitlab-org/gitlab-ce/issues/51352

Sample gitlab-ci.yml

# This job starts 
job:
  script: date
  when: delayed
  start_in: 3 minutes

Does this MR meet the acceptance criteria?

Edited by Shinya Maeda

Merge request reports