Delayed jobs
What does this MR do?
This MR is to implement the new feature - Delayed job.
- CE MR: here
- EE MR: https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/7761
Basic concept
- When
when: delayed
andstart_in: X sec/min/hour
are specified to a job in .gitlab-ci.yml, the job will start running in X sec/min/hour, instead of running immediately. - Delayed job can be unscheduled via job's unschedule button. "Unschedule" means that the delayed job will never be executed in the future, and users can still play it manually.
- Delayed job can be canceled via a pipeline cancel button or a job cancel button. "Cancel" means that the delayed job will never be executed in the future, and users can still retry it manually.
- Delayed job can be triggered immediately. The button exists in the job index page.
- Timer of Delayed jobs will start ticking right after the previous stage has finished
- Delayed job blocks the current stage by default. For example, if a job is delayed to run at 1 hour later, the current stage will have been blocked for 1 hour.
- Manual job blocks a stage with
allow_failure: true
, whereas delayed job blocks stage regardless of theallow_failure
value. - One delayed job will be delayed only once. For example, if a delayed job is canceled or unscheduled, users can only retry or play. In this case, the job is fired immediately.
- UI/UX requirements are described in the issue description
scheduled
State transition of Today we have 8 core statues for ci_builds.status
- created pending running success failed canceled skipped manual
.
And this MR adds a new status scheduled
to ci_builds.status
. The state will transit as the following,
- All job's status starts from
created
- When a job is scheduled to run in the future, the
created
status transits toscheduled
, viaProcessPipelineService
.BuildScheduleWorker
(sidekiq-worker) is scheduled to work at the right time. - When the right time has come for a
scheduled
job, thescheduled
status transits topending
, viaRunScheduledBuildService
. During this process,BuildScheduleWorker
worker checks if the scheduled job is still playable, at first. - When a user plays the
scheduled
job immediately, thescheduled
status transits topending
, viaPlayBuildService
. - When a scheduled job is unscheduled during
scheduled
status, thescheduled
status transits tomanual
. In this case,BuildScheduleWorker
(Scheduled sidekiq-jobs) will not proceed toRunScheduledBuildService
. - When a scheduled job is unscheduled during non-
scheduled
status, the system raises an exception. - When a scheduled job is canceled during
scheduled
status, thescheduled
status transits tocancel
. In this case,BuildScheduleWorker
(Scheduled sidekiq-jobs) will not proceed toRunScheduledBuildService
. -
scheduled
state transition is irreversible. It transits topending
, however, all status (exceptcreated
) can not transit back toscheduled
Concerns
-
Pipeline/Build status is tightly coupled with BE (e.g. Gitlab::Ci::Status::Build::Scheduled
will directly reflect the frontend components). Can we create a dynamic component for a specific state (i.e. scheduled jobs) ? => We'll follow up -
What's the compound status of a stage? (e.g. Job A: running
, Job B:pending
, Job C:scheduled
=> What is shown on pipeline-mini-graph?) -
What if sidekiq-jobs are lost? https://gitlab.com/gitlab-org/gitlab-ce/issues/36791. Do we just leave it? or do we introduce a clean-up worker? => We cleanup stale schedules in StuckCiJobsWorker
Performance implication
In this MR, we add scheduled_at
column to ci_builds
table. This column is UPDATE
d when the build is scheduled (To set the date), and the column is UPDATE
d when the scheduled build finished (To clear the date). Both are queried during the status transition (e.g. UPDATE ci_builds SET status = scheduled WHERE id = 100
), therefore there are no additional queried will be executed in the life cycle (e.g. UPDATE ci_builds SET status = scheduled, scheduled_at = '2018-09-24 10:06:19.385977' WHERE id = 100
).
However, due to the Sidekiq reliability problem, we can't assure that all scheduled jobs will be executed 100%. A few jobs might be stuck at the scheduled
state, in cases corresponding BuildScheduleWorker
queue has been lost by SIGKILL
.
To rescue those potential orphans, we're going to add a cleanup phase for stale scheduled jobs. This operation is included in StuckCiJobsWorker
as it's meant to handle stale pending/running builds. In order to find stale scheduled builds, the worker
executes Select * from ci_builds where scheduled_at IS NOT NULL && scheduled_at < '1 day ago'
. Given ci_builds
table is a very big table (At the moment, it contains over 100 million rows), we add a partial index on (scheduled_at
, id
) columns where scheduled_at IS NOT NULL
. This would make this operation much faster as it uses Index Scan at the first step, and expensive date comparison will perform to only small subsets.
Feature flag
This feature is behind the feature flag ci_enable_scheduled_build
. So that if something wrong with this implementation, we can minimize the impact by disabling the feature flag.
When ci_enable_scheduled_build
is disabled, delayed jobs will not be created even if gitlab-ci.yml has when: delayed
. Instead, it's simply translated to manual job.
Here is how to manipulate feature flag.
Feature.enabled?('ci_enable_scheduled_build') # Check if the feature is enabled
Feature.enable('ci_enable_scheduled_build') # Enable this feature
Feature.disable('ci_enable_scheduled_build') # Disable this feature
This feature will be evaluated in https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5223. After we made sure it's fully functional, we're going to remove the feature flag in https://gitlab.com/gitlab-org/gitlab-ce/issues/52183.
BE TODO
-
Add a new status scheduled
toci_builds.status
-
Respect allow_failure: true/false
, howeverscheduled
status should block pipline -
Retry shouldn't reschedule -
Play immidiately endpoint -
Ping DB team to review -
Write Unit tests -
Write Integration tests -
Feature flag
FE TODO
-
dropdown in pipelines list - [-] dropdown in environments list => https://gitlab.com/gitlab-org/gitlab-ce/issues/52129
-
icons in pipeline graph -
tooltip in pipeline graph with remaining time (will be made dynamic in follow-up issue) -
buttons on job list -
empty state for scheduled jobs -
favicon overlay - [-] Docs => https://gitlab.com/gitlab-org/gitlab-ce/issues/52127
BE+FE TODO
-
Write Feature/Acceptance tests
What are the relevant issue numbers?
Close https://gitlab.com/gitlab-org/gitlab-ce/issues/51352
Sample gitlab-ci.yml
# This job starts
job:
script: date
when: delayed
start_in: 3 minutes
Does this MR meet the acceptance criteria?
-
Changelog entry added, if necessary - [-] Documentation created/updated => https://gitlab.com/gitlab-org/gitlab-ce/issues/52127
-
Tests added for this feature/bug -
Conforms to the code review guidelines -
Conforms to the merge request performance guidelines -
Conforms to the style guides -
Conforms to the database guides