Possibility to enforce the execution order of jobs using resource_group
Problem to solve
resource_group
s from #15536 (closed) currently don't give any guarantee about the sequence order in which jobs work with the same resource_group
. The only guarantee is, that they can't work on it at the same time. It's first-come-first-served, allowing jobs from older pipelines to run after newer ones have finished.
Alternative solution
We provided an alternative solution for this issue in https://docs.gitlab.com/ee/ci/yaml/#pipeline-level-concurrency-control-with-cross-projectparent-child-pipelines
Intended users
Further details
Only allowing forward deployments, as intended with #25276 (closed) might not be suitable for all possible use cases.
E.g. when only
or except
is used, two pipelines might want to do different jobs to the same resource_group
depending on what was changed.
Or when there are other jobs only present in the older pipeline, they would be skipped and be missed otherwise.
See #15536 (comment 281839557) for more detailed use cases.
Proposal
Process Flow (Today)
- When a job is about to be
pending
or has finished:- The system picks the first job from the list of
waiting_for_resource
status jobs and attempts to allocate a resource for it.- If the allocation succeeded, the job transitions to
pending
. - If the allocation failed, the job stays
waiting_for_resource
status.
- If the allocation succeeded, the job transitions to
- The system picks the first job from the list of
Process Flow (Tomorrow)
- When a job is about to be
pending
or has finished:- The system picks the first job from the list of
waiting_for_resource
orcreated
status jobs and attempts to allocate a resource for it.- If the allocation succeeded, the job transitions to
pending
. - If the allocation failed, the job stays
waiting_for_resource
orcreated
status. - If the picked job is
created
status, the allocation definitely fails.
- If the allocation succeeded, the job transitions to
- The process order can be specified per resource group
- If the process order is
asc_by_pipeline_id
, the list is sorted by pipeline ID in ascending order. - If the process order is
desc_by_pipeline_id
, the list is sorted by pipeline ID in descending order.
- If the process order is
- The process scope can be specified per resource group
- If the process scope is
waiting_for_resource
, the list includeswaiting_for_resource
status jobs. (i.e. Current behavior) - If the process order is
created_or_waiting_for_resource
, the list includescreated
orwaiting_for_resource
status jobs.
- If the process scope is
- The system picks the first job from the list of
Examples
Example 1: Simple pipeline
Given that user created three pipelines at once:
Pipeline 3: build -> test -> deploy 3
Pipeline 2: build -> test -> deploy 2
Pipeline 1: build -> test -> deploy 1
and deploy
job requires a resource from a resource group.
process_scope: waiting_for_resource
and process_order: asc_by_pipeline_id
.
When the resource group property is The deployment jobs are executed in a random order (because we don't know which deploy job starts running at first)
process_scope: created_or_waiting_for_resource
and process_order: asc_by_pipeline_id
.
When the resource group property is The deployment jobs are executed in the deploy 1 => deploy 2 => deploy 3 order.
Even if the deploy 2 job tries to run at first, the system picks a deploy 1 as the allocation candidate (which is not ready for execution),
so the deploy 2 job stays waiting_for_resource
status.
process_scope: created_or_waiting_for_resource
and process_order: desc_by_pipeline_id
.
When the resource group property is The deployment jobs are executed in the deploy 3 => deploy 2 => deploy 1 order.
Even if the deploy 2 job tries to run at first, the system picks a deploy 3 as the allocation candidate (which is not ready for execution),
so the deploy 2 job stays waiting_for_resource
status.
Example 2: The same resource group is used in multiple jobs in the same pipeline
Given that user created three pipelines at once:
Pipeline 3: build -> test -> deploy 5,6
Pipeline 2: build -> test -> deploy 3,4
Pipeline 1: build -> test -> deploy 1,2
and deploy
job requires a resource from a resource group.
process_scope: waiting_for_resource
and process_order: asc_by_pipeline_id
.
When the resource group property is The deployment jobs are executed in a random order.
process_scope: created_or_waiting_for_resource
and process_order: asc_by_pipeline_id
.
When the resource group property is The deployment jobs are executed in the following order:
- deploy 1 => deploy 2 OR deploy 2 => deploy 1 (The inner order is not guaranteed)
- deploy 3 => deploy 4 OR deploy 4 => deploy 3 (The inner order is not guaranteed)
- deploy 5 => deploy 6 OR deploy 6 => deploy 5 (The inner order is not guaranteed)
process_scope: created_or_waiting_for_resource
and process_order: desc_by_pipeline_id
.
When the resource group property is The deployment jobs are executed in the following order:
- deploy 5 => deploy 6 OR deploy 6 => deploy 5 (The inner order is not guaranteed)
- deploy 3 => deploy 4 OR deploy 4 => deploy 3 (The inner order is not guaranteed)
- deploy 1 => deploy 2 OR deploy 2 => deploy 1 (The inner order is not guaranteed)
Example 3: Similar to Example 2 but deploy jobs are separate across different stages
Given that user created three pipelines at once:
Pipeline 3: deploy 5 -> build -> test -> deploy 6
Pipeline 2: deploy 3 -> build -> test -> deploy 4
Pipeline 1: deploy 1 -> build -> test -> deploy 2
and deploy
job requires a resource from a resource group.
process_scope: waiting_for_resource
and process_order: asc_by_pipeline_id
.
When the resource group property is The deployment jobs are executed in a random order.
process_scope: created_or_waiting_for_resource
and process_order: asc_by_pipeline_id
.
When the resource group property is The deployment jobs are executed in the following order:
- deploy 1 => deploy 2
- deploy 3 => deploy 4
- deploy 5 => deploy 6
process_scope: created_or_waiting_for_resource
and process_order: desc_by_pipeline_id
.
When the resource group property is The deployment jobs are executed in the following order:
- deploy 5 => deploy 6
- deploy 3 => deploy 4
- deploy 1 => deploy 2
Interface
Users can change the properies of the resource group via Public v4 API. Likely we introduce the following paths:
-
PUT projects/:project_id/resource_groups/:key
- options:
process_mode
is one ofunordered
,oldest_first
ornewest_first
- options:
- The default process mode is
unordered
.
(UI or .gitlab-ci.yml support is future iteration)
Previous proposal
Proposal
Introducing an option in the .gitlab-ci.yml
to configure when the lock for a resource_group
is obtained. Obtaining it means, that a waiting queue is maintained for each resource_group
which determines who is allowed to work with it next (a queueing mechanism should already exist for resource_group
as part of the semaphore primitive implemented).
Currently the lock is obtained and waited for right before the job starts, which ensures that no jobs run at the same time for the same resource. But when a job is ready to start (and is therefore requesting a lock), depends on the non-deterministic runtimes of previous jobs and the concurrent sheduling of jobs in general.
An alternative option that this feature should introduce is to obtain the lock when the job is created (which happens when the pipeline is created). This means that jobs wanting to work on the same resource_group
are forced to work in the order in that they are created. This ensures that older jobs are always running before newer jobs for one particular resource_group
. The downside of this is that the jobs might have to wait longer for a resource to become free in some situations. Jobs for other resource_groups
or jobs without them, should not be affected by this and will still benefit from concurrent execution.
Sample Configuration
Assumtion: Allowing resource_group
to be optionally defined as an dictionary, using a property named resource
to represent the string value from the non-dictionary mode.
This feature shall add a new property named lock_at
(or obtain_at
, obtain_lock_at
, *_on
, etc.) with the two possible values create
and start
.
The default value for this setting should in my opinion be create
.
It should result in having less issues due to race conditions and wrongly ordered deployments for users that are unaware of possible problems.
But this is debatable, the current behavior for resource_group
would be that of start
.
I can understand, that you'd not want to change it now that resource_group
is released.
# pipeline-level
resource_group:
resource: "Resource 1"
lock_at: create
job_a:
script: "echo a"
resource_group:
resource: 'Resource 2'
lock_at: create
job_b:
script: "echo b"
resource_group:
resource: 'Resource 2'
lock_at: start
# I don't know if mixing this setting for the same resource is useful for anything, but we shouldn't restrict it.
job_c:
script: "echo c"
resource_group:
resource: 'Resource 3'
lock_at: start
# But I can think of situations where mixing it in the same pipeline for diferent resources might be useful.
A create
lock for jobs should happen after the create
lock on the pipeline-level. Pipeline start
can only happen after the creation of the pipeline and all the jobs for it. A job at the first stage that has a start
value should lock the resource directly after the start
lock of the pipeline.
Documentation
Extend the documentation for resource_group
here: GitLab CI/CD Pipeline Configuration Reference
Availability & Testing
What risks does this change pose to our availability? It might make pipeline jobs wait longer for resources to become available, if the default is changed or the new option is used manually.
How might it affect the quality of the product?
- It should improve the reliability of concurrent pipelines that access shared
resource_group
s - Users that define their CI/CD Pipeline Configuration have a new option to fine-tune it for their use-case.
What additional test coverage or changes to tests will be needed?
It should be tested that there are no dead locks introduced by this feature. If I'm not mistaken, there shouldn't be any for this change in theory. But depending on how it's implemented and how Gitlab exactly works in the background, this might be possible?
Will it require cross-browser testing?
No
What does success look like, and how can we measure that?
Success
- Users are using the new feature, have fast concurrent pipelines with deployments they can rely on.
- Users not aware of the possible risks of execution order should have a safer default when using
resource_group
orenvironment
(which will useresource_group
implicit in the future, and should be added by #199048).
Measure
.gitlab-ci.yml
files in public projects counted for how often the new feature is used.
What is the type of buyer?
Core/Free because its a fundamental primitive required in conjunction with using resource_group
. It's extending an feature that is already in Core/Free .