Possibility to enforce the execution order of jobs using resource_group
Problem to solve
resource_groups from #15536 (closed) currently don't give any guarantee about the sequence order in which jobs work with the same resource_group. The only guarantee is, that they can't work on it at the same time. It's first-come-first-served, allowing jobs from older pipelines to run after newer ones have finished.
Alternative solution
We provided an alternative solution for this issue in https://docs.gitlab.com/ee/ci/yaml/#pipeline-level-concurrency-control-with-cross-projectparent-child-pipelines
Intended users
Further details
Only allowing forward deployments, as intended with #25276 (closed) might not be suitable for all possible use cases.
E.g. when only or except is used, two pipelines might want to do different jobs to the same resource_group depending on what was changed.
Or when there are other jobs only present in the older pipeline, they would be skipped and be missed otherwise.
See #15536 (comment 281839557) for more detailed use cases.
Proposal
Process Flow (Today)
- When a job is about to be
pendingor has finished:- The system picks the first job from the list of
waiting_for_resourcestatus jobs and attempts to allocate a resource for it.- If the allocation succeeded, the job transitions to
pending. - If the allocation failed, the job stays
waiting_for_resourcestatus.
- If the allocation succeeded, the job transitions to
- The system picks the first job from the list of
Process Flow (Tomorrow)
- When a job is about to be
pendingor has finished:- The system picks the first job from the list of
waiting_for_resourceorcreatedstatus jobs and attempts to allocate a resource for it.- If the allocation succeeded, the job transitions to
pending. - If the allocation failed, the job stays
waiting_for_resourceorcreatedstatus. - If the picked job is
createdstatus, the allocation definitely fails.
- If the allocation succeeded, the job transitions to
- The process order can be specified per resource group
- If the process order is
asc_by_pipeline_id, the list is sorted by pipeline ID in ascending order. - If the process order is
desc_by_pipeline_id, the list is sorted by pipeline ID in descending order.
- If the process order is
- The process scope can be specified per resource group
- If the process scope is
waiting_for_resource, the list includeswaiting_for_resourcestatus jobs. (i.e. Current behavior) - If the process order is
created_or_waiting_for_resource, the list includescreatedorwaiting_for_resourcestatus jobs.
- If the process scope is
- The system picks the first job from the list of
Examples
Example 1: Simple pipeline
Given that user created three pipelines at once:
Pipeline 3: build -> test -> deploy 3
Pipeline 2: build -> test -> deploy 2
Pipeline 1: build -> test -> deploy 1
and deploy job requires a resource from a resource group.
When the resource group property is process_scope: waiting_for_resource and process_order: asc_by_pipeline_id.
The deployment jobs are executed in a random order (because we don't know which deploy job starts running at first)
When the resource group property is process_scope: created_or_waiting_for_resource and process_order: asc_by_pipeline_id.
The deployment jobs are executed in the deploy 1 => deploy 2 => deploy 3 order.
Even if the deploy 2 job tries to run at first, the system picks a deploy 1 as the allocation candidate (which is not ready for execution),
so the deploy 2 job stays waiting_for_resource status.
When the resource group property is process_scope: created_or_waiting_for_resource and process_order: desc_by_pipeline_id.
The deployment jobs are executed in the deploy 3 => deploy 2 => deploy 1 order.
Even if the deploy 2 job tries to run at first, the system picks a deploy 3 as the allocation candidate (which is not ready for execution),
so the deploy 2 job stays waiting_for_resource status.
Example 2: The same resource group is used in multiple jobs in the same pipeline
Given that user created three pipelines at once:
Pipeline 3: build -> test -> deploy 5,6
Pipeline 2: build -> test -> deploy 3,4
Pipeline 1: build -> test -> deploy 1,2
and deploy job requires a resource from a resource group.
When the resource group property is process_scope: waiting_for_resource and process_order: asc_by_pipeline_id.
The deployment jobs are executed in a random order.
When the resource group property is process_scope: created_or_waiting_for_resource and process_order: asc_by_pipeline_id.
The deployment jobs are executed in the following order:
- deploy 1 => deploy 2 OR deploy 2 => deploy 1 (The inner order is not guaranteed)
- deploy 3 => deploy 4 OR deploy 4 => deploy 3 (The inner order is not guaranteed)
- deploy 5 => deploy 6 OR deploy 6 => deploy 5 (The inner order is not guaranteed)
When the resource group property is process_scope: created_or_waiting_for_resource and process_order: desc_by_pipeline_id.
The deployment jobs are executed in the following order:
- deploy 5 => deploy 6 OR deploy 6 => deploy 5 (The inner order is not guaranteed)
- deploy 3 => deploy 4 OR deploy 4 => deploy 3 (The inner order is not guaranteed)
- deploy 1 => deploy 2 OR deploy 2 => deploy 1 (The inner order is not guaranteed)
Example 3: Similar to Example 2 but deploy jobs are separate across different stages
Given that user created three pipelines at once:
Pipeline 3: deploy 5 -> build -> test -> deploy 6
Pipeline 2: deploy 3 -> build -> test -> deploy 4
Pipeline 1: deploy 1 -> build -> test -> deploy 2
and deploy job requires a resource from a resource group.
When the resource group property is process_scope: waiting_for_resource and process_order: asc_by_pipeline_id.
The deployment jobs are executed in a random order.
When the resource group property is process_scope: created_or_waiting_for_resource and process_order: asc_by_pipeline_id.
The deployment jobs are executed in the following order:
- deploy 1 => deploy 2
- deploy 3 => deploy 4
- deploy 5 => deploy 6
When the resource group property is process_scope: created_or_waiting_for_resource and process_order: desc_by_pipeline_id.
The deployment jobs are executed in the following order:
- deploy 5 => deploy 6
- deploy 3 => deploy 4
- deploy 1 => deploy 2
Interface
Users can change the properies of the resource group via Public v4 API. Likely we introduce the following paths:
-
PUT projects/:project_id/resource_groups/:key- options:
process_modeis one ofunordered,oldest_firstornewest_first
- options:
- The default process mode is
unordered.
(UI or .gitlab-ci.yml support is future iteration)
Previous proposal
Proposal
Introducing an option in the .gitlab-ci.yml to configure when the lock for a resource_group is obtained. Obtaining it means, that a waiting queue is maintained for each resource_group which determines who is allowed to work with it next (a queueing mechanism should already exist for resource_group as part of the semaphore primitive implemented).
Currently the lock is obtained and waited for right before the job starts, which ensures that no jobs run at the same time for the same resource. But when a job is ready to start (and is therefore requesting a lock), depends on the non-deterministic runtimes of previous jobs and the concurrent sheduling of jobs in general.
An alternative option that this feature should introduce is to obtain the lock when the job is created (which happens when the pipeline is created). This means that jobs wanting to work on the same resource_group are forced to work in the order in that they are created. This ensures that older jobs are always running before newer jobs for one particular resource_group. The downside of this is that the jobs might have to wait longer for a resource to become free in some situations. Jobs for other resource_groups or jobs without them, should not be affected by this and will still benefit from concurrent execution.
Sample Configuration
Assumtion: Allowing resource_group to be optionally defined as an dictionary, using a property named resource to represent the string value from the non-dictionary mode.
This feature shall add a new property named lock_at (or obtain_at, obtain_lock_at, *_on, etc.) with the two possible values create and start.
The default value for this setting should in my opinion be create.
It should result in having less issues due to race conditions and wrongly ordered deployments for users that are unaware of possible problems.
But this is debatable, the current behavior for resource_group would be that of start.
I can understand, that you'd not want to change it now that resource_group is released.
# pipeline-level
resource_group:
resource: "Resource 1"
lock_at: create
job_a:
script: "echo a"
resource_group:
resource: 'Resource 2'
lock_at: create
job_b:
script: "echo b"
resource_group:
resource: 'Resource 2'
lock_at: start
# I don't know if mixing this setting for the same resource is useful for anything, but we shouldn't restrict it.
job_c:
script: "echo c"
resource_group:
resource: 'Resource 3'
lock_at: start
# But I can think of situations where mixing it in the same pipeline for diferent resources might be useful.
A create lock for jobs should happen after the create lock on the pipeline-level. Pipeline start can only happen after the creation of the pipeline and all the jobs for it. A job at the first stage that has a start value should lock the resource directly after the start lock of the pipeline.
Documentation
Extend the documentation for resource_group here: GitLab CI/CD Pipeline Configuration Reference
Availability & Testing
What risks does this change pose to our availability? It might make pipeline jobs wait longer for resources to become available, if the default is changed or the new option is used manually.
How might it affect the quality of the product?
- It should improve the reliability of concurrent pipelines that access shared
resource_groups - Users that define their CI/CD Pipeline Configuration have a new option to fine-tune it for their use-case.
What additional test coverage or changes to tests will be needed?
It should be tested that there are no dead locks introduced by this feature. If I'm not mistaken, there shouldn't be any for this change in theory. But depending on how it's implemented and how Gitlab exactly works in the background, this might be possible?
Will it require cross-browser testing?
No
What does success look like, and how can we measure that?
Success
- Users are using the new feature, have fast concurrent pipelines with deployments they can rely on.
- Users not aware of the possible risks of execution order should have a safer default when using
resource_grouporenvironment(which will useresource_groupimplicit in the future, and should be added by #199048).
Measure
.gitlab-ci.yml files in public projects counted for how often the new feature is used.
What is the type of buyer?
Core/Free because its a fundamental primitive required in conjunction with using resource_group. It's extending an feature that is already in Core/Free .