Combine parallel:matrix with parallel:count to run multiple instances per matrix permutation (#601715) · Issues · GitLab.org / GitLab

Combine parallel:matrix with parallel:count to run multiple instances per matrix permutation

For a TL;DR see this comment: https://gitlab.com/gitlab-org/gitlab/-/work_items/601715#note_3402863744 ### Release notes GitLab's [`parallel`](https://docs.gitlab.com/ci/yaml/#parallel) keyword runs **either** N instances of a job (`parallel: 5`) **or** one job per matrix permutation (`parallel:matrix`), never both. So a common need can't be expressed directly: "run this matrix, and split each permutation's work across several instances" (for example, sharding a test suite across providers/stacks). This adds a `count:` multiplier alongside `matrix:`, running N instances per permutation. `parallel: { matrix: [...], count: 3 }` produces `(permutations × 3)` jobs: ```yaml test: script: run_tests.sh parallel: matrix: - PROVIDER: [aws, gcp] STACK: [app, db] count: 3 # run 3 instances of EACH matrix permutation ``` The change is **purely additive**: pipelines not combining the two forms behave exactly as today. It also defines how the new per-instance dimension interacts with every adjacent feature: predefined variables, `extends:`, `needs:`, `dependencies:`, and `trigger` jobs. ### Problem to solve As a [Sasha (Software Developer)](https://handbook.gitlab.com/handbook/product/personas/#sasha-software-developer) or [Simone (Software Engineer in Test)](https://handbook.gitlab.com/handbook/product/personas/#simone-software-engineer-in-test), I want multiple instances of _each_ matrix permutation, so I can split a test suite across parallel instances **within each** provider/stack combination, without polluting my matrix or generating config externally. Today this is impossible because `parallel: <number>` and `parallel:matrix` are **mutually exclusive** (`parallel:` accepts a number or a `matrix:`, not both). Workarounds: - **Pad the matrix** with a throwaway variable (e.g. `SHARD: [1, 2, 3]` on every row) to multiply job count. This adds a non-matrix dimension that leaks into job names and variables. - **Generate config dynamically** via templating or a child pipeline, pushing a one-line declaration into external tooling. This mutual exclusivity was **a deliberate, extensible design decision, not a technical barrier** (see Links). `parallel:` was built to grow sibling keys like this, and a per-permutation multiplier was sketched (as `replicas:`) and deferred when `parallel:matrix` shipped. ### User experience goal A single `parallel:` block in `.gitlab-ci.yml` combining `matrix:` and `count:`, generating `(permutations × count)` jobs. Each job receives its matrix variables plus a per-permutation instance index for sharding work, with no external tooling or matrix padding. ### Proposal Add a `count:` multiplier to `parallel:`. Combined with `matrix:`, each permutation runs `count` instances. #### What you write, and what you get ```yaml test: script: run_tests.sh parallel: matrix: - PROVIDER: [aws, gcp] STACK: [app, db] count: 3 ``` **Resulting jobs.** The matrix expands to 4 permutations (`aws/app`, `aws/db`, `gcp/app`, `gcp/db`); each runs 3 instances → `4 × 3 = 12` jobs: ``` test: [aws, app] 1/3 test: [aws, app] 2/3 test: [aws, app] 3/3 test: [aws, db] 1/3 ... (12 total) ``` **Job naming.** The combined name is the matrix name plus the sequential ` x/y` suffix, so `test: [aws, app] 2/3` reads as "instance 2 of 3 within the `[aws, app]` permutation." The `x/y` suffix counts instances _within a permutation_; the job's overall position across all 12 is not shown (it carries no actionable meaning). **Variables in each job:** - The **matrix variables** for its permutation (`PROVIDER`, `STACK`), as today. - **`CI_NODE_TOTAL` = 12** and **`CI_NODE_INDEX` = 1..12**: total jobs and this job's position across the whole set. - **`CI_COUNT_TOTAL` = 3** and **`CI_COUNT_INDEX` = 1..3** (new): instances per permutation, and which instance this job is _within its own permutation_. Set **only** when `count:` is combined with `matrix:`. #### Why `CI_NODE_*` counts all jobs (and why `CI_COUNT_*` is added) This is driven by backwards compatibility. [`CI_NODE_TOTAL`](https://docs.gitlab.com/ci/variables/predefined_variables/) is documented as "the total number of instances of this job running in parallel," and `CI_NODE_INDEX` as this job's position in that set: | config | `CI_NODE_TOTAL` | `CI_NODE_INDEX` | |--------|-----------------|-----------------| | `parallel: N` | `N` | `1..N` | | `parallel: { matrix: [...] }` | permutations | `1..permutations` | | `parallel: { matrix, count: C }` | **perms × C** | `1..(perms × C)` | Keeping `CI_NODE_*` as the count of _all_ generated jobs is the only option that preserves these variables' existing meaning for pipelines using `parallel: N` or `parallel:matrix` today. But that total can't tell a job "which instance am I within my permutation," which is exactly what per-permutation sharding needs. So the feature adds **`CI_COUNT_INDEX` / `CI_COUNT_TOTAL`**, reporting only the per-permutation instance position, letting a user split work _within_ one matrix permutation: ```yaml test: parallel: matrix: - PROVIDER: [aws, gcp] count: 3 script: - bundle exec rspec_booster --job $CI_COUNT_INDEX/$CI_COUNT_TOTAL ``` The permutation itself is still identified by the matrix variables and the job name. I am open to changing `CI_NODE_*` meaning to solve this, but I'm unsure of GitLab's stance on backwards compatibility. **Limits.** GitLab rejects a `parallel:` config creating more than **200** jobs (the documented max for both `parallel: N` and matrix permutations). The combined feature applies that same 200-job max to the **combined total** (`permutations × count`), reusing the existing "generates too many jobs" error. No new limit, no new message. #### Reusing configuration with `extends:` Since `count:` is just another key in `parallel:`, a base job defining `matrix:` and a job extending it with `count:` combine into the full feature: ```yaml .test_base: script: run_tests.sh parallel: matrix: - PROVIDER: [aws, gcp] STACK: [app, db] rspec: extends: .test_base parallel: count: 3 ``` After inheritance, `rspec` equals writing `{ matrix: [...], count: 3 }` directly: 4 × 3 = 12 jobs (`rspec: [aws, app] 1/3` … `rspec: [gcp, db] 3/3`), with the same names, variables, and 200-job limit. This combination is **intended**, and is a main reason the feature is an additive `count:` key rather than a separate keyword: teams keep a shared matrix base and let individual jobs opt into multiple instances. **`parallel: 5` and `parallel: { count: 5 }` are the same thing.** `count:` is the primary spelling; plain `parallel: 5` is shorthand for `parallel: { count: 5 }`, guaranteed identical in **every context, including `extends`**. This matters because `extends` merges a plain number and a block differently (a number replaces the inherited value entirely; a block merges key-by-key). Without this guarantee, the two forms would combine _differently_ with an inherited `matrix:`. Treating them as the same removes that inconsistency: they're interchangeable, including when a base `matrix:` meets an inherited count. The rule stays simple: a count is a count, however written. A `parallel:` block containing **only** `count:` (`parallel: { count: 5 }`) is valid and behaves exactly like `parallel: 5`: 5 instances, `CI_NODE_TOTAL` 5, `CI_NODE_INDEX` 1..5, no `CI_COUNT_*` (no matrix, so it's the plain numeric form). A user never has to reason about whether a count "requires" a matrix. #### How `count` interacts with `needs:` and `dependencies:` This is the most detailed part, because a job listing a `count`-using job in `needs`/`dependencies` now points at a _group of instances_ per permutation, not a single job. Every case is defined. Guiding principle: **a reference written exactly as today keeps working and depends on every matching job; depending on one specific instance is opt-in.** A `parallel:` block inside `needs` already works as a **filter**: its `matrix:` chooses which upstream _permutations_ to wait on, via fixed values (`PROVIDER: aws`) or the `$[[ matrix.PROVIDER ]]` expression GitLab already supports for [one-to-one matrix dependencies](https://docs.gitlab.com/ci/yaml/needs/#parallelmatrix-jobs). The proposal extends that filter to the instance dimension: `count:` inside `needs` chooses which upstream _instances_ to wait on, as a fixed number or an expression. This gives **three rules**: 1. **`count:` omitted → depend on ALL instances** of each selected permutation (or the single upstream job, if it doesn't use `count`). This default matches how `needs`/`dependencies` already behave (a job needing a `parallel: 3` job depends on all 3, never one), making `count:` **purely additive**: adding it upstream never changes an existing job's dependencies unless the author explicitly writes a `count:` filter. 2. **Fixed `count: N` → depend on exactly instance `N`** of each selected permutation (e.g. `count: 3` → the `… 3/3` instance only). The instance-dimension equivalent of selecting one fixed matrix value; valid whenever the upstream uses `count`, and out-of-range `N` is a config error. 3. **`count: '$[[ count.INDEX ]]'` → pair instances one-to-one**, instance `k` of the test job to instance `k` of the build job. This new expression mirrors `$[[ matrix.VAR ]]` and is replaced with the depending job instance's own 1-based index. Valid **only** when both jobs use `count` with the **same** value (the depending job needs its own instances to provide the index; mismatched counts would leave some instance `k` with no counterpart). Any other use is a config error, not a silent guess. Two naming choices follow: the expression is **`count.INDEX`** (not `copy.*`/`instance.*`) to align with the `count:` keyword and `CI_COUNT_INDEX` and introduce no new wording (only `INDEX` is offered; `count.TOTAL` has no pairing use and could be added later). And the design keeps **one consistent model**: `matrix:` and `count:` each _create_ jobs when written on a job, and _filter_ which upstream jobs to wait on when written inside `needs`, in both places as a fixed value or a `$[[ ]]` expression. ```yaml linux:build: script: build.sh parallel: matrix: - PROVIDER: [aws, gcp] count: 3 linux:test: script: test.sh parallel: matrix: - PROVIDER: [aws, gcp] count: 3 needs: - job: linux:build parallel: matrix: - PROVIDER: ['$[[ matrix.PROVIDER ]]'] # existing: match the same permutation count: '$[[ count.INDEX ]]' # new: also match the same instance number ``` Below, "test job" declares `needs`, "build job" is referenced, and "matching permutation" is the upstream permutation chosen by the `matrix:` filter (or by referencing the job with no `matrix:` filter). | test job uses `count` | build job uses `count` | `count:` in the `needs` filter | Result | Rule | |-----------------------|------------------------|--------------------------------|--------|------| | no | no | n/a (no `count` anywhere) | today's one-to-one matrix mapping, unchanged | — (no instances exist on either side) | | no | yes | omitted | test job depends on ALL instances of the matching permutation | 1 (additive default, same as `parallel: N` today) | | no | yes | `count: 3` (fixed) | test job depends on instance 3 only | 2 (fixed selection) | | no | yes | `$[[ count.INDEX ]]` | **config error** | 3 — test job has no instances, so no index to substitute | | yes | no | omitted | every test instance depends on the single build job | 1 (only one upstream job exists) | | yes | yes | omitted | every test instance depends on ALL build instances of the matching permutation | 1 (omitting must NOT silently switch to one-to-one) | | yes | yes | `count: 3` (fixed) | every test instance depends on build instance 3 | 2 (same fixed target for every test instance) | | yes | yes | `$[[ count.INDEX ]]`, **same** count | instance `k` → instance `k`, paired one-to-one | 3 (the expression's purpose; well-defined) | | yes | yes | `$[[ count.INDEX ]]`, **different** counts | **config error** | 3 — some instance `k` would have no counterpart | These rules also settle whether `count:` is _allowed_ inside a `needs` filter: it is, neither silently ignored nor rejected. Silently ignoring a written `count: 3` would be worst (user asks for instance 3, quietly gets all); rejecting it would discard a useful capability. Treating it as a filter gives the written value a clear, useful meaning. **Referring to a permutation by name in `dependencies:` / `needs:`.** GitLab documents [fetching artifacts from a specific matrix permutation](https://docs.gitlab.com/ci/jobs/job_control/#fetch-artifacts-from-a-parallelmatrix-job) by naming it exactly. When that permutation uses `count`, it's several instances, not one job. Unlike `needs`, a plain `dependencies` entry is just a job name with nowhere to add a `count:` filter, so the **name itself** expresses specificity, read from whether it includes an ` x/y` instance suffix. (Same applies to a job name written inside `needs`.) | reference written | refers to | rule | |-------------------|-----------|------| | job name only (`"ruby"`) | ALL generated jobs (every permutation × every instance) | (pre-existing) a name with no detail refers to all | | permutation name, no suffix (`"ruby: [2.7, aws]"`) | ALL instances of THAT permutation (`… 1/C..C/C`) | 1 — no suffix means all instances; avoids silently breaking an existing reference | | permutation name with suffix (`"ruby: [2.7, aws] 2/C"`) | exactly that ONE instance | 2 — targets a single instance | | suffix out of range (`"… 9/C"`, `9 > C`) | matches nothing — same as naming any nonexistent job today | 2 — unchanged behaviour; not the feature's job to police | Treating an unsuffixed name as "all instances" is the decisive choice. Users write `dependencies: ["ruby: [2.7, aws]"]` today. If a suffix-less name did _not_ refer to the instances, then once `ruby` added `count`, that existing line would match no job and the artifact fetch would quietly break, breaking a working pipeline from the _other_ side. Keeping it as "all instances" preserves the line; its meaning simply widens from "one job" to "all instances of that permutation." This is consistent with the `needs` default, so the feature tells one story: **a name with no instance suffix means all matching jobs.** #### Where `count` is not allowed: trigger jobs A [trigger job](https://docs.gitlab.com/ci/pipelines/downstream_pipelines/) runs no script; it starts a downstream pipeline. `parallel:matrix` on a trigger job is meaningful (one downstream pipeline per permutation, each with different variables), but `count` is not: ```yaml deploy: trigger: project: my-group/deploy-project parallel: matrix: - PROVIDER: [aws, gcp] # OK: one downstream pipeline per provider count: 3 # rejected: 3 identical downstream pipelines per provider ``` **`count` is not allowed on a trigger job**, neither with `matrix:` nor alone. Writing it is a config error with a clear message, the same way a plain `parallel: N` is already rejected on a trigger job. The reasons build on each other: - **It re-introduces what `parallel: N` is already rejected for.** Plain `parallel: N` is rejected because N _identical_ triggers are meaningless: no script to divide, so extra instances have nothing to do differently. `matrix:` + `count: N` is the same request: N _identical_ downstream pipelines per permutation (same provider, variables, downstream config, nothing to distinguish them). Rejecting `parallel: N` but allowing `count` would be inconsistent; both ask for "N identical triggers." - **Allowing it would define behaviour with no real use** ("N identical downstream pipelines per permutation"), matching no identified use case and almost always a mistake (e.g. the same deployment firing repeatedly at the same target). - **It keeps the first version focused** on the proven case of splitting work across instances of real jobs. This is also why `count` is limited to a job's own `parallel:` and can't reach a trigger job through `extends`. This is the deliberate counterpart to allowing `count` inside a `needs` filter: a `needs` filter selects real upstream instances that exist, so a filter is meaningful; a trigger job has no instances to point at, so `count` has nothing to mean. ### Further details #### Backwards compatibility This feature is **purely additive**. Any config not combining `count:` with `matrix:` behaves exactly as today (same jobs, names, variables, validation): - `parallel: 5` → 5 jobs `job 1/5`..`5/5`, `CI_NODE_INDEX` 1..5, `CI_NODE_TOTAL` 5. `CI_COUNT_*` **not** set. - `parallel: { matrix: [...] }` → one job per permutation, named `job: [values]`, `CI_NODE_TOTAL` = permutations. `CI_COUNT_*` **not** set; naming and `CI_NODE_*` unchanged. - A `parallel:` block with neither `matrix:` nor a count, or with unknown keys, still produces today's error. `CI_COUNT_INDEX` / `CI_COUNT_TOTAL` appear **only** when `count:` is combined with `matrix:`. #### Technical fit with the existing implementation This slots into existing extension points rather than reworking them: - `parallel:` is modelled as a `Simplifiable` entry with `ParallelBuilds` (numeric) and `MatrixBuilds` (hash) strategies in `lib/gitlab/ci/config/entry/product/parallel.rb`. `MatrixBuilds` declares `PERMITTED_KEYS = %i[matrix]`; `count` is added there, the same key-by-key growth the `exclude:` sibling key already follows. - Job expansion happens via `NumberStrategy` and `MatrixStrategy` (`lib/gitlab/ci/config/normalizer/`), combined in `factory.rb`. The feature composes these two existing strategies (matrix expansion × instance count), not a third mechanism. - The 200-job cap is one constant, `Parallel::PARALLEL_LIMIT = 200`, enforced by `Matrix#number_of_generated_jobs` as "generates too many jobs (maximum is 200)". The feature applies the same constant and message to `permutations × count`; no new limit or error string. - `CI_NODE_INDEX` / `CI_NODE_TOTAL` are emitted in `lib/gitlab/ci/variables/builder.rb` from `job.options[:instance]` and `ci_node_total_value`. The new `CI_COUNT_*` pair is added alongside, gated on the combined `matrix + count` case. - `needs` already validates `parallel:` with `allowed_strategies` and rejects `parallel: <number>` (`lib/gitlab/ci/config/entry/needs.rb`); `$[[ matrix.VAR ]]` resolution lives in `lib/gitlab/ci/config/interpolation/matrix_interpolator.rb`. The `count:` filter and `$[[ count.INDEX ]]` expression extend these existing paths. ### Documentation - Update [`parallel` / `parallel:matrix` in the CI YAML reference](https://docs.gitlab.com/ci/yaml/#parallel) to document `count:`, the combined form, and the 200-job limit applying to `permutations × count`. - Update [predefined variables](https://docs.gitlab.com/ci/variables/predefined_variables/) to add `CI_COUNT_INDEX` and `CI_COUNT_TOTAL`, clarifying when they're set. - Update [`needs:parallel:matrix`](https://docs.gitlab.com/ci/yaml/needs/#parallelmatrix-jobs) and [fetch artifacts from a parallel:matrix job](https://docs.gitlab.com/ci/jobs/job_control/#fetch-artifacts-from-a-parallelmatrix-job) to cover the `count:` filter, `$[[ count.INDEX ]]`, and the "name with no suffix = all instances" rule. ### Availability & Testing Config-parsing/expansion logic with no runtime infrastructure changes, so availability risk is low; the existing 200-job cap bounds amplification. Test areas: - **Unit**: `parallel.rb` (accepting `count:` in `MatrixBuilds`, numeric/block equivalence, `count`-only validity), `matrix.rb` / `number_of_generated_jobs` and the normalizer strategies (`permutations × count` expansion, `… x/y` naming), `variables/builder.rb` (`CI_COUNT_*` set only in the combined case; `CI_NODE_*` totals). - **Integration**: `YamlProcessor` / `Config::Normalizer` for combined expansion, `extends` merging (base `matrix:` + inherited `count:`, numeric vs block), `needs`/`dependencies` selection (all three rules + error cases from the matrix table), and trigger-job rejection. - **End-to-end**: a pipeline using `{ matrix:, count: }` sharding via `$CI_COUNT_INDEX`, plus a `needs` one-to-one `$[[ count.INDEX ]]` pairing. ### Is this a cross-stage feature? Primarily owned by **group::pipeline authoring** (`Category:Pipeline Composition`, `section::ci`, `devops::verify`). It touches predefined CI variables (Runner) and artifact/`needs` resolution, so a check-in with the relevant Verify groups is worthwhile, but no other stage owns the change. ### Prior art: why the two forms are mutually exclusive today The mutual exclusivity is **a design decision, not an accident or technical barrier**, made when `parallel:matrix` was introduced in [#15356 (closed)](https://gitlab.com/gitlab-org/gitlab/-/issues/15356) (shipped 13.3). Understanding it makes this proposal a natural extension, not a reversal. The team reasoned explicitly about the _type_ accepted by `parallel:` ([note_309409940](https://gitlab.com/gitlab-org/gitlab/-/issues/15356#note_309409940)): an **Integer** (`parallel: 5`) means "run N instances"; a **Hash** (`parallel: { matrix: [...] }`) means "expand a matrix". The `matrix:` keyword was wrapped in a Hash **specifically so `parallel:` could grow sibling keys later** without colliding with the integer or array forms. As Kamil Trzciński (the architect) put it, the extra Hash level was added "to clearly indicate the intent (to cut on some magic), but also have an ability to extend `parallel:` further if needed." Crucially, **this exact extension was already sketched in that discussion** as `replicas:` ([note_334877407](https://gitlab.com/gitlab-org/gitlab/-/issues/15356#note_334877407)): ```yaml parallel: replicas: 10 matrix: - OS: [windows, mac] ``` Described as equivalent to running `parallel: 10` for each matrix variant, i.e. a per-permutation multiplier, which is precisely what `count:` provides. It was deferred, not rejected: maintainers were wary of designing for a future that hadn't arrived ([note_335023273](https://gitlab.com/gitlab-org/gitlab/-/issues/15356#note_335023273)), preferring to leave the extension point empty until a concrete, demanded use case appeared. This proposal supplies it. **The reserved extension point has already been used once**: after launch, `parallel:matrix` gained an `exclude:` sibling key, exactly the key-by-key growth the Hash design enabled. A `count:` key follows the same pattern. This proposal doesn't fight the original design; it fills a slot the design deliberately reserved, for the use case the maintainers said they were waiting for. ### Links / references - [#15356 (closed)](https://gitlab.com/gitlab-org/gitlab/-/issues/15356) — introduced `parallel:matrix` and the type-overloading design this builds on; contains the deferred `replicas:` sketch ([note_334877407](https://gitlab.com/gitlab-org/gitlab/-/issues/15356#note_334877407)) and the rationale for reserving the Hash extension point ([note_309409940](https://gitlab.com/gitlab-org/gitlab/-/issues/15356#note_309409940)). - [#26362 (closed)](https://gitlab.com/gitlab-org/gitlab/-/issues/26362) — relaxed the "matrix requires ≥2 items" validation (shipped 13.5, via community MR [!42170 (merged)](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/42170)); precedent for safely loosening a conservative `parallel:matrix` restriction. - [#241127 (closed)](https://gitlab.com/gitlab-org/gitlab/-/issues/241127) — customer-reported friction with the dummy-variable workaround this feature removes. - [#27112](https://gitlab.com/gitlab-org/gitlab/-/issues/27112) — open, adjacent idea (`extends` with an array of values to create multiple jobs).

issue