Introduce DR mode for CI queueing [RUN ALL RSPEC] [RUN AS-IF-FOSS]
What does this MR do?
There's a known set of deficiencies in a CI queueing that might impact our ability to process builds.
As a way to temporarily mitigate them we allow to disable them for a very limited period.
Purpose
- Take into account that these queries are executed against replicas, always
- The usage of these queries is long queues or 500 returned by
jobs/request
Related to
Impact
This introduces two defcon modes (that can be used separately from each other) that reduces complexity of this query:
explain analyze
SELECT "ci_builds".* FROM "ci_builds"
INNER JOIN "projects" ON "projects"."id" = "ci_builds"."project_id"
LEFT JOIN project_features ON ci_builds.project_id = project_features.project_id
LEFT JOIN (
SELECT "ci_builds"."project_id", count(*) AS running_builds
FROM "ci_builds"
WHERE "ci_builds"."type" = 'Ci::Build' AND ("ci_builds"."status" IN ('running')) AND "ci_builds"."runner_id" IN (SELECT "ci_runners"."id" FROM "ci_runners" WHERE "ci_runners"."runner_type" = 1) GROUP BY "ci_builds"."project_id"
) AS project_builds ON ci_builds.project_id=project_builds.project_id
WHERE
("ci_builds"."status" IN ('pending')) AND "ci_builds"."runner_id" IS NULL AND
"projects"."shared_runners_enabled" = TRUE AND "projects"."pending_delete" = FALSE AND
(project_features.builds_access_level IS NULL or project_features.builds_access_level > 0) AND
"ci_builds"."type" = 'Ci::Build' AND
("projects"."visibility_level" = 20 OR (EXISTS (WITH RECURSIVE "base_and_ancestors" AS ((SELECT "namespaces".* FROM "namespaces" WHERE (namespaces.id = projects.namespace_id)) UNION (SELECT "namespaces".* FROM "namespaces", "base_and_ancestors" WHERE "namespaces"."id" = "base_and_ancestors"."parent_id")) SELECT 1 FROM "base_and_ancestors" AS "namespaces" LEFT JOIN namespace_statistics ON namespace_statistics.namespace_id = namespaces.id WHERE "namespaces"."parent_id" IS NULL AND (COALESCE(namespaces.shared_runners_minutes_limit, 400, 0) = 0 OR COALESCE(namespace_statistics.shared_runners_seconds, 0) < COALESCE((namespaces.shared_runners_minutes_limit + COALESCE(namespaces.extra_shared_runners_minutes_limit, 0)), (400 + COALESCE(namespaces.extra_shared_runners_minutes_limit, 0)), 0) * 60))))
ORDER BY COALESCE(project_builds.running_builds, 0) ASC, ci_builds.id ASC;
(This misses tags matching, but this is required in all cases).
DEFCON: disable quota
explain analyze
SELECT "ci_builds".* FROM "ci_builds"
INNER JOIN "projects" ON "projects"."id" = "ci_builds"."project_id"
LEFT JOIN project_features ON ci_builds.project_id = project_features.project_id
LEFT JOIN (
SELECT "ci_builds"."project_id", count(*) AS running_builds
FROM "ci_builds"
WHERE "ci_builds"."type" = 'Ci::Build' AND ("ci_builds"."status" IN ('running')) AND "ci_builds"."runner_id" IN (SELECT "ci_runners"."id" FROM "ci_runners" WHERE "ci_runners"."runner_type" = 1) GROUP BY "ci_builds"."project_id"
) AS project_builds ON ci_builds.project_id=project_builds.project_id
WHERE
("ci_builds"."status" IN ('pending')) AND "ci_builds"."runner_id" IS NULL AND
"projects"."shared_runners_enabled" = TRUE AND "projects"."pending_delete" = FALSE AND
(project_features.builds_access_level IS NULL or project_features.builds_access_level > 0) AND
"ci_builds"."type" = 'Ci::Build'
ORDER BY COALESCE(project_builds.running_builds, 0) ASC, ci_builds.id ASC;
DEFCON: disable fair scheduling
explain analyze
SELECT "ci_builds".* FROM "ci_builds"
INNER JOIN "projects" ON "projects"."id" = "ci_builds"."project_id"
LEFT JOIN project_features ON ci_builds.project_id = project_features.project_id
WHERE
("ci_builds"."status" IN ('pending')) AND "ci_builds"."runner_id" IS NULL AND
"projects"."shared_runners_enabled" = TRUE AND "projects"."pending_delete" = FALSE AND
(project_features.builds_access_level IS NULL or project_features.builds_access_level > 0) AND
"ci_builds"."type" = 'Ci::Build' AND
("projects"."visibility_level" = 20 OR (EXISTS (WITH RECURSIVE "base_and_ancestors" AS ((SELECT "namespaces".* FROM "namespaces" WHERE (namespaces.id = projects.namespace_id)) UNION (SELECT "namespaces".* FROM "namespaces", "base_and_ancestors" WHERE "namespaces"."id" = "base_and_ancestors"."parent_id")) SELECT 1 FROM "base_and_ancestors" AS "namespaces" LEFT JOIN namespace_statistics ON namespace_statistics.namespace_id = namespaces.id WHERE "namespaces"."parent_id" IS NULL AND (COALESCE(namespaces.shared_runners_minutes_limit, 400, 0) = 0 OR COALESCE(namespace_statistics.shared_runners_seconds, 0) < COALESCE((namespaces.shared_runners_minutes_limit + COALESCE(namespaces.extra_shared_runners_minutes_limit, 0)), (400 + COALESCE(namespaces.extra_shared_runners_minutes_limit, 0)), 0) * 60))))
ORDER BY ci_builds.id ASC;
DEFCON: disable quota and fair scheduling
explain analyze
SELECT "ci_builds".* FROM "ci_builds"
INNER JOIN "projects" ON "projects"."id" = "ci_builds"."project_id"
LEFT JOIN project_features ON ci_builds.project_id = project_features.project_id
WHERE
("ci_builds"."status" IN ('pending')) AND "ci_builds"."runner_id" IS NULL AND
"projects"."shared_runners_enabled" = TRUE AND "projects"."pending_delete" = FALSE AND
(project_features.builds_access_level IS NULL or project_features.builds_access_level > 0) AND
"ci_builds"."type" = 'Ci::Build'
ORDER BY ci_builds.id ASC;
Does this MR meet the acceptance criteria?
Conformity
-
Does this MR need a changelog?-
I have included a changelog entry. -
I have not included a changelog entry because _____.
-
-
Documentation (if required) -
Code review guidelines -
Merge request performance guidelines -
Style guides -
Database guides -
Separation of EE specific content
Availability and Testing
-
Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process. -
Tested in all supported browsers -
Informed Infrastructure department of a default or new setting change, if applicable per definition of done
Security
If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:
-
Label as security and @ mention @gitlab-com/gl-security/appsec
-
The MR includes necessary changes to maintain consistency between UI, API, email, or other methods -
Security reports checked/validated by a reviewer from the AppSec team
Merge request reports
Activity
added grouppipeline authoring label
added grouppipeline execution label and removed grouppipeline authoring label
added backend documentation labels
4 Messages CHANGELOG missing: If you want to create a changelog entry for GitLab FOSS, run the following:
bin/changelog -m 56658 "Introduce DR mode for CI queueing [RUN ALL RSPEC] [RUN AS-IF-FOSS]"
If you want to create a changelog entry for GitLab EE, run the following instead:
bin/changelog --ee -m 56658 "Introduce DR mode for CI queueing [RUN ALL RSPEC] [RUN AS-IF-FOSS]"
If this merge request doesn't need a CHANGELOG entry, feel free to ignore this message.
We are in the process of rolling out a new workflow for adding changelog entries. This new workflow uses Git commit subjects and Git trailers to generate changelogs. This new approach will soon replace the current YAML based approach. To ease the transition process, we recommend you start using both the old and new approach in parallel. This is not required at this time, but will make it easier to transition to the new approach in the future. To do so, pick the commit that should go in the changelog and add a
Changelog
trailer to it. For example:This is my commit's subject line This is the optional commit body. Changelog: added
The value of the
Changelog
trailer should be one of the following: added, fixed, changed, deprecated, removed, security, performance, other.For more information, take a look at the following resources:
https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/1564
- https://docs.gitlab.com/ee/api/repositories.html#generate-changelog-data
If you'd like to see the new approach in action, take a look at the commits in the Omnibus repository.
This merge request adds or changes documentation files. A review from the Technical Writing team before you merge is recommended. Reviews can happen after you merge. You're adding or removing a feature flag, your MR title needs to include [RUN ALL RSPEC] [RUN AS-IF-FOSS]
(we may have updated it automatically for you and started a new MR pipeline) to ensure everything is covered.Documentation review
The following files require a review from a technical writer:
doc/administration/troubleshooting/defcon.md
The review does not need to block merging this merge request. See the:
-
Metadata for the
*.md
files that you've changed. The first few lines of each*.md
file identify the stage and group most closely associated with your docs change. - The Technical Writer assigned for that stage and group.
- Documentation workflows for information on when to assign a merge request for review.
Reviewer roulette
Changes that require review have been detected! A merge request is normally reviewed by both a reviewer and a maintainer in its primary category (e.g. frontend or backend), and by a maintainer in all other categories.
To spread load more evenly across eligible reviewers, Danger has picked a candidate for each review slot, based on their timezone. Feel free to override these selections if you think someone else would be better-suited or use the GitLab Review Workload Dashboard to find other available reviewers.
To read more on how to use the reviewer roulette, please take a look at the Engineering workflow and code review guidelines. Please consider assigning a reviewer or maintainer who is a domain expert in the area of the merge request.
Once you've decided who will review this merge request, assign them as a reviewer! Danger does not automatically notify them for you.
Category Reviewer Maintainer backend Subashis Chakraborty ( @subashis
) (UTC-6, 8 hours behind@ayufan
)Gabriel Mazetto ( @brodock
) (UTC+2, same timezone as@ayufan
)If needed, you can retry the
danger-review
job that generated this comment.Generated by
DangerEdited by 🤖 GitLab Bot 🤖changed milestone to %13.10
removed grouppipeline execution label
added groupcloud connector label
added 924 commits
-
830e1977...71011b0a - 923 commits from branch
master
- e6df75a2 - Introduce DEFCON mode for CI queueing
-
830e1977...71011b0a - 923 commits from branch
- Resolved by Kamil Trzciński
@ayufan thank you for this. I'm in favour of this approach
We should make sure that we include references from the infrastructure runbooks to documentation on when to set defcon mode.
Would it be possible to alert on this, say, if it was forgotten enabled?
Also, minor nit, but GitLab discourages military references in names: I'm not sure if DEFCON is applicable to this or not
Edited by Andrew Newdigate
unassigned @grzesiek
requested review from @alexbuijs
- Resolved by Kamil Trzciński
@ayufan I think this shaves a pretty good deal of both of the queries times, so i think it's a nice approach. Would it make sense if we used the same flag to enable both instead of 2 different ones. 2 different ones give more granular control, but since we intend to use this for emergency situations only, it might make sense to use 1 flag, wdyt?
removed review request for @alexbuijs
changed milestone to %13.11
added missed:13.10 label
added grouppipeline execution label and removed groupcloud connector label
added 1 commit
- 3557402e - Use one feature flag for two related optimizations
added 1 commit
- 738a7d59 - Use one feature flag for two related optimizations
added 1 commit
- 23c8dc66 - Use one feature flag for two related optimizations
mentioned in epic &5434 (closed)
mentioned in issue #324367 (closed)
Setting label(s) devopsverify sectionops based on ~"group::continuous integration".
added devopsverify sectionops labels
- Resolved by Kamil Trzciński
- Resolved by Kamil Trzciński
- Resolved by Kamil Trzciński
removed review request for @grzesiek
added 2943 commits
-
23c8dc66...238897eb - 2942 commits from branch
master
- c1a94592 - Introduce DR mode for CI queueing
-
23c8dc66...238897eb - 2942 commits from branch
@ayufan this is great
thank you. When this is delivered, we need to make sure that the infrastructure team are aware of it, when to use it, and how to enable it, probably through updates to the runbooks documentation.unassigned @andrewn
changed milestone to %13.12
added missed:13.11 label
requested review from @grzesiek
added 5990 commits
-
a36d194b...972168d4 - 5989 commits from branch
master
- f99f8ea0 - Introduce DR mode for CI queueing
-
a36d194b...972168d4 - 5989 commits from branch
added feature flag label
- Resolved by Kamil Trzciński
@ayufan follow up from gitlab-com/gl-infra/production#4481 (closed), @ahegyi pointed out that this would help, an uses the new partial index, but is still negatively impacted as the queue grows longer.
Ideally, if at all possible, the length of the queue should not lead to the query taking longer, as this leads to a positive feedback loop - more timeouts mean more pending jobs, means more timeouts.
cc @abrandl
requested review from @grzesiek
requested review from @mkaeppler
- Resolved by Kamil Trzciński
- Resolved by Matthias Käppler
- Resolved by Matthias Käppler
added 296 commits
-
f99f8ea0...e8fd52b4 - 295 commits from branch
master
- 49338c97 - Introduce DR mode for CI queueing
-
f99f8ea0...e8fd52b4 - 295 commits from branch
enabled an automatic merge when the pipeline for 61f38205 succeeds
mentioned in commit 0ee2d125
added workflowstaging label
added workflowcanary label and removed workflowstaging label
added workflowproduction label and removed workflowcanary label
added releasedcandidate label
added typemaintenance label
mentioned in issue gitlab-org/quality/triage-reports#20613 (closed)
mentioned in issue gitlab-org/quality/triage-reports#20978 (closed)
mentioned in issue gitlab-org/quality/triage-reports#21525 (closed)
mentioned in issue gitlab-org/quality/triage-reports#22029