[Feature Flag] Rollout `ci_validate_build_dependencies`
Summary
We implemented build dependency validation feature a long time ago with the ci_validate_build_dependencies
feature flag. The feature implementation was complete and already enabled by default at code level (i.e. already enabled on on-premises instances), however, we forgot to enable the feature flag on gitlab.com until today. This issue is to enable the feature flag on production and eventually remove the flag to resolve the technical debts.
Owners
- Team: ~"group::continuous integration"
- Most appropriate slack channel to reach out to:
#g_ci
- Best individual to reach out to: @shinya.maeda
Expectations
What are we expecting to happen?
The feature is to validate that build dependencies are valid at the time when downstream stage/jobs started running. To illustrate this behavior, say you have the following .gitlab-ci.yml
:
compile:
stage: build
script: echo 'build' >> application_binary
artifacts:
paths:
- application_binary
production:
stage: deploy
script: deploy application_binary
# There are multiple ways to define dependencies.
#
# Option 1: Implicit dependencies
# Since this `production` job is executed in `deploy` stage which happens after `build` stage,
# this job automatically downloads the `compile` job's artifacts as a dependency.
#
# Option 2: Explicit dependencies
# dependencies: [compile]
# This `dependencies` keyword explicitly defines that `production` job depends on `compile` job,
# so that this job will download the artifacts of `compile` job.
#
# Option 3: DAG pipeline
# needs: [compile]
# This is similar to `dependencies` keyword. It explicitly defines that the `production` job depends
# on `compile` job, and will download the artifacts.
So the idea of build dependency is to download upstream artifacts. But what if the upstream artifacts have already been expired or erased? Runners will try to fetch the artifacts, but it fails. In this case, the failure is treated as script_failure, which is not really helpful to understand the underlying problem.
This feature comes into play that, when such case happens, the failure will be explicitly marked as missing_dependency_failure
, which shows an orthogonal message in job detail page. Also, since the job proactively fail when runner requested a job (RegisterJobService
), this feature potentially saves runners from a few load.
One additional database query will run per validation for fetching the depended job list, which looks like the following:
sample query
SELECT
"ci_builds".*
FROM
"ci_builds"
WHERE
"ci_builds"."type" = 'Ci::Build'
AND "ci_builds"."commit_id" = 277941511
AND (
"ci_builds"."retried" = FALSE
OR "ci_builds"."retried" IS NULL
)
AND (stage_idx < 6)
AND "ci_builds"."name" IN (
SELECT
"ci_build_needs"."name"
FROM
"ci_build_needs"
WHERE
"ci_build_needs"."build_id" = 1135212741
AND "ci_build_needs"."artifacts" = TRUE
)
The other example:
SELECT
"ci_builds".*
FROM
"ci_builds"
WHERE
"ci_builds"."type" = 'Ci::Build'
AND "ci_builds"."commit_id" = 277941511
AND (
"ci_builds"."retried" = FALSE
OR "ci_builds"."retried" IS NULL
)
AND (stage_idx < 5)
AND "ci_builds"."name" IN (
'rspec frontend_fixture',
'rspec-ee frontend_fixture 1/2',
'rspec-ee frontend_fixture 2/2'
)
This query has already been optimized, thus the expected timing should be ~25ms with cold cache in general.
What might happen if this goes wrong?
Users report that their CI pipelines start failing by There has been a missing dependency failure
. This is expected, but some users might be surprised by this behavioral change.
What can we monitor to detect problems with this?
Given that the validation happens in RegisterJobService
, we should monitor POST api/jobs/:id/request
endpoint's load.
You can also check recent dependency failure jobs by the following query:
SELECT * FROM ci_builds WHERE failure_reason = 5 ORDER BY created_at DESC LIMIT 10;
(This query could encounter a statement timeout due to the table size. Please create a temporary index if necessary)
Beta groups/projects
If applicable, any groups/projects that are happy to have this feature turned on early. Some organizations may wish to test big changes they are interested in with a small subset of users ahead of time for example.
-
gitlab-org/gitlab
project -
gitlab-org
/gitlab-com
groups - ...
Roll Out Steps
-
Enable on staging ( /chatops run feature set feature_name true --staging
) -
Test on staging -
Ensure that documentation has been updated -
Enable on GitLab.com for individual groups/projects listed above and verify behaviour ( /chatops run feature set --project=gitlab-org/gitlab feature_name true
) -
Coordinate a time to enable the flag with the SRE oncall and release managers - In
#production
mention@sre-oncall
and@release-managers
. Once an SRE on call and Release Manager on call confirm, you can proceed with the rollout
- In
-
Announce on the issue an estimated time this will be enabled on GitLab.com -
Enable on GitLab.com by running chatops command in #production
(/chatops run feature set feature_name true
) -
Cross post chatops Slack command to #support_gitlab-com
(more guidance when this is necessary in the dev docs) and in your team channel -
Announce on the issue that the flag has been enabled -
Remove feature flag and add changelog entry -
After the flag removal is deployed, clean up the feature flag by running chatops command in #production
channel
Rollout Commands
Given that this feature behavioral change could affect existing CI/CD pipeline success rate, we have granular options for the flag enablement.
Rollout for specific projects
-
/chatops run feature set ci_validate_build_dependencies_override true
to ensure that the feature is disabled globally. -
/chatops run feature set ci_validate_build_dependencies true --project=gitlab-org/gitlab
to enable the main switch for specific projects. /chatops run feature set ci_validate_build_dependencies true --project=gitlab-com/www-gitlab-com
-
/chatops run feature delete ci_validate_build_dependencies_override
to let the main switch control the feature state.
Progress: 100%
Begin Percentage Rollout
-
/chatops run feature set ci_validate_build_dependencies_override true
to ensure that the feature is disabled globally. -
/chatops run feature delete ci_validate_build_dependencies
to reset the main switch state. -
/chatops run feature set ci_validate_build_dependencies false
to turn off the main switch -
/chatops run feature set ci_validate_build_dependencies_override 95 --actors
to disable the feature for 95% of the projects i.e. rollout the feature for 5% of projects. -
/chatops run feature delete ci_validate_build_dependencies
to reset the main switch state, which enables the feature by default.
Progress: 10%
Increase the percentage of actors
-
/chatops run feature set ci_validate_build_dependencies_override 90 --actors
to disable the feature for 90% of the projects i.e. rollout the feature for 10% of projects.
Schedule
- 5% increase per one working day (from 0% ~ 25%)
- 25% increase per one working day (from 25% ~ 100%)
Rollback Steps
-
This feature can be disabled by running the following Chatops command:
/chatops run feature set ci_validate_build_dependencies false
History
We implemented a CI/CD feature Dependency Validator in %10.3 .
Unfortunately, this feature had a bug, so we created a patch. This patch will go into %10.3 RC 3.
For the RC1/2 deployment, we disabled the feature by Feature Flag (ci_disable_validates_dependencies
). This has already been effective on production(gitlab.com) https://gitlab.slack.com/archives/C101F3796/p1513671612000019.
We need to renable this feature after RC 3 deployed.
To renable, please do !feature-set ci_disable_validates_dependencies false
(Please be careful that this is double negation!)
If you have a question, please ping @ayufan, @stanhu or @dosuken123.