Rollout of `ci_unlock_non_successful_pipelines` feature flag
Production Change
Change Summary
This is for the rollout of gitlab-org/gitlab#428408 (closed).
The reason we are opening a change management issue for this is based on the documentation:
When feature toggles, or associated features, have previously had to be rolled back due to user-impacting service degradation, or as a result of the previous toggle leading to a production incident.
This is related to #16451 (closed).
Now that the new unlocking mechanism has been completely rolled-out, we will now roll-out this fix for the unlocking related bugs.
This has been attempted before with the old unlock mechanism but unfortunately resulted to an incident. Thus the need for the change management request so we can monitor closely.
Change Details
- Services Impacted - ServiceRedis ServiceSidekiq ServicePostgres
- Change Technician - @iamricecake
- Change Reviewer - DRI for the review of this change
- Time tracking - Around 10 days to incrementally enable feature flag to all projects.
- Downtime Component - N/A
Set Maintenance Mode in GitLab
N/A
Detailed steps for the change
Change Steps - steps to take to execute the change
We will slowly rollout ci_unlock_non_successful_pipelines
to all projects.
Estimated Time to Complete (1 day per 10% increment, total of 10 days, but this might increase depending on how we see it perform)
-
Set label changein-progress /label ~change::in-progress
-
/chatops run feature set ci_unlock_non_successful_pipelines 10 --actors
-
/chatops run feature set ci_unlock_non_successful_pipelines 20 --actors
-
/chatops run feature set ci_unlock_non_successful_pipelines 30 --actors
-
/chatops run feature set ci_unlock_non_successful_pipelines 40 --actors
-
/chatops run feature set ci_unlock_non_successful_pipelines 50 --actors
-
/chatops run feature set ci_unlock_non_successful_pipelines 60 --actors
-
/chatops run feature set ci_unlock_non_successful_pipelines 70 --actors
-
/chatops run feature set ci_unlock_non_successful_pipelines 80 --actors
-
/chatops run feature set ci_unlock_non_successful_pipelines 90 --actors
-
/chatops run feature set ci_unlock_non_successful_pipelines true
-
Set label changecomplete /label ~change::complete
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
We can disable ci_unlock_non_successful_pipelines
to revert to old behavior of only unlocking pipelines for successful ones.
Worst case, we can enable ci_stop_unlock_pipelines
to prevent any more pipelines to be enqueued, regardless if ci_unlock_non_successful_pipelines
is enabled or not.
Disable ci_unlock_pipelines_high|ci_unlock_pipelines_medium|ci_unlock_pipelines
to prevent the limited capacity worker from picking up any more new jobs.
Estimated Time to Complete (mins) - 5 mins
-
/chatops run feature set ci_unlock_non_successful_pipelines false
-
/chatops run feature set ci_unlock_pipelines_high false
-
/chatops run feature set ci_unlock_pipelines_medium false
-
/chatops run feature set ci_unlock_pipelines false
-
Set label changeaborted /label ~change::aborted
Monitoring
From gitlab-org/gitlab#428408 (closed)
- New Unlock Pipelines Mechanism Kibana Dashboard
- Redis Grafana Dashboard
-
Grafana
Ci::Refs::UnlockPreviousPipelinesWorker
Overview -
Grafana
Ci::UnlockPipelinesInQueueWorker
Overview -
Kibana Logs for
Ci::Refs::UnlockPreviousPipelinesWorker
- Observe the following metadata attributes:
- total_pending_entries
- If there is a continuous increase of this number for a long time, consider increasing the limited capacity worker rate.
- total_new_entries
- total_pending_entries
- Observe the following metadata attributes:
-
Kibana Logs for
Ci::UnlockPipelinesInQueueWorker
- Observe the following metadata attributes:
- exec_timeout
- This is not necessarily a bad thing, the worker is designed to pick up where it left off.
- We can observe this in correlation with other factors to see if the workers can keep up with the amount of enqueued pipelines.
- unlocked_job_artifacts
- unlocked_pipeline_artifacts
- exec_timeout
- Observe the following metadata attributes:
- Grafana Sidekiq Overview
- Grafana PostgreSQL Overview
- Grafana PostgreSQL Tuple stat
- pg_replication_lag_bytes Prometheus Graph
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncall
and this issue and await their acknowledgement.) - For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed prior to any C1, C2, or blocks deployments change being rolled out. (In #production channel, mention
@release-managers
and this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.