Rollout of `ci_unlock_non_successful_pipelines` feature flag

Production Change

Change Summary

This is for the rollout of gitlab-org/gitlab#428408 (closed).

The reason we are opening a change management issue for this is based on the documentation:

When feature toggles, or associated features, have previously had to be rolled back due to user-impacting service degradation, or as a result of the previous toggle leading to a production incident.

This is related to #16451 (closed).

Now that the new unlocking mechanism has been completely rolled-out, we will now roll-out this fix for the unlocking related bugs.

This has been attempted before with the old unlock mechanism but unfortunately resulted to an incident. Thus the need for the change management request so we can monitor closely.

Change Details

Services Impacted - ServiceRedis ServiceSidekiq ServicePostgres
Change Technician - @iamricecake
Change Reviewer - DRI for the review of this change
Time tracking - Around 10 days to incrementally enable feature flag to all projects.
Downtime Component - N/A

Set Maintenance Mode in GitLab

N/A

Detailed steps for the change

Change Steps - steps to take to execute the change

We will slowly rollout ci_unlock_non_successful_pipelines to all projects.

Estimated Time to Complete (1 day per 10% increment, total of 10 days, but this might increase depending on how we see it perform)

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

We can disable ci_unlock_non_successful_pipelines to revert to old behavior of only unlocking pipelines for successful ones.

Worst case, we can enable ci_stop_unlock_pipelines to prevent any more pipelines to be enqueued, regardless if ci_unlock_non_successful_pipelines is enabled or not.

Disable ci_unlock_pipelines_high|ci_unlock_pipelines_medium|ci_unlock_pipelines to prevent the limited capacity worker from picking up any more new jobs.

Estimated Time to Complete (mins) - 5 mins

/chatops run feature set ci_unlock_non_successful_pipelines false
/chatops run feature set ci_unlock_pipelines_high false
/chatops run feature set ci_unlock_pipelines_medium false
/chatops run feature set ci_unlock_pipelines false
Set label changeaborted /label ~change::aborted

Monitoring

From gitlab-org/gitlab#428408 (closed)

New Unlock Pipelines Mechanism Kibana Dashboard
Redis Grafana Dashboard
Grafana Ci::Refs::UnlockPreviousPipelinesWorker Overview
Grafana Ci::UnlockPipelinesInQueueWorker Overview
Kibana Logs for Ci::Refs::UnlockPreviousPipelinesWorker
- Observe the following metadata attributes:
  - total_pending_entries
    - If there is a continuous increase of this number for a long time, consider increasing the limited capacity worker rate.
  - total_new_entries
Kibana Logs for Ci::UnlockPipelinesInQueueWorker
- Observe the following metadata attributes:
  - exec_timeout
    - This is not necessarily a bad thing, the worker is designed to pick up where it left off.
    - We can observe this in correlation with other factors to see if the workers can keep up with the amount of enqueued pipelines.
  - unlocked_job_artifacts
  - unlocked_pipeline_artifacts
Grafana Sidekiq Overview
Grafana PostgreSQL Overview
Grafana PostgreSQL Tuple stat
pg_replication_lag_bytes Prometheus Graph

Change Reviewer checklist

C4 C3 C2 C1:

Check if the following applies:
- The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

Check if the following applies:
- The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
  - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.

Change Technician checklist

Check if all items below are complete:
- The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
- For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed prior to any C1, C2, or blocks deployments change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
- There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

Edited Dec 01, 2023 by Erick Bajao