Improve workflow for picking changes into auto-deploy
Problem Statement
The workflow for leveraging the Auto-Deploy's pick label for fixing issues is very manual, slightly error prone, and sluggish. The slight error prone could lead to situations where picks are not included when it is expected. This workflow is utilized anytime there's a breaking change that halts deploys, thus, we could use some steps to improve this workflow and make it slightly easier such that Release Managers have an easier time managing the end result and timing of the next deployment package. Consider the current workflow:
- A change is merged with the appropriate pick labels
- The next auto-deploy pick job will pick this change
- That branch will then start CI - this takes multiple hours, thus leads to two possible situations
- The next tag job will not pull in this commit because this commit is still building OR due to timing, the prepare job begins, creates a new auto-deploy branch, which does not have the desired fix
To resolve this we effectively perform the following runbook: https://gitlab.com/gitlab-org/release/docs/-/blob/4ace13de754ff77189524ceb1a307a4eb34f4091/runbooks/how_to_speed_auto_deploy_process_for_urgent_merge_requests.md:
- Cherry-pick the merge request into the latest auto-deploy branch by adding
~Pick into auto-deploy
and theseverity::2
or higher labels. - Trigger the
auto_deploy:pick
pipeline schedule. - Enable the
auto_deploy_tag_latest
feature flag. - Trigger the
auto_deploy:tag
pipeline schedule. This will create a coordinated pipeline with the merge request that needs to be deployed.
Goal
We should be able to automate the above.
When an issue has the appropriate pick labels, consider adjusting release tools to help accommodate that MR regardless of the timeline of events that currently occur. Examples:
- If we pick a change, and it's not tagged, we should minimally be alerted that the next package to be built is missing that desired change and/or vice versa, perhaps the prepare job should have some knowledge that its latest detected commit does not contain a desired change
- Perhaps the rake task that picks changes, detects a pick label was utilized, and effectively performs the above runbook inside of release tools
Edge cases
- This could potentially lead to a supply chain attack. We should validate the origination of the appropriate labels on MR's, and rejecting MR's that don't qualify (something we already do, but need to expand the scope of depending on the situation)
- We may have a situation due to timing when we kick off a pipeline, an older pipeline is still running - consider cancelling the older pipeline
- If the above is not achievable, we need to consider the messaging on the MR that was at some point picked into an auto-deploy branch - currently we state which branch it is inside of, but we may not deploy from that branch again
- Because we picked in a change that did not have a green pipeline, it's very likely the assets for that commit sha may not yet have been built. This leads to an issue where the omnibus build pipeline takes longer than desired, and the release-tools coordinated pipeline delay is no longer long enough. This forces us to need to retry that job to enable the release coordinator pipeline to continue as normal.
- ...
Logistics
2022-08-12
This needs further refinement prior to being worked on.
I'm marking this as low priority because we do have a well documented method of achieving our goal. I'm simply looking for a way to make this more automated.
@rpereira2 - feel free to edit this issue description as you see fit