Proposal for moving towards full CI/CD

Problem statement

The current way Release Management is set up is having a substantial impact on Delivery Group projects and OKR work. We have identified several reasons why we need to make changes to our package delivery and publishing process.

There are two distinct aspects of Release Management duties, each with its own impact.

Releasing new packages

Usually, it doesn't take a lot of time, thanks to the automation in place, but it requires a lot of attention to detail and coordination with different teams. Ideally, it should be done by a single person throughout the entire release cycle. Release Manager tasks are spread out across the entire month and usually do not occupy multiple consecutive days except the time before the actual release. We should definitely think how we can improve the process of releasing new packages, but for now, I would label it as Delivery impact3 Releasing new packages duties can easily be combined with OKR and project work, as the interruption time is not that high.

Delivering Auto-Deploy packages to Gitlab.com

I would consider this as the main driver of inefficiency and toil. The current process of delivering new GitLab versions to production SaaS can be problematic at times.

Release Management is a team member-intensive effort, requiring two dedicated engineers per month working full-time.
It has naturally an impact on the project and OKR work the team is carrying on. With the current team structure and Release Management duties, often a DRI for a project could potentially have an RM shift during the project. Impacting the project.
Projects tend to suffer from effort fragmentation: Engineers who were originally assigned to the projects and doing the work could potentially leave tasks unfinished (with a proper handover) to join RM shifts. If engineers working on the project are from different time zones, the impact on the project capacity is even bigger.
Sometimes projects require onboarding, like in the case of Dedicated. This onboarding might take a couple of days or even weeks. Onboarding happens at the beginning of the project and the team plans onboarding time in the backlog. If an engineer is coming from RM duties in the middle of project work, he/she is losing the onboarding window and therefore the time he/she could spend working on project tasks is spent on onboarding.
While the ChatOps approach is better than runbooks executed on a local machine, it's still a very manual process that requires paying a lot of attention to different Slack channels, watching the pipeline results, and waiting idly.
This also interferes with the duties of releasing new packages, applying patches, and doing security releases. As most of the time, the Release Manager has to wear two hats at the same time, and their focus is blurred between watching auto-deploy pipelines and preparing new releases.
We consider MTTP (Minimum Time to Production) as the leading performance indicator of teamDelivery, but if you look at the last 180 days chart this hasn't been improved. It looks like we hit the limitation for MTTP improvement for that particular process and if we want to improve it further, the process has to be changed.

Improving the procedures for rolling out new versions to production should be treated as Delivery impact1

Proposal

After reviewing a number of different proposals on how to improve the Release Management practices. I would like to combine and expand them here.

Issues and Epics

Here is the list of issues and epics that I used as the basis for my proposal.

Goal

Eliminate release management toil
Improve the effectiveness of delivery processes
Enable further MTTP key metric improvements

North Star

According to the The Three Ways of DevOps. The first and most important DevOps principle is Principle of Flow

If we follow this principle, our North Star goal for delivering new versions to GitLab.com should include the following requirements:

New versions should be rolled out to Production automatically and continuously as fast as possible. No manual steps should be allowed.
We should be confident enough in the quality of these versions as we test them on different environments and stages.
The only versions we skip should be the ones that didn't pass the quality check.
New versions should not pile up in any steps of the whole pipeline. We should focus on optimising for bottlenecks.
New versions should have constant and minimal amount of changes. Merge trains is the good approach to achieve that.
We should only deploy the changes as they arrive. Pull vs Push model. If we don't have enough changes (on the weekend for example) we don't deploy. If we have enough changes on Friday evening, we deploy them as fast as possible. We won't have "weekend gaps" in MTTP in this case.
We move away from "Deployer" type of Release Manager duties and switch to "Delivery on-call" type.
"Delivery on-call" person can potentially combine duties with Release Manager, but this is not necessary.

The Roadmap

It is obvious that achieving this North Star is a long process and require multiple iterations, improvements and feedback loops.

Phase 1

Splitting Auto-deploy and Release management duties

This was already discussed in the following issue: Introduce release manager and deploy manager rotation The goal at this phase is to split the duties of Release Manager and introduce the duty of Deploy Manager. Deploy manager is responsible for delivering auto_deploy packages to SaaS, and his/her shift is fully focused on that. Release manager is leading the release process through the whole release cycle. As auto_deploy will be controlled by a dedicated Deploy manager, Release manager can combine his RM duties with OKR and project work. Both parties are synchronizing their efforts to delivery auto_deploy and release packages in safest manner, i.e. Release Manager can ask Deploy Manager to pause auto_deploy pipelines during the release.

Some additional thoughts regarding this phase:

Deploy manager rotation should be one week long. This will help us to iterate faster and collect as much feedback as possible. This also minimizes the impact of the project work, as people who worked on projects are not leaving for long.
Ideally, at the end of the shift, Deploy manager should come up with a list of issues, manual tasks, or blockers that he/she faced during the shift and convert it into action items within one single Epic.
The team should dedicate time to collecting, resolving, and improving auto_deploy process. This efforts should be part of quarterly OKR's, so every quarter we should get closer to the North Star.
This phase should not have a deadline, as we should be confident enough to move to the next phase and cuting the corners might lead to potential degradation of the whole process.

Phase 2

Observed automated deployments

This idea isn't new either. It was discussed here Production promotion window

At this phase, we keep Deploy Manager duties, but we drastically reduce the scope of his/her responsibilities, as we eliminated most of the toil during the first phase. At the beginning of the shift, Deploy Manager enables the automatic rollout of auto_deploy packages and keeps observing the situation. He/she pauses auto_deploy in case of an incident or other reasons.

The goals of this phase are to build confidence that it's safe to roll out the versions automatically. During this phase, we should also collect, resolve, and improve the flow of the versions hitting production.

The team should be focused on building the constant flow of the versions without piling them up, skipping the versions, or idling. This work should also be included as a quarterly goal.

Phase 3

Delivery on-call and MTTP improvements

At the beginning of this phase, we are eliminating the Delivery Manager role and introducing Delivery on-call rotation as we are confident enough in the quality of the versions and processes we built around auto_deploy. Delivery on-call shifts should be similar to SRE on-call and follow all SRE on-call best practices defined for the whole infrastructure department. We should have a proper set of alarms and define the optimal escalation paths. We can also focus on improving MTTP and flow effectiveness during this phase. As an example, we might consider to split the build pipeline and skip all packaging steps for auto_deploy except docker images and gitaly, as we don't use dpkg or rpm on kubernetes etc. We might also think about Release Management and Observability improvements.

Note: I don't think we should define any timeline for this effort since the goal is not to deliver it in a couple of quarters, but it might take some time and lots of improvement iterations.

The Hamster's Law: If you don't dig, you won't have a burrow.

Edited Dec 18, 2023 by Vladimir Glafirov