Next-gen auto-deploy

Thanks to daily auto-deploy branch creations we lowered the MTTP indicator from ~50hr to ~30hr on average.

I think that we approached the limit of the current solution and therefore I'd like to expose some improvements we can introduce now.

When talking about improvements, I believe we should split the MTTP into 2 sections:

from merge to package
from package to production

Package to production can be segmented again, but this is outside the scope of this issue.

Here I want to focus on improvements from merge to package. Let's first recap the timing of the current solution.

release-tools consists of 3 auto-deploy jobs:

auto_deploy:prepare - At 04:00 on every day-of-week from Sunday through Friday, it creates a new auto-deploy branch from master
auto_deploy:pick - At minute 0 and 30 past every hour - it cherry-picks merged merge request labeled with ~pick into auto-deploy and update gitaly to the latest green master commit
auto_deploy:tag - At minute 5 and 35 past every hour - it tags a new version from the latest green commit on the auto-deploy branch, when done a package is built on our dev instance.

Timeline analysis

Those three jobs are quick to run, but there are other things to take into account. As an exercise, let's build a hypothetical timeline. I want to outline 2 use-cases

A new auto-deploy branch

04:00 auto_deploy:prepare - a new auto-deploy branch is created, on that branch release-tools will update GITALY_SERVER_VERSION to the latest green master commit
04:00 a pipeline starts for the new commit on the auto-deploy branch
04:00 auto_deploy:pick - this job will race with auto_deploy:prepare, it may end up picking commit in the old auto-deploy branch
04:05 auto_deploy:tag - this invocation fails as there are no green pipelines on the newly created auto-deploy branch
04:30 auto_deploy:pick - this may commit on the auto-deploy branch, either a pick or a new gitaly version, in those cases a new pipeline starts
04:35 auto_deploy:tag - this invocation fails as there are no green pipelines on the newly created auto-deploy branch
04:50 (timing here is estimated looking at recent pipelines) the first pipeline completes for the new commit on the auto-deploy branch
05:00 auto_deploy:pick - this may commit on the auto-deploy branch, either a pick or a new gitaly version, in those cases a new pipeline starts
05:05 auto_deploy:tag - finally we tag the first commit we created at 04:00

It took us 1h05min to tag, now the package can be built

A fix to deploy quickly

Here I'll consider the simple case of a revert. This is a simple case because the development time is really short, as well as the review time. Nonetheless, we have ~50min of merge request pipeline before we can merge it.

Let's consider the best case here when we merge on master right before running auto_deploy:pick.

06:00 a ~pick into auto-deploy merge request is merged to master
06:00 auto_deploy:pick - this will pick the merge request on the auto-deploy branch
06:00 a pipeline starts for the new commit on the auto-deploy branch
06:05 auto_deploy:tag - this invocation cannot tag the needed fix, the pipeline is still running
06:35 auto_deploy:tag - this invocation cannot tag the needed fix, the pipeline is still running
06:50 (timing here is estimated looking at recent pipelines) the pipeline completes for the new commit on the auto-deploy branch
07:05 auto_deploy:tag - finally we tag the first commit we created at 06:00

This time I removed the irrelevant auto_deploy:pick to declutter the timeline.

Again it took us 1h05min to tag, now the package can be built.

Worst case, we merge to master right after auto_deploy:pick job, in that case, we have an extra 30min for the next scheduled pipeline for a total time of 1h35min

As a general note, here we have to consider also the ~50min of the original merge request, so we are at 2h25min but this is outside the scope of tracking from merge to package.

Packaging

The problem here is that our journey toward a deployment is still very long.

Once we tag omnibus, we need ~1hr to build the packages and trigger a staging deploy.

To recap:

new auto-deploy branch: ~2hr form branch creation to package
revert and ~pick into auto-deploy: worst-case ~2hr30min from cherry-picking the change to the package, if we include also the original merge-request pipeline we have a timing of ~3hr20min

And all the above examples assume no pipeline failures of any kind.

Improvement proposals

Here I want to propose some changes that will reduce this time by 50%

Parallel package building

The major pain point in the above timelines is waiting for the ~50min pipeline on the auto-deploy branch.

What we can do instead of waiting is tagging regardless of the pipeline status and delay pipeline checking at the deployer.

Let's reconsider our fist timeline:

04:00 auto_deploy:prepare - a new auto-deploy branch is created, on that branch release-tools will update GITALY_SERVER_VERSION to the latest green master commit - gitlab-org/gitlab@e57f2b7e
04:00 a pipeline starts for the new commit on the auto-deploy branch - https://gitlab.com/gitlab-org/gitlab/pipelines/147781234
04:05 auto_deploy:tag - we tag gitlab-org/gitlab@e57f2b7e , package 13.1.202005200510+e57f2b7e935.f82b0c3ae51 can be built now
04:50 the pipeline completes for the new commit on the auto-deploy branch
05:05 13.1.202005200510+e57f2b7e935.f82b0c3ae51 is ready to deploy

Now when the deployer starts, it can extract the GitLab SHA from 13.1.202005200510+e57f2b7e935.f82b0c3ae51 and verify the original pipeline status for gitlab-org/gitlab@e57f2b7e

If it's green the deployment starts; if red or canceled we fail the deployment; if running we wait.

With this change, we saved ~1hr on a ~2hr process.

Implementation steps

we can add the pipeline checking job in the deployer and allow it to fail. This will give us confidence on our check
we make the check mandatory to continue the deployment
we start tagging from the first non-failed pipeline on the auto-deploy branch (green, running or queued)
extra - we add a job on_failure on GitLab to abort package compilation and save resources

Bonus point, we get more visibility on the auto-deploy branch health, right now we have to manually check for broken pipelines, with this improvement a broken build will be notified on slack

Moving toward master deployment

This is an extra improvement that will cut the MTTP again. In order to get the benefits from this, we first need to implement parallel package building.

Why do we create the auto-deploy branches?

The auto-deploy branch is a safe staging area where we are protected from the master's speed of development. Originally we created those branches once a week, it was a huge improvement from the previous cadence of once a month (creating the stable branch was kind of creating an auto-deploy branch).

This year we moved to create an auto-deploy branch twice a week, and now we create one every day.

One of the benefits of daily auto-deploy is that we end up picking fewer changes.

One of the key point in making daily branches was the concept that we should move fast by default, and slow down if we have problems. Because of that, the release managers have the power to inhibit the creation of a new auto-deploy branch to focus on the stability of the current one.

So far we never had to stop it, we deployed to production every day the changes from the day before.

Here is my proposal, why don't we create the auto-deploy branch when we tag?

Instead of creating a new branch every day, we create it on the spot when we tag

The auto_deploy:tag job starts,
it selects the latest green master commit,
and branch off an auto-deploy-202005200510 (note the timestamp now has also HHMM, the reference to the milestone is removed as well),
it commits the appropriate gitaly version into that branch
and tag the build.

Because we want to retain control of the speed of the process, we can introduce the concept of branch-stickiness.

If an environment variable (i.e. AUTO_DEPLOY_STICKY_BRANCH) is set, then we go back to the old process and keep tagging from that branch instead of creating a new one.

With these two improvements, I think we can reach production 2 or 3 times a day with changes from master. This is likely the lower limit we can reach without implementing automated production deployment pre-checks , but this covers from package to production.

/cc @gitlab-org/delivery