Next-gen auto-deploy
Thanks to daily auto-deploy branch creations we lowered the MTTP indicator from ~50hr to ~30hr on average.
I think that we approached the limit of the current solution and therefore I'd like to expose some improvements we can introduce now.
When talking about improvements, I believe we should split the MTTP into 2 sections:
- from merge to package
- from package to production
Package to production can be segmented again, but this is outside the scope of this issue.
Here I want to focus on improvements from merge to package. Let's first recap the timing of the current solution.
release-tools
consists of 3 auto-deploy jobs:
-
auto_deploy:prepare
- At 04:00 on every day-of-week from Sunday through Friday, it creates a new auto-deploy branch frommaster
-
auto_deploy:pick
- At minute 0 and 30 past every hour - it cherry-picks merged merge request labeled with~pick into auto-deploy
and update gitaly to the latest green master commit -
auto_deploy:tag
- At minute 5 and 35 past every hour - it tags a new version from the latest green commit on the auto-deploy branch, when done a package is built on our dev instance.
Timeline analysis
Those three jobs are quick to run, but there are other things to take into account. As an exercise, let's build a hypothetical timeline. I want to outline 2 use-cases
A new auto-deploy branch
- 04:00
auto_deploy:prepare
- a new auto-deploy branch is created, on that branch release-tools will updateGITALY_SERVER_VERSION
to the latest green master commit - 04:00 a pipeline starts for the new commit on the auto-deploy branch
- 04:00
auto_deploy:pick
- this job will race withauto_deploy:prepare
, it may end up picking commit in the old auto-deploy branch - 04:05
auto_deploy:tag
- this invocation fails as there are no green pipelines on the newly created auto-deploy branch - 04:30
auto_deploy:pick
- this may commit on the auto-deploy branch, either a pick or a new gitaly version, in those cases a new pipeline starts - 04:35
auto_deploy:tag
- this invocation fails as there are no green pipelines on the newly created auto-deploy branch - 04:50 (timing here is estimated looking at recent pipelines) the first pipeline completes for the new commit on the auto-deploy branch
- 05:00
auto_deploy:pick
- this may commit on the auto-deploy branch, either a pick or a new gitaly version, in those cases a new pipeline starts - 05:05
auto_deploy:tag
- finally we tag the first commit we created at 04:00
It took us 1h05min to tag, now the package can be built
A fix to deploy quickly
Here I'll consider the simple case of a revert. This is a simple case because the development time is really short, as well as the review time. Nonetheless, we have ~50min of merge request pipeline before we can merge it.
Let's consider the best case here when we merge on master right before running auto_deploy:pick
.
- 06:00 a
~pick into auto-deploy
merge request is merged to master - 06:00
auto_deploy:pick
- this will pick the merge request on the auto-deploy branch - 06:00 a pipeline starts for the new commit on the auto-deploy branch
- 06:05
auto_deploy:tag
- this invocation cannot tag the needed fix, the pipeline is still running - 06:35
auto_deploy:tag
- this invocation cannot tag the needed fix, the pipeline is still running - 06:50 (timing here is estimated looking at recent pipelines) the pipeline completes for the new commit on the auto-deploy branch
- 07:05
auto_deploy:tag
- finally we tag the first commit we created at 06:00
This time I removed the irrelevant auto_deploy:pick
to declutter the timeline.
Again it took us 1h05min to tag, now the package can be built.
Worst case, we merge to master right after auto_deploy:pick
job, in that case, we have an extra 30min for the next scheduled pipeline for a total time of 1h35min
As a general note, here we have to consider also the ~50min of the original merge request, so we are at 2h25min but this is outside the scope of tracking from merge to package.
Packaging
The problem here is that our journey toward a deployment is still very long.
Once we tag omnibus, we need ~1hr to build the packages and trigger a staging deploy.
To recap:
- new auto-deploy branch: ~2hr form branch creation to package
- revert and
~pick into auto-deploy
: worst-case ~2hr30min from cherry-picking the change to the package, if we include also the original merge-request pipeline we have a timing of ~3hr20min
And all the above examples assume no pipeline failures of any kind.
Improvement proposals
Here I want to propose some changes that will reduce this time by 50%
Parallel package building
The major pain point in the above timelines is waiting for the ~50min pipeline on the auto-deploy branch.
What we can do instead of waiting is tagging regardless of the pipeline status and delay pipeline checking at the deployer.
Let's reconsider our fist timeline:
- 04:00
auto_deploy:prepare
- a new auto-deploy branch is created, on that branch release-tools will updateGITALY_SERVER_VERSION
to the latest green master commit - gitlab-org/gitlab@e57f2b7e - 04:00 a pipeline starts for the new commit on the auto-deploy branch - https://gitlab.com/gitlab-org/gitlab/pipelines/147781234
- 04:05
auto_deploy:tag
- we tag gitlab-org/gitlab@e57f2b7e , package13.1.202005200510+e57f2b7e935.f82b0c3ae51
can be built now - 04:50 the pipeline completes for the new commit on the auto-deploy branch
- 05:05
13.1.202005200510+e57f2b7e935.f82b0c3ae51
is ready to deploy
Now when the deployer starts, it can extract the GitLab SHA from 13.1.202005200510+e57f2b7e935.f82b0c3ae51
and verify the original pipeline status for gitlab-org/gitlab@e57f2b7e
If it's green the deployment starts; if red or canceled we fail the deployment; if running we wait.
With this change, we saved ~1hr on a ~2hr process.
Implementation steps
- we can add the pipeline checking job in the deployer and allow it to fail. This will give us confidence on our check
- we make the check mandatory to continue the deployment
- we start tagging from the first non-failed pipeline on the auto-deploy branch (green, running or queued)
- extra - we add a job
on_failure
on GitLab to abort package compilation and save resources
Bonus point, we get more visibility on the auto-deploy branch health, right now we have to manually check for broken pipelines, with this improvement a broken build will be notified on slack
Moving toward master deployment
This is an extra improvement that will cut the MTTP again. In order to get the benefits from this, we first need to implement parallel package building.
Why do we create the auto-deploy branches?
The auto-deploy branch is a safe staging area where we are protected from the master's speed of development. Originally we created those branches once a week, it was a huge improvement from the previous cadence of once a month (creating the stable branch was kind of creating an auto-deploy branch).
This year we moved to create an auto-deploy branch twice a week, and now we create one every day.
One of the benefits of daily auto-deploy is that we end up picking fewer changes.
One of the key point in making daily branches was the concept that we should move fast by default, and slow down if we have problems. Because of that, the release managers have the power to inhibit the creation of a new auto-deploy branch to focus on the stability of the current one.
So far we never had to stop it, we deployed to production every day the changes from the day before.
Here is my proposal, why don't we create the auto-deploy branch when we tag?
Instead of creating a new branch every day, we create it on the spot when we tag
- The
auto_deploy:tag
job starts, - it selects the latest green
master
commit, - and branch off an
auto-deploy-202005200510
(note the timestamp now has also HHMM, the reference to the milestone is removed as well), - it commits the appropriate gitaly version into that branch
- and tag the build.
Because we want to retain control of the speed of the process, we can introduce the concept of branch-stickiness.
If an environment variable (i.e. AUTO_DEPLOY_STICKY_BRANCH
) is set, then we go back to the old process and keep tagging from that branch instead of creating a new one.
With these two improvements, I think we can reach production 2 or 3 times a day with changes from master. This is likely the lower limit we can reach without implementing automated production deployment pre-checks , but this covers from package to production.