Address slow rollout of bugfixes for AutoDevOps
Recent outages involving AutoDevops have highlighted the need for the ability to quickly rollout fixes in a controlled and targeted manner.
Versioning of ADO templates would allow for a quick redeployment of an older version when faced with an outage scenario.
Problem to solve
Can't avoid deployment rollback on Production Incident
- Sometimes we accidentaly make a breaking change to a template that starts breaking user's pipelines on
gitlab.com
. - In such case, SREs have to perform full rollback to the previous version. This has to be avoided at all costs.
- There are no capability to lock the version of the template after its inclusion.
- See this issue for more info.
Proposal
In the long run, we want full versioning of the Auto DevOps YAML itself. But the mechanism for this is still being discussed, and getting this right is very important for the product. For example, there is a proposal to move away from include:template
entirely.
To address the immediate need for fast rollbacks, we will add CI/CD variables AUTO_BUILD_IMAGE_VERSION
, AUTO_DEPLOY_IMAGE_VERSION
and DAST_AUTO_DEPLOY_IMAGE_VERSION
to Auto DevOps, allowing users to control the version of the image used. For example:
variables:
AUTO_BUILD_IMAGE_VERSION: v1.0.0
build:
image: registry.gitlab.com/gitlab-org/cluster-integration/auto-build-image:${AUTO_BUILD_IMAGE_VERSION}
# ...
The version can then be overridden at the instance level in the instance admin CI/CD configuration, or by running the following command:
curl --request POST --header "PRIVATE-TOKEN: <your_access_token>" \
"https://gitlab.example.com/api/v4/admin/ci/variables" \
--form "key=AUTO_BUILD_IMAGE_VERSION" \
--form "value=v1.3.1"
More information
In earlier discussions, I have outlined some drawbacks of this approach, but I have come around and consider this the most pragmatic approach available.
It is also a valuable feature for users: Just as we can pin images at the instance level to respond to an incident, users can pin the images in order to update at their own cadence. Users who do not want variable images can edit their .gitlab-ci.yml
, for example:
include:
- template: Auto-DevOps.gitlab-ci.yml
build:
image: registry.gitlab.com/gitlab-org/cluster-integration/auto-build-image:v1.0.0
Definition of done
-
CI/CD templates are updated with variable image versions -
Runbooks are updated with instructions on how to revert a bad image update. Roughly, the process would be: - Set the instance level variable
- Revert in an MR
- Once MR is rolled out, delete the instance level variable