Explorations needed for the Auto DevOps Helm3 upgrade
This is an exploratory issue to have a clear understanding of the tools needed and approach to be taken for the Auto DevOps Helm 3 upgrade. As it's an exploratory issue, its definition of done is to answer a set of questions to decrease risks and complexity.
Questions to answer
- How does the
helm-2to3
migration work? - Can we support both helm 2 and 3 for a while? How?
Conclusions
helm-2to3
migration work?
How does the The helm 2to3
migration migrates the in-cluster release data to the new format. It is a destructive process (some old releases can be lost), but based on user reports, very reliable. The general flow should be something like this
graph TD
A(Helm 2) -->B[Back up data]
B --> C[Convert in-cluster data to Helm 3]
C --> D[Verify Helm 3 releases]
D --> E{Success?}
E -->|Yes|F[Make a test release]
E -->|No|G[Maybe: Clean up Helm 3 data]
G --> A
F --> H{Success?}
H -->|Yes|Y[Clean up Helm 2 releases]
H -->|No|G
Y --> X(Helm 3)
(see also working CI example of the happy path)
-
The back-up step is an additional safety precaution for the case where data loss or Helm 3 incompatibilities are only discovered after a few releases. I do not think we need to provide any automated restore procedure, documentation on how to manage the backups should be sufficient.
-
I propose the test release because Helm 2 and Helm 3 can handle the same chart differently. We already saw this in our testing for
auto-deploy-app
. If we do not perform the test-release before deleting the v2 data, then we increase the chances of leaving users with a broken installation requiring a restore from backup. -
The clean-up on failure step is to help make the process idempotent. Otherwise, data from a previously failed migration could get in the way of a future attempt.
Can we support both helm 2 and 3 for a while? How?
Yes.
Proposal
I think we should leverage the on-going work in gitlab-org/charts/auto-deploy-app#70 (moved) to make a "safe" breaking change release, so that new feature only go to the Helm 3 variant going forward, but the Helm 2 variant receives bugfixes and security updates until %14.0.
- new users can get the Helm 3 version by default before %14.0,
- existing users stay on the last stable Helm 2 release until they opt-in
- at %14.0, we do a one-time failing pipeline that informs the user that Helm 2 is fully unsupported (but they can still remain on that image)
Alternative
An alternative is to support both Helm 2 and Helm 3 within auto-deploy-image
, and develop new features for both. The main benefit of having both in the same image is that it can be done as a non-breaking change and does not depend on the work in #70 (closed) for opt-in Helm 3 support. The downside of this approach is a significant increase in the image size, as well as code complexity.