Helm3 upgrade retrospective

Summary

GitLab uses GitLab's cloud-native Chart and for other services like prometheus and logging. Because November, 2020 was the last officially supported version of Helm2 we decided to prioritize a Helm 3 upgrade on all clusters. This meant identifying places where we were incompatible with Helm3 both in the GitLab chart and the other charts we depended on and going through a cluster by cluster upgrade.

The entire upgrade process took approximately 4 weeks which was over the holiday break so there was approximately 2 to 3 weeks of work total to complete the upgrade across 9 Kubernetes clusters.

Epic: &370 (closed)

What went well

Having a dual image that runs in both modes was very helpful in the beginning #1413 (closed) , we were able to run helm2 and helm3 alongside each other for weeks since the transition took awhile
Having preprod has a test environment was extremely helpful, we were able to sort out a bunch of issues before we moved to Staging
The split between gitlab-com and gitlab-helmfiles worked to our advantage since we could upgrade them separately
Having dry-runs in CI was a good addition as it helped to keep to increase confidence

What could have been improved

Because of multi-zonal clusters there were a lot of upgrades that needed to be done, there were a total of 9 clusters which was a bit of a grind
We needed to manage remote connections to all 9 clusters since the upgrade was done out of CI, if our tooling was a bit better around this it would have been easier, perhaps a dedicated kubectl host for each cluster instead of tunneling through hosts.

What happened that we didn't expect

I was initially expecting that we would run into more issues with the GitLab Chart but there were only a handful of updates needed there.
Prometheus operator was a bit of a pain because we needed to upgrade the Chart first, and then upgrade. We should do a bit better about keeping on top of version updates #1438 (closed)

Edited Jan 28, 2021 by John Jarvis