Helm3 upgrade retrospective
Summary
GitLab uses GitLab's cloud-native Chart and for other services like prometheus and logging. Because November, 2020 was the last officially supported version of Helm2 we decided to prioritize a Helm 3 upgrade on all clusters. This meant identifying places where we were incompatible with Helm3 both in the GitLab chart and the other charts we depended on and going through a cluster by cluster upgrade.
The entire upgrade process took approximately 4 weeks which was over the holiday break so there was approximately 2 to 3 weeks of work total to complete the upgrade across 9 Kubernetes clusters.
Epic: &370 (closed)
What went well
- Having a dual image that runs in both modes was very helpful in the beginning #1413 (closed) , we were able to run helm2 and helm3 alongside each other for weeks since the transition took awhile
- Having preprod has a test environment was extremely helpful, we were able to sort out a bunch of issues before we moved to Staging
- The split between
gitlab-com
andgitlab-helmfiles
worked to our advantage since we could upgrade them separately - Having dry-runs in CI was a good addition as it helped to keep to increase confidence
What could have been improved
- Because of multi-zonal clusters there were a lot of upgrades that needed to be done, there were a total of 9 clusters which was a bit of a grind
- We needed to manage remote connections to all 9 clusters since the upgrade was done out of CI, if our tooling was a bit better around this it would have been easier, perhaps a dedicated kubectl host for each cluster instead of tunneling through hosts.
What happened that we didn't expect
- I was initially expecting that we would run into more issues with the GitLab Chart but there were only a handful of updates needed there.
- Prometheus operator was a bit of a pain because we needed to upgrade the Chart first, and then upgrade. We should do a bit better about keeping on top of version updates #1438 (closed)
Edited by John Jarvis