Incident Review: Gitaly/Praefect 16.0 configuration breaking changes

Incident Review

In #9091 (closed) we've faced problems with upgrading the auto-deploy package because of some breaking changes that the 16.0.xxxx brought it. This was specifically the gitaly and praefect configuration.

Before the incident

Before the incident started the ~"team::Practices" was made aware that this is going to be a 16.0 breaking change 1 week before the incident and we've opened https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/19359 to tackle this, one assumption we made was that we had till late May 2023 since 16.0 was meant to be released on May 22nd. However, this was not the case because auto-deploy started tagging 16.xxx as soon as 15.11 was released on April 22nd.

Follow-up:

Communication breakdown that a breaking change was coming: Configuration Deprecation process Refinement fo... (gitlab-org/gitlab#408557)
Breaking changes for GitLab.com come a month earlier for configuration: gitlab-com/Product#5673 (comment 1372440947)

Learning:

When there is a breaking change for GitLab.com we should assume it's going to come a month earlier since we release it a month earlier, especially internally.

April 24th: Deploy 16.0.xxx for the first time

On April 24th we deployed our first package of 16.0.xxx which triggered the errors below when trying to install the package, this was because the configuration was marked for removal on 16.0.

We knew updating the configuration for gitaly and praefect would have taken time so our first instinct was to temporarily increase the removal to 16.1 rather than 16.0 so that we can unblock deployments and work on updating the configuration.

This, however, was not possible since the removal checks happened before we install the new package, so even when we bumped the removal we where still blocked. To fix this we had to hotpatch VMs with the deprecations.rb so we set removal to 16.1. Since this was a hot patch we've tried to limit this to servers that we knew got affected. However, we later found out that more servers got affected and we needed to hotpatch them as well #9112 (closed), #9110 (closed), #9580 (closed).

Another problem we faced that forced us to do this hotpatch is that even when we updated the configuration for pre we still had to hotpatch it because the configuration wasn't updated since gitlab-ctl reconfigure was failing.

Late in EMEA time we've updated the configuration from pre which took some time since it was the first time the engineers had to use hashicorp vault. Shout out to @mchacon3 for helping us out here!

Action Items:

Make it easy to rollback a broken package: gitlab-org/omnibus-gitlab#7797

Learnings:

To get a full list of servers we should have hotpatches we could have used: omnibus_build_info
How to update hashicorp vault: https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/vault/usage.md#chef-secrets

April 25th: Dealing with staging

We've opened #9114 (closed) to start updating staging, we found this environment to be more complex since the configuration was in multiple files, and also in the base roles. We've moved slowly here because we thought we weren't blocking deploys so we wanted to make sure we have all the configuration in the right order so we can execute the same in gprd.

April 26th: Finish with staging

We've rolled out the configuration in staging and from our testing, everything looked fine.

We've also kept getting distracted here because we found some more nodes failing in pre because gitlab-ctl reconfigure was still failing on some of them.

We've also had to update cookbook-comnibus-gitlab since certificates where defined in that repository and needed to update some tests around it. This took almost all of the day since we weren't familiar with the code and wanted to see if we should move certificate management outside of this cookbook which lead to a dead end and inconsistency on how we manage certificates.

During the AMER shift we've also found out that some configuration was broken and opened a new incident since it was yet again blocking deployments #9262 (closed) this is because of how we migrated the configuration we've moved storage in the wrong field.

Learning:

We should validate that chef_client is not failing using: chef_client_error > 0

April 27th: Preparing for Production rollout

There was the usual distraction that some more servers needed to be patched. We've also updated the rest of the environments listed in https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/19359 which had more environments and wanted to sort them out before gprd so that we can be more confident with the change in gprd. We've also cleaned up some of the configurations since they had fields that no longer existed and they were just dead configurations in our code base.

Learnings:

Deploy nodes require praefect.db configuration to run migrations

April 28th: Updating `gprd`

On the 28th we've finished and executed the configuration update for gprd #9541 (closed) this took most of the day since we had to do a lot of prep work since a lot of files needed to be updated. Luckily everything was updated successfully and there were no more problems with Gitaly configuration.