Consider limiting how many times we change our database schema

https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/9176 describes a production issue we had yesterday. The summary is that a migration caused an increase in load. Reverting this migration proved problematic, and it wasn't until a bit earlier today that we managed to set up the merge requests for this.

This brings me to a problem I have been observing for several years now: we change our database schema a lot. As our database grows, there is less room for error. As our popularity grows, there will also be more eyes on GitLab.com, meaning we have to be even more careful.

Changing the schema as often as we do likely is not going to scale with that. You simply can't rewrite a 100GB table every month (or even multiple times per month) without eventually causing an incident. With this in mind, I think it's time we start investigating what options we have to keep this under control. A naive approach is to limit the number of migrations per team per month. On paper this may seem like a good idea, but in practise I see this leading to fewer but much bigger migrations.

The end goal is to ensure that when we introduce a migration, we know for a fact that it will work (and work well). If we don't have that kind of certainty, a migration should not be introduced at all. This may seem extreme, but as we grow we can't stick to the old approach of "It works on my computer, and we'll see how it works in production".

In the past we had a merge request template for database changes, but this was rarely used in practise. Adding additional TODOs to the default merge request template won't work either, as most will probably just remove or ignore them (something that already happens with our current TODOs in the template). We likely also need help from Infrastructure, as we can only ensure a migration works as well as it has been tested; and good testing requires production data.

Edited by 🤖 GitLab Bot 🤖