Change procedure for making database changes

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

The problems described in https://gitlab.com/gitlab-com/infrastructure/issues/1803 are for me the final straw that broke the camel's back, and we need to change the way we make database changes and review these changes. Thus far there have essentially been 3 scenarios:

  1. Somebody makes a database change and all is well
  2. Somebody makes a database change. Several weeks later a database specialist (or somebody else for that matter) discovers serious problems with the changes
  3. Somebody makes a database change, breaks master or a deployment, and asks a DB specialist to look into it or even solve the problem

The biggest problem for me is scenario 3. While it doesn't happen that often, it happens often enough that it becomes profoundly annoying. A database specialist shouldn't be tasked with cleaning up other people's mess every time.

To deal with this we need to change our procedure. For example, I'm thinking of the following:

  • We define a group in gitlab-org called "Database Specialists". Any merge request changing anything database related (so basically any change in db/) needs to be approved by the "Database Specialists" group (using merge request approvals)
  • When designing a new feature any database related changes should be clearly stated, and a database specialist should review them before any coding is done. Don't ask a database specialist to design everything from scratch as this consumes too much of their time, instead do as much work yourself and merely ask a DB specialist to give their feedback/opinion.
  • Developers who block deployments because of problematic migrations (both during and after a deploy) should be put on call until the problem is solved. This allows production engineers to escalate problems, and will hopefully make developers take greater care in testing/reviewing/thinking about their changes
  • Each database change must be tested on staging, and each merge request must contain some boilerplate for the author to fill in (e.g. the migration names, the time it took to run, etc). If an author does not have staging access (e.g. they are not an employee of GitLab) it's up to the reviewer to gather the data
  • Each merge request containing database changes should be labelled gitlab-ce~3098698 This is usually already the case I think, but it's worth stating this more clearly
  • Database specialists should not be asked to review code until the migrations have been tested on staging, this to prevent them from drowning in work
  • When asking a database specialist to review a merge request the author should include a link to the diff of the migrations, instead of "Please review the migrations". This reduces the amount of work for a database specialist, especially when reviewing gigantic merge requests
  • If somebody either blocks a deployment due to a bogus database change, or GitLab.com becomes unstable, then solving this becomes the number one priority and the author is expected to drop whatever they are doing and solve this problem first.

This means that a hypothetical workflow would be:

  1. An issue is created to discuss a bug/feature/change/whatever. In this issue any database related changes are discussed. A database specialist may be asked for reviewing the proposed changes, though this is only really necessary at this point when adding new columns or tables.
  2. An MR is created for this issue
  3. The author adds the "Database Specialists" group to the merge request approvers
  4. The author adds the gitlab-ce~3098698 label to the MR
  5. The author runs the migrations on staging, and includes the timings for every migration in the MR body (e.g using a Markdown table)
  6. The author discusses the changes with any endbosses/colleagues (not necessarily DB specialists)
  7. The author asks a database specialist to review the migrations, including a link to the diffs in the comment
  8. A DB specialist reviews the changes and approves the MR if deemed OK, or they reject the changes requiring the author to adjust the code accordingly (or even close the MR, depending on the changes and feedback)
  9. This continues until the changes are approved by at least 1 database specialist
  10. The merge request is assigned to and endboss for final review
  11. The merge request is merged

Should the deploy fail (e.g. due to a timeout or a problem with a migration), then for the author of said migration the procedure will be:

  1. Stop whatever it is that you were working on
  2. Solve the problem
  3. Help get this fix in a new package version
  4. Be on call when the new changes are deployed, in case they break again

When things go wrong I think we also need to clearly note down which migrations broke, why they broke, and the merge requests that introduced them. This helps us keep track of the number of failures over time. This can also help us see if the problems are introduced by the same people over and over, though primarily the purpose is to simply see if we're improving over time.

I think we also need to start requiring backend developers (as a start) to take some kind of SQL/database course (preferably an existing one since I'm really not looking forward to training 80 developers myself). Such a course should cover topics such as:

  • How to write SQL queries in the first place
  • How to use PostgreSQL (as in, the extensions/features it may provide)
  • How to design database schemas properly
  • How to write efficient queries

Such a course would be useful for designers/frontend developers as well so they have a better understanding of the system as a whole, but it's not as much a hard requirement as it is for backend developers.

Finally, when hiring developers I think we need to make SQL knowledge a hard requirement, and make PostgreSQL knowledge a highly preferred skill to have. This prevents us from hiring developers that may be unfamiliar with SQL, then proceed to make a mess out of things.

cc @stanhu @ernstvn @DouweM @smcgivern @rspeicher @dzaporozhets: since this basically affects all backend teams

Edited by 🤖 GitLab Bot 🤖