2020-06-09: post-deployment migration failure

Summary

post-deployment migration failure

The db/post_migrate/20200602143020_update_routes_for_lost_and_found_group_and_orphaned_projects.rb post-deployment migration failed due to an index violation.

Timeline

All times UTC.

2020-06-09

10:41 - The production deployment failed
10:43 - our tooling reported the failure on slack
10:43 - @jarv and @nolith investigates the failure, that post-deployment migration failure was deemed safe for the production environment
10:49 - gitlab-org/gitlab!33653 (diffs, comment 357921042) was identified as the source of the failure
10:50 - @iroussos was engaged to estimate the possibility of a revert
10:59 - @iroussos is working on a revert
11:27 - acaiazza declares incident in Slack using /incident declare command.
13:44 - proposed revert if merged into master: gitlab-org/gitlab!34163 (merged)
14:07 - Release Manager fast tracks getting this by starting the build early (internal: https://gitlab.slack.com/archives/C0139MAV672/p1591708029245300)
18:37 - Production deploy of the revert is completed

Resolution plan:

wait for gitlab-org/gitlab!34163 (merged) to be merged
pick it into auto-deploy
tag a new release
deploy the new release to recover other pending migrations (if any)

Click to expand or collapse the Incident Review section.

Incident Review

Summary

There was a bug with the db/post_migrate/20200602143020_update_routes_for_lost_and_found_group_and_orphaned_projects.rb post-deployment migration that was introduced with gitlab-org/gitlab!33653 (merged).

The migration was depending on unique namespace names and it was not checking whether the path was also unique (you can have different names that result to the same path). That causes no issues while generating a namespace, but results on an error when a route is created if there is another route with the same path.

The staging database is not up to date with production, so this error was not caught on stage. But even if the staging database was up to date, we could have missed such a bug. We got "lucky" that there was a conflicting route in our own database. A different namespace name and path could have been chosen that would result in no conflicts. In that case, this migration would succeed in all our environments and could have caused issues in self hosted instances.

On the code testing side, as this was a bug, no specs were included to test against that case, so it was not caught by our test jobs. This migration uses a pretty unique namespace name and path, so without explicit tests for that case, we would never catch such a bug.

Service(s) affected: Deployment to GitLab.com production
Team attribution: Database
Minutes downtime or degradation: 476

Customer Impact

Who was impacted by this incident?

Deployment of new updates to GitLab.com production.
What was the customer experience during the incident?

The deployment process had to be halted until the revert migration (gitlab-org/gitlab!34163 (merged)) was deployed.
How many customers were affected?
If a precise customer impact number is unknown, what is the estimated potential impact?

Incident Response Analysis

How was the event detected?

10:43 - our tooling reported the failure on slack
How could detection time be improved?

Detection was immediate
How did we reach the point where we knew how to mitigate the impact?

@jarv and @nolith investigated the failure, the author of the migration was engaged (@iroussos) and the post-deployment migration failure was deemed safe for the production environment.
How could time to mitigation be improved?

The author of the migration was engaged and started working on a revert 10 minutes after the incident happened.

Post Incident Analysis

How was the root cause diagnosed?

The author of the migration (@iroussos) identified the issue and created a followup MR that fixed it.
How could time to diagnosis be improved?

The diagnosis for this issue was trivial once it happened and we could see the error messages.
Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?

This is a fault on the business logic of the introduced update and we can see no way to mitigate such issues other than even more thorough reviews.
Was this incident triggered by a change?

gitlab-org/gitlab!33653 (merged)

5 Whys

The data migration failed when deployed on GitLab.com, why?
- The code did not satisfy one of the Model requirements: each path used in a route must be unique.
Why that requirement was not checked?
- The migration was depending on a previous migration that created the paths for the namespaces and did not check for path uniqueness.
- Path uniqueness is not enforced when a namespace is created, while they must be unique when the route is created.
Why our Model specs did not catch this faulty behavior / logic?
- Data migrations can not rely on existing models. They must be isolated and can not use application code (e.g. models defined in app/models).
- That means that migration authors have to replicate existing functionality on the migration class.
- We require specs to be written for all data migrations, but a bug like this one can slip through those tests as they are not as exhaustive as the ones we have for each of our Models.
  
  For example the specs for Namespace, Group and Route add up to 3K lines, which can not be reintroduced whenever a new migration is added.
- This incident would have been avoided if our application level models for Namespace, Group and Route were used. But we have to define new ones by cherry picking the functionality needed for the migration.
Why the logic that requires unique paths on routes was not replicated in the migration?

There is no validation at the code level for path uniqueness when a Namespace is created. The process depends on a unique path being generated. That code was not properly included in the initial migration as only the name was checked for uniqueness by mistake. But multiple namespace names can result to the same path, which was the root cause for this incident.
Why can't we use the application Models in background and data migrations?

Since these migrations can take a long time to run it’s possible for new versions to be deployed while they are still running.

It’s also possible for different migrations to be executed at the same time. This means that different background migrations should not migrate data in a way that would cause conflicts.
Why was this issue not caught in staging?

The staging database is not up to date with production, so this error was not caught on stage.

Even if the staging database was up to date, we could have missed such a bug. A different namespace name and path could have been chosen that would result in no conflicts. In that case, this migration would succeed in all our environments and could have caused issues in self hosted instances.

Lessons Learned

We should keep data migrations as simple as possible.
When a complex migration is introduced, we should double check during review that all the validations and Model logic are properly followed.

Corrective Actions

gitlab-org/gitlab#225199

Edited Jul 07, 2020 by Alessio Caiazza