2020-09-22: post-deployment migration failed on Production

Summary

The daily deploy got stuck when one of the two post-deploy migrations introduced by gitlab-org/gitlab!41663 (merged) failed in production.

The migration had successfully run in staging, but the affected table/s had only a few records vs. thousands of them in the production database.

The main reason for that is that in staging there are no records in group_import_states that reference a Group without an owner and with a parent defined for it. The error only occurs when the code can not find an owner for a referenced group and tries to fetch its parent Group's owner. With no records in staging for group_import_states for that scenario, the migration could run without ever that branch of the code running.

The incident team realised that a simple rollback was not possible, since some data/schema changes had already landed in production (not breaking anything).

The database team - helping with the incident - concluded there were two possible ways to resolve the issue, to be able to unblock the daily deploy. @georgekoltsov quickly found a migration code fix, so we could progress with the fastest of the two options.

From there, the offending data migration was rolled back in staging, the code fix was merged, the post-deploy migrations could be run in staging again with no issues (after some testing, explained here) and the whole daily deploy could be resumed - progressing to our production environment with no issues.

This error could also have been caught early if tests for covering this scenario were included in the specs for the data migration. As a corrective action, full coverage for this use case has been also added in the specs in the fix (gitlab-org/gitlab!42987 (merged)). Additional discussion on preventing similar issues with data or background migration test coverage will continue in gitlab-org/gitlab#225199.

Change that added the migration (the offending change, finally reverted): gitlab-org/gitlab!41663 (diffs)

TASK [Run migrations] **********************************************************
fatal: [deploy-01-sv-gprd.c.gitlab-production.internal]: FAILED! => changed=true 
  cmd:
  - /usr/bin/gitlab-rake
  - db:migrate
  delta: '0:00:20.456282'
  end: '2020-09-22 10:42:25.257486'
  msg: non-zero return code
  rc: 1
  start: '2020-09-22 10:42:04.801204'
  stderr: |-
    rake aborted!
    StandardError: An error has occurred, all later migrations canceled:
    Invalid single-table inheritance type: Group is not a subclass of CleanupGroupImportStatesWithNullUserId::Namespace
    /opt/gitlab/embedded/service/gitlab-rails/db/post_migrate/20200909161624_cleanup_group_import_states_with_null_user_id.rb:39:in `default_owner'
    /opt/gitlab/embedded/service/gitlab-rails/db/post_migrate/20200909161624_cleanup_group_import_states_with_null_user_id.rb:65:in `block (2 levels) in up'
    /opt/gitlab/embedded/service/gitlab-rails/db/post_migrate/20200909161624_cleanup_group_import_states_with_null_user_id.rb:64:in `block in up'
    /opt/gitlab/embedded/service/gitlab-rails/app/models/concerns/each_batch.rb:90:in `block in each_batch'
    /opt/gitlab/embedded/service/gitlab-rails/app/models/concerns/each_batch.rb:68:in `step'
    /opt/gitlab/embedded/service/gitlab-rails/app/models/concerns/each_batch.rb:68:in `each_batch'
    /opt/gitlab/embedded/service/gitlab-rails/db/post_migrate/20200909161624_cleanup_group_import_states_with_null_user_id.rb:63:in `up'
    /opt/gitlab/embedded/bin/bundle:23:in `load'
    /opt/gitlab/embedded/bin/bundle:23:in `<main>'
    Caused by:
    ActiveRecord::SubclassNotFound: Invalid single-table inheritance type: Group is not a subclass of CleanupGroupImportStatesWithNullUserId::Namespace
    /opt/gitlab/embedded/service/gitlab-rails/db/post_migrate/20200909161624_cleanup_group_import_states_with_null_user_id.rb:39:in `default_owner'
    /opt/gitlab/embedded/service/gitlab-rails/db/post_migrate/20200909161624_cleanup_group_import_states_with_null_user_id.rb:65:in `block (2 levels) in up'
    /opt/gitlab/embedded/service/gitlab-rails/db/post_migrate/20200909161624_cleanup_group_import_states_with_null_user_id.rb:64:in `block in up'
    /opt/gitlab/embedded/service/gitlab-rails/app/models/concerns/each_batch.rb:90:in `block in each_batch'
    /opt/gitlab/embedded/service/gitlab-rails/app/models/concerns/each_batch.rb:68:in `step'
    /opt/gitlab/embedded/service/gitlab-rails/app/models/concerns/each_batch.rb:68:in `each_batch'
    /opt/gitlab/embedded/service/gitlab-rails/db/post_migrate/20200909161624_cleanup_group_import_states_with_null_user_id.rb:63:in `up'
    /opt/gitlab/embedded/bin/bundle:23:in `load'
    /opt/gitlab/embedded/bin/bundle:23:in `<main>'
    Tasks: TOP => db:migrate
    (See full trace by running task with --trace)
  stderr_lines: <omitted>
  stdout: '== 20200909161624 CleanupGroupImportStatesWithNullUserId: migrating ==========='
  stdout_lines: <omitted>

post-deployment migration failed on Production

Production post-deploy migration has failed

Timeline

All times UTC.

2020-09-22

10:50 am - aphillips declares incident in Slack using /incident declare command.
11:00 am - The two database team members join the incident call.
01:08 pm - The incident team determines and tests the best way forward to mitigate (and resolve) the incident.
01:55 pm - The fix was merged and the delivery team starts the deployment (stsg and then prod) again.

Incident Review

Summary

Service(s) affected: Our daily deploy to staging and production.
Team attribution: Delivery team
Minutes downtime or degradation: From 10:50 UTC to 01:55pm UTC - 3h 05 min

Metrics

Customer Impact

No customer was impacted by this incident #2731 (comment 420008287).

Incident Response Analysis

How was the event detected? The deploy process broke for the delivery team and @amy declared an incident in Slack.
How could detection time be improved? It would have been great to catch this in staging instead of production, but the DB data in staging is quite old and different. This issue tries to address that limitation. Additionally, if tests for covering this scenario were included in the specs for the data migration
How did we reach the point where we knew how to mitigate the impact? We needed part of the Database team in the Incident call (@abrandl , @iroussos ) so they could review in depth what happened with the post-deploy migration. They consulted back and forth with @georgekoltsov , who had been involved in that code, to validate what was the fastest and safest way to mitigate and fixing.
How could time to mitigation be improved? The author involved the database team early in the process. The problem was diagnosed, a solution was found and a fix was ready at the quickest possible timeframe. We could have fully reverted the migration and reduce the time to mitigate the issue by 30-60 minutes, but as a stable solution was found, we decided to fix the problem even with an additional introduced delay. Given that, the only part of the mitigation plan that could be improved in similar cases is rolling back a migration in staging or production; with the experience of successfully rolling back a migration in this incident, we could add a runbook for rolling back migrations.

Post Incident Analysis

How was the root cause diagnosed? See point 3 of the previous section.
How could time to diagnosis be improved? See point 4 of the previous section.
Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident? Only the daily deployment process was disrupted. As long as we have to fix the root cause of a bug that causes an incident and revert modifications by migrations in our staging and production environments, we will have to block deployments for the duration that process takes. The only real way to address this, as discussed in other sections, is by catching similar issues as early as possible and before they reach production.
Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)? Yes, link is here: gitlab-org/gitlab!41663 (diffs)

5 Whys

Why did the deploy break in production? There was a code bug: When the code can not find an owner for a referenced group in group_import_states and tries to fetch its parent Group's owner by using the default_owner method of the local CleanupGroupImportStatesWithNullUserId::Group Class introduced in the data migration, it breaks as Group is not a subclass of the local CleanupGroupImportStatesWithNullUserId::Namespace Class.
Why did they deploy passed staging? There was not enough data in the involved post-deployment migration tables (in staging) to make it fail. All Groups referenced by group_import_states entries have an owner defined, so the second part of default_owner that tries to access a Group's parent never run.
Why a plain rollback was not possible (from production) and we needed to manually troubleshoot? Some data (presumably) and the schema had already been modified in production, however not affecting its functioning.
Why this particular test case was not covered by the spec? Both the author(s) and the reviewers of the MR missed that this case could cause issues and should be included in the specs for the Migration Class. The issues caused by duplicating and adding existing specs to custom (local) definitions of classes used in data and background migrations is further discussed in gitlab-org/gitlab#225199.

Lessons Learned

See the summary section, close to the end.

Corrective Actions

Full coverage for this use case has been also added in the specs in the fix (gitlab-org/gitlab!42987 (merged)).
infrastructure#7214
Add a runbook for rolling back migrations
Work that the delivery team is doing (@nolith and others), with the Database team, to improve how post-deployment migrations are happening as part of our release process (a sub-EPIC in this EPIC: &280)

Guidelines

Blameless RCA Guideline

Edited Oct 06, 2020 by Alberto Ramos