Investigate processes needing restart after migration adding column
Background
During a recent production incident https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7792, ran into unsafe behaviour during database migrations.
A migration AddRejectNonDcoCommitsToPushRules
was applied: !97938 (merged). This migration adds a reject_non_dco_commits
column to the push_rules
table.
After this migration ran, (a subset of?) saves on the PushRule
ActiveRecord object started failing with this message:
Unable to save project. Error: unknown attribute 'reject_non_dco_commits' for PushRule.
(source)
This was fixed by restarting all of the pods.
Problem
This suggests that some migrations which add columns are not safe.
Rails caches its view of the database schema. It appears that in this case, rails re-validated the cached schema against the actual schema, saw the new column, and canceled the save operation.
We cover many caveats, in particular dropping columns in https://docs.gitlab.com/ee/development/database/avoiding_downtime_in_migrations.html. But thus far adding a column had been considered safe AFAICT.
Potential impact: This creates not only an availability risk, but also a durability and data consistency one.
What is needed
We need to reproduce this issue and investigate under which conditions adding a column is not safe. Once we understand the trigger, we can decide if we need to change something about how we do migrations that add columns.