Skip to content

Context

!108410 (merged) introduced migrations to rename plan_limits.web_hooks column to plan_limits.web_hooks_high according to the rename column best practices.

The migrations (both regular and post) were successfully executed in staging, but the effects of it were immediately noticed on the environment:

PG::UndefinedColumn: ERROR:  column plan_limits.web_hook_calls does not exist
LINE 1: ...", "plan_limits"."ci_registered_project_runners", "plan_limi..
app/models/plan.rb:30:in `actual_limits', 
app/controllers/projects/settings/ci_cd_controller.rb:22:in `show'

Given the deployment order (canaries environment first, main environments later) the regular migration was also executed in production, but the post-migration wasn't.

Schema version details
  • db/migrate/20230109095622_rename_web_hook_calls_to_web_hook_calls_high.rb was executed on staging and production. As such the plan_limits has a column web_hook_calls_high on both environments:
--- staging
gitlabhq_production=> SELECT version FROM schema_migrations WHERE version = '20230109095622';
    version
----------------
 20230109095622

-- production
gitlabhq_production=> SELECT version FROM schema_migrations WHERE version = '20230109095622';
    version
----------------
 20230109095622
  • db/post_migrate/20230109100044_cleanup_web_hook_calls_column_rename.rb was only executed on staging.
---staging
gitlabhq_production=> SELECT version FROM schema_migrations WHERE version = '20230109100044';
    version
----------------
 20230109100044

--- production
gitlabhq_production=> SELECT version FROM schema_migrations WHERE version = '20230109100044';
 version
---------
(0 rows)

The failures on staging initiated an incident gitlab-com/gl-infra/production#8264 (closed). The purpose of that incident was to:

  • Restore the staging integrity, and to,
  • Ensure the databases in both staging and production were standardized.

Attempts to fix the problem

Attempt 1: Restore the column by renaming the column from plan_limits.web_hook_high back to plan_limits.web_hook

Merge request: !109373 (closed)

The purpose of this MR was to revert the changes made by !108410 (merged). But it quickly grew out in complexity because it required to consider the database status of staging, production and self-managed. Later, it was discovered this approach wouldn't work because it attempted to rename a column that still had triggers, a workaround could have been to transform the original post-migration into a regular migration but that would have been risky since it involves deleting a column as part of a regular migration.

Attempt 2: Ignore plan_limits.web_hook_calls to avoid using it when it's renamed

Merge request !109441 (merged)

It was discovered that cleanup_concurrent_column_rename is not safe to use on a table/model that have ignored columns gitlab-com/gl-infra/production#8264 (comment 1246724753)

The purpose of this merge request was to ignore the web_hook_calls so it wasn't called by the plan_limits query. !109441 (merged) was deployed to staging-canary but the specs targeting staging-canary and staging continue to fail:

This MR could solve the problem on staging, but given how deployments are designed (first staging-canary followed by specs, then production-canary follow by specs, and then the main environments). The deployment can't progress until the QA specs are green.

Attempt 3: Restore plan_limits.web_hook without the rename helpers

Merge request: !109511 (merged)

This MR restores the previously renamed web_hook_calls column by:

  • Re-add the old column if we don't have it (PRD still has it)
  • Cleanup the old triggers
  • Install one-way sync trigger for web_hook_calls_high -> web_hook_calls

The old web_hook_calls_high still needs to be deleted.

This MR is in progress to be deployed.

Outcomes of this retrospective

  • Understand the root cause of this problem. We've executed several renames in the past, it is not quite clear why this one caused this problem. Perhaps the PlanLimits is not backwards compatible https://docs.gitlab.com/ee/development/multi_version_compatibility.html
  • Understand what are the acceptable circumstances to rename a column. On !108410 (merged), a column was renamed to follow the same naming pattern as the other ones. In retrospect, perhaps this didn't require a table rename and a ruby alias could have had the same effect.
  • Define steps to prevent this from happening again.
  • Is it possible to detect these failures during the development cycle (e.g. in merge requests)?´
Edited by Mayra Cabrera