Skip to content

RCA: Stale database schema problem caused by `db/post_migrate/20230711093010_drop_default_partition_id_value_for_ci_tables.rb`

Problem

We had customer escalation that resulted in inability to create downstream pipelines after performing zero-downtime upgrade from 16.1 to 16.2.

The error logged by execution of Ci::CreateDownstreamPipelineWorker was:

PG::NotNullViolation: ERROR:  null value in column "source_partition_id" of relation "ci_sources_pipelines" violates not-null constraint
DETAIL:  Failing row contains (2328526, 3928, 3912358, 3928, 3912288, 44676629, 100, null).

What happened?

  1. The application was running on 16.1.
  2. The database migrations for 16.2 were run.
  3. The application Puma/Sidekiq nodes were restarted.
  4. The database post migrations were run.
  5. The Ci::CreateDownstreamPipelineWorker started to fail with data integrity error.
  6. The Puma/Sidekiq restart fixed issue.

Why it failed?

  1. The application once was loaded at step 3. read the DB structure to be ci_sources_pipelines.source_partiotion_id default 100.
  2. The database post migration 20230711093010_drop_default_partition_id_value_for_ci_tables.rb did change the default to: ci_sources_pipelines.source_partiotion_id default null.
  3. The https://gitlab.com/gitlab-org/gitlab/-/blob/v16.2.8-ee/app/models/ci/sources/pipeline.rb#L44 since it had in cache default 100, it was not setting the value. Since this value was default it was not send with INSERT INTO ci_sources_pipelines (source_partition_id) as the application expected this to be set by the database via default.
  4. Once we restarted the application, the application read the database default to be nil. Making the Ci::Sources::Pipelines#set_source_partition_id to copy source_job.partition_id value.

Why the existing mitigation failed?

  1. We had this issue that was caused by stale database schema cache recently: https://gitlab.com/gitlab-com/feature-change-locks/-/issues/38.
  2. We identified this as an solution to the root cause: #412980 (closed).
  3. We forgot to add source_partition_id to columns_changing_defaults: #427489 (comment 1592168571)

Possible solutions

  1. Implement pro-active schema reload across the cluster: #412980 (comment 1404785697) or #412980 (comment 1408831804).
  2. Forbid changing DDL (adding, changing, or removing columns) in post migrations.
  3. (New) Run CI tests with application having old DB structure, and be updated mid-way to new DB structure.
Edited by Kamil Trzciński