Geo 14.6.0 error: Column "verification_state" of relation "pages_deployments" does not exist

Summary

After upgrade to 14.6.0 some Geo customers may face the following errors repeating every minute in their Sidekiq log:

PG::UndefinedColumn: ERROR: column "verification_state" of relation "pages_deployments" does not exist 
LINE 1: ...config_name:main*/ UPDATE "pages_deployments" SET "verificat..."

The relation may also vary between pages_deployments and lfs_objects or uploads

Steps to reproduce

Reproducing this requires at least one Pages verification job that has exceeded its 8 hour runtime without failing/succeeding.

Example Project

This is not project specific.

What is the current bug behavior?

VerificationTimeoutWorker fails with an error when it encounters a timed out Pages, LFS Object, or Upload related verification job and is unable to mark these jobs as timed out.

What is the expected correct behavior?

VerificationTimeoutWorker does not fail with an error when marking some verification jobs as timed out.

Relevant logs and/or screenshots

{
  "severity": "WARN",
  "time": "2021-12-24T10:27:16.017Z", …
  "queue": "geo:geo_verification_timeout",
  "class": "Geo::VerificationTimeoutWorker",
  "args": [
    "pages_deployment"
  ],
  "meta.caller_id": "Geo::VerificationCronWorker",
  "meta.feature_category": "geo_replication",
  "message": "Geo::VerificationTimeoutWorker JID-a947ff98b4813459b4319223: fail: 0.7699 sec",
  "job_status": "fail",
  "error_message": "PG::UndefinedColumn: ERROR:  column \"verification_state\" of relation \"pages_deployments\" does not exist\nLINE 1: ...config_name:main*/ UPDATE \"pages_deployments\" SET \"verificat...\n                                                             ^\n",
  "error_class": "ActiveRecord::StatementInvalid",
  "error_backtrace": [
    "lib/gitlab/database/load_balancing/connection_proxy.rb:126:in `block in write_using_load_balancer'",
    "lib/gitlab/database/load_balancing/load_balancer.rb:112:in `block in read_write'",
    "lib/gitlab/database/load_balancing/load_balancer.rb:172:in `retry_with_backoff'",
    "lib/gitlab/database/load_balancing/load_balancer.rb:110:in `read_write'",
    "lib/gitlab/database/load_balancing/connection_proxy.rb:125:in `write_using_load_balancer'",
    "lib/gitlab/database/load_balancing/connection_proxy.rb:67:in `block (2 levels) in <class:ConnectionProxy>'",
    "ee/lib/gitlab/geo/verification_state.rb:246:in `block in fail_verification_timeouts'",
    "app/models/concerns/each_batch.rb:98:in `block (2 levels) in each_batch'",
    "app/models/concerns/each_batch.rb:98:in `block in each_batch'",
    "app/models/concerns/each_batch.rb:68:in `step'",
    "app/models/concerns/each_batch.rb:68:in `each_batch'",
    "ee/lib/gitlab/geo/verification_state.rb:245:in `fail_verification_timeouts'",
    "ee/app/models/concerns/geo/verifiable_replicator.rb:20:in `fail_verification_timeouts'",
    "ee/app/workers/geo/verification_timeout_worker.rb:17:in `perform'", …
  ],
}

Output of checks

This bug happens on GitLab 14.6.0 self-managed instances using Geo replication features.

Possible fixes

Related 14.6.0 MRs:

The issue occurs due to an incorrect relation used when attempting to adjust the value for verification_state while marking a verification job as failed due to time-out.

The approach in the MRs linked above that cause this error took a different approach than their predecessors such as !60935 (merged)

The difference was that in !60935 (merged) the verification_state and associated columns were ALTERed into the base table of their model, whereas in the above MRs they were introduced as new tables via a 1:1 relation to the base table.

The code used when seeking out the entries to mark as timeout was not changed to account for this relationship difference:

As a result, in the following snippet when running for Pages data, the relation fetched is PagesDeployment instead of the expected PagesDeploymentState, and PagesDeployment cannot perform an update_all with attributes referencing column verification_state which resides in a different relation PagesDeploymentState:

verification_timed_out.each_batch do |relation|
    relation.update_all(attrs)
end

See also premium customer ticket https://gitlab.zendesk.com/agent/tickets/258741 (internal link)

/cc @nhxnguyen @aakriti.gupta @vsizov @katrinleinweber

Edited by Harsh Chouraria