Skip to content

Geo: Introduce "verification disabled" state

Michael Kozono requested to merge mk/scope-verification-properly into master

What does this MR do and why?

Problem 1: Currently, if a resource cannot be verified due to being in object storage, then the resource is marked "verification succeeded" in order to stop verification from happening, and to avoid a permanent loop of "verification failed, resync, repeat". But "verification succeeded" is an incorrect representation in the data and to the sysadmin.

Problem 2: Currently, if a resource cannot be verified due to the primary has not checksummed it yet, then the resource falls into a loop of "verification failed, resync, repeat" until the resource becomes checksummed on the primary. This is wasteful, though at least the problem is transient.

This MR introduces a "verification disabled" state for these cases. No wasteful loops, and no inaccurate representation of what's verified.

Maintainer: Please don't squash the commits.

Part of #299819 (closed)

Screenshots

See Javiera's screenshots during testing: !87034 (comment 955254106)

How to set up and validate locally

Numbered steps to set up and validate the change are strongly suggested.

How to validate fix for object stored blobs

First, reproduce the problem on master branch:

  1. With master branch, and GDK + Geo
  2. Configure object storage
  3. Visit /admin/geo/sites
  4. Find the secondary site on the page
  5. Click the Edit button (pencil icon)
  6. Check Allow this secondary site to replicate content on Object Storage and click Save
  7. Upload a file in an issue
  8. On the secondary Rails console, wait until Upload.last.replicator.registry.synced? returns true
  9. On the secondary Rails console,, output verification state: Upload.last.replicator.registry.verification_state
  10. 🚫 Notice that it is 2, for "verification success", which is inaccurate.

Try to repro the problem again on MR branch:

  1. On the secondary: git checkout mk/scope-verification-properly; gdk restart rails
  2. If ps aux | grep sidekiq-cluster | grep -v "grep" returns more than 1 line per running GDK, then kill them pkill -lf 'sidekiq-cluster' and wait for GDK to start some again
  3. Upload a file in an issue
  4. On the secondary Rails console, wait until Upload.last.replicator.registry.synced? returns true
  5. On the secondary Rails console, output verification state: Upload.last.replicator.registry.verification_state
  6. Notice that it is 4, for "verification disabled", which is accurate.
  7. On the secondary Rails console: Geo::MetricsUpdateWorker.new.perform to immediately update the status in the UI
  8. Notice no failure in the Upload verification progress bar, and notice that the verification progress bar total is 1 less than the replication progress bar

How to validate fix for "not yet checksummed problem"

First, reproduce the problem on master branch.

On the primary:

  1. Stop Sidekiq so verification doesn't occur automatically: gdk stop rails-background-jobs
  2. Open Rails console: bin/rails console
  3. Clear primary checksum for an upload: u = Upload.first; u.verification_checksum = nil; u.verification_pending!

On the secondary:

  1. Stop Sidekiq so verification doesn't occur automatically: gdk stop rails-background-jobs
  2. Kill any lingering sidekiq processes if needed: pkill -lf 'sidekiq-cluster'
  3. Open Rails console: bin/rails console
  4. Trigger verification for that upload, then output verification state: u = Upload.first; u.replicator.verify; u.replicator.registry.verification_state
  5. 🚫 Notice that verification_state is 3, meaning "verification failed".
  6. Refresh this site's status data: Geo::MetricsUpdateWorker.new.perform
  7. In browser, visit /admin/geo/sites
  8. 🚫 Notice 1 failure in the Upload replication progress bar
  9. 🚫 This represents a transient verification failure when the resource is not yet checksummed on the primary. If we run Sidekiq, this will cause a verification => sync loop until the resource is checksummed on the primary, at which point verification will succeed on the secondary.

Now we can validate the fix on the MR branch.

(Note this is continued from above.) On the secondary:

  1. git checkout mk/scope-verification-properly; gdk restart rails-web
  2. Exit the already open Rails console (apparently reload! isn't enough): exit
  3. Open Rails console: bin/rails console
  4. Resync the upload (because it's currently marked failed sync), verify it, then output verification state: u = Upload.first; u.replicator.send(:download); u.replicator.verify; u.replicator.registry.verification_state
  5. Notice that verification_state is 4, meaning "verification disabled".
  6. Refresh this site's status data: Geo::MetricsUpdateWorker.new.perform
  7. In browser, visit /admin/geo/sites
  8. Notice no failure in the Upload verification progress bar, and notice that the verification progress bar total is 1 less than the replication progress bar
  9. This shows that there will be no verification => sync loop. When the resource becomes checksummed on the primary, then a checksum_succeeded event will be created, which causes all secondaries to immediately reverify the resource. That is the exact right time to attempt verification.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Michael Kozono

Merge request reports