Skip to content

Update checkpoint data_types

Igor Frenkel requested to merge 414977-update-pm-checkpoints into master

What does this MR do and why?

Existing checkpoints in an instance's database have their checkpoints set to the advisories data_type. This is incorrect, the data_type should be licenses. This has the negative result of PackageMetadata::SyncService not finding the last data checkpointed and starting sync from scratch.

The sync is non-destructive so no data is lost as part of this bug, but re-syncing the entire dataset is unnecessary and is costly in terms of resrouces.

Bug timeline

  1. Add fields to Checkpoint (!118939 - merged) is applied
    • adds data_type column
    • adds Enums::PackageMetadata::DATA_TYPES which is { advisories: 1, licenses: 2 }
    • sets existing checkpoint entries to data_type=advisories (or 1)
    • but this does not affect sync since the unique key on checkpoints only uses purl_type
  2. Add package metadata ingestion for version form... (!120027 - merged) is applied
    • changes unique key from purl_type to (data_type, version_format, purl_type)
    • next time sync runs it is looking for checkpoints with data_type=licences or data_type=2 as well as purl_type
      • query changes from select * from pm_checkpoints where purl_type = X to select * from pm_checkpoints where purl_type = X and data_type = 2 ...
    • the correct (mislabeled) checkpoint is found, so a new one is created starting at sequence: 0, chunk: 0
  3. after this MR is applied
    • the "mislabeled" checkpoints are now data_type=2 and will be found the next time sync runs

Example checkpoint lifecycle

How to set up and validate locally

  1. Add advisories checkpoints manually: gdk psql -c 'insert into pm_checkpoints(sequence:1111, chunk: 0, data_type: 2, purl_type: 1, version_format: 1)'
  2. Run migration, the data above should be removed.
  3. Run ingestion, wait for at least one new checkpoint to appear.
  4. Verify that the checkpoints are of type licenses with the int value being 1.

How to run ingestion

Run ingestion via rails runner

ingest.rb: ingest.rb

Run this via: bundle exec rails runner ingest.rb

Sync progress can be see in log/application_json.log where the sync url is indicated.

Note: The PM_SYNC_INDEV environment flag controls whether sync runs in the development environment. It is false by default. Ensure you can sync via export PM_SYNC_INDEV=true before running ingest.rb.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #414977 (closed)

Edited by Igor Frenkel

Merge request reports