Skip to content

Skip advisory scans during initial sync

Problem to solve

Whan an instance syncs with the Package Metadata DB (PM DB) for the very first time, it triggers vulnerability scans for a very large number of advisories.

  • This might not scale, and it possibly impacts the responsiveness of the instance by enqueuing too many jobs.
  • It doesn't seem needed. Users should already have regular scans on their projects before continuous scans is enabled.

The same problem applies when all the advisories of the PM DB are exported again. This might be needed to fix export data.

To a lesser extent, a similar problem might occur when an instances catches up with PM DB after not being able to sync for a significant amount of time.

This problem was raised in #371063 (comment 1494215853).

Overview of advisory scan fan out. On initial sync, N would be the total amount of advisories in the package metadata database.

Overview of the advisory scan fan out

Links

Proposal

TBD. See current proposals:

A. Disable FF until the initial sync is completed

#371063 (comment 1494215853)

B. Tag initial export in order to skip scans

#371063 (comment 1496510422)

Pros

  • It's explicit.
  • It's robust, and supports a scenario where an instance can't sync for a while.
  • It supports all sorts of use-cases, like re-exporting for technical reasons without triggering scans.

Cons

  • We have to update the export format, the exporter, and the syncer.

C. Skip scans based on updated_at timestamp

#371063 (comment 1496516167)

Pros

  • It supports re-exporting everything.

Cons

  • A new field must be added to the exported JSON object.
  • It doesn't support re-ingesting everything.
  • Instances that can't sync for a while miss scans.

C2. Skip scans based on published date

Same as C. Skip scans based on ingestion date, but reuse the existing published_date field.

Pros

  • It supports re-exporting everything.
  • It supports re-ingesting everything.

Cons

  • Instances that can't sync for a while miss scans.
  • We miss old advisory recently added to the advisory database.

D. Skip scans when chunk is too large

#371063 (comment 1496519259)

Yet another option is to skip scans when the ingested chunk contains too many advisories.

TODO: evaluate pros and cons

E. Use a scan queue

Create a queue of advisories to be scanned with a set amount of advisories scanned at any one time.

Pros:

  • Comprehensive - can scan the entire backlog without worrying about missing something.
  • Monolith abstractions already exist for this limited capacity worker and state_machines-activerecord.
  • Doesn't need elaborate schemes for denoting initial scans or where scans should start.
  • No changes on the external service nor the protocol side.

Cons:

  • Will scan all ingested advisories - so the resources needed to scan all of the ingested data will still be used, there just won't be a thundering herd problem.

Implementation Plan

Proposal C chosen for expediency and explicit nature of solution.

Does this feature require an audit event?

/cc @ifrenkel @hacks4oats @gonzoyumo

Edited by Oscar Tovar