Add a scheduled worker and a sync service to ingest advisories exported by License DB into the vulnerability_advisories DB table of the main Postgres DB.
Sashi Kumar Kumaresanchanged title from Create or update vulnerability_advisories when gemnasium advisories is updated to Add scheduled sidekiq job to update vulerability_advisories when advisories are updated
changed title from Create or update vulnerability_advisories when gemnasium advisories is updated to Add scheduled sidekiq job to update vulerability_advisories when advisories are updated
Fabien Catteauchanged title from Add scheduled sidekiq job to update vulerability_advisories when advisories are updated to Add scheduled sync worker for vulnerability advisories
changed title from Add scheduled sidekiq job to update vulerability_advisories when advisories are updated to Add scheduled sync worker for vulnerability advisories
Fabien Catteauchanged the descriptionCompare with previous version
@ifrenkel@hacks4oats Does any of you want to refine this issue? You might split that up and create additional issues as part of the planning breakdown. New issues should be added to the same epic.
@ifrenkel@fcatteau I've started filling out some of the implementation plan, but have some questions about the contents of the objects contained in the new lines. When reading the option selected, I got the impression that a line in the export could contain either a package license object or an advisory object. Here's an example of what I believe this could look like (note that I used a mock format for the advisory object).
If this is true, then I think this might create some complexity for the parsing and validation steps. I'd like to propose an approach where we export all advisories and licenses separately, and then reference them in the package metadata object via identifiers. This would allow us to simplify the validation logic, and it would also reduce the amount of times we try to insert a license. Here's an example of how this new structure would look like:
For licenses and advisories, we would change the bulk insertable task so that we only create relationships between the package/advisories and drop the attempt to create a new license/advisory record. WDYT?
When reading the option selected, I got the impression that a line in the export could contain either a package license object or an advisory object.
@hacks4oats License information and advisories would be exported separately, using separate Cloud Run services, and separate GCP Buckets for export.
For licenses and advisories, we would change the bulk insertable task so that we only create relationships between the package/advisories and drop the attempt to create a new license/advisory record. WDYT?
If we go for full database normalization then the backend should upsert into pm_packages when ingesting advisories, and also when ingesting licenses–like today.
However, we might denormalize the database schema to achieve better performance, and to better isolate Vulnerability Scanning from License Scanning.
In the current implementation plan I've added, the assumption has been to create new v2 classes that we would use when an FF is used (this would need to be created and could be something like package_metadata_synchronization_v2). The motivation for this is to provide an easy way to back out the changes in the case where there's an issue with the new version format or the sync implementation.
Pros
FF toggles are faster than an MR revert
Clear distinction between the v1 and v2 implementations
We can improve on the classes with some of the lessons learned in the first implementation. For example, we could have a base class that's shared by both the GCP and Offline connectors.
Cons
There is some duplication of code.
Oscar Tovarchanged the descriptionCompare with previous version
Unfortunately, I wasn't able to complete the refinement for this before my PTO due to some higher priority work on the new license scanning feature. I'll be unassigning myself from this issue, but can help refine/work on this once I return.
That said, we probably need to alter the pm_checkpoints table to track import of advisories. Feel free to create an issue for this, or to simply cover it as a task of this issue.
the implementation plan looks good but I have an alternate suggestion regarding the SyncWorker
@hacks4oats apologies about this. I didn't check over this comment before making it. I realize that we already have this entrypoint with the Evented worker.
There are a few points that are still useful to discuss though:
we could push csv/json parsing out of DataObject into a dedicated parser which could handle either format
this would allow us to still handle v1 format which would be useful in case we split licenses from CVS MVC
Ingestion service may be a good candidate for splitting between advisories and licenses. On the one hand it might be nice to add an advisory step to the current ingestion as it would just work. OTH the advisories step wouldn't be related to the other data. WDYT?
we could push csv/json parsing out of DataObject into a dedicated parser which could handle either format
this would allow us to still handle v1 format which would be useful in case we split licenses from CVS MVC
I'm not sure how this would look like, and might need an example to better understand how this implementation would look. Would the parser parse each file instead of each line?
Ingestion service may be a good candidate for splitting between advisories and licenses. On the one hand it might be nice to add an advisory step to the current ingestion as it would just work. OTH the advisories step wouldn't be related to the other data.
My choice would be to split the advisories so that we could keep the license scanning and dependency scanning concerns separate. The other benefit I see to this is that having a separate ingestion job for advisories makes it possible to speed up the initial sync with the external license db and can be a control/status in the admin panel described to be introduced by Show status of Package Metadata sync to instanc... (#394747).
@ifrenkel Thanks for updating the implementation plan! I've reread it and have one minor question. Should the advisory worker be PackageMetadata::Worker::SyncAdvisoriesWorker or is it possible to use PackageMetadata::SyncLicensesWorker instead? The latter is a bit more concise, but I'm not sure if this abides to the Ruby way of doing things.
I'm thinking that this needs to be broken up into at least 2 issues.
That makes sense to me Do you think we can separate the issues so that the advisory work is separated from the license scanning work? My thinking behind this is that we can prioritize the advisory scanning issues higher than the license scanning ones (since the new license scanning has now launched). One possibility of the breakdown could look like this:
Create AdvisorySyncWorker
Create AdvisorySyncService
Create DataFile and AdvisoryDataObject
Edit: Also, I wanted to say thank you for linking the relevant discussions about the format of the advisories. This helped clear a lot of the confusion I had around the structure of the data (I wasn't aware that we would use the same format used in GLAD).
separate the issues so that the advisory work is separated from the license scanning work
@hacks4oats are you talking about not doing the renames in this case? Like don't rename the sync service to PackageMetadata::LicenseSyncService but instead have PackageMetadata::SyncService and PackageMetadata::AdvisorySyncService?
@ifrenkel I was thinking of doing them, but after introducing the advisory specific classes. Maybe I'm being too cautious here, but to me keeping the code where helps with stability of the license scanning feature while we work on ingesting advisories. I don't have a strong opinion on this and can see the clarity brought by renaming PackageMetadata::SyncService to PackageMetadata::LicenseSyncService when creating PackageMetadata::AdvisorySyncService.
@hacks4oats these are good points. One thing to note is that the 2 sync services would share a lot of common functionality, to the point where one of the points in the plan is to extract it up into a base class. If we do this, we touching license stuff anyways. WDYT?
@hacks4oats the previous implementation plan had a point about adding a validator. I temporarily excluded from the current implementation plan because I want to discuss it.
The validators referenced have to do with public schemas whereas our sync protocol is not for 3rd parties to publish or consume but rather it's between the external db and the monolith. So I was thinking that's not really necessary. WDYT?
@ifrenkel I added the validator to the implementation plan to act as a safeguard during the parsing of the exported artifacts. I vaguely remember that some entries in the license exports were missing a field which caused the synchronization to error, and thought that validation could help guard against something like this.
Pros
We can track entries that don't pass validation either through a counter, logs, or both.
Cons
There's a performance impact for validating each line (small or large).
I added a weight of 5 to this issue and believe it can be broken up into several issues. It will definitely be multiple MRs. As I don't have time left this week, I believe we can proceed on the 1st two points of the plan. The exception is ingestion (in point 2) which may be changed depending on Spike: Should we import CVEs and affected packa... (#404996 - closed)
Let me know your thoughts @hacks4oats about this being ready. I can break this issue up early next week to reduce the weight.