Optimize CVE enrichment ingestion (#490985) · Issues · GitLab.org / GitLab

Optimize CVE enrichment ingestion

## Overview As discussed in https://gitlab.com/gitlab-org/gitlab/-/merge_requests/165782+ (comments [1](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/165782#note_2116110112), [2](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/165782#note_2114910651), [3](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/165782#note_2116770649)) the current implementation of CVE enrichment ingestion updates entries where data may not have changed (i.e. EPSS score stayed the same over two days) as the `updated_at` value changes. This results in unnecessary writing. Removing these updates would improve performance. There are two options for this: 1. Update entries only if score has changed. 2. Insert all new data in a new partition separated by date, then drop previous day's data. 3. Update the exporter only to export updated rows while keeping the ingestion process as-is. We need to understand which of these options is more sensible in the balance between effort and benefit, and implement accordingly. ## Action plan - [ ] Understand which approach is healthier, in terms of effort vs benefit. ### Option 1 - [x] Change the upsert to: ```sql DO UPDATE "score" = excluded."score", "updated_at" = excluded."updated_at" WHERE pm_epss.score != excluded."score" ``` - [ ] Decide whether this change belongs solely in CVE enrichment ingestion or whether it may be implemented as part of `app/models/concerns/bulk_insert_safe.rb` ### Option 2 - [ ] Implement [partitioning](https://docs.gitlab.com/ee/development/database/partitioning/) by date to ingest data anew instead of upserting (see comments [1](https://gitlab.com/groups/gitlab-org/-/epics/11544#note_1960893977), [2](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/165782#note_2116110112)) - [ ] Drop previous day partition after ingesting new data ### Option 3 - [ ] Update the exporter only to export updated rows - [ ] Discuss exporter logic: * Compare rounded exported data with full scores (ensure comparison is made against rounded values, as GitLab DB stores rounded values while PMDB stores full values) * For example, add another column to store the previously exported rounded value

issue