Sign in or sign up before continuing. Don't have an account yet? Register now to get started.
Optimize CVE enrichment ingestion
## Overview
As discussed in https://gitlab.com/gitlab-org/gitlab/-/merge_requests/165782+ (comments [1](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/165782#note_2116110112), [2](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/165782#note_2114910651), [3](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/165782#note_2116770649)) the current implementation of CVE enrichment ingestion updates entries where data may not have changed (i.e. EPSS score stayed the same over two days) as the `updated_at` value changes. This results in unnecessary writing. Removing these updates would improve performance.
There are two options for this:
1. Update entries only if score has changed.
2. Insert all new data in a new partition separated by date, then drop previous day's data.
3. Update the exporter only to export updated rows while keeping the ingestion process as-is.
We need to understand which of these options is more sensible in the balance between effort and benefit, and implement accordingly.
## Action plan
- [ ] Understand which approach is healthier, in terms of effort vs benefit.
### Option 1
- [x] Change the upsert to:
```sql
DO UPDATE
"score" = excluded."score", "updated_at" = excluded."updated_at"
WHERE pm_epss.score != excluded."score"
```
- [ ] Decide whether this change belongs solely in CVE enrichment ingestion or whether it may be implemented as part of `app/models/concerns/bulk_insert_safe.rb`
### Option 2
- [ ] Implement [partitioning](https://docs.gitlab.com/ee/development/database/partitioning/) by date to ingest data anew instead of upserting (see comments [1](https://gitlab.com/groups/gitlab-org/-/epics/11544#note_1960893977), [2](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/165782#note_2116110112))
- [ ] Drop previous day partition after ingesting new data
### Option 3
- [ ] Update the exporter only to export updated rows
- [ ] Discuss exporter logic:
* Compare rounded exported data with full scores (ensure comparison is made against
rounded values, as GitLab DB stores rounded values while PMDB stores full values)
* For example, add another column to store the previously exported rounded value
issue