Reduce sbom_occurrences table growth
Why are we doing this work
Currently, we are essentially using sbom_occurrences as an append-only log and this doesn't scale.
When enabling the cyclonedx_sbom_ingestion feature flag, we saw approximately 1 million sbom_occurrences created per hour. This is because we're storing all the occurrences created in each pipeline. We should abandon the append-only log approach.
Instead, we should add columns tracking when the dependency appeared. For example:
| column | type | description |
|---|---|---|
| latest_default_branch_pipeline_id | bigint | Latest pipeline on the default branch where this dependency appeared |
| present_on_default_branch | bool | True if this dependency is currently present on the default branch |
This allows us to fullfil the current project requirements, which is to provide the same functionality as the existing dependency list. The dependencies which appear in the list will be the ones that have present_on_default_branch = true and the latest_default_branch_pipeline_id can be used to show the last successful scan. In the future, it we would like to show dependencies on other branches, we will need to come up with a different solution for that.
Relevant links
Non-functional requirements
-
Documentation: -
Feature flag: -
Performance: -
Normalising will increase the amount of joins we have to do. Test the potential impact.
-
-
Testing:
Implementation plan
- Write database migrations to:
- Delete all existing rows from
sbom_occurrences(DB lab says this would take around 7 seconds) - Remove
index_sbom_occurrences_on_ingestion_attributes - Create a new unique index on
%i[project_id component_id component_version_id source_id]
- Delete all existing rows from
- Update the ingestion tasks:
- Update the ingestion pipeline so that
Sbom::Ingestion::IngestReportsServicereceives a list of all thesbom_occurrenceids (ingested_ids) that were present in the ingested reports. - Pass the
ingested_idsto a new serviceSbom::Ingestion::PostIngestionService
- Update the ingestion pipeline so that
- Have the
PostIngestionServiceiterate through all the occurrences in the given project. - Delete the records which aren't present in
ingested_idsin batches of 100. - Update the
pipeline_idandcommit_shaof the already-existing records to that of the current pipeline
Verification steps
Auto-Summary 🤖
Discoto Usage
Points
Discussion points are declared by headings, list items, and single lines that start with the text (case-insensitive)
point:. For example, the following are all valid points:
#### POINT: This is a point* point: This is a point+ Point: This is a point- pOINT: This is a pointpoint: This is a **point**Note that any markdown used in the point text will also be propagated into the topic summaries.
Topics
Topics can be stand-alone and contained within an issuable (epic, issue, MR), or can be inline.
Inline topics are defined by creating a new thread (discussion) where the first line of the first comment is a heading that starts with (case-insensitive)
topic:. For example, the following are all valid topics:
# Topic: Inline discussion topic 1## TOPIC: **{+A Green, bolded topic+}**### tOpIc: Another topicQuick Actions
Action Description /discuss sub-topic TITLECreate an issue for a sub-topic. Does not work in epics /discuss link ISSUABLE-LINKLink an issuable as a child of this discussion
Last updated by this job
- TOPIC Ignored columns #373781 (comment 1198216147)
- TOPIC Partitioning #373781 (comment 1198227876)
- TOPIC Delete #373781 (comment 1198240485)
- TOPIC Indexes #373781 (comment 1198252592)
Discoto Settings
---
summary:
max_items: -1
sort_by: created
sort_direction: ascending
See the settings schema for details.