Implement a new worker to store security reports for projects
Why are we doing this work
The ingestion framework creates/updates vulnerability-related entities in a transaction. In that transaction, we are creating/updating a maximum of 50 vulnerability records and associated entities. Most of the queries fired in that transaction are UPSERT queries which require acquiring locks on unique index tuples. This is a problem because if multiple processes try to UPSERT the records with the same unique attributes, there will be a lock contention and each process will wait for the other to complete. If there are too many processes all the connections can be held by StoreSecurityReportsWorker
jobs for a long time and the rest of the application can suffer from this.
The above scenario is what exactly happened recently and the details are here.
We have already implemented a temporary solution to synchronize the ingestion of vulnerabilities per project because the unique attributes are scoped to a project which means we can run the ingestion logic for different projects in parallel without problem but not for the same project.
In this issue, we want to implement a long-term solution to remove the application layer semaphore and use the already implemented job deduplication policy.
So the idea is, instead of ingesting the vulnerabilities for a project by a background job accepting pipeline ID, we want to ingest the vulnerabilities by a background job accepting the project ID. This way, we can deduplicate the jobs and prevent running the jobs in parallel for the same project.
The new background job should select the latest succeeded pipeline of the default branch and ingest the vulnerabilities for that pipeline.
Potential issue: There can be multiple succeeded pipelines for the default branch between scheduling the job and starting the job. In that case, we can lose some important data if the older pipeline generates different artifacts than the most recent one.
Relevant links
- 2024-03-25: StoreSecurityReportsWorker saturate... (gitlab-com/gl-infra/production#17754 - closed)
- Use semaphore for storing security reports (!147816 - merged)
Historical Context
- 2024-03-25
- ┃ The `StoreSecurityReportsWorker` job has the lock-contention issue described above. It was discovered in this production incident
- 2024-03-25
- ┃ We implemented a temporary solution in !147816 (merged)
- 2024-05-15
- ┃ An implementation of a new job, utilizing the `pipeline_metadata` table was done - During the review, the CI team notified us we could not use that table ( !151541 (comment 1908623485))
- 2024-05-30
- ┃ we decided to use redis instead until we could come up with a plan using a new database table !151541 (comment 1928200998)
- 2024-06-04
- ┃ that redis-cache implementation got MR review feedback that we should fix this at the sidekiq job deduplication layer !151541 (comment 1933657674) - feedback was blocking as no MR approval was given and non "non-blocking" comment tag was used
- 2024-06-12
- ┃ closed the MR and put this issue back into ~"workflow::refinement
- ┃ re-opened as the MR was merged
needs refining)
Implementation plan (-
create a new database table/column/etc to store a pipelines ingestion state in -
roughly do the same thing as this attempted MR, except make use of the database table you create -
transition to the new worker with a FF - #460476 (closed) for the FF rollout issue
Verification steps
-
resolve sentry errors and they should not re-occur -
unblock alerts per this thread - we shouldn't see new alerts