Proposal - Implement Secure Data Consistency Worker
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
It's come to our attention recently that there have been a few occasions where we've detected data in the GitLab database which is not aligning with it's expected state, sometimes for years at a time.
This inconsistency can vary in impact on GitLab, ranging from minimal and mostly just taking up storage space redundantly, to actively harming operations due to data not reflecting expected application state. This can have a negative impact on both our ability to update data when implementing new features, as well as the accuracy of data reported to users.
Additionally, it has been flagged that certain operational behaviours we typically assume to be guaranteed are not perfectly lossless. For example, sidekiq jobs are typically used to update data states on certain events. (Such as updating the archived/traversal_id values on vulnerability_reads and sbom_occurrences when a project is moved). Production incidents can result in jobs being lost. If this were to occur, these records could become stuck in an incorrect state.
Examples
-
severity1 priority1 Some projects with vulnerabilities don't have their
has_vulnerabilitiesflag set to true - 96 dependency list records that should have been deleted are orphaned in the database.
- Some sec-decomposition efforts require splitting of updates from single database transactions. While unlikely, this could theoretically allow for a desync between tracking characteristics in the main database and the actual records in the sec database.
Proposal
I propose that we implement a DataConsistency worker on a cron schedule to assert that certain assumptions about our data state are being appropriately maintained according to our intentions.
This worker should run on a semi-frequent, deprioritized basis and do a low cost scan across the data domain and check that certain pieces of business logic are being adhered to. If discrepancies are detected, it should ideally be able to fix them, and notify us of them. That way if it's an intermittent issue (such as a production incident) it will solve the problem for us, but if it's continually happening then we'll be proactively made aware of the problem.