Security Finding partition logic can lead to disk saturation on small SM instances.
Problem
We've received reports of the security_findings_1
partition becoming a disk saturation risk for Self Managed GitLab instances.
The security_findings
table is an ingestion mechanism that temporarily holds all security findings found from security scans to allow performant comparisons and avoid disk thrash. Without it the merge request widget for new/fixed vulnerabilities in a branch would not be able to scale in most circumstances.
In ordinary operation, this partitioned table will write data to the same partition until the partition reaches 100GB in size. Once the partition exceed this size, a new partition is created and new records are written to it. GItLab then only holds the old partition until the newest record in the partition is 90 days (30 days for GitLab.com) old, at which point the entire partition is detached and the space is freed back for operational usage.
This has a few implications:
- SM admins may not be aware that this table has a minimum size requirement of 100GB before data will be cycled out.
- If the partition never grows past 100GB, the data in it will be kept indefinitely regardless of age and usefulness.
- When we discuss the retention period of Security Findings, we've not traditionally considered this exception. This likely means our documentation has some incorrect specifications.
Possible Solutions
- Force a new partition after the retention period duration, regardless of partition size.
- This should in theory mitigate this problem for the vast majority of users, as it means data will be aged out regardless of us reaching the 100GB threshold.
- Make the retention period and max partition size admin configurable
- Allowing instance admins to configure these values would likely improve the admins ability to tailor the retention behavior of this data to their needs and available infrastructure and space.
- Make the max partition size auto scale according to the rate of security finding ingestion.
- This is essentially a bit more of a technical solution, but essentially we could make new partitions be created at a far smaller threshold, and then implement logic to monitor how quickly the partitions fill up and ramp up the partition sizing to avoid having too many partitions but have them large enough to keep a reasonably consistent ingestion rate. It would probably still be desirable for Admin's to be able to set a maximum size though, so this may be more effort than it's worth.
Workarounds
Because the data in this table is ephemeral and retained for a temporary period anyways, if instances are under distress due to the partition growing too large, it should be perfectly safe to manually truncate the partition and then rerun pipelines for any open MR's to continue with standard operations.
This can be done via postgres console with the following command: TRUNCATE TABLE gitlab_partitions_dynamic.security_findings_1;
(The number at the end may be different if the instance has created new partitions before)