[Threat Insights] - Proposal: Separate "Very Large Group" namespace query optimisation from standard feature implementation
Intention
Per this conversation, the idea is to discuss a change to Threat Insights development behaviour for new namespace level queries by opting to introduce namespace size limits by default on all newly implemented group queries such that namespaces containing more than X number of projects are excluded from the use of that feature until it has been determined that the feature can safely and with good performance provide the functionality to namespaces above that size.
An existing limit belongs to the Namespace SBOM dependency filter which is limited to namespaces with 600 or less projects in them.
Motivation
Namespace Traversal Queries in GitLab are beginning to butt up against the limits of what can be considered safe and performant for very large namespaces. This is something being investigated for optimisation, as namespace hierarchy queries are a fundamental aspect of queries across GitLab.
However, building namespace level features dependent on the aggregation on data across namespace hierarchy trees are unable to avoid performing this kind of query, meaning that in instances where the tree is very large, Postgres is forced to read substantial amounts of data to provide that aggregation. (Some queries have needed to read in excess of 2GB). Reading this amount of data is slow and not great for user experience, and begins to pose an operational risk to GitLab as a whole due to the impact on buffers and database contention such large reads has.
As a result, when new Group features such as the Group SBOM dependency report feature are implement, they frequently find themselves held up in the database review as Database Maintainers are forced to spend days and sometimes even weeks working with the developer to squeeze every ounce of performance out of the query, mitigate all operational risks and accept that for large groups we are reaching the bounds of GitLab's infrastructural capabilities. As @zmartins and @mokhax can attest, this can be a really stressful and long process, and can be demotivating when the feature is under pressure to be delivered within a milestone, only to be held up at the final mark by a performance issue that requires as much effort to resolve as implementing the feature itself took in the first place.
By defaulting to a namespace size limit, we can split the effort of optimising the feature for very large namespaces apart from the effort of implementing it. This has a few benefits:
- We can iterate more quickly, knowing that our features are less likely to cause a production incident as we know they will be operating within a known bound in terms of namespace size.
- We can roll features out to smaller groups to allow testing in production sooner, allowing us to see actual performance characteristics which will better inform us about how well the feature can scale.
- We can push the effort for optimising for very large groups into the next milestone while still delivering the feature, which aligns with our Iteration value, and means that some of our users can benefit from the work while we handle the performance issues at scale.
TL:DR: Implement a group size limit for new namespace level queries, split the optimisation for very large groups into a separate issue to improve iteration.
Implementation
It's important to note here that the larger namespaces in GitLab are generally owned by GitLab's largest customers (and GitLab itself), so while the intention of this is to improve iteration, it is not intended to be an excuse to put off this optimisation process indefinitely. Ideally the optimisation effort should immediately follow the release of the feature.
The Group Level SBOM dependecy report currently uses a namespace project limit of 600 projects, above which certain functionalities are disabled. 600 projects was chosen as this excluded only the top 6 largest namespaces, and resulted in the query typically reading about 50MB of data per execution, which should be both performant and operationally safe for GitLab.
We can likely borrow the mechanics from this feature to implement a frontend flag/count indicating the namespace is too large, and disable features on both the frontend and backend which are above 600 projects in size until we've actively optimised and tested them to handle larger project counts.
600 is not guaranteed to perform well, as the query can still perform poorly even for namespaces with this many projects, but the effort of determining how many projects a query can sufficiently support is the same effort as beginning to optimise the query for large namespaces, so the preference here is to find a reasonably safe limit to operate within for initial feature releases.
Other Considerations:
- These limits should be probably configurable if we are unable to optimise the query beyond a certain namespace size. Either the limit should be modifiable, or able to be disabled, as Self-Managed users might prefer a degraded experience to having features limited from them. Since such won't compromise the platform stability for other users as it might on GitLab.com, user choice wouldn't hurt.
- This is purely for anything new we develop from now until we fix the problem
Proposal
For any new group Vulnerability or Dependency List feature addition, limit the number of projects to 1,000. Provide a notification to the user about why this limitation is in place.