Currently, there is a limit of 5 policies and each policy can have up to 5 rules. The limits were introduced to limit the impact in various areas:
Initially, reading from Gitaly was adding overhead for policy processing and the more policies we had, the longer it took to propagate them for a group/project (based on this comment). This has been already improved in Cache policy.yml in security policy project (#414528 - closed) and we cache policy.yml. This shouldn't be a big concern anymore.
The limit also directly affects ProcessScanResultPolicyWorker, because we have to create more approval rules (and with license_scanning also license records). High limits would prolong the time it takes for a policy to propagate across the project / group.
The limit of 5 policies only partially addresses this concern, because technically it's possible to have a deep sub-group structure (group->sub-group->subsub-group->...->project) and define 5 policies per each sub-group, so the deepest project could have many policies applied already today.
Because we don't have mechanism to selectively sync policies for affected projects based on what changes in the policy, we have to delete everything and re-create it from scratch.
The amount of open merge requests in a project also plays role, because for each open merge request, we have to:
sync approval rules
sync any_merge_request rules, which invokes Gitaly to get the commits signatures (signed/unsigned)
We do have also a limit of 5 rules per policy, but it seems like this hasn't been a concern and it's enough? Limits of policies may be more important for customers that want to define different approvers for different branches. Rule limit wouldn't help them to achieve that.
There might be a small impact in places where we check if policies apply to override project settings, such as Prevent branch modification when a policy disab... (&9705 - closed), but thanks to Gitaly caching and the simplicity of these checks, it doesn't seem like a big concern.
Limit the amount of records for Approval Project Rules and Approval Merge Request Rules. A big amount of these records can negatively affect loading times of MRs
(disclaimer: GDK loading times are worse than production)
If there are many rules that require approval, it could have user experience implications. This is not applicable to any_merge_request rules as they don't create approval_rules, but 20 scan_finding and license_scanning rules that require approval would look like this:
After load
Expanded
Implications of the database read model on the limits
With this paragraph, I'd like to touch on how the database read model could lower the risk of having higher limits.
Selective policy sync
We will have possibility to selectively sync affected policies instead of recreating them from scratch. For example:
if we detect a change only in the description, we don't have to recreate approval rules
if we detect a removed policy, we can only remove the associated records
if we detect a new rule while other rules are kept as before, we don't need to recreate the existing ones and can only create the new ones
In the worst case, however, if everything in policy.yml changes, the impact can be the same as currently.
Cleanup of approval rule columns
Approval rules will hold reference to the security policy rule means less redundant data that needs to be recreated for a policy change. Having more approval rules will be less costly.
Duplicated data due to project_id in scan_result_policies
The tables will be normalized to not duplicate policy data for each project (useful only for group policies). This will make the group policy sync more performant and having more approval rules will be less costly.
instead of creating and updating rule records for every MR we only update a single "template" rule that has an association to the MR, and only create new MR-level records when a user modifies the rule for that MR... and since what little data I've seen on it shows precious few people take the time to manually edit approval rules at the MR level, it's a much faster, cleaner approach.
With this, we would gain the following:
we wouldn't have to sync opened merge requests, because we would update the "template" rule and it would automatically apply to the merge requests
if group-level approval rules work eventually in a similar fashion, this would further simplify the situation for group policies as we wouldn't have to copy the approval rules for each project
Proposal
To be able to test the new limits and revert the new limits in case we notice degraded performance, we can add ApplicationSetting. This approach will also allow self-managed customers to adapt the limits to their needs. I would leave the rules limit to 5, as it is.
I think it makes sense to set the default value to 5 as we have now and based on this comment, the limit is enough for the most cases.
Avoid Limits - Limits should be in place to protect the system but not to “slowly try out” a feature
I would also propose to set a minimum to 5 and a maximum (at least initially) to 20 before we evaluate the performance on production. We know that feature works with 5, so I don't think it makes sense to go lower. I would see the maximum as means to protect the system and 20 would be already a big jump from what we have currently. If there's no significant impact on performance even with 20, we may update the validation and allow higher limits (e.g. 50).
There might be some cases I'm not accounting for, but I think there's no good way to really judge the performance impact other than rolling it out incrementally on production.
We can start by increasing the limit to 10 and observe the performance using the execution metric introduced in Add execution duration metric for ProcessScanRe... (!139263 - merged) (or alternatively, by observing duration_s value for Security::ProcessScanResultPolicyWorker in the logs). Using the metric will allow us to sum the metrics by configuration_id to measure the policy propagation time.
To be able to test the new limits and revert the new limits in case we notice degraded performance, we can add ApplicationSetting.
What would happen if:
We increase the limit of policies
A customer creates more policies than the current limit
We decide to reduce the limit again
Would some policies not be visible and applicable anymore?
Do we need to consider the limit of the yaml file?
As we increase the limit of rules per policy, we could hit the limit on the YAML size in the policy editor defined by the max_yaml_size_bytes setting. If the yaml size is bigger than this limit, we return an error: The parsed YAML is too big
If they sync the policies with higher limit and we reduce the limit afterwards, the approval rules they have will stay untouched until they re-sync again (e.g. by updating the policy, or relinking the policy project). This is indeed tricky territory and it may not be ideal, but I can't think of another way to handle it
The limit is read at the time when the policy is processed and the corresponding amount of approval rules are created.
Do you see another way to handle it?
As we increase the limit of rules per policy
I would propose to keep the limit of rules per policy to 5 and only increase the limit for amount of policies. The limit of rules is defined in the schema, so it's not possible to dynamically change it. We would have to take it out of the JSON and validate the number of rules in the code.
Interesting - the max_yaml_size_bytes setting, I didn't know about it. I've tested with 100 policies in the YAML, each with 3 rules. I was changing the limit via ApplicationSetting and re-running Security::SyncScanPoliciesWorker. I didn't come across this error. The default seems to be 1MB, which should still be plenty.
The only alternative I can think of would be to add the limit only for the policy creation. And take into account all active policies during the policy processing.
The limit of rules is defined in the schema, so it's not possible to dynamically change it.
It would be possible to dynamically change it if we define the schema in our code instead of reading a JSON file. But I believe this is something we are not considering.
We do have some type of validation, at least in the UI, when it comes to creating more than the defined limit of policies. See discussion here.
We do have also a limit of 5 rules per policy, but it seems like this hasn't been a concern and it's enough? Limits of policies may be more important for customers that want to define different approvers for different branches. Rule limit wouldn't help them to achieve that.
I believe customers are starting to hit the limits and some are anticipating needing many policies. A few contributing factors to anticipating us exceeding the limit:
We now have support for MR compliance enforcement, a highly desired capability. We will continue to expand the settings we can enforce.
We previously did not have policy scoping and as this is generally available, I think there will also be an increased desire to set policies granularly against particular projects -- e.g. one policy for 10 projects, another policy for these 20, another for these 5, etc etc.
Branches, approver rules - as you identified, I think will also be common. I think we can continue to perfect the policies to handle more complexity in fewer policies, such as be allowing for different approvals based on various inputs.
Pipeline execution policies - this will also expand usage and new policies.
And with security policy scopes, I expect more customers will organize policies centrally where they may have created multiple policy projects, now that they can specify projects granularly.
In any case, I like the proposal of having a setting with a default limit and a max of 20. If we continue to have many customers bumping against this limit we can work to better understand why, as well as better understand any performance constraints, before we extend further. This would definitely give us good breathing room in the short term!
If there are many rules that require approval, it could have user experience implications. This is not applicable to any_merge_request rules as they don't create approval_rules, but 20 scan_finding and license_scanning rules that require approval would look like this:
And on this, I don't think this would be common to have 20 rules impacting the same MRs. I think the desire for more policies will be in cases of more granular enforcement for various cases, such as enforcement around MRs that are following a hotfix flow that varies from the standard flow (which may target specific branch patterns). Or granular policy scopes requiring 3 policies doing something similar but for unique sets of projects. Or the multiple approvers case we saw come up recently.
Thanks @g.hickman for additional information, very valuable!
I believe customers are starting to hit the limits
I think we're clear on it, but just to be sure and to provide a bit more details, the examples you've listed are more related to the policy limits and not rule limits. We can raise policy limits to 10 or 20 and keep rule limits to 5.
An example of rule limits 5 would be 1 policy with:
2 scan finding rules
2 license scanning rules
1 any merge request rule
the same set of approvers
the same approval settings
Because the approval settings and approvers are per-policy and not per-rule, I came to the conclusion that rules limit of 5 is enough in most cases and it's more important to raise the policy limits.
And on this, I don't think this would be common to have 20 rules impacting the same MRs.
Agreed, I mainly wanted to highlight an edge case scenario which would theoretically be possible, but I also think it's rare to end up in this situation.
That's a great first step to start with the setting, and we can evaluate limits and impact on performance on GitLab.com and allow self-managed to set their limits as it should not decrease their performance.
I'm looking forward to new improvements that should help reduce this limitation as we proceed with the read model.
I think we're clear on it, but just to be sure and to provide a bit more details, the examples you've listed are more related to the policy limits and not rule limits. We can raise policy limits to 10 or 20 and keep rule limits to 5.
Good point, yes I think the rule limit has been less of an issue so far compared to the policy limit.
The only alternative I can think of would be to add the limit only for the policy creation. And take into account all active policies during the policy processing.
That seems tricky because users can also just directly push policy.yml and they don't have to go through our policy editor. We would have to add a push check to catch that and on top of it, it would also not help us reduce the performance impact and process less policies in case we need to lower the limits.
if we define the schema in our code instead of reading a JSON file
Interesting.. agree, I would probably try to avoid doing that