- Jul 27, 2024
-
-
As another step towards drastically improving the performance of policy evaluations with Open Policy Agent, we can provide the capability to pre-filter the data that is collected by DMD, before evaluating it. For instance, in the case that we know we only want to check against a subset of Docker images which have a specific namespace, we can: - create an Rego rule that filters for this - add a DMD filter directive that pre-filters the data for this This way, we can make sure that we only fetch the data we need, while still having control inside the policy around what should match. Filters are applied with an `AND` across each field being filtered on, and an `OR` for any possible values. We also allow wildcards with `*`, which are mapped to an SQL `%`. For now, we only support the `filter` directive, and the filtering on the `package_name` and `package_type`. Note that we use the `package_type` not `package_manager` as a step towards #446. As an extension of #603.
-
Jamie Tanna authored
This is a significant refactor of how we perform policy evaluations and the generation of advisories from policy evaluations, as part of #603. Previously, we would retrieve all rows in the database, regardless of whether we needed to use them in a Policy, including `JOIN`s on tables and `GROUP_CONCAT`s which could increase the complexity of the query and include a lot of unnecessary data. Additionally, this would lead to quite high memory usage which was found in 53fc5226 for Renovate datasource imports, but also affects here. Instead, we want to parse the Policy definitions, and determine if the relevant data is queries, and if not, omit it from being fetched. This introduces a new type, the `policies.Evaluator`, which will perform the evaluation of policies, based on the `EvaluationInputOpts` that a given Policy maps to. These are based on the parsed AST of the Policy itself, and enables the lookup of data only when necessary. As we're now very dynamically querying data we need, this is slightly out of scope of sqlc's domain, so requires we hand-write the queries. Unfortunately /this is very horrible and cursed/ and I am sad to have written it this way. Unfortunately, there wasn't a way to do this using sqlc, and to avoid adding other libraries, or to avoid writing every combination of i.e. `QueryRepoKey_Licenses_DependencyHealth` , we're doing awful string concatenation, but it does work, albeit is going to be a nightmare to maintain. We now need to handle when the RepoKey isn't set for a given `policyViolationSnippet`, and if so insert that snippet of a violation with `InsertAdvisoryForPartialRenovatePackage`/`InsertAdvisoryForPartialSBOMPackage` which takes advantage of `INSERT INTO ... SELECT` which reduces the Go code to write, and instead pushes it onto SQLite. We also make sure that we process the first evaluation on its own, allowing us to possibly capture any cases where i.e. HTTP requests can be pre-cached to avoid a "Thundering Herd" of sorts. Then, we iterate through the rest of the batch, which we've moved to as a means to reduce the memory footprint of evaluations. We also make sure that we re-use the rows across any Policies that have the same `EvaluationInputOpts`, as it leads to a more resource-cautious approach. Because we've got an arbitrary query, we can now wrap the whole thing in a `COUNT` to avoid having to write the query twice. We also can remove a few now unused methods. Closes #603.
-