Commits · v0.110.0 · tanna.dev / Dependency Management Data

Jul 27, 2024

feat(policies): allow pre-filtering data via comment directives · f110e8fc

Jamie Tanna authored 7 months ago and

Jamie Tanna committed 7 months ago

As another step towards drastically improving the performance of policy
evaluations with Open Policy Agent, we can provide the capability to
pre-filter the data that is collected by DMD, before evaluating it.

For instance, in the case that we know we only want to check against a
subset of Docker images which have a specific namespace, we can:

- create an Rego rule that filters for this
- add a DMD filter directive that pre-filters the data for this

This way, we can make sure that we only fetch the data we need, while
still having control inside the policy around what should match.

Filters are applied with an `AND` across each field being filtered on,
and an `OR` for any possible values.

We also allow wildcards with `*`, which are mapped to an SQL `%`.

For now, we only support the `filter` directive, and the filtering on
the `package_name` and `package_type`.

Note that we use the `package_type` not `package_manager` as a step
towards #446.

As an extension of #603.

f110e8fc

perf(advisories): only fetch required data for policies · 015feb72

Jamie Tanna authored 9 months ago

This is a significant refactor of how we perform policy evaluations and
the generation of advisories from policy evaluations, as part of #603.

Previously, we would retrieve all rows in the database, regardless of
whether we needed to use them in a Policy, including `JOIN`s on tables
and `GROUP_CONCAT`s which could increase the complexity of the query and
include a lot of unnecessary data.

Additionally, this would lead to quite high memory usage which was
found in 53fc5226 for Renovate
datasource imports, but also affects here.

Instead, we want to parse the Policy definitions, and determine if the
relevant data is queries, and if not, omit it from being fetched.

This introduces a new type, the `policies.Evaluator`, which will perform
the evaluation of policies, based on the `EvaluationInputOpts` that a
given Policy maps to.

These are based on the parsed AST of the Policy itself, and enables the
lookup of data only when necessary.

As we're now very dynamically querying data we need, this is slightly
out of scope of sqlc's domain, so requires we hand-write the queries.

Unfortunately /this is very horrible and cursed/ and I am sad to have
written it this way.

Unfortunately, there wasn't a way to do this using sqlc, and to avoid
adding other libraries, or to avoid writing every combination of i.e.
`QueryRepoKey_Licenses_DependencyHealth` , we're doing awful string
concatenation, but it does work, albeit is going to be a nightmare to
maintain.

We now need to handle when the RepoKey isn't set for a given
`policyViolationSnippet`, and if so insert that snippet of a violation
with `InsertAdvisoryForPartialRenovatePackage`/`InsertAdvisoryForPartialSBOMPackage`
which takes advantage of `INSERT INTO ... SELECT` which reduces the Go
code to write, and instead pushes it onto SQLite.

We also make sure that we process the first evaluation on its own,
allowing us to possibly capture any cases where i.e. HTTP requests can
be pre-cached to avoid a "Thundering Herd" of sorts.

Then, we iterate through the rest of the batch, which we've moved to as
a means to reduce the memory footprint of evaluations.

We also make sure that we re-use the rows across any Policies that have
the same `EvaluationInputOpts`, as it leads to a more resource-cautious
approach.

Because we've got an arbitrary query, we can now wrap the whole thing in
a `COUNT` to avoid having to write the query twice.

We also can remove a few now unused methods.

Closes #603.

015feb72