Skip to content
Snippets Groups Projects
  1. Jul 27, 2024
    • Jamie Tanna's avatar
      feat(policies): allow pre-filtering data via comment directives · f110e8fc
      Jamie Tanna authored and Jamie Tanna's avatar Jamie Tanna committed
      As another step towards drastically improving the performance of policy
      evaluations with Open Policy Agent, we can provide the capability to
      pre-filter the data that is collected by DMD, before evaluating it.
      
      For instance, in the case that we know we only want to check against a
      subset of Docker images which have a specific namespace, we can:
      
      - create an Rego rule that filters for this
      - add a DMD filter directive that pre-filters the data for this
      
      This way, we can make sure that we only fetch the data we need, while
      still having control inside the policy around what should match.
      
      Filters are applied with an `AND` across each field being filtered on,
      and an `OR` for any possible values.
      
      We also allow wildcards with `*`, which are mapped to an SQL `%`.
      
      For now, we only support the `filter` directive, and the filtering on
      the `package_name` and `package_type`.
      
      Note that we use the `package_type` not `package_manager` as a step
      towards #446.
      
      As an extension of #603.
      f110e8fc
    • Jamie Tanna's avatar
      perf(advisories): only fetch required data for policies · 015feb72
      Jamie Tanna authored
      This is a significant refactor of how we perform policy evaluations and
      the generation of advisories from policy evaluations, as part of #603.
      
      Previously, we would retrieve all rows in the database, regardless of
      whether we needed to use them in a Policy, including `JOIN`s on tables
      and `GROUP_CONCAT`s which could increase the complexity of the query and
      include a lot of unnecessary data.
      
      Additionally, this would lead to quite high memory usage which was
      found in 53fc5226 for Renovate
      datasource imports, but also affects here.
      
      Instead, we want to parse the Policy definitions, and determine if the
      relevant data is queries, and if not, omit it from being fetched.
      
      This introduces a new type, the `policies.Evaluator`, which will perform
      the evaluation of policies, based on the `EvaluationInputOpts` that a
      given Policy maps to.
      
      These are based on the parsed AST of the Policy itself, and enables the
      lookup of data only when necessary.
      
      As we're now very dynamically querying data we need, this is slightly
      out of scope of sqlc's domain, so requires we hand-write the queries.
      
      Unfortunately /this is very horrible and cursed/ and I am sad to have
      written it this way.
      
      Unfortunately, there wasn't a way to do this using sqlc, and to avoid
      adding other libraries, or to avoid writing every combination of i.e.
      `QueryRepoKey_Licenses_DependencyHealth` , we're doing awful string
      concatenation, but it does work, albeit is going to be a nightmare to
      maintain.
      
      We now need to handle when the RepoKey isn't set for a given
      `policyViolationSnippet`, and if so insert that snippet of a violation
      with `InsertAdvisoryForPartialRenovatePackage`/`InsertAdvisoryForPartialSBOMPackage`
      which takes advantage of `INSERT INTO ... SELECT` which reduces the Go
      code to write, and instead pushes it onto SQLite.
      
      We also make sure that we process the first evaluation on its own,
      allowing us to possibly capture any cases where i.e. HTTP requests can
      be pre-cached to avoid a "Thundering Herd" of sorts.
      
      Then, we iterate through the rest of the batch, which we've moved to as
      a means to reduce the memory footprint of evaluations.
      
      We also make sure that we re-use the rows across any Policies that have
      the same `EvaluationInputOpts`, as it leads to a more resource-cautious
      approach.
      
      Because we've got an arbitrary query, we can now wrap the whole thing in
      a `COUNT` to avoid having to write the query twice.
      
      We also can remove a few now unused methods.
      
      Closes #603.
      015feb72
Loading