Add a test or review step that explicitly checks the set of rules that are active in the Semgrep-based SAST analyzer

From gitlab-org/security-products/analyzers/semgrep!445 (comment 1968870324)

Background

This issue basically subsumes two problems:

how to verify which rules are active in a semgrep analyzer.
how to make sure that we include all rules from sast-rules are shipped.

When thinking about the Semgrep configuration, I think we always have to consider two modes of operations:

Vanilla mode: Running the semgrep analyser with the vanilla rule-set without any modification. For Vanilla mode you could argue that problems 1 and 2 are identical. IMHO, I consider problem 1 to be more related to explicitly listing to the user which rules are currently used and the easiest way to do this is just to explicitly list all the rules that are picked up by semgrep.
Custom mode: Running the semgrep analyser with custom-ruleset configuration. Custom mode can include the addition of entirely new rules but also the modification of existing ones so that also the rule-set we ship under /rules can be modified (new rules can be added, rules can be overwritten) etc. From my recent interactions with solution architects in our Slack channel, it seems as if this feature is frequently used. If you see something like /usr/local/bin/semgrep -f /rules -o /builds/julianthome/tada/semgrep.sarif --sarif --no-rewrite-rule-ids --strict --disable-version-check --no-git-ignore --exclude spec --exclude test --exclude tests --exclude tmp --metrics on --verbose in the semgrep log, you cannot be certain that the rule-set is the same one that we shipped. Just showing the list of active rules is useful in this mode because users that are using custom rulesets would receive an immediate feedback about the rules which are active which seems desirable based on my recent interactions with solution architects in our slack channel. There is a certain lack of transparency/visibility as to what rules are actually active in a semgrep run. https://gitlab.slack.com/archives/CLA54H7PY/p1718178603797339 is one of the more recent examples where dynamically computing and prominently displaying the list of active rules could have saved a couple of combined debugging hours 😊.

So to build a solution for problems 1 and 2, it is probably better to start looking at Custom mode first because it is more general: a solution that works for Custom mode also works for Vanilla mode.

Having the list of rule files and the checksum solves problem 1 for both Vanilla mode and Custom mode because it explicitly indicates the rules that are picked up by semgrep even if configuration locations are changed by means of custom rule-sets. It also provides information with regards to changes applied to existing rules. If a user modifies /rules/bandit.yml by means of a custom ruleset, we can verify the rule change (with regards to the standard rule) by just looking at the job log. As a side benefit, solution architects/customers/users get additional visual feedback based on their applied rule changes.

Having the list of rule files and the checksum solves problem 2 by enabling us to check the checksums of active rules against the sast-rules rule-packs we ship in Vanilla mode and Custom mode scenarios. At the moment, we only explain the manual verification process https://gitlab.com/gitlab-org/security-products/analyzers/semgrep/-/blob/65c857e570b8b83235040ae591d8668d6222b911/README.md#rule-integrity-check but as mentioned in #463607 (comment 1924624480), the idea (follow-up MR) would be to generate a Manifest file as part of the sast-rules deployment which we can include in the docker image and then check against the active rules so that we can verify that the set of active rules is identical to the ones we shipped--if there is a deviation, we can print out a warning. We cannot reliably verify that active rules=shipped rules during build time, actually.

Possible Use-cases

As discussed in the context of Add a test or review step that explicitly check... (#463607 - closed) • Julian Thome • 17.3 • On track, I would like to change the sast-rules deployments so that it generates a manifest file that includes meta-information about the rule-set including the checksum of the rule files; another piece of meta information we could add would be the version number.

Releasing a new version of semgrep

I would like to include the manifest file into the semgrep analyzer and then add a check in our analyzer that compares the combined checksum from the manifest file with the ones of the active rules when a rule-set is loaded.

If they don't match we could print out a message "semgrep runs with a non-standard configuration" or something along these lines. I would like to integrate a CI test job that makes sure that this does not happen which would essentially fix Add a test or review step that explicitly check... (#463607 - closed) • Julian Thome • 17.3 • On track in the context of releasing a new version.

Running semgrep with a custom configuration

In this context, just knowing the list of active rule files is useful because it signals to the user, that their custom configuration if they are adding new files (which is probably the most common case) is working (or not) 😊. Since you can override files, I would suggest to indicate that in the log by printing a similar log message as before "semgrep runs with a non-standard configuration" but adding the rules that were impacted by the change as additional information so that if I have a custom rule that modifies /rules/bandit.yml a change to this rule is reflected in the log and provides a signal that my custom-ruleset configuration had an impact. At the moment this configuration process is more or less trial and error -- you have to setup a test project that includes some test vulnerabilities, run a scan and then check the impact of your custom-ruleset configuration in the gl-sast-report.json so I think that just showing more information in the log would be useful.

Debugging case

If we get requests in the future from customers, we can instantaneously see whether or not they operate with the default rule-set by just looking at the log which hopefully makes our life of reproducing issue(s) a little easier.

Edited Jun 28, 2024 by Julian Thome