For SAST analyzers that have already been transitioned to semgrep rules and other custom rules including VET, GitLab users may want to use various data-sources to run/build/assemble their own custom-rulesets.
Some of these data-sources may include:
git repository
local (sourced from the local file system)
raw
Users may also want to disable a subset of the rules or overwrite them.
We could extend the current rule-syntax to add support for multiple data-sources. The rules (in the example below we used semgrep as an example) can be sourced from multiple places (file, raw, git, url). The configuration file below shows an example configuration we could use to build a custom ruleset myrulepack.yml.
In essence the configuration below assembles a rule pack by pulling various rules from an arbitrary git repository, a local file, an a raw/inline configuration:
[semgrep]description='semgrep custom rules configuration'targetdir="/sgrules"validate=true[[semgrep.passthrough]]type="raw"value="""# My ruleset"""target="rule.yml"[[semgrep.passthrough]]type="url"value="https://semgrep.dev/c/p/gosec"target="rule.yml"mode="append"[[semgrep.passthrough]]type="file"value="foo.yml"target="rule.yml"mode="append"[[semgrep.passthrough]]type="raw"mode="append"target="rule.yml"value="""- id: "Foo" patterns: - pattern: "func Foo() {...}" message: | Function Foo detected metadata: cwe: "CWE-200: Exposure of Sensitive Information to an Unauthorized Actor" severity: "ERROR" languages: - "go""""[[semgrep.passthrough]]type="file"value="bar.yml"validator="yaml"[[semgrep.passthrough]]type="git"value="https://github.com/dgryski/semgrep-go"ref="b14e2f07411c22cadaab3a5d7df2346a99e7b36d"[[semgrep.passthrough]]type="git"value="https://gitlab.com/julianthome/semgrep-rules"subdir="go"ref="refs/heads/develop"
FWIW here's the original thread on discussing configuration formats and tradeoffs between the decision around TOML vs YAML: #229579 (comment 382581432)
@fcatteau@NicoleSchwartz I am wondering whether we could use a similar mechanism so that users/customers could hook in their own advisory database for dependency scanning?!
hmmm, i mean we do use multiple sources today, i wonder how easy it would be to make sources an array set as long as it was formatted to specific rules?
@julianthome Thanks for sharing. That said, I don't see how this could be used to define custom advisory databases for Dependency Scanning. Could you tell us more?
@fcatteau we could use use a mechanism similar to the one above to pull-in additional rules (in addition to the ones we ship) so that customers/users could include their own advisory (yaml) files into the mix.
Rulepack synthesis is integrated in the rules module with gitlab-org/security-products/analyzers/ruleset!6 (merged). Apart from gosec, nodejs-scan, we are planning to use this for semgrep, VET and hopefully DAST as well so that they can be tailored/configured to specific use-cases. The mechanism should be flexible enough to be used with the gemnasium analyzer as well so that users/customer could add their own advisories to the mix if they wanted to. The documentation of this feature is available here: !75234 (merged).
I am just leaving this comment here because I do not know if the ability to add custom-advisories to the ones we ship is a commonly requested feature. However, if it is a commonly requested feature, we could re-use the same mechanism we already use for SAST and secret detection for DS and potentially DAST as well. It would be great if we could use a common configuration schema for as many analyzers as possible.
The mechanism should be flexible enough to be used with the gemnasium analyzer as well so that users/customer could add their own advisories to the mix if they wanted to.
What would be the benefit, compared to supporting user-defined vulnerability databases? Or maybe these options are not mutually exclusive, and rules could be used to pass the YAML definitions of extra security advisories?
A concrete example would help actually. Could you illustrate how this would work? Thanks!
What would be the benefit, compared to supporting user-defined vulnerability databases? Or maybe these options are not mutually exclusive, and rules could be used to pass the YAML definitions of extra security advisories?
@fcatteau Yes, essentially it allows for augmentation of existing advisories without requiring complete control/maintenance of a database. Take this pseudocode for example:
[gemnasium]description='gemnasium with additional custom advisory'validate=true[[gemnasium.passthrough]]type="raw"# appending to an existing filemode="append"target="custom_advisory.yml"value=""" identifier: "CVE-2022-8675309" identifiers: - "CVE-2022-8675309" package_slug: "mycustom/package" title: "Something bad" description: "Oh noez" date: "2022-01-04" pubdate: "2022-01-04" affected_range: "<1.0.1" fixed_versions: - "1.0.1" affected_versions: "All versions before 1.0.1" not_impacted: "All versions after 1.0.1" solution: "Upgrade to versions 1.0.1 or above." urls: - "https://example.com/vuln/detail/CVE-2022-8675309" cvss_v3: "CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H" uuid: "31029308-3b3d-4089-b6af-7729e672e133" cwe_ids: - "CWE-8675309""""
Awesome! Thanks a lot @theoretick for providing this usage example. Another possible scenario would be to use a privately hosted advisory database or to combine the data from https://gitlab.com/gitlab-org/security-products/gemnasium-db with advisories stored in a privately hosted git repository:
[gemnasium]description='advisories'# targetdir where the custom configuration is assembledtargetdir="/advisories"# Automatically check the validity of the files after applying file, url or# raw passthroughs.# Appropriate validator is inferred based on the file extension.validate=true[[gemnasium.passthrough]]type="git"value="https://gitlab.com/gitlab-org/security-products/gemnasium-db"ref="refs/heads/main"[[gemnasium.passthrough]]type="git"value="https://mygitlabinstance/advisories"ref="b14e2f07411c22cadaab3a5d7df2346a99e7b36d"
Now, I'd like to more on how this works. What do we need to implement in the analyzer (gemnasium) to support this? From what I see the analyzers call ruleset.Load, and they directly get the configuration file they've requested, in a unified way. Is this correct?
I'm reading ruleset_test.go to get familiar with this package, so I might figure that out by myself, eventually. 😄
After importing gitlab.com/gitlab-org/security-products/analyzers/ruleset it should be sufficient to load the ruleset configuration rulesetConfig, err := ruleset.Load(rulesetPath, "gemnasium") (semgrep example) and then process the passthroughs by invoking the ruleset.ProcessPassthroughs methods (semgrep example).
Pretty sure the @gitlab-com/gl-security/appsec team is interested in being able to write custom rules for internal projects. Makes a lot of sense to me to have at least a local override.
Regarding the public rules in the Semgrep rule registry, I think it is worth looking at p/oswap-top-ten.
TUI customer feedback was that using p/oswap-top-ten had more relevant findings.
My apologies for not commenting on this much sooner, I've had a persistent TODO that I've been trying to articulate unsuccessfully. I've just decided to write a minimal comment first to kick off the discussion and hopefully gather my thoughts more through back and forth.
I'm concerned about the complexity of this for a few reasons: who the target user is, what is our long-term support, and what do we need to consider from a security perspective.
It also feels more like we are building this capability for our own purposes and I'm not sure if we want to introduce such features for customer usage. They could use this for sure, but it presents a significant lift around our support requirements (debugging complex join/union/pipeline configurations) and affects recommendations for build-your-own rulesets vs using our precurated ones.
One of the niceties of our current ruleset implementation is that it is dead-simple. It's almost inflexibly so, which is obviously a con for enterprise usecases but a pro for explicitness. The above proposal drastically increasing the attack surface with remote fetching, parsing, and joining of rulesets and a non-quite-turing-complete set operands.1
All that said: this covers a real usecase and capability we aren't addressing and I really want to see that addressed. I don't mean to say "no" to this feature, but I would love to iterate together on what steps look like to minimally and iteratively introduce these capabilities without overwhelming ourselves with the future costs.
I have been planning on organizing a threat modeling exercise around one of our analyzers in the coming quarter (planned OKR) and imagine custom rulesets will be a fixture of that. I'd love to add on this scope as anothe vector to consider there ↩
Good thoughts. Perhaps, the first iterative step could be a way to be able to pull in rules that are not shipped in the box? For example, an overwrite. It is possible that the community rules (e.g. p/oswap-top-ten as @mbrandner1 mentioned) could yield better results, because we were focused on result parity with the existing OSS toolchain. On another hand, for Goldimate customers we'd probably need a way to overwrite the shipped ruleset with our enhanced versions (my understanding, PMs please correct me if I'm wrong).
I would also love to see it done in a generic way that even Composition Analysis and DAST could benefit from.
@theoretick It's a good idea to TM this. Curious what your security related thoughts are. My understanding is that even if there's a way to point to a malicious ruleset (or something like file:// on the same box), it's still within the "shared responsibility model" because the user can always sabotage their CI (e.g., through environment variables) if desired.
Good thoughts. Perhaps, the first iterative step could be a way to be able to pull in rules that are not shipped in the box? For example, an overwrite. It is possible that the community rules (e.g. p/oswap-top-ten as @mbrandner1 mentioned) could yield better results, because we were focused on result parity with the existing OSS toolchain.
💯 there's the "quick" way which is using a before_script and there's the "better" way. The quick is documented in #342435 and that issue serves as the key feature for this capability.
On another hand, for Goldimate customers we'd probably need a way to overwrite the shipped ruleset with our enhanced versions (my understanding, PMs please correct me if I'm wrong).
We essentially already do this within our analyzers, right? Unless you're talking about something more sophisticated than a copy-over. Merging YAML refs would definitely add some complexity if we wish to keep them more minimal.
@theoretick It's a good idea to TM this. Curious what your security related thoughts are. My understanding is that even if there's a way to point to a malicious ruleset (or something like file:// on the same box), it's still within the "shared responsibility model" because the user can always sabotage their CI (e.g., through environment variables) if desired.
That's true but I guess it depends on how comfortable we are with footguns. There's also additional work that's liable to increase the surface area here such as shared workspaces between jobs. There's also the age-old question of liability regarding our recommended configs and customers deployed ones (k8s privileged runner with questionable volume mounts). SSRFs and/or DoS'ing endpoints for ruleset retrieval, billion laughs attacks with circular ruleset resolutions, etc...
Anyway, TM sounds fun and I haven't done one before so this should be a good candidate to play with the exercise 😸
Environment variable array of rules, top to bottom
GitLab rules
In the above scenario, ./some/repo/path would overwrite anything in mycustomrepository which would overwrite p/owasptop10 which would overwrite anything in GitLab rules.
As for resolution of ANALYZER_RULES, we could resolve RULE_REPOSITORY name values, then look on the filesystem, then if it contained a 'p' look into the semgrep registry.
I guess we would also need to support disabling GitLab rules if a customer simply wanted to use their own rules.
Is there a reason we can't just support a variable and assume an order of precedence?
This sounds dead-simple to me 👍. We have to do some light munging with YAML merges but that's a solved problem.
I guess we would also need to support disabling GitLab rules if a customer simply wanted to use their own rules.
We do this on an individual level with disabling pre-existing rulesets but not globally. Happy to entertain approaches but TBH I'm not sure we ever want to encourage this officially as it would point towards either a need we should be handling directly or some weird hacky abuses of our scanners. It should still be possible regardless and if needed they can just semgrep-sast.before_script: rm -rf ./rules
I'll just add a side comment to this discussion as well: if this proposal IS needed to solve a real need we have around benchmarking and development of rulesets, then I'm in favor of implementing it but perhaps as an undocumented version or fork of our existing code (i.e. a development dependency). That way we keep the simplicity of the end product but give ourselves flexibility during dev
I can't speak for SAST but I can for DAST and my ASR project (which uses semgrep under the covers). This is a definite need, not right now but as we roll out more of our DAST checks it will be.
Totally agree with the points @theoretick and @idawson mentioned above. It is also true that this feature is kind of important for us internally because it enables us to dynamically pull rule-sets for semgrep, VET and DAST depending on the context (core<>goldimate). In addition to that we would be able to use it for benchmarking and dogfooding (e.g., running custom rule-sets on the GitLab codebase to find regressions).
That being said I have the feeling that we receive an increasing amount of feedback from users/customers who would like to use such a feature (https://gitlab.slack.com/archives/C8S0HHM44/p1636515917174900, #339614 (comment 710558276)) so that implementing an MVC that works as @idawson described above seems worthwhile. We should be able to have a basic MVC working until the beginning of next week.
The blog post (non-blocker) is going to be published on Dec, 21 2021. Other than that, all related issues are resolved and this issue can be considered completed and closed.