rule-pack synthesis using rules from multiple sources

added to epic &4179

added devopssecure groupvulnerability research sectionsec labels

changed the description

added GitLab Core GitLab Premium GitLab Ultimate typefeature labels

made the issue confidential

changed the description

/cc @gitlab-org/secure/vulnerability-research @gitlab-org/secure/static-analysis-be

@julianthome great job on this proposal!

@theoretick I know that you've thought deeply regarding custom rulesets. Would you please take a look at this and let us know what you think?

changed the description

mentioned in epic &4179

changed title from Allow rule-pack syntheisis including multiple sources to Allow rule-pack synthesis including multiple sources

changed title from Allow rule-pack synthesis including multiple sources to rule-pack synthesis using rules from multiple sources

changed milestone to %Backlog

assigned to @idawson

While I'm a huge fan of TOML, I think given GitLab's nature of using yaml everywhere it may be better to use yaml for this.

Other than that, this proposal looks solid from my point of view!

Heads up that Static Analysis is already using TOML for customizing rulesets https://docs.gitlab.com/ee/user/application_security/sast/#customize-rulesets.

oh wow I had no idea, that's great then!

FWIW here's the original thread on discussing configuration formats and tradeoffs between the decision around TOML vs YAML: #229579 (comment 382581432)

unassigned @idawson

added [deprecated] Accepting merge requests label

made the issue visible to everyone

Unmarking as confidential, we'll need customer feedback on this given it'll be an extension of custom rulesets.

changed the description

@fcatteau @NicoleSchwartz I am wondering whether we could use a similar mechanism so that users/customers could hook in their own advisory database for dependency scanning?!

hmmm, i mean we do use multiple sources today, i wonder how easy it would be to make sources an array set as long as it was formatted to specific rules?

@julianthome Thanks for sharing. That said, I don't see how this could be used to define custom advisory databases for Dependency Scanning. Could you tell us more?

@fcatteau we could use use a mechanism similar to the one above to pull-in additional rules (in addition to the ones we ship) so that customers/users could include their own advisory (yaml) files into the mix.

Rulepack synthesis is integrated in the rules module with gitlab-org/security-products/analyzers/ruleset!6 (merged). Apart from gosec, nodejs-scan, we are planning to use this for semgrep, VET and hopefully DAST as well so that they can be tailored/configured to specific use-cases. The mechanism should be flexible enough to be used with the gemnasium analyzer as well so that users/customer could add their own advisories to the mix if they wanted to. The documentation of this feature is available here: !75234 (merged).

I am just leaving this comment here because I do not know if the ability to add custom-advisories to the ones we ship is a commonly requested feature. However, if it is a commonly requested feature, we could re-use the same mechanism we already use for SAST and secret detection for DS and potentially DAST as well. It would be great if we could use a common configuration schema for as many analyzers as possible.

/cc @NicoleSchwartz @fcatteau @gonzoyumo

@julianthome it si requested let me find the backlog and link it.

#347096

@julianthome Thanks for explaining!

The mechanism should be flexible enough to be used with the gemnasium analyzer as well so that users/customer could add their own advisories to the mix if they wanted to.

What would be the benefit, compared to supporting user-defined vulnerability databases? Or maybe these options are not mutually exclusive, and rules could be used to pass the YAML definitions of extra security advisories?

A concrete example would help actually. Could you illustrate how this would work? Thanks!

What would be the benefit, compared to supporting user-defined vulnerability databases? Or maybe these options are not mutually exclusive, and rules could be used to pass the YAML definitions of extra security advisories?

@fcatteau Yes, essentially it allows for augmentation of existing advisories without requiring complete control/maintenance of a database. Take this pseudocode for example:

[gemnasium]
  description = 'gemnasium with additional custom advisory'
  validate = true

 [[gemnasium.passthrough]]
    type  = "raw"
    # appending to an existing file
    mode  = "append"
    target = "custom_advisory.yml"
    value = """
    identifier: "CVE-2022-8675309"
    identifiers:
    - "CVE-2022-8675309"
    package_slug: "mycustom/package"
    title: "Something bad"
    description: "Oh noez"
    date: "2022-01-04"
    pubdate: "2022-01-04"
    affected_range: "<1.0.1"
    fixed_versions:
    - "1.0.1"
    affected_versions: "All versions before 1.0.1"
    not_impacted: "All versions after 1.0.1"
    solution: "Upgrade to versions 1.0.1 or above."
    urls:
    - "https://example.com/vuln/detail/CVE-2022-8675309"
    cvss_v3: "CVSS:3.1/AV:N/AC:L/PR:N/UI:N/S:U/C:N/I:N/A:H"
    uuid: "31029308-3b3d-4089-b6af-7729e672e133"
    cwe_ids:
    - "CWE-8675309"
"""

Note: mode: "append" most importantly. See Julian's amazing blogpost for more on the full flexibility for SAST https://about.gitlab.com/blog/2021/12/21/rule-pack-synthesis/#user-defined-sast-configuration

Awesome! Thanks a lot @theoretick for providing this usage example. Another possible scenario would be to use a privately hosted advisory database or to combine the data from https://gitlab.com/gitlab-org/security-products/gemnasium-db with advisories stored in a privately hosted git repository:

[gemnasium]
  description = 'advisories'
  # targetdir where the custom configuration is assembled
  targetdir = "/advisories"
  # Automatically check the validity of the files after applying file, url or
  # raw passthroughs.
  # Appropriate validator is inferred based on the file extension.
  validate = true

  [[gemnasium.passthrough]]
    type  = "git"
    value = "https://gitlab.com/gitlab-org/security-products/gemnasium-db"
    ref = "refs/heads/main"

  [[gemnasium.passthrough]]
    type  = "git"
    value = "https://mygitlabinstance/advisories"
    ref = "b14e2f07411c22cadaab3a5d7df2346a99e7b36d"

@theoretick @julianthome Thank you both for sharing great examples!

Now, I'd like to more on how this works. What do we need to implement in the analyzer (gemnasium) to support this? From what I see the analyzers call ruleset.Load, and they directly get the configuration file they've requested, in a unified way. Is this correct?

I'm reading ruleset_test.go to get familiar with this package, so I might figure that out by myself, eventually.

After importing gitlab.com/gitlab-org/security-products/analyzers/ruleset it should be sufficient to load the ruleset configuration rulesetConfig, err := ruleset.Load(rulesetPath, "gemnasium") (semgrep example) and then process the passthroughs by invoking the ruleset.ProcessPassthroughs methods (semgrep example).

Pretty sure the @gitlab-com/gl-security/appsec team is interested in being able to write custom rules for internal projects. Makes a lot of sense to me to have at least a local override.

@ngeorge1 FYI

assigned to @julianthome

changed milestone to %14.5

removed [deprecated] Accepting merge requests label

/cc @mbrandner1

Thanks for having me chime in Julian!

Regarding the public rules in the Semgrep rule registry, I think it is worth looking at p/oswap-top-ten. TUI customer feedback was that using p/oswap-top-ten had more relevant findings.

Another similar request coming from customers: #343723 (closed)

My apologies for not commenting on this much sooner, I've had a persistent TODO that I've been trying to articulate unsuccessfully. I've just decided to write a minimal comment first to kick off the discussion and hopefully gather my thoughts more through back and forth.

I'm concerned about the complexity of this for a few reasons: who the target user is, what is our long-term support, and what do we need to consider from a security perspective.

It also feels more like we are building this capability for our own purposes and I'm not sure if we want to introduce such features for customer usage. They could use this for sure, but it presents a significant lift around our support requirements (debugging complex join/union/pipeline configurations) and affects recommendations for build-your-own rulesets vs using our precurated ones.

One of the niceties of our current ruleset implementation is that it is dead-simple. It's almost inflexibly so, which is obviously a con for enterprise usecases but a pro for explicitness. The above proposal drastically increasing the attack surface with remote fetching, parsing, and joining of rulesets and a non-quite-turing-complete set operands.¹

All that said: this covers a real usecase and capability we aren't addressing and I really want to see that addressed. I don't mean to say "no" to this feature, but I would love to iterate together on what steps look like to minimally and iteratively introduce these capabilities without overwhelming ourselves with the future costs.

I have been planning on organizing a threat modeling exercise around one of our analyzers in the coming quarter (planned OKR) and imagine custom rulesets will be a fixture of that. I'd love to add on this scope as anothe vector to consider there

Good thoughts. Perhaps, the first iterative step could be a way to be able to pull in rules that are not shipped in the box? For example, an overwrite. It is possible that the community rules (e.g. p/oswap-top-ten as @mbrandner1 mentioned) could yield better results, because we were focused on result parity with the existing OSS toolchain. On another hand, for Goldimate customers we'd probably need a way to overwrite the shipped ruleset with our enhanced versions (my understanding, PMs please correct me if I'm wrong).

I would also love to see it done in a generic way that even Composition Analysis and DAST could benefit from.

@theoretick It's a good idea to TM this. Curious what your security related thoughts are. My understanding is that even if there's a way to point to a malicious ruleset (or something like file:// on the same box), it's still within the "shared responsibility model" because the user can always sabotage their CI (e.g., through environment variables) if desired.

@mark.art

Good thoughts. Perhaps, the first iterative step could be a way to be able to pull in rules that are not shipped in the box? For example, an overwrite. It is possible that the community rules (e.g. p/oswap-top-ten as @mbrandner1 mentioned) could yield better results, because we were focused on result parity with the existing OSS toolchain.

there's the "quick" way which is using a before_script and there's the "better" way. The quick is documented in #342435 and that issue serves as the key feature for this capability.

On another hand, for Goldimate customers we'd probably need a way to overwrite the shipped ruleset with our enhanced versions (my understanding, PMs please correct me if I'm wrong).

We essentially already do this within our analyzers, right? Unless you're talking about something more sophisticated than a copy-over. Merging YAML refs would definitely add some complexity if we wish to keep them more minimal.

@theoretick It's a good idea to TM this. Curious what your security related thoughts are. My understanding is that even if there's a way to point to a malicious ruleset (or something like file:// on the same box), it's still within the "shared responsibility model" because the user can always sabotage their CI (e.g., through environment variables) if desired.

That's true but I guess it depends on how comfortable we are with footguns. There's also additional work that's liable to increase the surface area here such as shared workspaces between jobs. There's also the age-old question of liability regarding our recommended configs and customers deployed ones (k8s privileged runner with questionable volume mounts). SSRFs and/or DoS'ing endpoints for ruleset retrieval, billion laughs attacks with circular ruleset resolutions, etc...

Anyway, TM sounds fun and I haven't done one before so this should be a good candidate to play with the exercise

I am all for a simple approach, I agree with @theoretick that this proposal is a bit too deep / complex for most users.

Is there a reason we can't just support a variable and assume an order of precedence?

RULE_REPOSITORY:
 name: "mycustomrepository"
 value: "git://guest:${SOME_PASS}@gitlab.org/some/project"

ANALYZER_RULES:
 - ./some/repo/path
 - mycustomrepository
 - p/owasptop10

Precedence would be:

Environment variable array of rules, top to bottom
GitLab rules

In the above scenario, ./some/repo/path would overwrite anything in mycustomrepository which would overwrite p/owasptop10 which would overwrite anything in GitLab rules.

As for resolution of ANALYZER_RULES, we could resolve RULE_REPOSITORY name values, then look on the filesystem, then if it contained a 'p' look into the semgrep registry.

I guess we would also need to support disabling GitLab rules if a customer simply wanted to use their own rules.

Is there a reason we can't just support a variable and assume an order of precedence?

This sounds dead-simple to me . We have to do some light munging with YAML merges but that's a solved problem.

I guess we would also need to support disabling GitLab rules if a customer simply wanted to use their own rules.

We do this on an individual level with disabling pre-existing rulesets but not globally. Happy to entertain approaches but TBH I'm not sure we ever want to encourage this officially as it would point towards either a need we should be handling directly or some weird hacky abuses of our scanners. It should still be possible regardless and if needed they can just semgrep-sast.before_script: rm -rf ./rules

I'll just add a side comment to this discussion as well: if this proposal IS needed to solve a real need we have around benchmarking and development of rulesets, then I'm in favor of implementing it but perhaps as an undocumented version or fork of our existing code (i.e. a development dependency). That way we keep the simplicity of the end product but give ourselves flexibility during dev

I can't speak for SAST but I can for DAST and my ASR project (which uses semgrep under the covers). This is a definite need, not right now but as we roll out more of our DAST checks it will be.

Totally agree with the points @theoretick and @idawson mentioned above. It is also true that this feature is kind of important for us internally because it enables us to dynamically pull rule-sets for semgrep, VET and DAST depending on the context (core<>goldimate). In addition to that we would be able to use it for benchmarking and dogfooding (e.g., running custom rule-sets on the GitLab codebase to find regressions).

That being said I have the feeling that we receive an increasing amount of feedback from users/customers who would like to use such a feature (https://gitlab.slack.com/archives/C8S0HHM44/p1636515917174900, #339614 (comment 710558276)) so that implementing an MVC that works as @idawson described above seems worthwhile. We should be able to have a basic MVC working until the beginning of next week.

We have a first working MVC draft:

mentioned in issue #345171 (closed)

mentioned in merge request gitlab-org/security-products/analyzers/ruleset!6 (merged)

mentioned in merge request gitlab-org/security-products/analyzers/semgrep!93 (merged)

mentioned in issue #345657 (closed)

changed the description

changed milestone to %14.6

added missed:14.5 label

changed the description

mentioned in merge request !74893 (merged)

changed the description

mentioned in merge request !75234 (merged)

marked this issue as related to #347258 (closed)

mentioned in issue #347258 (closed)

rule-pack synthesis using rules from multiple sources

Proposal

Version 1 (latest) Mark Wood 4 years ago

...

Child items ...

Activity

rule-pack synthesis using rules from multiple sources

Proposal

Blocks

Relates to

Activity