Commit e5dba3e8 authored by Vishwa Bhat's avatar Vishwa Bhat 💬 Committed by Lucas Charles
Browse files

ADR for Unified Scan Engine and Switch to Go engine

parent ffed3c42
Loading
Loading
Loading
Loading
+17 −1
Original line number Diff line number Diff line
@@ -152,6 +152,8 @@ as self-managed instances.
- [004: Standalone Secret Detection Service](decisions/004_secret_detection_scanner_service)
- [005: Use Runway for service deployment](decisions/005_use_runway_for_deployment)
- [006: Unified SD Support for all GitLab Environments](decisions/006_support_for_all_environments)
- [007: Switch to Vectorscan-based Go scan engine](decisions/007_switch_to_go_scan_engine)
- [008: Unified SD Scan Engine](decisions/008_unified_scan_engine)

## Challenges

@@ -188,6 +190,8 @@ for past discussion around scaling approaches.

### Detection engine

#### Initial Decision

Our current secret detection offering uses [Gitleaks](https://github.com/gitleaks/gitleaks/)
for all secret scanning in pipeline contexts. By using its `--no-git` configuration
we can scan arbitrary text blobs outside of a repository context and continue to
@@ -203,6 +207,10 @@ for pre-filtering and `re2` for regex detections. See [spike issue](https://gitl
Notable alternatives include high-performance regex engines such as [Hyperscan](https://github.com/intel/hyperscan) or it's portable fork [Vectorscan](https://github.com/VectorCamp/vectorscan).
These systems may be worth exploring in the future if our performance characteristics show a need to grow beyond the existing stack, however the team's velocity in building an independently scalable and generic scanning engine was prioritized, see [ADR 001](decisions/001_use_ruby_push_check_approach_within_monolith) for more on the implementation language considerations.

#### Recent Consideration

We will use [Vectorscan](https://github.com/VectorCamp/vectorscan) regex engine ported in Go language for the scan engine. See [ADR 007](decisions/007_switch_to_go_scan_engine.md) for the reasoning behind the language and engine considerations.

### Organization-level Controls

Configuration and workflows should be oriented around [Organizations](https://docs.gitlab.com/ee/architecture/blueprints/organization/index.html). Detection controls and governance patterns should support configuration across multiple projects and groups in a uniform way that emphasizes shared allowlists, organization-wide policies (i.e. disablement of push option bypass), and auditability.
@@ -361,7 +369,15 @@ Secret Push Protection requires scans to run on git diffs (target type) in a blo

Read more about the above decisions [here](./decisions/006_support_for_all_environments.md).

![High-level Secret Detection design](/images/engineering/architecture/design-documents/secret_detection/006_support_all_envs.png "High level design supporting sync and async scans")
![High-level Secret Detection Design supporting Sync/Async scans](/images/engineering/architecture/design-documents/secret_detection/006_support_all_envs.png "High level design supporting sync and async scans")

#### High-level for Unified SD Scan Engine

A single scanning engine for all scan target types and with the help of target type-specific Adapters, the overall design of Secret Detection for various scan target types looks like the following illustration:

![High-level Design for unified scan engine](/images/engineering/architecture/design-documents/secret_detection/008_high_level_design.png "High-level Design for unified scan engine")

Read more details about unified scan engine [here](decisions/008_unified_scan_engine.md).

#### High-level SD detection flow for Work Items

+83 −0
Original line number Diff line number Diff line
---
title: "GitLab Secret Detection ADR 007: Switch to Vectorscan-based Go scan engine"
---

## Context

This ADR captures the [decision](https://gitlab.com/gitlab-org/gitlab/-/issues/526227) to switch from RE2-based(port) Ruby scan engine to Vectorscan-based(port) Go scan engine for Secret Detection scans.

Here's the historical context on why Ruby was chosen and the problems associated with it.

### Why Ruby was chosen earlier?

In the initial days of Secret Push Protection(SPP) feature development, we chose Ruby for running SD scans for two main reasons:

1. The expected team's velocity to build the SPP feature out-of-box was pretty high and adding tight time constraints in to the mix led us choosing an approach that had least resistance without compromising much on the quality. Given the team's familarity with Ruby and ease of integration with GitLab Rails, building the SPP feature within a Ruby Gem made sense.

2. Using RE2 engine ported in Ruby boosted our confidence in the decision as it performed much better in terms [latency](https://gitlab.com/gitlab-org/gitlab/-/issues/423832) and [memory consumption](https://gitlab.com/gitlab-org/gitlab/-/issues/422574#note_1582015771) than the other language+regex engine in the bucket i.e., Go with built-in regex engine.

Fast-forward to the technical discovery phase of building Secret Detection Service(SDS). We retained the decision of using the same language(Ruby) and engine(RE2 port) for a couple of reasons:

1. The initial scope of Secret Detection Service(SDS) was to contain _anything and everything_ pertaining to Secret Detection. That means, pulling the files from a particular source for scanning, updating metadata in the DB, running post-processing actions like revocation, to name a few. Given this idea, we could leverage and integrate with existing systems from GitLab Rails like workers, cache, DB etc. However, we later descoped to make it a stateless service responsible for just scanning the given payloads while Pre and Post-processing steps are handled at GitLab Rails side.

2. There was an active development of SPP feature by vendoring SD Ruby gem so the migration would've pushed the SPP feature release further. Given the strict release timeline we had, we couldn't have afforded it. We needed a non-disruptive approach.

3. Runway did not (and still doesn't) support deployment to Self-Managed and Dedicated environments. Cloud Connector was an effortful option to consider. Runway [had plans](https://gitlab.com/gitlab-com/gl-infra/platform/runway/team/-/issues/236) to support it so a temporary approach of reusing the on-going work (vendored gem) made more sense.

4. RE2 Regex performance was still _sufficiently_ better in terms [latency](https://gitlab.com/gitlab-org/gitlab/-/issues/423832) and [memory consumption](https://gitlab.com/gitlab-org/gitlab/-/issues/422574#note_1582015771) than built-in Go regex. We did not explore Go's alternative regex engines.

#### There are some quirks that come with choosing Ruby

1. We over-emphasized on raw regex performance of Ruby RE2 during the decision-making such that we skipped other crucial factors that contribute to the overall performance, particularly when serving the scan via HTTP request. Some of the missed out factors are Server throughput, request latency, memory consumption + GC, Cost-effectiveness due to Resource utilization etc., and Ruby underperforms in _all_ of these factors due to the following:

   1. Ruby generally has high-memory footprint, starting from its runtime VM to regex operations, and its GC is not optimised (like any other interpreted languages). We did consider ["running the scan within subprocess" approach](003_run_scan_within_subprocess.md) to address it, however, we had to drop it in SDS since GRPC doesn't support forking processes.

   2. Global Interpreter Lock(GIL) allows only one request at a time per-process-per-cpu, keeping the other requests blocked for longer periods that in turn could potentially block the customers on git push operation.

   3. No built-in support for concurrency. Even parallelism isn't easily achievable as Ruby doesn't effectively utilize multiple CPU cores and threads due to GIL.

2. Ruby lacks first-class support for GRPC, so it uses [C-based GRPC](https://github.com/grpc/grpc) library under the hood. C-Ruby bindings introduces some overhead in addition to missing language-specific optimizations, like for we have for [Go](https://grpc.io/blog/grpc-go-perf-improvements/).

3. Considering the plan of SDS as the sole platform of running scans across different scan target types, its high throughput and low latency becomes a mandatory requirement. Looking at the [metrics of GRPC servers](https://nexthink.com/blog/comparing-grpc-performance), Ruby is among the poorest performers in the list.

4. Missing first-class support for generating efficient binary executable makes it difficult to unify the Secret Scanning engine across the entire Secret Detection suite of features.

## Proposal

The proposal was explore other performant languages and regex engines suitable for running SD scans in different scan modes i.e, as a Distributed service and as an executable binary.

### Languages (and engines) considered for the decision

1. [Ruby](https://gitlab.com/vbhat161/regex-performance/-/tree/main/ruby) with RE2 Port

2. [Go](https://gitlab.com/vbhat161/regex-performance/-/tree/main/golang) with Vectorscan Port

3. [C++](https://gitlab.com/vbhat161/regex-performance/-/tree/main/cpp) with RE2 (native) and Vectorscan (native).

4. [Rust](https://gitlab.com/vbhat161/regex-performance/-/tree/main/rust) with Std Regex and Std Regex w/ parallelised rule processing.

### Conclusion

The [final conclusion](https://gitlab.com/gitlab-org/gitlab/-/issues/526227#note_2491660337) for the language and regex engine came out to be the combination of **Go and Vectorscan (Port)**. 

Results for micro-benchmarking and grpc throughput testing can be found [here](https://gitlab.com/gitlab-org/gitlab/-/issues/526227#note_2460625471). The source code and dataset used for testing can be found [here](https://gitlab.com/vbhat161/regex-performance).

### Why Go will be suitable moving forward?

1. Go has a significant upper hand ([around 8-10x](https://nexthink.com/en/blog/comparing-grpc-performance)) over Ruby in terms of overall performance(mentioned above) such that it could compensate for its moderate regex performance.

2. We did not explore other Go regex alternatives during the Spike since we were convinced about using Ruby due to other constraints mentioned above. There could have been a possibility that a third-party Go regex libraries perform better.

3. First-class support for building executable binary makes Go suitable even in the areas where accessing SD service is not feasible. This also helps us with building a unified scanner engine as we could serve Secret Detection as a Service, Binary executable and CLI with a single Go source code.

4. Having [first-class GRPC support](https://github.com/grpc/grpc-go), we have better [tooling](https://github.com/grpc-ecosystem/awesome-grpc?tab=readme-ov-file#lang-go) for monitoring, testing, debugging and plethora of [middleware extensions](https://github.com/grpc-ecosystem/go-grpc-middleware). Go also has better alignment with gRPC's type system, so fewer unexpected type-related crashes

5. Given the critical role of SDS in the future, Go's maintainability, dependency management and its type system holds a significant advantage.

6. Adding a cherry on top, integration with [Cloud Connector for SDS](https://gitlab.com/gitlab-org/gitlab/-/work_items/525472) wouldn't have the blocker of a missing auth SDK, like we currently have for Ruby.

In addition to above, there are more positives of choosing Go over Ruby like efficient GC, low runtime resource consumption, smaller container size leading to faster deployment, efficient file handling, native go-groutines w/ channels for a clean concurrency code, compile-time checks, mature profiling/instrumentation tools, cloud native i.e, better suited for containerized and microservice architecture.

### Why now is the right time?

The scanner engine is currently stable with no significant feature planned in the immediate roadmap. Secret Detection Service is still in Beta with major features that actively use SDS in the planning phase. We are at an inflection point where delaying this decision further will exponentially increase the migration complexity and technical debt as SDS matures and becomes more deeply integrated. This is our only chance to establish the appropriate technical decision for the long-term.
+87 −0
Original line number Diff line number Diff line
---
title: "GitLab Secret Detection ADR 008: Unified SD Scan Engine"
---

## Context

Currently, the Secret Detection scans are running for different scan target types like Source code (via Pipeline SD), Git commits (Push Protection), Issue description/comments (Client-side SD), and more target types to follow. All the mentioned scan target types have different scan engines (Gitleaks, [one managed by](https://gitlab.com/gitlab-org/security-products/secret-detection/secret-detection-service) the team, another one by Web frontend), and to add complexity to the mix, there are certain teams at GitLab maintaining their version of secret detection. This leads to multiple issues (outlined below) and an inconsistent experience for customers:

* It causes a significant burden of maintaining feature parity across multiple engines.
* It limits us from optimizing the core scanning logic for performance and efficiency.
* It becomes difficult to ensure our ruleset quality(precision and recall) against all the scan engines.
* It also becomes difficult to optimize rule patterns which are generally done against a specific regex engine.
* Inconsistent integration of intelligent scanning capabilities with the engines based on their compatibility.

## Proposal

The proposal is to build a unified Secret Detection core scanning engine that will be used for all the scan target types. The engine's core responsibility will be to run a Secret Detection scan for a given set of payload(s) against the provided scan configuration. The pre-processing (gathering/structuring payload into a scannable format) and the post-processing steps(validity checks etc) will be managed on the caller side. This setup will ensure the reusability of the core scanning logic across all the implementation layers of scan target types.

### Characteristics of the unified scan engine

* The engine is portable. This is crucial considering some scan target types(source code/artifacts) running in an isolated environment like GitLab CI.
* The engine can run transactional or streaming-based scan.
* The engine is cross-platform compatible.
* The engine is stateless by nature.
* The engine is highly resource efficient to be able to run in a resource-constrained environment (like in client IDE or small CI environment).
* The engine is to tweak the scan behavior by accepting user inputs (with defaults):
  * exclusions (rule/path/value)
  * custom ruleset
  * ruleset version
  * timeout constraints
  * payload size limit
  * payloads acceptable via file-path/dir-path/stdin (binary executable)
  * configurable resource(memory/cpu cores) limit (binary executable)
* The engine is able to run as a binary executable in air-gapped environments.

### Pre-requisite

The idea of unified engine could hold true only if we have a scan engine that is highly efficient yet portable by nature. As per the [decision](007_switch_to_go_scan_engine.md), we need to have Vectorscan-based Go scan engine implemented.

### Mode of using the Scan Engine

The minimalistic scope and the stateless nature of the proposed scan engine will open up the _portability_ advantage, which is a necessity for certain target scan types (source code or job artifacts running in CI env). Therefore, the scan engine could be adopted in one of the following three modes depending on the nature of the scan target type (traffic,size,etc.):

* **As a Distributed Service**: The scan engine will be wrapped with a REST/gRPC layer having scan API endpoints. The caller makes the scan request over the network. This mode is suitable for SD features having high traffic with lightweight payloads (\<1MB). Example: Scanning Work Items via [Secret Detection Service](https://gitlab.com/gitlab-org/security-products/secret-detection/secret-detection-service)

* **As an Embedded module**: The core engine in this mode is _embedded_ within the same host as the caller application. The caller invokes the scan for a scannable payload. This mode is transactional by nature. We are already using this mode for the Push Protection feature where the engine is embedded as a Ruby Gem and installed in the Rails monolith. The Rails monolith makes the scan request (including `git diff` data as a scannable payload) to the gem.

![Embedded Mode](/images/engineering/architecture/design-documents/secret_detection/008_scan_mode_embedded.png "Embedded Scan Mode")

* **As a Batch processor**: This is a special case to support [in-storage processing](https://en.wikipedia.org/wiki/In-situ_processing) where the Secret Detection program (+engine) runs where the data resides. This reverse approach is suitable for scan target types having larger data sizes, like source code or job artifacts, to avoid data-transfer costs incurred b/w data storage and scan servers. The primary difference when compared to Embedded mode is that the caller includes the scannable payload within the scan request whereas in Batch mode, the caller points at the scannable payload(s) available at the target host (where the program and data reside), e.g. passing a file path along with the request.

![Batch Mode](/images/engineering/architecture/design-documents/secret_detection/008_scan_mode_batch.jpg "Batch Scan Mode")

#### Adapters

The engine's minimalistic scope of running the scan for the given payload implies that the caller should implement pre-processing and post-processing steps of the scan. Since the scan target types are located at different sources (ex: Rails/CI/Gitaly), it is important to have a consistent implementation approach across all target types for better maintainability. In addition, there is a decent amount of common logic that exist in the pre-processing and post-processing steps across different scan target types which could be reused between Adapters.

We will follow the concept of `Adapters` where an `Adapter` sits between the caller and scan engine abstracting the implementation details, similar to [Language Servers](https://en.wikipedia.org/wiki/Language_Server_Protocol) for IDE. `Adapters` are primarily used in Embedded or Batch mode. 

![Workflow representing Adapters and Scan engine responsibilities](/images/engineering/architecture/design-documents/secret_detection/008_adapters.png "Workflow representation")

**Example Scenario**: We're currently using `gitleaks` scan engine in Pipeline SD scan. Though `gitleaks` offer a lot of functionalities we don't needed, there are three of them needed for this usecase:

1. To traverse through the git commit tree and retrieve diff data for each commit
2. Run secret detection scan on the retrieved content(scannable payload) efficiently (ex: scan parallelly, early-exit through keyword-based match, filter noise/FP on the detected findings through entropy etc.)
3. Create a consumable response as requested by the client via scan params

The above process typically the same for any scan target type where the `Step 1` represents pre-processing phase of the scan, `Step 2` represents the SD Scan operation itself, and finally the `Step 3` represents the post-processing phase of the scan. The pre-processing and post-processing logic are generally scan target type dependent whereas the scan logic is common across.

If we replicate the above process keeping the context of Adapters + Unified scan engine in mind, we would have an `Adapter` (say `SourceCodeAdapter`) responsible for handling pre-processing and post-processing logic whereas the `Scan engine` responsible for running the SD scan for the given scannable payloads. 

Here's the comprehensive view of unified scan engine with the different Adapters, each for a scan target type.

![High-level Design for unified scan engine with Adapters](/images/engineering/architecture/design-documents/secret_detection/008_high_level_design.jpg "High-level Design for unified scan engine with Adapters")

## Distribution

Given there are different modes of using the scan engine and the Vectorscan regex engine being platform-dependent, it is important to identify how the engine is made accessible within GitLab applications.

* When using the engine as a distributed (grpc) service: The grpc service will be deployed using Runway making it accessible to internal GitLab.com applications. The clients can make remote scan request to the service via grpc endpoints.

* When using the engine as an Embedded module or Batch processing: Depending on the client application, the nature of embedding differs. The engine can be imported as a Go module for Go-based applications whereas for Non Go applications (mostly `Ruby` in GitLab Rails), there are two choices:

  1. Write Ruby C extension for a static library generated from Go's scan engine source
  2. Package Go binary executable within GitLab Rails using Omnibus

Considering the VectorScan engine scans on a chunk of data instead of a line-by-line basis (current approach), the [high heap usage problem](https://gitlab.com/gitlab-org/gitlab/-/issues/422574#note_1582015771) is no longer applicable. In addition, Vectorscan engine operates on a pre-allocated [scratch space](https://intel.github.io/hyperscan/dev-reference/runtime.html#scratch-space) contributing to the lower memory consumption of the overall scan. The throughput of calling Go methods via C-function adds barely any overhead when compared to IPC between Ruby <-> Go processes. All that said, the decision is leaning towards writing Ruby C extension. However, we are yet to conduct a spike to confirm this decision with a data-driven evidence.
+152 KiB
Loading image diff...
Loading