Issue for documenting technical discovery effort on future architectural direction for Category:Secret Detection. See parent epic for more
This issue is meant to be a noisy scratch-pad of ideas for discovery. On completion we should have an architectural roadmap to outline a phased approach to delivering feature availability.
Auto-Summary 🤖
Discoto Usage
Points
Discussion points are declared by headings, list items, and single
lines that start with the text (case-insensitive) point:. For
example, the following are all valid points:
#### POINT: This is a point
* point: This is a point
+ Point: This is a point
- pOINT: This is a point
point: This is a **point**
Note that any markdown used in the point text will also be propagated
into the topic summaries.
Topics
Topics can be stand-alone and contained within an issuable (epic,
issue, MR), or can be inline.
Inline topics are defined by creating a new thread (discussion)
where the first line of the first comment is a heading that starts
with (case-insensitive) topic:. For example, the following are all
valid topics:
# Topic: Inline discussion topic 1
## TOPIC: **{+A Green, bolded topic+}**
### tOpIc: Another topic
Quick Actions
Action
Description
/discuss sub-topic TITLE
Create an issue for a sub-topic. Does not work in epics
Pipeline scanning is fully supported on both GitLab Saas and self-managed instances
Revocation is only enabled on GitLab Saas, but a loose specification was defined allowing self-managed users to run a revocation service (whether any do is currently unknown).
Do these deployment models work for new feature additions? For example, if we are expecting a deployed standalone service will this component be optional or on-by-default for self-managed?
We may consider PoC development targeting GitLab.com for rapid prototyping and testing, however architecture should be aimed at supporting all instances of GitLab in a self-contained manner. This will mean tying and deployable service to omnibus and our GitLab helm charts for cloud-native deployments.
Outcome: solution should target all GitLab instances
If we are expanding our default ruleset capabilities to cover a larger swath of the product area we should consider segmenting the rules under scan to both improve performance and target specific workflows.
One way we could segment is by performant rules. Another is by context-specific rules (dont scan for deploy tokens in issue bodies but do in pipelines or push events)
One issue is that the current Category:Secret Detection capability has limited consumer-facing features within GitLab Free. This means that with expanded detection capabilities we still have a limited way in which a user will be able to recognize a leak; i.e. no access to Category:Vulnerability Management and if we move detection out of a pipeline-context the normal JSON report would be inaccessible
I'm not familiar with how new services tend to get spun up at GitLab. From what I've seen, there's a tendency to build features in the Rails monolith where we get some advantages around data locality and interoperability with existing modules.
What kind of deployment are we looking at with this service? I'm also aware it might be too early to discuss.
We almost certainly will need a standalone service to handle the actual scanning. Based on the potential volume we wouldn't want to overload our existing app servers and our background job processing through sidekiq doesn't feel like the most appropriate place to handle the load (plus performance would be subpar).
That said, All interfacing with the monolith will likely occur through asynchronous background jobs, such as the existing ScanSecurityReportSecretsWorker Service, so there will still be a sizeable Rails component.
Once we reach a "block pre-commits" stage, we would likely need to call this new service from gitaly but that's a long way off
On the block-pre-commit topic, I want to link a great example of how a slow pre-receive hook can cause a Gitaly server to melt down. #246819 (comment 1164411983)
Ordered by priority. Feature parity first resolves known user experience issues in a known/validated/popular use case. Also helps expand coverage of post-processing flows.
Parity with existing solution
Repo contents (akin to historical scans)
Commits (closest to current pipeline behavior)
High-priority additions
Job logs
Issue/MR/Epic comments
Issue/MR/Epic descriptions
Everything else
Wikis (actually repos internally so might be a quick-win by porting Step 1 parity)
Issue/MR/Epic titles
Docker manifests
Visual images
Container image/layers
GitLab Error Tracking event data (unmasked production data)
Active push-based notifications could be a great iteration after an MVC; i.e. email and/or in-app alerting.
The easiest MVC would be leaning into the existing vulnerability management capabilities. Outside of security report parsing we could leverage the manual creation API.
Longer-term, separate the UI may be worth exploring since secrets have some categorical differences from vulnerabilities in how resolution occurs, but that's out of scope of this beginning work.
Point: Detection will continue to be surfaced through Vulnerability management UI.
Revocation (Post-processing)
The Revocation flow won't change significantly, although there is some related work ongoing with #371658 (comment 1159759669).
That said, as mentioned in this comment, detection triggers for post-processing will expand to match each target of detection.
Lucas Charleschanged the descriptionCompare with previous version
changed the description
Lucas Charleschanged the descriptionCompare with previous version
To allow us to scale up scanning of both commits and arbitrary text blobs we need to developer a sidecar component for both our ~SaaS deployments and self-managed. While we could handle basic string matching within our background workers we would very quickly run into resource contention issues based on the request volume. In addition, by setting up a separate service we can better optimize around the specific performance considerations for high-volume token detection.
Based on the detection targets we should first consider scanning push events to handle the main usecase of both moving detection out of a CI context as well as the critical path for platform-wide detection. For this reason the initial architecture can interface with both the RoR application (via sidekiq workers tied to the Post-Receive commit hook) and Gitaly for streaming commit blobs.
To support alternate usecases we may need to provide a mechanism to send arbitrary text blobs to the BlobScannerService but given the expected paylods for hooking into push events, that's not a tenable approach for push event targets.
How
To match existing architectural considerations, this service will likely be built using go and communicate with both the RoR monolith and Gitaly over gRPC.
Detection engine
Configuration
Authentication
Detection engine
We currently utilize gitleaks for all token detection within our CI pipelines. By isolating detection away from a repository or pipeline context we may no longer find it the best tool but given our in-house experience with gitleaks and it's support for ruleset configuration, we should start with gitleaks in a --no-git configuration and establish some benchmarks for scaling requests.
Configuration
To allow differentiated rulesets by context we should provide a mechanism for injecting token configuration at several levels, see below.
The initial MVC however can target a narrow band of well-defined GitLab-specific tokens.
Our current offering is entirely limited to default branch detection due to technical limitations, see #299212 (closed). It should be a critical aspect of this new work to support detection across all branches.
The currently proposed Push detection flows would cover this scenario as the PostReceiveWorker ties into any commit.
Prior to completion of this however, it would be ideal if we could re-examine if there are any quick-wins using the existing pipeline approach
In short: the latter approach seems relatively simple, matches a pre-established GL pattern, and comes with substantially less overhead (no availability concerns, no auth/application boundaries, etc...). The drawbacks would be dependent on the binary size (ES indexer is ~33mb, surprisingly) and potential contention of our sidekiq fleet due to long-running blob scanning2.
@amarpatel@connorgilbert As we discussed, if we are parking this one for a bit the benchmarks are likely to get outdated fairly quickly so it's worth suspending that work. While I would consider that still a part of technical discovery, if you would like to track that effort as part of the PoC development I can close this one out and spin off the separate phase in a new issue under &8667, wdyt?
shelling out is mostly due to ES's unique loader requirements of delivering huge JSON payloads. In our case we could most likely use gRPC or something a bit cleaner ↩
This is both a pro and a con. We have a ton of experience with sidekiq contention and horizontal scaling which we would essentially have to adapt to and learn autoscaling strategies if this were an isolated service. ↩
The PoC is sort of ready now. It's got two rules which were ported from secrets and uses a combination of keyword matching and the regexp Go package. Keyword matching works the same as in secrets; it acts as a fast prefilter to determine whether the more intensive regex operations should take place to confirm the match.
I did some quick benchmarking using hyperfine of the end-to-end flow (piping in a text blob as input and producing JSON output), and the results look good.
Much of it seems to be shell-related overhead; the internal Go benchmarks are much, much faster. There's also a warning that the results may not be accurate since the command runs too quickly.
90KB of input across two lines:
➜ hush git:(main) hyperfine --warmup 10 'cat testdata/blobs/large_oneline | ./hush'Benchmark 1: cat testdata/blobs/large_oneline | ./hush Time (mean ± σ): 0.9 ms ± 0.1 ms [User: 1.4 ms, System: 0.2 ms] Range (min … max): 0.8 ms … 1.2 ms 2013 runs Warning: Command took less than 5 ms to complete. Results might be inaccurate.
300KB of input across ~11k lines:
➜ hush git:(main) hyperfine --warmup 10 'cat testdata/blobs/large | ./hush'Benchmark 1: cat testdata/blobs/large | ./hush Time (mean ± σ): 0.9 ms ± 0.1 ms [User: 1.4 ms, System: 0.2 ms] Range (min … max): 0.8 ms … 1.5 ms 2160 runs Warning: Command took less than 5 ms to complete. Results might be inaccurate.
Should we look at "more efficient" string searching algorithms for the keyword search? My thinking is no, because we won't be able to amortise the cost of preprocessing that these algorithms require, since this binary lives and dies per blob.
What if we dispatch a goroutine to handle each line? My hunch is we won't see better numbers (or even see worse ones) because each goroutine will need to compile regexp objects independently. There's also overhead in forking and joining routines.
Should we benchmark a large number of consecutive scans?
Topic: Limiting concurrency with synchronous sidekiq workers blocking on long-running scans
If we are running these scans synchronously via our sidekiq worker fleet we need to limit the worker concurrency to prevent exhausting available workers with large queue sizes against target objects. For this we can use the LimitedCapacity::Worker.
Sidekiq's default concurrency is set to 10, so if we have 5 sidekiq processes we are looking at 500 worker threads executing 500 jobs simultaneously.
Let's say we have a push events rate of around 1,000 per minute with a average execution time of 30 seconds per scan execution. At that rate, we will be completely exhaust our fleet from push events alone, not to mention secondary target types and (more importantly) any remaining queues GitLab uses for handling background processing.
We should choose a fairly conservative concurrency limit initially and monitor queue size to ensure we are achieving our expected SLO.
I would be surprised if we pin to main as that file seems to imply 🤔 I wonder if that's defined elsewhere although I couldn't immediately find it.
Otherwise, sounds pretty straightforward! I do think we would need ot present some evidence like a production readiness review to justify to the maintainers the merge of the additional binary but that shouldn't stop us from moving forward with figuring out the omnibus process.
Ideally, we would target the cloud-native chart as well but omnibus is a great first step! And glad to see it's clear-cut for gdk as well.
The real concern I was discussing with Connor last week is around approval/deployment which could be process-heavy. Building off the architectural blueprint we might need to start looking at the production readiness review if we are service-impacting.
I think we're sufficiently deviated from a deployed standalone service but the impact could be significant around our sidekiq instances so running through the readiness review checklist even informally would help us communicate our bottlenecks in advance and give the infrastructure team an idea of what impact rollout will have.
Maybe a Super-MVC version of this rollout could look something like a sidekiq job that scans for a single token, with no deployed binary, and logs results.
This keeps the UI surface minimal, gives us a chance to run worker behind a feature flag for isolation and controlled observability, and we remain unblocked on delivering value in advance of deploying our own binary.
Deploy sidekiq worker containing a hyper-minimal scanning job above
Add enqueueing feature flag for enqueueing of sidekiq workers on post-receive events
Add separate execution feature flag for execution of enqueued worker on post-receive events
Enable enqueueing-flag using actor based rollout for test projects or small percentage. Since execution flag will be off worker will be no-ops but allow us to observe queue volume.
Enable execution-flag using actor based rollout to check worker execution for small number of projects
In parallel we should actively pursue distributing a binary but I think something like this could help us stay unblocked on the general rollout since we tend to have more conservative timelines for getting items into omnibus and when those items will hit production releases.
We currently utilize gitleaks for all token detection within our CI pipelines. By isolating detection away from a repository or pipeline context we may no longer find it the best tool but given our in-house experience with gitleaks and it's support for ruleset configuration, we should start with gitleaks in a --no-git configuration and establish some benchmarks for scaling requests.
Copying from one of our internal google docs around some initial research around Hyperscan:
Intel's hyperscan library is used by github's secret scanning. Hyperscan is very mature (10+ years old) and performant. There is limited regex support to achieve such performance, such as no lookahead, but that matches existing limitations using gitleaks. It's also purpose-built to be generic which could work well for multiple blob/text types beyond repositories.
[Risk] I will note that the hyperscan project's health was a bit suspect until this month's release with the most recent commit from Jan 2021. Here's one HN comment from an Intel team member calling it abandoned by Intel, and an unclear comment on the repo itself. Seems Github has forked it and may be maintaining now. There's also Vectorscan, which looks like a more portable fork.
[Risk] We also have no expertise in-house for a new C++ dependency so would rely off community support for this added component.
Thanks for the input! We do need to expand protection outside of codebases as well (&8668 (closed)), though. And we can't rely on users using a particular development environment... while organizations want/need to shift detection left, they also can't rely on client-/developer-side protections. They need something on the server to implement centralized governance. I see there is a related discussion going on in https://gitlab.com/gitlab-org/gitlab/-/issues/414738#note_1422806623 as well on this exact topic so I'll keep this short.
@connorgilbert I completely agree with this vision - but we are MVCing implementation ideas whose upfront scalability is very questionable. This will likely lead to either a lot of lost time iterating on a pattern that either must be completely refactored, or the pattern is released regardless of scaling architecture, and promptly turned off (but customers or gitlab.com) due to scalability problems.
If all content being written to an instance is the vision - sidekiq jobs should not even be considered a viable first MVC.
This is where the concept that Agile Architecture can save a lot of waste is very salient.
If all content being written to an instance is the vision - sidekiq jobs should not even be considered a viable first MVC.
@DarwinJS Sidekiq is a well-understood and highly scalable queuing system that relies off both our company and customer's subject matter expertise in managing and monitoring GitLab instances. Introducing an entirely new architecture is a much riskier proposition.
There are also many ways for us to iterate on a sidekiq-based implementation allowing gradual scaling:
or the pattern is released regardless of scaling architecture, and promptly turned off (but customers or gitlab.com) due to scalability problems.
That is certainly not our intention or expectation: the aim of this discovery is to build our MVC in a gradual and scalable way with the express intention of not blocking our users. That requirement includes ensuring we are providing a valuable feature on day 1 of production deployment and I have confidence in our team's ability to deliver with an awareness of the risks involved.
The conversations in this issue were very valuable; ongoing work to adopt this approach is tracked in #413274. I'm marking as Duplicate only so that it is easier to spot the new issue (in the "Closed (duplicate)" header) when someone encounters this issue.