Static Analysis: real-time IDE SAST technical investigations and development
# Real-time IDE SAST - initial release technical investigations and development ## Overview This epic tracks ~"Static Analysis::Blue" team work on real-time SAST from the IDE. ## Scope This epic's focus is on the technical tasks required to develop an architecture that reflects priorities set by the [product description](https://gitlab.com/groups/gitlab-org/-/epics/10283) and [initial discovery work](https://gitlab.com/gitlab-org/gitlab/-/issues/420927). ## The problems to solve The following outline is meant to structure future work. Once reviewed, the collapsed body of each item below will populate a child epic and can be removed from the description here. 1. [Create a service to scan customer code for security vulnerabilities](https://gitlab.com/groups/gitlab-org/-/epics/14254) 2. [Deploy the scanner service using Runway](https://gitlab.com/gitlab-org/gitlab/-/issues/462808) 3. [Add SAST scan feature to a GitLab supported IDE](https://gitlab.com/groups/gitlab-org/-/epics/14020) 4. [Implement profiling and benchmarking for performance tuning](https://gitlab.com/groups/gitlab-org/-/epics/13940) ## Spikes In parallel to the primary MVC work outlined above, we will investigate alternatives with managed [spikes](https://handbook.gitlab.com/handbook/product/product-processes/#spikes). These spikes will serve to validate made decisions and suggest timely course corrections. **To begin work on a spike**, create a child epic that includes a description of the work, measures of success, and a timeframe. Update the list below, removing the ideas that are covered by the new child epic. Note: ensure there is team capacity before taking on new spike work. **To add an idea for a spike**, add a bullet with a collapsible section below. Include details that would help create a child epic. Note: the list should be prioritized by product relevance. - <details><summary>Reduce user impact of network latency by caching intermediate results</summary> - Prerequisites - An otherwise identical service that does not use caching, (e.g. [POC](https://gitlab.com/gitlab-org/gitlab-vscode-extension/-/tree/lcharles-remote-security-scans)) - Scanner service profiling and benchmarking harness - Tasks to consider - Index and store results by file content hash - Hash the AST and index findings by node - Pre-populate report cache in the background using anticipatory design, e.g. by scanning - all open tabs - recent files - code referenced from current file (via LSP) - local caching of match subexpressions, vis a vis [memoization](https://en.wikipedia.org/wiki/Memoization) - Use Lightz "partial scanning" with “signatures” cache - Measures of success - Scans triggered on files with cached results do not require re-scanning - Enable code decoration without any network traffic, e.g. at file load time (link to discussions motivating “on save” constraint) - Questions to answer - How can synchronization with remote caching... - further reduce the impact of network latency? - make use of pipeline vulnerability findings? - Can we design a caching strategy that’s compatible with Lightz “signatures” incremental scanning via “partial scan”? </details> - <details><summary>Analyze and optimize ruleset for fast scanning</summary> - Prerequisites - Scanner profiling and benchmarking harness - Tasks to consider - Identify empirically "fast" rules - determine analytically why they're fast - Identify rules that can be matched with limited source scope, e.g. `f(...)` is limited to function calls - rulesets with limited scope can be applied incrementally, e.g. only the scoped context of changes need to be re-scanned. - Measures of success - Scanner benchmarks to (in)validate ruleset optimzation strategies - Questions to answer - Is further investigation into rules likely to identify better results? </details> - <details><summary>Compare Lightz and Semgrep OSS performance on non-taint rules</summary> - Prerequisites - Scanner profiling and benchmarking harness - Tasks to consider - Compute scan times across a variety of languages, file sizes, and rules using both analyzers - Measures of success - Scanner benchmarks to inform tuning via scanner selection according to language, file size, rule </details> - <details><summary>Identify opportunities to integrate telemetry</summary> - Prerequisites - IDE integrations with relevant, measurable user interactions like viewing of summary reports. - Tasks to consider - Detect "interest" in local scan results and report via [insights](https://docs.gitlab.com/ee/user/project/insights/), for example. - Measures of success - Articulation of usage for customer users and admins as well as Gitlab product and engineering teams </details> - <details><summary>Assess and improve “partial scanning” with Lightz</summary> - Prerequisites - Scanner profiling and benchmarking harness - Tasks to consider - Use an approximate/limited “signatures” database that's fast to compute, e.g. by using “signatures” of APIs reachable from current file - Measures of success - Performance gap and root cause analysis of inter-file real-time SAST with Lightz “partial scanning” for performance tuning </details> - <details><summary>Run some analysis locally to improve performance</summary> - Prerequisites - Scanner service profiling and benchmarking harness - A fully local deployment is likely out of scope for realistic evaluation in a spike. This spike supports other spikes by running _some_ of the analysis locally. - Tasks to consider - where the primary spike concern is caching - use `js_of_ocaml` and `Emscripten` to build a "frontend" of Semgrep OSS directly into the extension (as is done by the semgrep-vscode extension) where - the "frontend" computes the common/generic AST from code and - the "backend" would be implemented in the scanner service, and would consume common/generic AST - deploy parts of VET as WASM for intra-project reachability anticipatory scanning - where the primary spike concern is Lightz "partial scanning" - deploy parts of VET as WASM to compute an approximation of Lightz signatures database - Measures of success - Scanner service benchmarks to inform tuning efforts - Questions to answer - Is localizing some of the scan worth the complexity and maintenance cost of additional toolchains? - Is there scope to futher investigate partitioning scans into "frontend"/local and "backend"/remote in order to reduce latency? </details> ## links - https://gitlab.com/groups/gitlab-org/-/epics/10283+s - https://gitlab.com/gitlab-org/gitlab/-/issues/420927+s - https://gitlab.com/gitlab-org/gitlab/-/issues/405439+s - https://gitlab.com/groups/gitlab-org/editor-extensions/-/epics/36+s Disclaimer: This is a working draft. <!-- triage-serverless v3 PLEASE DO NOT REMOVE THIS SECTION --> *This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.* <!-- triage-serverless v3 PLEASE DO NOT REMOVE THIS SECTION -->
epic