Incremental scanning for Advanced SAST (skip unchanged code)
## Executive Summary **_Very short version:_**\_ Make Advanced SAST faster by caching results from previous scans, and only scanning code that has actually changed.\_ Advanced SAST does detailed analysis of customer programs to provide accurate results, but this analysis can take enough time that the scan doesn't finish fast enough for customers to adopt Advanced SAST in their MR processes. In some large repositories, scans may even exceed the time limit for pipeline jobs. To address this, we will cache the results of computationally-expensive tasks and re-use them for future scans. Those scans will then have to do less work, since they can skip what's already computed in the cache. We will call this "incremental scanning". Because most commits and MRs only involve a small portion of the repository, we expect this to have significant impact. Advanced SAST scan runtime is a significant blocker to adoption in large repositories, and adoption in those repositories can block organizations from adopting Advanced SAST at all. This will be a mostly invisible performance improvement (in a good way!), because each scan will produce full, correct, accurate results, but in less time. For practical risk reduction reasons, though, we will likely allow administrators to choose whether it is enabled. This feature is a further iteration of https://gitlab.com/groups/gitlab-org/-/epics/16790+; the main difference is that this feature uses cache to produce full results, while that feature produces only partial results. It will deliver customer results in concert with the algorithm changes and other performance improvements we are also separately delivering in other parts of https://gitlab.com/groups/gitlab-org/-/epics/16560+. ### Engineering Assessment While the general idea is simple, there are implementation details that need exploring. - Branch support. - A feature branch may not have a signature cache available, which means the engine may need to use the default branch cache as a starting point. - Cache storage/retrieval. - [Early discussions](https://gitlab.com/groups/gitlab-org/-/epics/14123#note_2301497386) considered a relational database for storage with a new API. While this is likely to be a successful choice, it is also high effort and requires cross-stage coordination. - CI Artifact storage is already available, and is likely to be fit for purpose. While this choice restricts the solution to CI, it seems appropriate given that the scope of this epic is limited to CI jobs anyway (e.g. no support for SAST on IDE). - Support for very large projects - In this scenario, it's possible that GLAS times-out before it's able to complete a scan. While there are workarounds — e.g. extend pipeline timeout, reduce scan scope, use multi-core, increase computing resources —, failure to complete a full scan prevents projects from benefiting from this feature. - We may want to explore the ability to stop/resume scans, so that a full scan is not required. For example, if the engine is aware of the pipeline timeout, it can save its cache and results before a forced exit. The next job will then have a cache available that it can use to continue scanning. ### Dependencies - Team dependencies: - https://gitlab.com/groups/gitlab-org/-/epics/16790+ - UI: Minimal or no direct changes required for UI. - Epic/Issue dependencies - _Link to dependent epics/issues via the linked items widget below for ease of drill down_ - External dependencies: N/A <details> <summary> ## Other Interlock administrivia </summary> #### DRIs - **PM**: @connorgilbert <!--also add as assignee to this epic--> - **EM**: @thiagocsf <!--also add as assignee to this epic--> - **UX/PDM**: No stable counterpart; cc @jmandell <!--also add as assignee to this epic--> - **Group(s)**: ~&quot;group::static analysis&quot; <!--also add as label--> - **Engineering Owner**: @twoodham #### Initiative Driver - Product or Engineering? - [ ] x \] **Product-driven initiatives (P1/P2/P3)** - Customer-facing features or improvements driven by Product teams that require engineering resources and commitment - These initiatives require a Product Priority label (P1/P2/P3) - They may also receive GTM tier labels (T1/T2/T3) for external communication - [ ] **Engineering-driven initiatives (E1/E2/E3)** - Internal technical improvements that may not have customer-facing components - These initiatives require an Engineering Priority label (E1/E2/E3) - They have internal visibility only and are not externally communicated - Examples include: technical debt reduction, infrastructure improvements, refactoring, dependency upgrades #### Sizing and Funding (Optional) - **Size**: \[XS/S/M/L/XL\] - **Funding Status**: \[Funded/Partially funded/Not funded\] #### Hygiene Guidelines :bulb: \_See additional details about this process at https://handbook.gitlab.com/handbook/product-development/r-and-d-interlock/ ##### :one: Pre-Interlock - [x] Update epic description with all relevant information - [x] Ensure all dependencies are identified - [x] Apply appropriate labels (see below) - [ ] Apply target delivery Milestone - [ ] Update interlock status as discussions progress (via label) ##### :two: Post-Interlock: once quarter begins - Update health status weekly (via label) - Document any newly identified risks or dependencies - Link to implementation epics/issues as work begins - Flag any scope or timeline changes immediately <!--Apply appropriate labels: - [ ] Section (section::dev, section::ops, section::sec) - [ ] Stage (devops::plan, devops::create, devops::verify, etc.) - [ ] Group (group::product planning, group::project management, etc.) - [ ] Interlock Priority (Product labels = Interlock Priority::P1, Interlock Priority::P2, Interlock Priority::P3, Engineering labels = Interlock Priority::E1, Interlock Priority::E2, Interlock Priority::E3) - [ ] Investment theme (Investment theme::Core-Devops, Investment theme::Security-Compliance, Investment theme::AI across SDLC) - [ ] Platforms (platform: GitLab.com, platform: dedicated, platform: dedicated for gov, platform: self-managed) - [ ] Subscription tier (GitLab Ultimate, GitLab Premium, GitLab Free) - [ ] Quarter (FY27 Q1, FY27 Q2, FY27 Q3, FY27 Q4) - [ ] Pre-interlock status label (interlock status::New/Proposal in progress, interlock status::cancelled, etc) - [ ] Post-interlock status label (R&D roadmap status::Executing, R&D roadmap status::Completed) - [ ] Post-interlock, once quarter begins update health weekly (health::on track, health::needs attention, health::at risk) *For guidance on labels, see the [labels guide here](https://handbook.gitlab.com/handbook/product-development/r-and-d-interlock/#labels-guide)--> </details> ## Feature details ### Rollout control This feature is the kind that could be on-by-default. Eventually, we likely will do that. However, practically, especially as we start rolling the feature out for the first time: - If there are any infrastructure requirements or stability concerns, it would be ideal to allow the feature to be disabled by: - Self-Managed admins, including Dedicated admins - .com users of sufficient privilege (likely Maintainer or Owner), on a group or project basis - We should show some indication in logs (at a minimum) or in the UI (ideal). This is to: - Aid in debugging - Build awareness around the feature’s existence and operational status - We should use a feature flag to minimize rollout risk, at least for the initial rollout on .com. The FF is not necessary for Self-Managed and Dedicated releases as long as the rollout controls above, where admins can disable the feature, are available. ### Semantics/behavior - Storage can be limited to the default branch. - _Assumption:_ other branches or branching strategies will still work, but could be slower. (That is, the optimization would be less effective but not _incorrect_.) - If we are trading off between slightly increased runtime versus simplicity or reliability, we can accept a runtime hit in return for reliability. - We will need to produce a defensible (even if not perfect) number for describing the feature, for example "Incremental scanning products identical results in 40-80% less time in typical repositories” We must: - Include system limits for system stability - These are to be analyzed and proposed by Engineering, but, for example, could include size of cached objects or time taken to load those objects from storage. - Support self-managed + offline, i.e., not use Cloud Connector for this feature - Know the cases in which an incremental scan will miss something, & document this (if any) - Only report results from incremental scans that would also appear in a full/non-incremental scan. (Put another way: An incremental scan must not produce a superset of a full scan's results.) We can, if needed: - Start out by only offering the feature on non-default branches, if we have any concern about an incremental scans being incomplete or incorrect. (This would be to reduce risk.) ### Interaction with other features - This will likely obviate the need for diff-based MR scans in most cases (https://gitlab.com/groups/gitlab-org/-/epics/16790). But, diff-based scanning may be independently useful even so, especially as incremental scanning is still under development/being released. - It would be great if this technology could contribute to our ability to use Advanced SAST-based logic in real-time IDE scanning. Note that this is _NOT_ a solution for removing irrelevant-seeming results from the merge request widget. Any problems with this are likely related to how security reports are handled, or how scan results are tracked over time. ### Target Metric * Improved scan runtime performance, which can lead to an increase in adoption of Advanced SAST. See [my internal comment](https://gitlab.com/groups/gitlab-org/-/epics/15545#note_2890243734) for more details.
epic