Next Generation SAST engine (#3260) · Epics · GitLab.org

Next Generation SAST engine

### Problem to solve Our current SAST offering is based on [various and heterogeneous tools](https://gitlab.com/gitlab-org/security-products/analyzers). While this approach was a great strategy to get up to speed quickly, it also has a lot of limitations. Most of them are due to the lack of a common code representation. - Lack of common code representation - we cannot formalize a set of general attack or vulnerability patterns that are re-usable across language boundaries. Conceptually an XSS follows the same pattern irrespective of the language that was used to develop its vulnerable host application. - we cannot learn new attack patterns or code smells across different languages. - we cannot release features such as incremental code scanning, code navigation, snippet matching because they require access to a code representation that contains AST, data-flow and control-flow information. - we cannot mine for certain patterns across different GitLab projects developed in different languages. - data-flow and control-flow are not always provided. The latter is particularly important for some of our categories, like [RASP](https://about.gitlab.com/direction/defend/rasp/). - Engines are all written in different languages ### Intended users * [Delaney (Development Team Lead)](https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/#delaney-development-team-lead) * [Sasha (Software Developer)](https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/#sasha-software-developer) * [Devon (DevOps Engineer)](https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/#devon-devops-engineer) * [Sidney (Systems Administrator)](https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/#sidney-systems-administrator) * [Sam (Security Analyst)](https://about.gitlab.com/handbook/marketing/product-marketing/roles-personas/#sam-security-analyst) ### Further details  Some vulnerabilities are almost impossible to spot without a complete understanding of the code, and the control and data flow. SQL injections are a good example, they depend most of the time on a complete flow, that rules based on regular expressions are not able to cover. ### Proposal  The vulnerability research team already evaluated a potential foundation in issue https://gitlab.com/gitlab-com/gl-security/appsec/vulnerabiltiy-research/issues/4, and developed a limited [Ruby POC](https://gitlab.com/gitlab-org/gitlab/issues/35380) that extracts code information from the AST (functions, calls, arguments), augments that with data-flow and call information and stores that in a graph-database. The graph-database enables users to run code queries in order to answer questions such as: `Is there a data flow from a to b`. ### Permissions and Security  TODO ### Documentation  TODO ### Testing  TODO ### What does success look like, and how can we measure that?  - Better results (lower FPR, higher TPR) - Contributions to our rules ### What is the type of buyer?  gitlab~3207279 ### Links / references #### LLVM IR References/Resources * http://dev.stephendiehl.com/numpile/ - Python JIT with LLVM and LLVM IR * https://github.com/dabeaz/llvm-py/blob/master/www/src/userguide.txt#L315 - Dealing with modules in interpreted languages in LLVM IR (specifically Python) * https://us.pycon.org/2016/schedule/presentation/1995/ - Wrestling Python into LLVM IR * https://llvm.org/pubs/2004-Spring-AlexanderssonMSThesis.html - Ruby to LLVM * https://github.com/k0kubun/llrb/blob/master/README.md - Ruby -> LLVM IR -> LLVM Bitcode * https://go.googlesource.com/gollvm/ Go -> LLVM IR -> Backend * https://github.com/ShiftLeftSecurity/llvm2graphml -> LLVM to GraphML

epic