Generic SAST engine
Problem to solve
Our current SAST offering is based on various and heterogeneous tools. While this approach was a great strategy to get up to speed quickly, it also has a lot of limitations. Most of them are due to the lack of a common code representation.
- Lack of common code representation
- we cannot formalize a set of general attack or vulnerability patterns that are re-usable across language boundaries. Conceptually an XSS follows the same pattern irrespective of the language that was used to develop its vulnerable host application.
- we cannot learn new attack patterns or code smells across different languages.
- we cannot release features such as incremental code scanning, code navigation, snippet matching because they require access to a code representation that contains AST, data-flow and control-flow information.
- we cannot mine for certain patterns across different GitLab projects developed in different languages.
- data-flow and control-flow are not always provided. The latter is particularly important for some of our categories, like RASP.
- Engines are all written in different languages
Intended users
- Delaney (Development Team Lead)
- Sasha (Software Developer)
- Devon (DevOps Engineer)
- Sidney (Systems Administrator)
- Sam (Security Analyst)
Further details
Some vulnerabilities are almost impossible to spot without a complete understanding of the code, and the control and data flow. SQL injections are a good example, they depend most of the time on a complete flow, that rules based on regular expressions are not able to cover.
Proposal
The vulnerability research team already evaluated a potential foundation in issue https://gitlab.com/gitlab-com/gl-security/appsec/vulnerabiltiy-research/issues/4, and developed a limited Ruby POC that extracts code information from the AST (functions, calls, arguments), augments that with data-flow and call information and stores that in a graph-database. The graph-database enables users to run code queries in order to answer questions such as: Is there a data flow from a to b.
Permissions and Security
TODO
Documentation
TODO
Testing
TODO
What does success look like, and how can we measure that?
- Better results (lower FPR, higher TPR)
- Contributions to our rules
What is the type of buyer?
Links / references
LLVM IR References/Resources
- http://dev.stephendiehl.com/numpile/ - Python JIT with LLVM and LLVM IR
- https://github.com/dabeaz/llvm-py/blob/master/www/src/userguide.txt#L315 - Dealing with modules in interpreted languages in LLVM IR (specifically Python)
- https://us.pycon.org/2016/schedule/presentation/1995/ - Wrestling Python into LLVM IR
- https://llvm.org/pubs/2004-Spring-AlexanderssonMSThesis.html - Ruby to LLVM
- https://github.com/k0kubun/llrb/blob/master/README.md - Ruby -> LLVM IR -> LLVM Bitcode
- https://go.googlesource.com/gollvm/ Go -> LLVM IR -> Backend
- https://github.com/ShiftLeftSecurity/llvm2graphml -> LLVM to GraphML