[RFC] [Skims] New graph representation and symbolic evaluation
Index
1. Syntax graph
2. Syntax CFG
3. Context graph
4. New symbolic evaluation
5. Simple example
6. Multi-function example
1/6 - Syntax graph
Background
After skims selects all the file paths that can be analyzed an AST (Abstract Syntax Tree) is constructed for each of these files using the Tree-Sitter library.
The AST represents each part of the code with multiple nodes to capture all the syntax parts used in the instructions at hand. The result of this task is a large graph, containing language dependent nodes, that represents the information of the file.
The AST is then used to create an execution flow of the file called the CFG (Control Flow Graph). This CFG allows to know all the possible execution paths that the program might use in an execution.
Finally, after the CFG is completed, each of its nodes are also represented as a group of NamedTuples with the objective of create a meta syntax that is equal for all languages in order to simplify the analysis of the code. This list of meta syntax steps is the input of the current symbolic module.
Problem description
The AST is heavily used throughout the preparation and execution of the SAST checks, however, the use of this graph implies time consuming iterations due to the large number of nodes and the construction of special code to process language dependent nodes.
Proposal
Do not use a list of NameTuple objects to represent the abstract syntax, instead use this objects as graph nodes in a compact representation that contains all the necessary information from the AST but in a language independent manner. Also do not calculate the CFG from the AST, construct the execution flow from the new syntax tree to minimize the number of nodes to analyze.
A possible architecture could be
Advantages
- Smaller representation of the code
- Faster iteration
- Representation with language independence
Downsides
- The current checks should be updated in order to use the new syntax tree.
2/6 - Syntax CFG
Background
The CFG is constructed locally for each method declaration. To complement the local information of the function a metadata structure is implemented with information about the global context of the file.
Problem description
The CFG does not take into account the information outside the definition of a specific function. This makes difficult to implement checks that need information about a non-local context.
Proposal
Extend the CFG construction from the root node of a file in order to include the global context initialization and all the function definition in a connected graph.
A possible architecture could be
Advantages
- No need for metadata.
- Possibility to access global data by following the CFG.
- Possibility to consult information of other files.
Downsides
- The consult of information in other files can take some time to fetch depending on the number of references.
3/6 - Context graph
Background
When a statement is evaluated, its components are resolved based on a context schema in which higher levels and previous components are available but lower levels and latter components are not. To illustrate this concept, consider the following code
global a = 3
function one(){
print(a)
local a = 1
}
function two(){
local a = 2
print(a)
}
The first thing to notice is that there are two context levels in the example. The first one is the global context, in which the a
variable and the functions are defined. The second one, is composed of local contexts related to the definition of each function.
The information of a higher level is available in a lower one, however, the information of lower level is not available in the higher one. This means that the global definition of a
is available in both functions but the definitions the variables inside the functions are not available in the global context nor in the context of other functions.
Another thing to consider is that the execution within a context is sequential, which means that only the instructions executed until the current point are available in the evaluation of a statement. This means that when one
is executed the reference to a
is resolved in the global context, there are no available definitions in the local one yet and the evaluation should continue the search in a higher level, producing print(3)
. On the other hand when two
is executed there is an available declaration in the current context and thus there is no need to search a higher level, so the output is print(2)
.
Problem description
Solve a variable reference could be a complex problem, it involves a context handle and search for intermediate references. This may produce code that is hard to understand and would involve a time consuming task depending on the size and references used in a specific graph.
Proposal
The CFG can be used a context manager and the reference solve could be highly simplified if additional arcs are constructed from a CFG node to each of its variable references.
To avoid unnecessary computation each the context for a CFG node can be added to the syntax graph only the first time is needed and then be consulted directly.
Advantages
- The CFG information is already calculated
- Avoid over complicated code
- High decrease in search and evaluation time
Downsides
- There is no guaranty that all the variable references in a CFG node are needed in a specific evaluation, however, the number of child nodes is usually small.
4/6 - New symbolic evaluation
Background
The current symbolic module takes a list of syntax steps, represented as NameTuples, to perform the evaluation of the code.
The process starts by marking the syntax steps representing inputs that could be dangerous, such as a HTTP request or a part of a SQL query. After that, the steps representing the parts of the code in which these inputs could be harmful, such as the usage of a variable from the HTTP request or the execution of the SQL query, are marked as sinks.
Then all the possible paths that can connect an input to a sink are calculated using the CFG.
Finally, each node in a path from an input to a sink is analyzed with the objective of determine if the dangerous input is indeed used in the sink, generating a vulnerability that must be reported.
Problem description
The analysis of each node in the path from an input to a sink is not divided by finding and the evaluation functions perform multiple secondary consults, such as value calculations, construction of attribute sets and other metadata related tasks.
This multiple responsibilities in the evaluation functions generates unnecessary overhead and make the code difficult to read.
Proposal
Define a unique responsibility for evaluation functions by decoupling the propagation of the danger state from the specific findings checks and metadata related tasks.
A possible architecture could be
Advantages
- Improve code reading
- Facilitate code maintenance by assigning a single responsibility to each function
- Avoid execution of unnecessary checks
Downsides
- A lot of evaluation functions from the symbolic module must be redefined to implement the change.
5/6 - Simple example
C# code
namespace Controllers
{
public class Calculate
{
public static void ProcessRequest(HttpRequest req, HttpResponse res)
{
string name = req.QueryString["name"];
res.Write("Hello " + name);
string value = req.QueryString["value"];
if (value == null || !Regex.IsMatch(value, "^[a-zA-Z0-9]+$"))
{
throw new InvalidOperationException("Invalid value");
}
res.AddHeader("X-Header", value);
}
}
}
Syntax graph
Syntax CFG
Full context graph
Evaluation of res.Write("Hello " + name);
Evaluation comments
The enumeration refers to a evaluation order number (yellow ids). The nodes are referred with the AST id (blue ids)
1. Start evaluation of Write
method invocation
11. When a Symbol Lookup
is evaluated the context module is called to search all references of the current lookup. The search goes backwards in the path iterating through the context graph for each CFG node. In this case, since the current Symbol Lookup
is part of the CFG node 34, the search starts with CFG node 31 and ends at CFG node 14 when the res
parameter is founded.
12. After the search is completed each CFG node founded is evaluated to ensure a full evaluation of the references it contains, then the result of the desired node is assigned. This means that the node 14 is evaluated but the result of node 42 in step 11 is the same as node 23 in step 14.
17. Since there is no partial evaluation, each node can be evaluated only once. When the evaluation of node 54 starts in step 17 the program does not re-evaluate the node 14 or it childs, it consults the previous result obtained in step 14. That is why there are no more executions in step 17 and there is only a consult to node 27.
Evaluation of res.AddHeader("X-Header", value);
Evaluation comments
This section assumes that the Write
evaluation example has already been read.
1. Start evaluation of AddHeader
method invocation
4. When there are multiple usages of the same variable in a single CFG node the evaluation of that node is triggered multiple times, however, only the first one is carried out, the others simply consult the result of the first one. That is why there is only one full evaluation of the if statement in step 21 to later consult the required references.
5. When the CFG node and the founded reference are the same, the evaluation does not behave differently, the different color is just to point that the result of the CFG node will be assigned to the lookup since is also the n_id of the resolved reference.
10. In this case there is an intermediate usage of the symbol in node 42 before its definition in node 23. When this happens the references are evaluated in execution order, which means that the CFG node associated to node 23 is evaluated first than the CFG node associated to node 42 (evaluation 11 and 15 respectively).
32. This Symbol Lookup
can not be solved because is not defined in the code. This means that the search of the context node does not return any references and thus there is no evaluation aside from a finding specific check. For space reasons the full context graph search is not included in the diagram.
6/6 - Multi-function example
C# code
namespace Controllers
{
public class Calculate
{
public static void Auxiliar(HttpRequest req, HttpResponse res)
{
string name = req.QueryString["name"];
res.Write("Hello " + name);
}
public static void ProcessRequest(HttpRequest req, HttpResponse res)
{
Auxiliar(req, res);
string value = req.QueryString["value"];
res.AddHeader("X-Header", value);
}
}
}
Syntax graph
Syntax CFG
Full context graph
Evaluation of res.Write("Hello " + name);
Evaluation of res.AddHeader("X-Header", value);
Evaluation comments
This section assumes that the hole document has already been read.
15. Every method invocation tries to resolve its own invocation expression, however, this is the first example in which that consult can be successful, that is why is not included in previous diagrams.
19. When a function invocation is resolved, a new evaluation context is created with its own path to evaluate the associated execution block. This is why there is a gray box for this part of the evaluation.
30. When the new evaluation context is created the danger status of the parameters used in the function invocation are stored. This means that in the gray evaluation nodes 23 and 27 have the same value of nodes 92 and 95 from the original evaluation. That is why those nodes do not have a yellow id in the gray evaluation.