Spike on Using Semgrep as a Foundation of API Discovery

Pros

Zero user configuration

Cons

Need to understand the limitation of the tooling

Spike

Investigate if lightz (GitLab version of semgrep) could be modified to extract API operations from source code. This would work by adding a new mode of operation (API Discovery) with a rule set tailored to that usage. We could then leverage the taint analysis to identify registration of routes back to the implementation method. From there analyze the implementation to identify arguments, body structure, use of headers, etc. The output from the tool would be an OpenAPI specification.

How would this potentially work?

Let's use a simple Flask+Python example to see how this might work.

Given the following Python code, we look for the app = Flask(...) which defines a Flask application. From there taint analysis allows us to track the use of app and identify methods with @app.route(...) decorators. This is one way to add API operations to Flask.

From the app.route we look at the parameters to get the path (/employees) and method types (GET) supported. We can also identify arguments from the path itself (<int:id>) and possibly type.

From there we would look at the methods implementation and find any calls to the request object to identify headers, form data, etc. If we see the request body json-deserialized, we could track the usage of the dictionary to identify json structure and types. Assuming lightz allows that level of analysis.

import json
from flask import Flask, jsonify, request
app = Flask(__name__)

employees = [{ 'id': 1, 'name': 'Ashley' }, ... ]

@app.route('/employees', methods=['GET'])
def get_employees():
 return jsonify(employees)

@app.route('/employees/<int:id>', methods=['GET'])
def get_employee_by_id(id: int):
 employee = get_employee(id)
 if employee is None:
   return jsonify({ 'error': 'Employee does not exist'}), 404
 return jsonify(employee)

Questions to answer

Does lightz support Python method decorators?
Does lightz support Python type hints? Meaning, can we access it given a method argument?
How hard would it be to add a new rule type to lightz that does what we need?
Can we inspect arguments and see static values?
Given a method can we look, for that method and all methods it calls, for uses of the global request variable? For this we want to identify any headers, forms, query strings being read/set. Or bodies being parsed as json/xml/etc.

Proposal

Talk with Jason Leasure in SAST. Jason is not a member of the lightz team, but has been tasked with understanding lightz capabilities and identifying gaps. First stop because it won't use up any of the lightz teams time, which we need to be sensitive of.
1. Include @mikeeddington in the meeting
Depending on the outcome of the meeting with Jason, identify a lightz member to have further discussions with.
Review existing lightz rules to get an idea of how the system works for finding vulnerabilities.
Review the semgrep/lightz code.
Write up results

Edited May 21, 2024 by Michael Eddington