One file with syntax errors should not stop SAST and similar kind of jobs from running

This SAST job failed for me recently, because apparently one file had an syntax error:

https://gitlab.com/gitlab-org/gitlab-ce/-/jobs/86638388

In my opinion one unparseable file should not stop scanning of hundred files other files, so I would do something like this:

all_files= set of all files
compiled_files = {}
error_files = {}

for file in all_files
try
   compiled = compile(file)
   compiled_files = compiled_files ∪ {compiled}
catch e
   error_files = error_files ∪ {file}

run_sast(compiled_files)
print "SAST could not scan " + error_files

At the moment we are simply scanning all javascript files in a repo. But it is totally legit to have corrupt or incorrect files in a repo, because you might be building a parser yourself and need those files for tests. You are testing systems how they react to broken inputs.

Activity Summary

Certain scanners require "pre-processing steps" to perform their analysis: in some cases compilation, or dependency fetching, or comment stripping
pre-processing steps can fail, causing the job to fail
question: if pre-processing fails, should we attempt scan anyway? can we?
- answer: it depends.
  - A "syntax error" here is an unsupported runtime essentially; new javascript features our parser doesn't recognize. If we can't strip comments what do we do? See below possible solutions.
  - If compilation fails, we cannot build a project. If we can initiate a partial scan that's essentially reporting incomplete data. We should avoid false negatives as much as possible.
this issue seems largely isolated to javascript, we haven't seen other languages with similar issues (although it's possible)

Possible solutions

Remove babel preprocessing #199660 (closed) - this should effectively solve the issue for nodejs-scan, the primary scanner that experiences this issue. A stop-gap is the implementation of #33065 (closed) which has a much lower weight and Level-of-effort.
Allow syntax errors - If we initiate a partial scan, that's essentially reporting incomplete data. We should avoid false negatives as much as possible. This is not viable, IMO

Edited Jul 17, 2020 by Thomas Woodham