Debugging report parsing and ingestion errors

Introduction

There are multiple cases where the analyzer may produce report parsing or ingestion errors. This guide aims to go through some example tickets that contain work arounds and issues (open and closed) where such bugs have been encountered and the steps to go about debugging and preparing a work around to unblock customers before bug fixes are merged.

Reproducing:

Things you need:

While facing these issues, these are the items you will need to reproduce and understand the problem better:

If the issue presents itself on a self-managed instance, obtain the version of GitLab the customer is running. Otherwise, you can reproduce on GitLab.com.
.gitlab-ci.yml file from the customer. This should contain the config used for the scan e.g. Secure CI/CD variables, image tag/version assuming maybe the customer has pinned to the analyzer to a specific version, etc
The screenshot of the security tab of the pipeline showing the error they are facing. For example, a SAST job might present the error: [Schema] property '/vulnerabilities/x/name' is invalid: error_type=maxLength**. Where x is the nth item in the vulnerabilities section in the report, or the nth vulnerability i.e vulnerabilities[x].name. Note that when x is part of an array in a report, remember the counter starts at 0 as with any other array. For example:

Copy of the job log with SECURE_LOG_LEVEL: debug set. The variable is not necessarily needed, after all, the job log should provide the image:tag being used, and the analyzer version without the variable. But it's always good to have verbose logs in case of anything. For example:

...
...
Preparing the "docker" executor
00:04
Using Docker executor with image registry.gitlab.com/security-products/secrets:5 ...
Pulling docker image registry.gitlab.com/security-products/secrets:5 ...
Using docker image sha256:c3d021a0c765e9178360ae4811b6270b02c5cccdc05cf77ed13ef1e4e2119541 for registry.gitlab.com/security-products/secrets:5 with digest registry.gitlab.com/security-products/secrets@sha256:13ab9d49bbe76f08a324465f858a088ddbcaadf4fc6cbd1d7ae43829b826baa0
...
...
Using docker image sha256:c3d021a0c765e9178360ae4811b6270b02c5cccdc05cf77ed13ef1e4e2119541 for registry.gitlab.com/security-products/secrets:5 with digest registry.gitlab.com/security-products/secrets@sha256:13ab9d49bbe76f08a324465f858a088ddbcaadf4fc6cbd1d7ae43829b826baa0 ...
$ /analyzer run
[INFO] [secrets] [2024-05-13T07:19:03Z] [/go/src/app/main.go:24] ▶ GitLab secrets analyzer v5.2.7
...
...

include:
  - template: Jobs/SAST.gitlab-ci.yml

variables:
  SECURE_LOG_LEVEL: "debug"

The JSON report generated from the scan. This can be downloaded from the security tab in the pipeline page or as an artifact in the pipelines list page from the specific pipeline. This will be helpful to get the version of the schema being used. It should be at the top of the report: "version": "15.0.7". For example:

$ jq . gl-sast-report.json | less
{
  "version": "15.0.7",
  "vulnerabilities": [
    {
      "id": "8110a4b9546eb128d6242e974b2efdb558632bd216c902cedde0e62a12cf3fb5",
      "category": "sast",
      "name": "Improper certificate validation",
...

The schema that the report is validated against. For self-managed instances, this schema is shipped with the instance, for GitLab.com, you can find the schemas here: https://gitlab.com/gitlab-org/security-products/security-report-schemas/-/blob/master/dist/ e.g https://gitlab.com/gitlab-org/security-products/security-report-schemas/-/blob/v15.0.7/dist/sast-report-format.json?ref_type=tags. You can use the GitLab.com schemas to work on the self-managed tickets as long as you have picked the correct version. You can also tell the correlation between GitLab versions and Schema versions: https://gitlab.com/gitlab-org/gitlab/-/blob/v16.6.0-ee/lib/gitlab/ci/parsers/security/validators/schema_validator.rb?ref_type=tags
Logs / GitLab SOS. Some errors may not be meaningful in the UI, especially scan ingestion errors (rather than report parsing errors). They might appear in the logs. For example: gitlab-org/gitlab#440853 (closed)
Have jq installed on the machine you are reproducing the issue: https://jqlang.github.io/jq/
(optional) Script for validating the schema against the report: https://docs.gitlab.com/ee/development/integrations/secure.html#validate-locally. This will display the same error you observe in the UI. For example:

$ ruby validator.rb sast-report-format.json gl-sast-report.json 
property '/vulnerabilities/1563/name' is invalid: error_type=maxLength

Steps to reproduce:

Once you have the above items, you are ready to reproduce the bug the customer is observing. The general steps involve:

Get a copy of the gl-*report.json file (gl-sast-report.json, gl-secret-detection-report.json, gl-dependency-scanning-report.json etc),
GitLab Version if the problem is observed on a self-managed instance
The analyzer version which you can get from the job logs
Now, you have 2 options. You can either reproduce using the UI, or locally with the script provided in the documentation.

Reproducing in the UI

If it's a self-managed instance, deploy a test instance matching the customer's GitLab Version.
Create a dummy security scan project. The project will contain the gl-*-report.json file obtained from the customer and a simple .gitlab-ci.yml file that upload the report as an artifact in the job. Example:

sast-dummy-job:
  script:
    - echo "SAST Dummy Job"
  artifacts:
    reports:
      sast: gl-sast-report.json

You can also reproduce with the whole project if the customer is willing to share their project or a specific file. In this case, you will just run a regular scan but be sure you are using the same version of the analyzer that the customer is using by pinning the version of the analyzer in your .gitlab-ci.yml file:

include:
  - template: Jobs/SAST.gitlab-ci.yml

variables:
  SECURE_LOG_LEVEL: 'debug'
  
semgrep-sast:
  variables:
    SAST_ANALYZER_IMAGE_TAG: "4.18.1"

Once the pipeline is ran, check the pipeline's security tab and confirm you can observe the same error.
Assuming you receive the message: property '/vulnerabilities/1563/name' is invalid: error_type=maxLength, you can use jq to double check the state of the 1564th vulnerability's name. This is usually for messages that are descriptive in the UI or rather, actually mention an offending field in the report.

$ jq .vulnerabilities[1563].name gl-sast-report.json | wc -m
299

Check the correct schema version for what is expected of maxLength: https://gitlab.com/gitlab-org/security-products/security-report-schemas/-/blob/v15.0.7/dist/sast-report-format.json?ref_type=tags#L655. We can see a vulnerability's name requires a string of maxLength 255, whereas the report has 299 characters for the offending vulnerability's name.
Create an issue or check if there are existing issues, and prepare a workaround if possible to unblock the customer (examples in the tickets section).
If the message in the UI is generic, you'll probably get information from the logs/GitLab SOS similar to this issue: gitlab-org/gitlab#440853 (closed). In such cases, follow step 1-3 and then run GitLab SOS to see if you find an exception in the logs. In this particular case, the error was explained by a combination of what the schema expects the report to look like and the constraints set on a particular column in the vulnerability_occurrences table in the DB.

Reproducing using the schema validation script

Obtain the gl-*-report.json file from the customer.
Check which version of the report has been produced.
Obtain the same schema from GitLab.com and save it locally. For example, if the customer's report has the version at 15.0.7 for a semgrep-sast job, you will use this file: https://gitlab.com/gitlab-org/security-products/security-report-schemas/-/blob/v15.0.7/dist/sast-report-format.json
Copy the script into a file validator.rb: https://docs.gitlab.com/ee/development/integrations/secure.html#validate-locally. The command to run is ruby validator.rb <schema_file.json> <report_from_customer.json>
Run the following:

$ ruby validator.rb sast-report-format.json gl-sast-report.json

You will get the same message from as the customer did in the UI:

property '/vulnerabilities/1563/name' is invalid: error_type=maxLength

Follow step 4-7 from the previous section.

NOTE: For scan ingestion errors, you might not get any output from this script. The error will be present in GitLab logs instead. This is to mean that the schema validation has passed, but whatever issue exists is probably deeper e.g. a column's constraints in the DB.

Sample tickets - Analysis, Debugging and Work arounds

1. SAST: [Schema] property '/vulnerabilities/x/name' is invalid: error_type=maxLength

Issue(s): gitlab-org/gitlab#443628 (closed)

Ticket(s):

Sample Project: https://gitlab.com/gitlab-com/support/test-projects/ci-examples/secure/security-reports-errors/phpcs-maxlength-error/-/pipelines/1294672902/security

Note: Since the 17.0 template disables the phpcs-security-audit-sast job and instead uses semgrep for PHP files, this project uses a 16.11 template and disables semgrep to mimick the previous behavior. Customers on 17.0 onward won't face this bug if they are running semgrep.

REPRODUCTION STEPS:

Error in UI:

Error using script:

$ ruby validator.rb sast-report-format.json gl-sast-report.json 
property '/vulnerabilities/0/name' is invalid: error_type=maxLength

GitLab version 16.11.1

include:
  - remote: https://gitlab.com/gitlab-org/gitlab/-/raw/v16.11.1-ee/lib/gitlab/ci/templates/Jobs/SAST.gitlab-ci.yml

Analyzer Version:

$ /analyzer run
[INFO] [phpcs-security-audit v2] [2024-05-17T08:51:35Z] [/go/pkg/mod/gitlab.com/gitlab-org/security-products/analyzers/command/v2@v2.2.0/command.go:76] ▶ GitLab phpcs-security-audit v2 analyzer v4.1.6

Report Version:

cat gl-sast-report.json | head -n5
{
  "version": "15.0.7",
  "vulnerabilities": [
    {
      "id": "b41bb6c1e4e5f5e56a1e97f75a4d32d9a842ec7a3d5d26aca1f9cff7bff129bf",

Vulnerability name in schema allows for a maxLength of 255 characters: https://gitlab.com/gitlab-org/security-products/security-report-schemas/-/blob/v15.0.7/dist/sast-report-format.json#L655
jq check to confirm the report is not adhering to the schema:

 jq .vulnerabilities[0].name gl-sast-report.json| wc -m
358

WORKAROUND: Truncate any vulnerability names that are longer than 255 characters

The work around is done in an after_script section once the report has been generated. It involves downloading the jq binary in the job and truncating vulnerability names. This particular image runs with php user so it was not possible to use a package manager due to sudo privileges.

2. External Analyzer: Error parsing security report - IngestionError Ingestion failed for some vulnerabilities

Issue(s): Feature proposal: truncate description (and solution) fields during vulnerability ingestion to avoid "Validation failed: Description is too long (maximum is 15000 characters)" // custom security scanner - fixed in 16.10

REPRODUCTION STEPS:

The customer was using an external tool (checkmarx SAST) to run a scan and generate a report that adheres to the schema: https://docs.gitlab.com/ee/development/integrations/secure.html#validate-locally

The only way to reproduce was to get a copy of the gl-sast-report.json file and create a dummy job to reproduce the error in the UI in GitLab versions earlier than 16.10.

Sample Project: https://gitlab.com/gitlab-com/support/test-projects/ci-examples/secure/security-reports-errors/external-tool-sast

Ticket(s):

https://gitlab.zendesk.com/agent/tickets/493712

REPRODUCTION STEPS:

Error in UI:

Error using script:

No output using script. Notice the error is not clear on which report fields are problematic.

GitLab version < 16.10
Analyzer Version: Not needed since this was an external tool
Report Version:

$ cat gl-sast-report.json | head -n5
{
  "version": "15.0.7",
  "vulnerabilities": [
    {
      "id": "0d4d2803bbd6d39ee9ea058582ed217e5d8d59553b72e51b5d37d5a2b632c071",

Logs: You will get an exception in the logs if you run the pipeline in the affected GitLab Version.

"exception.message":"Validation failed: Description is too long (maximum is 15000 characters)"

jq to check which vulnerability description is greater than 15,000 characters:

 jq '[.vulnerabilities[].description] | to_entries | map(select(.value | length > 15000))' gl-sast-report.json 
[
  {
    "key": 0,
    "value": "gdVWB9X4jlEWwr3cfWL6OYhyySWggS6Lv6LcaiKRns9hOOc6DTNXKSx6pQ

Description field in schema: https://gitlab.com/gitlab-org/security-products/security-report-schemas/-/blob/v15.0.7/dist/sast-report-format.json?ref_type=tags#L660 (allows for a description field to be a string of ~1 million characters)
Database column constraint of 15,000 characters:

\d+ vulnerability_occurrences
...
Check constraints:
    "check_4a3a60f2ba" CHECK (char_length(solution) <= 7000)
    "check_ade261da6b" CHECK (char_length(description) <= 15000)
    "check_f602da68dd" CHECK (char_length(cve) <= 48400)

The schema allows for a larger value that the DB column.

WORKAROUND: Truncate the all description fields to 15,000 characters

The work around was done in Ruby in an after_script section of the job.

DAST / DAST API

1. (Error) [ScanIngestionError] Ingestion failed for security scan | (Warning) [Parsing] Report artifact contained unicode null characters which are escaped during the ingestion.

Issue(s):

Ticket(s):

https://gitlab.zendesk.com/agent/tickets/503170

Container scanning

1. [Schema] - property '/vulnerabilities/x/location/image` does not match pattern: ^[^:]+(:\d+[^:]*)?:[^:]+$

Issue(s):

Ticket(s):

https://gitlab.zendesk.com/agent/tickets/312435

Edited May 17, 2024 by Christopher Mutua