Post-process leaked credentials on all branches (!114711) · Merge requests · GitLab.org / GitLab

Ahmed Hemdan requested to merge post-process-leaked-credentials-on-all-branches into master Mar 15, 2023

What does this MR do and why?

This merge request:

Updates security_finding_data JSON schema to include raw_source_code_extract.
Updates Security:: StoreFindingsService to persist raw_source_code_extract.
Updates Security::StoreScansService to:
- Schedule ScanSecurityReportSecretsWorker separately from StoreSecurityReportsWorker.
- Use security_scans instead of security_findings to determine if token revocation should take place.
Updates ScanSecurityReportSecretsWorker to depend on security_findings instead of vulnerability_findings.
Updates Security Report Ingestion Overview document to reflect the updated process.
And finally, updates all related specs to reflect the changes above.

Please keep in mind the following points while reviewing:

ScanSecurityReportSecretsWorker is moved out of StoreSecurityReportsWorker to make sure they run separately:
- I opted for this to:
  - Ensure report ingestion will keep running solely for the default branch.
  - While token revocation will run for all pipelines.
- I decided to not have them both run for all pipelines, as the explanation on how vulnerabilities are created from security reports implied that ingestion report had to happen on the default branch, and I didn't want to mess up with that process as I'm still figuring my way around this part of the codebase.
Additionally, vulnerability_findings is completely replaced with security_findings in ScanSecurityReportSecretsWorker because the latter exist for all pipelines, and not just the default branch as is the case with the former. I'm not entirely sure if this will have side-effects later on though, but I tried to minimize them by using each associated vulnerability_finding of a security_finding while collecting revocable_keys.
I had to remove a lot of spec examples in which private methods were being tested, as this is often considered a bad practice. Instead, I tried to write spec examples in such a way that tests the behaviour of the service classes or workers without calling private methods directly.
~~I'm not sure what happens for security_findings that didn't have raw_source_code_extract persisted before, do we have to backfill them? 🤔 cc/ @theoretick wdyt?~~

Resolves #299212 (closed).

Database query plans + SQL

This merge request updates ScanSecurityReportSecretsWorker to:

Use each_batch to iterate over security_findings in batches (see this line).

And Security::StoreScansService to:

To use security_scans instead of security_findings to deteremine if token revocation should take place (see this line).

In doing so, the following SQL queries are introduced/updated, respectively:

Query 1

Raw SQL

-- initial query to get the first id
SELECT
    "security_findings"."id"
FROM
    "security_findings"
    INNER JOIN "security_scans" ON "security_findings"."scan_id" = "security_scans"."id"
    INNER JOIN "security_scans" "scans_security_findings" ON "scans_security_findings"."id" = "security_findings"."scan_id"
WHERE
    "security_scans"."pipeline_id" = 810828014
    AND "security_scans"."scan_type" = 5
ORDER BY
    "security_findings"."id" ASC
LIMIT 1

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/16909/commands/57439

--- second query to get maximum findings by offset
SELECT
    "security_findings"."id"
FROM
    "security_findings"
    INNER JOIN "security_scans" ON "security_findings"."scan_id" = "security_scans"."id"
    INNER JOIN "security_scans" "scans_security_findings" ON "scans_security_findings"."id" = "security_findings"."scan_id"
WHERE
    "security_scans"."pipeline_id" = 810828014
    AND "security_scans"."scan_type" = 5
    AND "security_findings"."id" >= 1
ORDER BY
    "security_findings"."id" ASC
LIMIT 1 OFFSET 100

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/16909/commands/57440

-- actual query
SELECT
    "security_findings".*
FROM
    "security_findings"
    INNER JOIN "security_scans" ON "security_findings"."scan_id" = "security_scans"."id"
    INNER JOIN "security_scans" "scans_security_findings" ON "scans_security_findings"."id" = "security_findings"."scan_id"
WHERE
    "security_scans"."pipeline_id" = 810828014
    AND "security_scans"."scan_type" = 5
    AND "security_findings"."id" >= 1

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/16909/commands/57441

Note: a random master pipeline for which the secret_detection job had 170 leaks reported was used.

Query 2

Raw SQL

SELECT
    1 AS one
FROM
    "security_scans"
WHERE
    "security_scans"."pipeline_id" = 813814339
    AND "security_scans"."scan_type" = 5
LIMIT 1

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/16880/commands/57331

Please also see:

These two threads on why each_batch is introduced: 1, 2.
This thread on preloading vulnerability and vulnerability_findings.

Screenshots or screen recordings

Please see the screen recording below to see this in action.

post-process-leaked-credentials-on-all-pipelines-comp

How to set up and validate locally

To validate locally, please follow the steps below.

In your local GDK setup, create a new project if you don't have one already.
Add a .gitlab-ci.yml file to the project, which includes the secret-detection template, like the example below.

image: ruby:latest

include:
  - template: Jobs/Secret-Detection.gitlab-ci.yml

From Settings > Access Tokens, create a new project access token.
Create a new file with .rb extension in the repository. Let's call it gitlab_token.rb, for example.
Paste the project access token you created inside the file, as shown below.

  gitlab_token = "PASTE_PROJECT_ACCESS_TOKEN_HERE"

Save the file, and create a new branch, then commit the file to the branch.
Push the branch to your repository, and wait until the pipeline finishes running.
When the secret_detection job is finished, the access token should be revoked already.
Visit Settings > Access Tokens again to verify the token is no longer displayed under active tokens. 🎉

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

Edited Mar 22, 2023 by Ahmed Hemdan

Post-process leaked credentials on all branches

What does this MR do and why?

Database query plans + SQL

Screenshots or screen recordings

How to set up and validate locally

MR acceptance checklist

Merge request reports