Improve performance of package license query to re-enable SBOM based license scanning

Summary

The query to fetch package licenses is exhausts resources. From March 15 this query is running on replica nodes, but this is not enough to make this part of the code production ready.

We need to investigate and improve the performance of Gitlab::LicenseCompliance::SbomScanner#report.

Right now this issue is preventing us from enabling feature on production.

Further information

Please check the checkup report on replicas. You can see that the issue is coming from Ci::SyncReportsToReportApprovalRulesWorker. Most probably, because users interact the most with MR page and this sync approvals worker is triggered even if the change isn't related to the license compliance feature. The particular method is triggered in the sync service and then it is executed in the report method.

Logs

Implementation Plan

  1. Replace the single query in Gitlab::LicenseScanning::PackageLicenses#fetch with 2 faster queries to optimize the index scan: :

    1. The first query will use a common table expression (CTE) to get the list of pm_packages.id and version values that correspond to the given components:

      WITH needed_package_versions ( purl_type, name, version ) AS (
          VALUES
              ( 4, 'beego', 'v1.10.0' ),
              ( 6, 'camelcase', '1.2.1' ),
              ( 6, 'camelcase', '4.1.0' ),
              ( 6, 'cliui', '2.1.0' ),
              ( 4, 'cliui', '2.1.0' )
      )
      SELECT DISTINCT
          pm_packages.id,
          needed_package_versions.version
      FROM
          needed_package_versions
          JOIN pm_packages ON pm_packages.purl_type = needed_package_versions.purl_type AND
          pm_packages.name = needed_package_versions.name;
    2. The second query will feed the above pm_packages.id and version values into a revised form of the original query, now filtering by pm_packages.id values instead of package names, so that the pm_package_versions index can be scanned in one pass:

      SELECT
          pm_packages.purl_type,
          pm_packages.name,
          pm_package_versions.version,
          array_agg( pm_licenses.spdx_identifier )
      FROM
          pm_packages
          JOIN pm_package_versions ON pm_package_versions.pm_package_id = pm_packages.id
          JOIN pm_package_version_licenses ON pm_package_version_licenses.pm_package_version_id = pm_package_versions.id
          JOIN pm_licenses ON pm_licenses.id = pm_package_version_licenses.pm_license_id
      WHERE
          (
              pm_package_versions.pm_package_id,
              pm_package_versions.version
          ) IN (
              ( 386, 'v1.10.0' ),
              ( 387, '1.2.1' ),
              ( 387, '4.1.0' ),
              ( 388, '2.1.0' ),
              ( 389, '2.1.0' )
          )
      GROUP BY
          pm_packages.purl_type,
          pm_packages.name,
          pm_package_versions.version;
  2. Fix any tests broken by the above changes.

  3. Verify performance in staging and production environments.

  4. Re-enable the license_scanning_sbom_scanner feature flag.

Workarounds

The recommended workaround is to re-include the license scanning template.

# .gitlab-ci.yml or equivalent
include:
  - template: Security/License-Comliance.gitlab-ci.yml
Edited by Adam Cohen