Allow group.vulnerabilities GraphQL API with owasp_top_10 filters to use unnest IN query optimization
When expanding the Non - Owasp top 10 group by category on large groups, we observe that the underlying DB query from the GraphQL API is timing out.
Request Payload:
{"operationName":"groupVulnerabilities","variables":{"first":100,"vetEnabled":true,"fullPath":"gitlab-org","sort":"severity_desc","last":null,"state":["DETECTED","CONFIRMED"],"dismissalReason":[],"hasResolution":false,"reportType":["API_FUZZING","CONTAINER_SCANNING","COVERAGE_FUZZING","DAST","DEPENDENCY_SCANNING","SAST","SECRET_DETECTION","GENERIC"],"owaspTopTen":["NONE"]},"query":"query groupVulnerabilities($fullPath: ID!, $before: String, $after: String, $first: Int = 20, $last: Int, $projectId: [ID!], $severity: [VulnerabilitySeverity!], $reportType: [VulnerabilityReportType!], $scanner: [String!], $scannerId: [VulnerabilitiesScannerID!], $state: [VulnerabilityState!], $dismissalReason: [VulnerabilityDismissalReason!], $sort: VulnerabilitySort, $hasIssues: Boolean, $hasResolution: Boolean, $hasMergeRequest: Boolean, $hasRemediations: Boolean, $vetEnabled: Boolean = false, $clusterAgentId: [ClustersAgentID!], $owaspTopTen: [VulnerabilityOwaspTop10!]) {\n group(fullPath: $fullPath) {\n id\n vulnerabilities(\n before: $before\n after: $after\n first: $first\n last: $last\n severity: $severity\n reportType: $reportType\n scanner: $scanner\n scannerId: $scannerId\n state: $state\n dismissalReason: $dismissalReason\n projectId: $projectId\n sort: $sort\n hasIssues: $hasIssues\n hasResolution: $hasResolution\n hasMergeRequest: $hasMergeRequest\n hasRemediations: $hasRemediations\n clusterAgentId: $clusterAgentId\n owaspTopTen: $owaspTopTen\n ) {\n nodes {\n ...VulnerabilityFragment\n __typename\n }\n pageInfo {\n ...PageInfo\n __typename\n }\n __typename\n }\n __typename\n }\n}\n\nfragment VulnerabilityFragment on Vulnerability {\n id\n title\n state\n severity\n detectedAt\n dismissalReason\n vulnerabilityPath\n resolvedOnDefaultBranch\n userNotesCount\n falsePositive @include(if: $vetEnabled)\n hasRemediations\n issueLinks {\n nodes {\n id\n issue {\n id\n iid\n webUrl\n webPath\n title\n state\n __typename\n }\n __typename\n }\n __typename\n }\n mergeRequest {\n id\n webUrl\n state\n iid\n __typename\n }\n identifiers {\n externalType\n name\n __typename\n }\n location {\n ...VulnerabilityLocation\n __typename\n }\n project {\n id\n nameWithNamespace\n __typename\n }\n reportType\n scanner {\n id\n vendor\n __typename\n }\n __typename\n}\n\nfragment VulnerabilityLocation on VulnerabilityLocation {\n ... on VulnerabilityLocationClusterImageScanning {\n image\n kubernetesResource {\n agent {\n id\n name\n webPath\n __typename\n }\n __typename\n }\n __typename\n }\n ... on VulnerabilityLocationContainerScanning {\n image\n __typename\n }\n ... on VulnerabilityLocationDependencyScanning {\n blobPath\n file\n __typename\n }\n ... on VulnerabilityLocationSast {\n blobPath\n file\n startLine\n __typename\n }\n ... on VulnerabilityLocationSecretDetection {\n blobPath\n file\n startLine\n __typename\n }\n ... on VulnerabilityLocationDast {\n path\n __typename\n }\n __typename\n}\n\nfragment PageInfo on PageInfo {\n hasNextPage\n hasPreviousPage\n startCursor\n endCursor\n __typename\n}\n"}
Response:
{
"errors": [
{
"message": "Request timed out. Please try a less complex query or a smaller set of records."
}
]
}
As observed in the explain plans, this is happening even though we have an index because the in filter unnest optimisations are not being applied to the final query.
Observations
For owasp_top_10 IS NULL condition
Without unnest optimization: https://explain.depesz.com/s/63WG#stats https://explain.depesz.com/s/BZa2#stats
With unnest optimization applied: https://explain.depesz.com/s/yxtL#stats
For owasp_top_10 = ? condition
Without unnest optimization: https://explain.depesz.com/s/tlSv#stats
With unnest optimization: https://explain.depesz.com/s/uM03#stats
Above results conclude that not only for owasp_top_10 IS NULL, we can apply the unnest optimization for with values as well owasp_top_10 = ?.
Proposal
Current query:
SELECT
"vulnerability_reads".*
FROM
"vulnerability_reads"
WHERE
(traversal_ids >= '{9970}')
AND (traversal_ids < '{9971}')
AND "vulnerability_reads"."archived" = FALSE
AND "vulnerability_reads"."report_type" IN (6, 2, 5, 3, 1, 0, 4, 99)
AND "vulnerability_reads"."severity" IN (1, 2, 4, 5, 6, 7)
AND "vulnerability_reads"."state" IN (1, 4)
AND "vulnerability_reads"."resolved_on_default_branch" = FALSE
AND "vulnerability_reads"."owasp_top_10" IS NULL
ORDER BY
"vulnerability_reads"."severity" DESC,
"vulnerability_reads"."traversal_ids" DESC,
"vulnerability_reads"."vulnerability_id" DESC
LIMIT
101
New query:
SELECT
"vulnerability_reads".*
FROM
"vulnerability_reads"
WHERE
"vulnerability_reads"."vulnerability_id" IN (
SELECT
"vulnerability_reads"."vulnerability_id"
FROM
unnest('{6,2,5,3,1,0,4,99}' ::smallint []) AS "report_types"("report_type"),
unnest('{1,2,4,5,6,7}' ::smallint []) AS "severities"("severity"),
unnest('{1,4}' ::smallint []) AS "states"("state"),
LATERAL (
SELECT
"vulnerability_reads"."archived",
"vulnerability_reads"."report_type",
"vulnerability_reads"."severity",
"vulnerability_reads"."state",
"vulnerability_reads"."resolved_on_default_branch",
"vulnerability_reads"."owasp_top_10",
"vulnerability_reads"."traversal_ids",
"vulnerability_reads"."vulnerability_id"
FROM
"vulnerability_reads"
WHERE
(traversal_ids >= '{9970}')
AND (traversal_ids < '{9971}')
AND "vulnerability_reads"."archived" = FALSE
AND "vulnerability_reads"."resolved_on_default_branch" = FALSE
AND "vulnerability_reads"."owasp_top_10" IS NULL
AND (
vulnerability_reads."report_type" = "report_types"."report_type"
)
AND (
vulnerability_reads."severity" = "severities"."severity"
)
AND (vulnerability_reads."state" = "states"."state")
ORDER BY
"vulnerability_reads"."severity" DESC,
"vulnerability_reads"."traversal_ids" DESC,
"vulnerability_reads"."vulnerability_id" DESC
LIMIT
101
) AS vulnerability_reads
ORDER BY
"vulnerability_reads"."severity" DESC,
"vulnerability_reads"."traversal_ids" DESC,
"vulnerability_reads"."vulnerability_id" DESC
LIMIT
101
)
ORDER BY
"vulnerability_reads"."severity" DESC,
"vulnerability_reads"."traversal_ids" DESC,
"vulnerability_reads"."vulnerability_id" DESC
LIMIT
101
Current behaviour for owasp_top_10 IS NULL
-
Without unnest: Query is timing out as it is not using the available index. See: https://explain.depesz.com/s/BZa2#stats
-
With unnest: Query is using another index and although it does not timeout, it is still problematic as it is not making use of the owasp_top_10 group level index and because of the reason listed here it is still a concern. To force use the existing owasp_top_10 group level index we have to change
nullto-1or useorder by owasp_top_10in the query as summarized here. Changingnullto-1is the more straight forward approach. See for query plan: https://explain.depesz.com/s/yxtL#stats
Solution identified:
- Change
nullto-1. See: https://explain.depesz.com/s/L9GF#source - Use unnest optimisations along with the change to -1. See: https://explain.depesz.com/s/I6AB#source
Implementation plan:
| Milestone | MR's planned | Status |
|---|---|---|
| %17.3 | 1. Turn off the null FF for gitlab.com and change default value of column owasp_top_10 from null to -1 with a migration. 2. Change vulnerability_reads_finder to use -1 instead of null. |
1. !161823 (merged) workflowcomplete 2. !161890 (merged) workflowcomplete |
| %17.4 | 3. BBM migration to backfill null values, once this migration completes, turn on FF on gitlab.com (After this step, the timeouts should stop but the query would still take a bit longer to load for Non OWASP category on gitlab.com/gitlab-org). 4. Backport the above BBM to %17.3 as it is the last required stop. Subjected to confirmation in slack discussion. 5. Modify the existing index to use unnest optimizations to further improve performance that is acceptable within guidelines. ( resolved_on_default_branch should be in front of the severity, traversal_ids, and vulnerability_id columns) 6. Enable the FF on gitlab.com and schedule FF removal in %17.6 (as finalization can only be done after a required stop which is 17.5) |
3. !161962 (merged) workflowcomplete 4. Backport option is ruled out. 5. !164252 (merged) workflowcomplete 6. #473748 (comment 2098930032) workflowverification |
| %17.6 | 7. Finalize the BBM introduced in MR3/MR4. Subjected to slack discussion (if backport is agreed, we can move this and below FF removal step to %17.4) 8. Remove FF. |