Improve delete_software_license_policies query
What does this MR do and why?
This MR improves the performance of delete query to include the scan_result_policy_reads
of the concerned project only to reduce the number of IDs filtered. This MR also changes the query to sub-query instead of IN
query.
This is a corrective action of Incident: gitlab-com/gl-infra/production#17168 (closed)
The security_orchestration_policy_configuration_id
that caused the slow query is scoped to a namespace that has 54706 project. And each project has only 1 rows of scan_result_policies
but since we do policy_configuration. scan_result_policy_reads.pluck(:id)
it selects 54701
IDs.
More context on the data: gitlab-com/gl-infra/production#17168 (comment 1663027970)
Database
Before
SELECT
"scan_result_policies"."id"
FROM
"scan_result_policies"
WHERE
"scan_result_policies"."security_orchestration_policy_configuration_id" = 1029627
SELECT
"software_license_policies".*
FROM
"software_license_policies"
WHERE
"software_license_policies"."project_id" = 24583235
AND "software_license_policies"."scan_result_policy_id" IN (10324607, 10326279, 10324491, 10326623, 10324468, 10326230, 10324484, 10326628, 10324495, 10326710, 10324552, 10326753, 10324563, 10326804, 10324572, 10326875, 10324586, 10326412, 10324744, 10326386, 10324627, 10326702, 10324675, 10326810, 10324550, 10327045, 10324597, 10326937, 10324913, 10327059, 10324493, 10326617, 10324669, 10327301, 10324615, 10326282, 10324490, 10326621, 10324472, 10326246, 10324469, 10326235, 10324467, 10326553, 10324588, 10326624, 10324473, 10326227, 10324482, 10326228, 10324537, 10327189, 10324505, 10326631, 10324576, 10327187, 10324585, 10326312, 10324489, 10326604, 10324511, 10326644)
After
SELECT
"software_license_policies".*
FROM
"software_license_policies"
WHERE
"software_license_policies"."project_id" = 43110842
AND "software_license_policies"."scan_result_policy_id" IN (
SELECT
"scan_result_policies"."id"
FROM
"scan_result_policies"
WHERE
"scan_result_policies"."security_orchestration_policy_configuration_id" = 4261
AND "scan_result_policies"."project_id" = 43110842
Query Plan : https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/24184/commands/77181
Nested Loop (cost=0.85..18.21 rows=1 width=40) (actual time=16.599..266.054 rows=456 loops=1)
Buffers: shared hit=149 read=309 dirtied=185
I/O Timings: read=257.544 write=0.000
-> Index Scan using index_scan_result_policies_on_position_in_configuration on public.scan_result_policies (cost=0.43..3.45 rows=1 width=8) (actual time=8.047..8.049 rows=1 loops=1)
Index Cond: ((scan_result_policies.security_orchestration_policy_configuration_id = 4261) AND (scan_result_policies.project_id = 43110842))
Buffers: shared hit=4 read=3
I/O Timings: read=7.941 write=0.000
-> Index Scan using idx_software_license_policies_unique_on_project_and_scan_policy on public.software_license_policies (cost=0.42..14.76 rows=1 width=40) (actual time=8.544..257.659 rows=456 loops=1)
Index Cond: ((software_license_policies.project_id = 43110842) AND (software_license_policies.scan_result_policy_id = scan_result_policies.id))
Buffers: shared hit=145 read=306 dirtied=185
I/O Timings: read=249.603 write=0.000
Time: 268.676 ms
- planning: 2.321 ms
- execution: 266.355 ms
- I/O read: 257.544 ms
- I/O write: 0.000 ms
Shared buffers:
- hits: 149 (~1.20 MiB) from the buffer pool
- reads: 309 (~2.40 MiB) from the OS file cache, including disk I/O
- dirtied: 185 (~1.40 MiB)
- writes: 0
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Merge request reports
Activity
assigned to @sashi_kumar
- Resolved by 🤖 GitLab Bot 🤖
Proper labels assigned to this merge request. Please ignore me.
@sashi_kumar - please see the following guidance and update this merge request.1 Error Please add typebug typefeature, or typemaintenance label to this merge request. Edited by 🤖 GitLab Bot 🤖
- A deleted user
added backend databasereview pending typebug labels
Reviewer roulette
Changes that require review have been detected!
Please refer to the table below for assigning reviewers and maintainers suggested by Danger in the specified category:
Category Reviewer Maintainer backend @rcobb
(UTC-8, 8 hours behind author)
@brytannia
(UTC+1, 1 hour ahead of author)
database @ck3g
(UTC+1, 1 hour ahead of author)
@ahegyi
(UTC+1, 1 hour ahead of author)
Please check reviewer's status!
Feel free to override these selections if you think someone else would be better-suited or use the GitLab Review Workload Dashboard to find other available reviewers.
To read more on how to use the reviewer roulette, please take a look at the Engineering workflow and code review guidelines. Please consider assigning a reviewer or maintainer who is a domain expert in the area of the merge request.
Once you've decided who will review this merge request, assign them as a reviewer! Danger does not automatically notify them for you.
If needed, you can retry the
danger-review
job that generated this comment.Generated by
Danger- Resolved by Krasimir Angelov
In case we need the new query in the description
SoftwareLicensePolicy LOAD SELECT "software_license_policies"."id" FROM "software_license_policies" WHERE "software_license_policies"."project_id" = 1 AND "software_license_policies"."scan_result_policy_id" IN ( SELECT "scan_result_policies"."id" FROM "scan_result_policies" WHERE "scan_result_policies"."security_orchestration_policy_configuration_id" = 1 AND "scan_result_policies"."project_id" = 1) ORDER BY "software_license_policies"."id" ASC, "software_license_policies"."updated_at" ASC LIMIT 1
- Resolved by Phil Calder
@mc_rocha Would you mind doing the initial review?
requested review from @mc_rocha
marked the checklist item I have evaluated the MR acceptance checklist for this MR. as completed
added corrective action label
mentioned in issue gitlab-com/gl-infra/production#17168 (closed)
Allure report
allure-report-publisher
generated test report!e2e-test-on-gdk:
test report for 30ee368dexpand test summary
+-----------------------------------------------------------------------+ | suites summary | +------------------+--------+--------+---------+-------+-------+--------+ | | passed | failed | skipped | flaky | total | result | +------------------+--------+--------+---------+-------+-------+--------+ | Verify | 32 | 0 | 0 | 0 | 32 | ✅ | | Create | 40 | 0 | 7 | 0 | 47 | ✅ | | Plan | 55 | 0 | 0 | 0 | 55 | ✅ | | Package | 0 | 0 | 1 | 0 | 1 | ➖ | | Monitor | 8 | 0 | 0 | 0 | 8 | ✅ | | Govern | 57 | 0 | 0 | 0 | 57 | ✅ | | Framework sanity | 0 | 0 | 1 | 0 | 1 | ➖ | | Data Stores | 23 | 0 | 0 | 0 | 23 | ✅ | | Manage | 0 | 0 | 1 | 0 | 1 | ➖ | +------------------+--------+--------+---------+-------+-------+--------+ | Total | 215 | 0 | 10 | 0 | 225 | ✅ | +------------------+--------+--------+---------+-------+-------+--------+
e2e-package-and-test:
test report for 30ee368dexpand test summary
+-----------------------------------------------------------------------+ | suites summary | +------------------+--------+--------+---------+-------+-------+--------+ | | passed | failed | skipped | flaky | total | result | +------------------+--------+--------+---------+-------+-------+--------+ | Govern | 297 | 1 | 20 | 3 | 318 | ❌ | | Create | 147 | 0 | 16 | 0 | 163 | ✅ | | Monitor | 8 | 0 | 0 | 0 | 8 | ✅ | | Framework sanity | 0 | 0 | 2 | 0 | 2 | ➖ | | Data Stores | 4 | 0 | 0 | 0 | 4 | ✅ | | Plan | 8 | 0 | 0 | 0 | 8 | ✅ | | Package | 0 | 0 | 2 | 0 | 2 | ➖ | +------------------+--------+--------+---------+-------+-------+--------+ | Total | 464 | 1 | 40 | 3 | 505 | ❌ | +------------------+--------+--------+---------+-------+-------+--------+
- Resolved by Krasimir Angelov
- Resolved by Krasimir Angelov
I would test this on large volumes;
the solution with explicit ID lists in
IN
was implemented for some reason in the past, I think. If we see here that the subquery behaves worse on large volumes, then we could also consider a CTE with MATERIALIZED, retrieving the ID list first, and then used in the next step, after materialization. // Just a side comment, I haven't looked at the plansUPD: please don't consider this like I'm thinking that the change won't help; likely it will, we just need to double-check on large volumes.
Edited by Nikolay Samokhvalov
The query is generating correctly and giving the same results as the previous one:
SELECT "software_license_policies".* FROM "software_license_policies" WHERE "software_license_policies"."project_id" = 81 AND "software_license_policies"."scan_result_policy_id" IN (SELECT "scan_result_policies"."id" FROM "scan_result_policies" WHERE "scan_result_policies"."security_orchestration_policy_configuration_id" = 30 AND "scan_result_policies"."project_id" = 81)
- Resolved by Sashi Kumar Kumaresan
- Resolved by Sashi Kumar Kumaresan
The
security_orchestration_policy_configuration_id
that caused the slow query is scoped to a namespace that has 54706 project. And each project has only 1 rows ofscan_result_policies
but since we dopolicy_configuration. scan_result_policy_reads.pluck(:id)
it selects54701
IDs.More context on the data: gitlab-com/gl-infra/production#17168 (comment 1663027970)
changed milestone to %16.7
- Resolved by Krasimir Angelov
@NikolayS
, thanks for approving this merge request.This is the first time the merge request has been approved. To ensure we don't only run predictive pipelines, and we don't break
master
, a new pipeline will be started shortly.Please wait for the pipeline to start before resolving this discussion and set auto-merge for the new pipeline. See merging a merge request for more details.
added pipeline:mr-approved label
mentioned in issue gitlab-org/release/tasks#7431 (closed)
requested review from @mayra-cabrera
removed review request for @mayra-cabrera
requested review from @krasio
- Resolved by Phil Calder
To summarize my understanding of the this MR (please correct me if I got anything wrong):
Currently we do the following:
-- 1. Find scan_result_policies IDs SELECT "scan_result_policies"."id" FROM "scan_result_policies" WHERE "scan_result_policies"."security_orchestration_policy_configuration_id" = $1; -- 2. Find batch start, end, etc... SELECT "software_license_policies"."id" FROM "software_license_policies" WHERE "software_license_policies"."project_id" = $1 AND "software_license_policies"."scan_result_policy_id" IN (...scan_result_policies IDs...) ORDER BY "software_license_policies"."id" ASC, "software_license_policies"."updated_at" ASC LIMIT 1; -- 3. And then the actual DELETE query
The problem we had in gitlab-com/gl-infra/production#17168 (closed) is that for some
security_orchestration_policy_configuration_id
query1
may return many thousands of records.To mitigate this we combine them all in a single query (plus sub-query) and add additional filter by
project_id
:-- 1. Find batch start, end, etc... SELECT "software_license_policies"."id" FROM "software_license_policies" WHERE "software_license_policies"."project_id" = $1 AND "software_license_policies"."scan_result_policy_id" IN ( SELECT "scan_result_policies"."id" FROM "scan_result_policies" WHERE "scan_result_policies"."security_orchestration_policy_configuration_id" = $2 AND "scan_result_policies"."project_id" = $1 ) ORDER BY "software_license_policies"."id" ASC, "software_license_policies"."updated_at" ASC LIMIT 1; -- 2. And then the actual DELETE query
In both cases, the queries generated by
EachBatch
are not supported by index, and may time-out depending on the data distribution - !137707 (comment 1663135267).We also know that when additionally filtered by
project_id
, the existing query1
can return at most 25 records (!137707 (comment 1663132305)), so even if we keep it as a separate query this should be fine, and resolve the problem we hit during the incident.Additionally, we should look into improving the iteration in batches. If the assumption that a project may have at most 600 records in
software_license_policies
(!137707 (comment 1663137161)), we may try with slow iteration, and apply all the filter on the batch itself (this will need an index onproject_id, id
).As for this MR, I see two options:
- Leave it as is. It is not ideal, and the batch query plan is more complex that what we have now, but it eliminates the problematic
pluck
query. - Keep the pluck query, but update it to add filter by
project_id
, as we know it may return 25 results at most. This also resolves the immediate problem, keeps the plan for the batch query somehow simpler, but requires a code change.
I am fine with either one, since both are kind of equally not good. I'm going to approve for database, in case we want to proceed as is, but we need to follow up and make things better.
/cc @pcalder
- Leave it as is. It is not ideal, and the batch query plan is more complex that what we have now, but it eliminates the problematic
added databaseapproved label and removed databasereview pending label
enabled an automatic merge when the pipeline for 77eb43a6 succeeds
mentioned in issue #432749 (closed)
Hello @sashi_kumar
The database team is looking for ways to improve the database review process and we would love your help!
If you'd be open to someone on the database team reaching out to you for a chat, or if you'd like to leave some feedback asynchronously, just post a reply to this comment mentioning:
@gitlab-org/database-team
And someone will be by shortly!
Thanks for your help!
This message was generated automatically. You're welcome to improve it.