Investigate and improve delete_software_license_policies query
Follow up !137707 (comment 1663144066)
To summarize my understanding of the this MR (please correct me if I got anything wrong):
Currently we do the following:
-- 1. Find scan_result_policies IDs SELECT "scan_result_policies"."id" FROM "scan_result_policies" WHERE "scan_result_policies"."security_orchestration_policy_configuration_id" = $1; -- 2. Find batch start, end, etc... SELECT "software_license_policies"."id" FROM "software_license_policies" WHERE "software_license_policies"."project_id" = $1 AND "software_license_policies"."scan_result_policy_id" IN (...scan_result_policies IDs...) ORDER BY "software_license_policies"."id" ASC, "software_license_policies"."updated_at" ASC LIMIT 1; -- 3. And then the actual DELETE query
The problem we had in gitlab-com/gl-infra/production#17168 (closed) is that for some
security_orchestration_policy_configuration_id
query1
may return many thousands of records.To mitigate this we combine them all in a single query (plus sub-query) and add additional filter by
project_id
:-- 1. Find batch start, end, etc... SELECT "software_license_policies"."id" FROM "software_license_policies" WHERE "software_license_policies"."project_id" = $1 AND "software_license_policies"."scan_result_policy_id" IN ( SELECT "scan_result_policies"."id" FROM "scan_result_policies" WHERE "scan_result_policies"."security_orchestration_policy_configuration_id" = $2 AND "scan_result_policies"."project_id" = $1 ) ORDER BY "software_license_policies"."id" ASC, "software_license_policies"."updated_at" ASC LIMIT 1; -- 2. And then the actual DELETE query
In both cases, the queries generated by
EachBatch
are not supported by index, and may time-out depending on the data distribution - !137707 (comment 1663135267).We also know that when additionally filtered by
project_id
, the existing query1
can return at most 25 records (!137707 (comment 1663132305)), so even if we keep it as a separate query this should be fine, and resolve the problem we hit during the incident.Additionally, we should look into improving the iteration in batches. If the assumption that a project may have at most 600 records in
software_license_policies
(!137707 (comment 1663137161)), we may try with slow iteration, and apply all the filter on the batch itself (this will need an index onproject_id, id
).As for this MR, I see two options:
- Leave it as is. It is not ideal, and the batch query plan is more complex that what we have now, but it eliminates the problematic
pluck
query.- Keep the pluck query, but update it to add filter by
project_id
, as we know it may return 25 results at most. This also resolves the immediate problem, keeps the plan for the batch query somehow simpler, but requires a code change.I am fine with either one, since both are kind of equally not good. I'm going to approve for database, in case we want to proceed as is, but we need to follow up and make things better.