Skip to content
Snippets Groups Projects

Improve delete_software_license_policies query

Merged Sashi Kumar Kumaresan requested to merge sk/fix-17168-inc into master

What does this MR do and why?

This MR improves the performance of delete query to include the scan_result_policy_reads of the concerned project only to reduce the number of IDs filtered. This MR also changes the query to sub-query instead of IN query.

This is a corrective action of Incident: gitlab-com/gl-infra/production#17168 (closed)

The security_orchestration_policy_configuration_id that caused the slow query is scoped to a namespace that has 54706 project. And each project has only 1 rows of scan_result_policies but since we do policy_configuration. scan_result_policy_reads.pluck(:id) it selects 54701 IDs.

More context on the data: gitlab-com/gl-infra/production#17168 (comment 1663027970)

Database

Before

SELECT
    "scan_result_policies"."id" 
FROM
    "scan_result_policies" 
WHERE
    "scan_result_policies"."security_orchestration_policy_configuration_id" = 1029627


SELECT
    "software_license_policies".* 
FROM
    "software_license_policies" 
WHERE
    "software_license_policies"."project_id" = 24583235 
    AND "software_license_policies"."scan_result_policy_id" IN (10324607, 10326279, 10324491, 10326623, 10324468, 10326230, 10324484, 10326628, 10324495, 10326710, 10324552, 10326753, 10324563, 10326804, 10324572, 10326875, 10324586, 10326412, 10324744, 10326386, 10324627, 10326702, 10324675, 10326810, 10324550, 10327045, 10324597, 10326937, 10324913, 10327059, 10324493, 10326617, 10324669, 10327301, 10324615, 10326282, 10324490, 10326621, 10324472, 10326246, 10324469, 10326235, 10324467, 10326553, 10324588, 10326624, 10324473, 10326227, 10324482, 10326228, 10324537, 10327189, 10324505, 10326631, 10324576, 10327187, 10324585, 10326312, 10324489, 10326604, 10324511, 10326644)

After

SELECT
    "software_license_policies".* 
FROM
    "software_license_policies" 
WHERE
    "software_license_policies"."project_id" = 43110842 
    AND "software_license_policies"."scan_result_policy_id" IN (
        SELECT
            "scan_result_policies"."id" 
        FROM
            "scan_result_policies" 
        WHERE
            "scan_result_policies"."security_orchestration_policy_configuration_id" = 4261 
            AND "scan_result_policies"."project_id" = 43110842

Query Plan : https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/24184/commands/77181

 Nested Loop  (cost=0.85..18.21 rows=1 width=40) (actual time=16.599..266.054 rows=456 loops=1)
   Buffers: shared hit=149 read=309 dirtied=185
   I/O Timings: read=257.544 write=0.000
   ->  Index Scan using index_scan_result_policies_on_position_in_configuration on public.scan_result_policies  (cost=0.43..3.45 rows=1 width=8) (actual time=8.047..8.049 rows=1 loops=1)
         Index Cond: ((scan_result_policies.security_orchestration_policy_configuration_id = 4261) AND (scan_result_policies.project_id = 43110842))
         Buffers: shared hit=4 read=3
         I/O Timings: read=7.941 write=0.000
   ->  Index Scan using idx_software_license_policies_unique_on_project_and_scan_policy on public.software_license_policies  (cost=0.42..14.76 rows=1 width=40) (actual time=8.544..257.659 rows=456 loops=1)
         Index Cond: ((software_license_policies.project_id = 43110842) AND (software_license_policies.scan_result_policy_id = scan_result_policies.id))
         Buffers: shared hit=145 read=306 dirtied=185
         I/O Timings: read=249.603 write=0.000

Time: 268.676 ms  
  - planning: 2.321 ms  
  - execution: 266.355 ms  
    - I/O read: 257.544 ms  
    - I/O write: 0.000 ms  
  
Shared buffers:  
  - hits: 149 (~1.20 MiB) from the buffer pool  
  - reads: 309 (~2.40 MiB) from the OS file cache, including disk I/O  
  - dirtied: 185 (~1.40 MiB)  
  - writes: 0  

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Alan (Maciej) Paruszewski

Merge request reports

Loading
Loading

Activity

Filter activity
  • Approvals
  • Assignees & reviewers
  • Comments (from bots)
  • Comments (from users)
  • Commits & branches
  • Edits
  • Labels
  • Lock status
  • Mentions
  • Merge request status
  • Tracking
    • Resolved by Krasimir Angelov

      I would test this on large volumes;

      the solution with explicit ID lists in IN was implemented for some reason in the past, I think. If we see here that the subquery behaves worse on large volumes, then we could also consider a CTE with MATERIALIZED, retrieving the ID list first, and then used in the next step, after materialization. // Just a side comment, I haven't looked at the plans

      UPD: please don't consider this like I'm thinking that the change won't help; likely it will, we just need to double-check on large volumes.

      Edited by Nikolay Samokhvalov
  • The query is generating correctly and giving the same results as the previous one:

    SELECT "software_license_policies".* FROM "software_license_policies" WHERE "software_license_policies"."project_id" = 81 AND "software_license_policies"."scan_result_policy_id" IN (SELECT "scan_result_policies"."id" FROM "scan_result_policies" WHERE "scan_result_policies"."security_orchestration_policy_configuration_id" = 30 AND "scan_result_policies"."project_id" = 81)
  • added 1 commit

    Compare with previous version

  • The security_orchestration_policy_configuration_id that caused the slow query is scoped to a namespace that has 54706 project. And each project has only 1 rows of scan_result_policies but since we do policy_configuration. scan_result_policy_reads.pluck(:id) it selects 54701 IDs.

    More context on the data: gitlab-com/gl-infra/production#17168 (comment 1663027970)

  • Sashi Kumar Kumaresan changed the description

    changed the description

  • added 1 commit

    • 30ee368d - Use let_it_be instead of let

    Compare with previous version

  • Sashi Kumar Kumaresan changed milestone to %16.7

    changed milestone to %16.7

  • Hey folks, marking as an severity1 infradev given that it's blocking canary deploys.

  • Nikolay Samokhvalov approved this merge request

    approved this merge request

  • Marcos Rocha approved this merge request

    approved this merge request

  • Biren Shah approved this merge request

    approved this merge request

  • requested review from @mayra-cabrera

  • Mayra Cabrera removed review request for @mayra-cabrera

    removed review request for @mayra-cabrera

  • requested review from @krasio

  • Sashi Kumar Kumaresan changed the description

    changed the description

  • Alan (Maciej) Paruszewski changed the description

    changed the description

  • Krasimir Angelov approved this merge request

    approved this merge request

    • Resolved by Phil Calder

      To summarize my understanding of the this MR (please correct me if I got anything wrong):

      Currently we do the following:

      -- 1. Find scan_result_policies IDs
      SELECT
          "scan_result_policies"."id" 
      FROM
          "scan_result_policies" 
      WHERE
          "scan_result_policies"."security_orchestration_policy_configuration_id" = $1;
      
      
      -- 2. Find batch start, end, etc...
      SELECT
          "software_license_policies"."id"
      FROM
          "software_license_policies"
      WHERE
          "software_license_policies"."project_id" = $1
          AND "software_license_policies"."scan_result_policy_id" IN (...scan_result_policies IDs...)
      ORDER BY
          "software_license_policies"."id" ASC,
          "software_license_policies"."updated_at" ASC
      LIMIT
          1;
      
      -- 3. And then the actual DELETE query

      The problem we had in gitlab-com/gl-infra/production#17168 (closed) is that for some security_orchestration_policy_configuration_id query 1 may return many thousands of records.

      To mitigate this we combine them all in a single query (plus sub-query) and add additional filter by project_id:

      -- 1. Find batch start, end, etc...
      SELECT
          "software_license_policies"."id"
      FROM
          "software_license_policies"
      WHERE
          "software_license_policies"."project_id" = $1
          AND "software_license_policies"."scan_result_policy_id" IN (
              SELECT
                  "scan_result_policies"."id"
              FROM
                  "scan_result_policies"
              WHERE
                  "scan_result_policies"."security_orchestration_policy_configuration_id" = $2
                  AND "scan_result_policies"."project_id" = $1
          )
      ORDER BY
          "software_license_policies"."id" ASC,
          "software_license_policies"."updated_at" ASC
      LIMIT
        1;
      
      -- 2. And then the actual DELETE query

      In both cases, the queries generated by EachBatch are not supported by index, and may time-out depending on the data distribution - !137707 (comment 1663135267).

      We also know that when additionally filtered by project_id, the existing query 1 can return at most 25 records (!137707 (comment 1663132305)), so even if we keep it as a separate query this should be fine, and resolve the problem we hit during the incident.

      Additionally, we should look into improving the iteration in batches. If the assumption that a project may have at most 600 records in software_license_policies (!137707 (comment 1663137161)), we may try with slow iteration, and apply all the filter on the batch itself (this will need an index on project_id, id).

      As for this MR, I see two options:

      1. Leave it as is. It is not ideal, and the batch query plan is more complex that what we have now, but it eliminates the problematic pluck query.
      2. Keep the pluck query, but update it to add filter by project_id, as we know it may return 25 results at most. This also resolves the immediate problem, keeps the plan for the batch query somehow simpler, but requires a code change.

      I am fine with either one, since both are kind of equally not good. I'm going to approve for database, in case we want to proceed as is, but we need to follow up and make things better.

      /cc @pcalder

  • Sincheol (David) Kim resolved all threads

    resolved all threads

  • Sincheol (David) Kim approved this merge request

    approved this merge request

  • Sincheol (David) Kim enabled an automatic merge when the pipeline for 77eb43a6 succeeds

    enabled an automatic merge when the pipeline for 77eb43a6 succeeds

  • mentioned in issue #432749 (closed)

  • Hello @sashi_kumar :wave:

    The database team is looking for ways to improve the database review process and we would love your help!

    If you'd be open to someone on the database team reaching out to you for a chat, or if you'd like to leave some feedback asynchronously, just post a reply to this comment mentioning:

    @gitlab-org/database-team

    And someone will be by shortly!

    Thanks for your help! :heart:

    This message was generated automatically. You're welcome to improve it.

  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Loading
  • Please register or sign in to reply
    Loading