Skip to content

Add background migration to backfill casted_cluster_agent_id

What does this MR do and why?

In GraphQL filter for vulnerabilities by container... (#358763 - closed) we want to add filter for vulnerabilities by container images with vulnerabilities, to do so we need to change the column type that was erroneously set to text instead of bigint. This MR is changing that. In this MR we are adding background migration to handle backfilling casted_cluster_agent_id with data from cluster_agent_id column. The last MR will remove old column and rename casted_cluster_agent_id to cluster_agent_id. This is second MR of planned 3 (first was Change column type of vulnerability_reads.clust... (!88333 - merged)).

⋊> env VERBOSE=true bundle exec rake db:migrate:up:main VERSION=20220525201022
main: == 20220525201022 AddTemporaryIndexForVulnerabilityReadsClusterAgentIdMigration: migrating 
main: -- transaction_open?()
main:    -> 0.0000s
main: -- index_exists?(:vulnerability_reads, [:id, :report_type], {:name=>"tmp_index_vulnerability_reads_on_report_type", :algorithm=>:concurrently})
main:    -> 0.0148s
main: -- execute("SET statement_timeout TO 0")
main:    -> 0.0005s
main: -- add_index(:vulnerability_reads, [:id, :report_type], {:name=>"tmp_index_vulnerability_reads_on_report_type", :algorithm=>:concurrently})
main:    -> 0.0266s
main: -- execute("RESET statement_timeout")
main:    -> 0.0006s
main: -- transaction_open?()
main:    -> 0.0000s
main: -- index_exists?(:cluster_agents, [:id, :project_id], {:name=>"tmp_index_cluster_agents_on_project_id", :algorithm=>:concurrently})
main:    -> 0.0042s
main: -- add_index(:cluster_agents, [:id, :project_id], {:name=>"tmp_index_cluster_agents_on_project_id", :algorithm=>:concurrently})
main:    -> 0.0041s
main: == 20220525201022 AddTemporaryIndexForVulnerabilityReadsClusterAgentIdMigration: migrated (0.0696s) 
⋊> env VERBOSE=true bundle exec rake db:migrate:down:main VERSION=20220525201022
main: == 20220525201022 AddTemporaryIndexForVulnerabilityReadsClusterAgentIdMigration: reverting 
main: -- transaction_open?()
main:    -> 0.0000s
main: -- indexes(:vulnerability_reads)
main:    -> 0.0175s
main: -- execute("SET statement_timeout TO 0")
main:    -> 0.0006s
main: -- remove_index(:vulnerability_reads, {:algorithm=>:concurrently, :name=>"tmp_index_vulnerability_reads_on_report_type"})
main:    -> 0.0050s
main: -- execute("RESET statement_timeout")
main:    -> 0.0008s
main: -- transaction_open?()
main:    -> 0.0000s
main: -- indexes(:cluster_agents)
main:    -> 0.0055s
main: -- remove_index(:cluster_agents, {:algorithm=>:concurrently, :name=>"tmp_index_cluster_agents_on_project_id"})
main:    -> 0.0025s
main: == 20220525201022 AddTemporaryIndexForVulnerabilityReadsClusterAgentIdMigration: reverted (0.0482s) 

Queries

explain SELECT MIN(vulnerability_reads.id), MAX(vulnerability_reads.id) FROM vulnerability_reads;

Time: 17ms

https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/10292/commands/36465

We have around ~14 millions records in vulnerability_reads table (https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/10292/commands/36430), however we will perform updates for ~300000 of them (https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/10292/commands/36432 and https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/10292/commands/36475):

select vulnerability_reads.id from vulnerability_reads INNER JOIN cluster_agents ON CAST(vulnerability_reads.cluster_agent_id AS bigint) = cluster_agents.id AND vulnerability_reads.project_id = cluster_agents.project_id INNER JOIN projects ON vulnerability_reads.project_id = projects.id WHERE vulnerability_reads.report_type = 7;
Time: 9.225 s
  - planning: 18.630 ms
  - execution: 9.207 s
    - I/O read: 6.139 s
    - I/O write: 0.000 ms

https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/10292/commands/36432

We have batches of size 2_500 => 14_000_000 / 2_500 = 5600 loops; we have set interval delay for 2 minutes, so we have 5600 * 2.minutes = 11_200.minutes ~ 187.hours.

Each batch will perform update in subbatches of 100:

explain select vulnerability_reads.id from vulnerability_reads INNER JOIN cluster_agents ON CAST(vulnerability_reads.cluster_agent_id AS bigint) = cluster_agents.id AND vulnerability_reads.project_id = cluster_agents.project_id INNER JOIN projects ON vulnerability_reads.project_id = projects.id WHERE vulnerability_reads.report_type = 7 AND vulnerability_reads.id BETWEEN 13750000 AND 13752000;
Time: 323.135 ms  
  - planning: 22.626 ms  
  - execution: 300.509 ms  
    - I/O read: 298.628 ms  
    - I/O write: 0.000 ms  

https://postgres.ai/console/gitlab/gitlab-production-ci/sessions/11077/commands/39693

Update query for subbatch

explain update vulnerability_reads SET casted_cluster_agent_id = CAST(vulnerability_reads.cluster_agent_id AS bigint) FROM cluster_agents WHERE CAST(vulnerability_reads.cluster_agent_id AS bigint) = cluster_agents.id AND vulnerability_reads.project_id = cluster_agents.project_id AND vulnerability_reads.report_type = 7 AND vulnerability_reads.id BETWEEN 13740000 AND 13740100;
Summary:
Time: 62.326 ms
  - planning: 46.800 ms
  - execution: 15.526 ms
    - I/O read: 13.950 ms
    - I/O write: 0.000 ms

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/11078/commands/39694

We are performing 20 subbatches for single batch (2_500 / 100 = 25), each batch should finish within 2 minutes, so max time for single update is 2.minutes / 25 = 120.seconds / 25 = 4.8.seconds, with additional indexes we are way below that value.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to GraphQL filter for vulnerabilities by container... (#358763 - closed)

Edited by Alan (Maciej) Paruszewski

Merge request reports