Skip to content

Reschedule Feedback -> StateTransition background migration, attempt 3

What does this MR do and why?

This MR reschedules MigrateVulnerabilitiesFeedbackToVulnerabilitiesStateTransition. Differences this time around:

  1. Smaller delay interval
  2. Smaller sub-batch size

Related to #387665 (closed)

Database review

Reasons for failure

According to investigation on Slack, it appears that one of the jobs reached it maximum limits of retries, was marked as failed subsequently failing the entire thing.

The offending job has id of 174091 and was the third retry from an initial batch size of 250.

Job details
#<Gitlab::Database::BackgroundMigration::BatchedJob:0x00007f4a34705d48                                                                                                                                                            
 id: 174091,                                                                                                                                                                                                                      
 created_at: Thu, 09 Feb 2023 10:05:38.402537000 UTC +00:00,                                                                                                                                                                      
 updated_at: Thu, 09 Feb 2023 10:29:17.170663000 UTC +00:00,                                                                                                                                                                      
 started_at: Thu, 09 Feb 2023 10:29:01.528448000 UTC +00:00,                                                                                                                                                                      
 finished_at: Thu, 09 Feb 2023 10:29:17.170157000 UTC +00:00,                                                                                                                                                                     
 batched_background_migration_id: 354,                                                                                                                                                                                            
 min_value: 302210,                                                                                                                                                                                                               
 max_value: 302891,                                                                                                                                                                                                               
 batch_size: 31,                                                                                                                                                                                                                  
 sub_batch_size: 50,                                                                                                                                                                                                              
 status: 2,
 attempts: 3,
 metrics: {},
 pause_ms: 100>

If I understand things correctly then it was:

  1. Tried three times with a batch size of 250
  2. Tried three times with a batch size of 125
  3. Tried three times with a batch size of 61
  4. Marked as failed because batch_size > sub_batch_size from can_split? returned false so it could not be split further

Reason for timeout

Thanks to !111630 (closed) and the logs I determined the cause of timeout in this particular batch was preloading:

SELECT "security_findings"."id", "security_findings"."scan_id", "security_findings"."scanner_id", "security_findings"."severity", "security_findings"."confidence", "security_findings"."project_fingerprint", "security_findings"."deduplicated", "security_findings"."uuid", "security_findings"."overridden_uuid", "security_findings"."finding_data" FROM "security_findings" WHERE "security_findings"."uuid" IN ('681c6f87-e2f6-58bb-9c2f-00c97a3af1f0', 'c3d493a8-9de7-55b8-b3ed-2999cea89175', 'd193d3f7-135b-5c00-aa53-7b72e3cd4208', 'd82b64f0-a69d-520b-9535-e3e50077540b', 'f23e5e57-bb28-58be-aa71-f57320874916');

Which has an atrocious query plan: https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/15367/commands/53427

We had a discussion about this in !97699 (comment 1175832004) and in !97699 (comment 1180066710) we agreed that the batch size of 250 is a safe choice but it appears not be a good choice for this particular batch.

I will employ further testing for a smaller sub-batch size and see if retries will fix that particular batch. Otherwise I think we're looking at a smaller batch size.

Proposed fix

Reduce sub-batch size to 5.

Estimated runtime

We have 760329 rows in the vulnerability_feedback table. With batch size of 250 we're looking at 3042 batches with delay interval of 2 minutes so we're looking at 6084 minutes which is ~102 hours of total runtime. Fortunately we don't have to migrate everything because each batch will filter on migrated_to_state_transition column which leaves us with 120680 records. As such, the actual number of batches anything is 483 which gives us 966 minutes which gives us around 16 hours of total runtime.

Sanity check

If my understanding is correct then each of 483 batches will try to migrate 5 records every iteration. This gives us 250/5 = 50 iterations for every batch. If any of the batches fails with a query statement timeout 3 times then it will be retried with a smaller batch size (125 for the first split) which will be retried up to 3 times again. As such we will have a total of 6 retries:

  1. Try with batch size of 250 up to 3 times
  2. Try with batch size of 125 up to 3 times
  3. Try with batch size of 62 up to 3 times
  4. Try with batch size of 31 up to 3 times
  5. Try with batch size of 15 up to 3 times
  6. Try with batch size of 7 up to 3 times
  7. Fail because 3 is lower that the sub-batch size

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Michał Zając

Merge request reports