Follow-up from "Fix recording audit event for status check response changes"

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

The following discussion from !201308 (merged) should be addressed:

  • @DylanGriffith started a discussion:

    Suggestion (non-blocking): It looks like this batching strategy is not optimal. Per https://docs.gitlab.com/development/database/batched_background_migrations/#using-a-composite-or-partial-index-to-iterate-a-subset-of-the-table we need to have indexes that exactly match all the where clauses before each_batch. In this case it seemed like it previously almost matched:

    "idx_status_check_responses_on_id_and_status" btree (id, status)

    But even that wasn't quite right because it would have needed the index to be (status, id) for the previous query to actually make use of efficient pagination.

    Given the query plan in https://console.postgres.ai/gitlab/gitlab-production-main/sessions/42438/commands/129933 we can see the index scan is not very efficient and it basically needs to scan the whole table.

    In this case it worked out OK because there's not that much data in this table. It's still not cheap though because that's loading ~86.00 MiB of data from disk without finding a single record that matches.

    In this implementation the performance of this pagination will get worse over time. And the use of each_batch won't give us stable batch size performance due to missing an index to match the where clause.

    For this reason it's usually more efficient to only have the each_batch on the exactly indexed columns. In this case we only have id and as such we might get better performance with:

          ::MergeRequests::StatusCheckResponse.each_batch do |batch|
            record_ids = batch.pending.timeout_eligible.pluck_primary_key
            batch.update_all(status: 'failed')
  • @DylanGriffith started a discussion:

    Nitpick (non-blocking): If we're already getting a batch of ids here we probably don't need to use find_each. And if we were sure we wanted to deal with batches then we're probably better off using each_batch instead of find_each. But in this case the full list of ids are already in memory so I think it's safe to assume it's never going to be too large and we can just use each. Using find_each does add some overhead if it ends up splitting up the groups and the queries aren't optimized like each_batch.

Edited by 🤖 GitLab Bot 🤖