SPIKE: update_column_in_batches improvements
While checking update_column_in_batches for this MR !62627 (merged) - the following questions and queries came up:
- A
count
query for the whole query
- Might require proper indexes on gprd
- Is used to return if
total == 0
- Is used to set the batch size to 5% of the total
- A 5% default batch size based on the total affected tables
- Depending on the distribution of the data across the table for the particular query against the current table, this may be not that useful
-
A select query that checks
LIMIT 1 OFFSET <BATCH_SIZE>
to check if any update is needed -
The
UPDATE
andSELECT
both depend onid
and LIMIT
For a query returning a sparse result set on a large table, the range by id will be too large but the actual rows needing update will be small. As the ID range gets bigger, even with a small batch size we might risk some timeouts.
-- We detect the stop_id for the last affected row
SELECT "merge_requests"."id" FROM "merge_requests" WHERE "merge_requests"."id" >= 103199407 AND "merge_requests"."state_id" = 1 AND "merge_requests"."draft" = FALSE AND "merge_requests"."title" ~* '^\[draft\]|\(draft\)|draft:|draft|\[WIP\]|WIP:|WIP' ORDER BY "merge_requests"."id" ASC LIMIT 1 OFFSET 100
-- We issue an update query
UPDATE "merge_requests" SET "draft" = TRUE WHERE "merge_requests"."id" >= 103199407 AND "merge_requests"."id" < 103306493 AND "merge_requests"."state_id" = 1 AND "merge_requests"."draft" = FALSE AND "merge_requests"."title" ~* '^\[draft\]|\(draft\)|draft:|draft|\[WIP\]|WIP:|WIP'
- The last range is unbounded - this could have performance implications on a large table
SELECT "merge_requests"."id" FROM "merge_requests" WHERE "merge_requests"."id" >= 103306493 AND "merge_requests"."state_id" = 1 AND "merge_requests"."draft" = FALSE AND "merge_requests"."title" ~* '^\[draft\]|\(draft\)|draft:|draft|\[WIP\]|WIP:|WIP' ORDER BY "merge_requests"."id" ASC LIMIT 1 OFFSET 100
UPDATE "merge_requests" SET "draft" = TRUE WHERE "merge_requests"."id" >= 103306493 AND "merge_requests"."state_id" = 1 AND "merge_requests"."draft" = FALSE AND "merge_requests"."title" ~* '^\[draft\]|\(draft\)|draft:|draft|\[WIP\]|WIP:|WIP'
Risks
With larger and larger table sizes, and sparser queries we could have more risks of timeout for individual queries in a batch.
Some improvement ideas
Investigate: Update with limit clone in postgress
WITH cte AS (
SELECT id
FROM merge_requests
WHERE ...
LIMIT 1
)
UPDATE merge_requests
SET draft = true
FROM cte
WHERE merge_requests.id IN cte.id
Details with a batch size of 10
-- exec_query("SELECT COUNT(*) AS count FROM \"merge_requests\" WHERE \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP'")
-> 0.0021s
-- exec_query("SELECT \"merge_requests\".\"id\" FROM \"merge_requests\" WHERE \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP' ORDER BY \"merge_requests\".\"id\" ASC LIMIT 1")
-> 0.0010s
-- exec_query("SELECT \"merge_requests\".\"id\" FROM \"merge_requests\" WHERE \"merge_requests\".\"id\" >= 1 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP' ORDER BY \"merge_requests\".\"id\" ASC LIMIT 1 OFFSET 10")
-> 0.0010s
-- execute("UPDATE \"merge_requests\" SET \"draft\" = TRUE WHERE \"merge_requests\".\"id\" >= 1 AND \"merge_requests\".\"id\" < 11 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP'")
-> 0.0051s
-- exec_query("SELECT \"merge_requests\".\"id\" FROM \"merge_requests\" WHERE \"merge_requests\".\"id\" >= 11 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP' ORDER BY \"merge_requests\".\"id\" ASC LIMIT 1 OFFSET 10")
-> 0.0013s
-- execute("UPDATE \"merge_requests\" SET \"draft\" = TRUE WHERE \"merge_requests\".\"id\" >= 11 AND \"merge_requests\".\"id\" < 21 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP'")
-> 0.0046s
-- exec_query("SELECT \"merge_requests\".\"id\" FROM \"merge_requests\" WHERE \"merge_requests\".\"id\" >= 21 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP' ORDER BY \"merge_requests\".\"id\" ASC LIMIT 1 OFFSET 10")
-> 0.0011s
-- execute("UPDATE \"merge_requests\" SET \"draft\" = TRUE WHERE \"merge_requests\".\"id\" >= 21 AND \"merge_requests\".\"state_id\" = 1 AND \"merge_requests\".\"draft\" = FALSE AND \"merge_requests\".\"title\" ~* '^\\[draft\\]|\\(draft\\)|draft:|draft|\\[WIP\\]|WIP:|WIP'")
-> 0.0043s
Edited by Alper Akgun