Add and backfill project_id to merge_request_diffs
What does this MR do and why?
As part of the Cells architecture we will need to add a "sharding key" to all tables that are sharded across cells. Cells are separate GitLab instances and will eventually be how we run GitLab.com. We've documented in https://docs.gitlab.com/ee/architecture/blueprints/organization/isolation.html how this sharding key will be necessary to prevent any cross organization data in the database and allow us to confidently move data between cells.
The plan is to add either project_id
or namespace_id
to every table that ultimately belongs to an "Organization" and can therefore be moved between Cells as a complete unit. A lot of tables already have one of these columns but we need to add and backfill this column for a few hundred tables. A lot of the migrations will be similar to this one in that we can just look up another foreign key and find the project_id
from that related parent table.
This MR is the first in adding and backfilling this column to a large table, with asynchronous index creation and constraint validation.
The merge_request_diffs
table is very large, with over half a billion records. If we used the default batch size/delay (1000 records every 2 minutes) the migration would take over a year. To bring the total time down to something reasonable I've used a batch size of 10,000 instead. If we conservatively allow 0.5s for every sub-batch of 100 (see explain plan below), a batch should comfortably complete in around a minute (0.5s * 100 sub-batches, with a small pause between batches), allowing time for any database housekeeping before the next batch starts.
With a batch size of 10000
and a 2.minutes
wait between updates, updating ~540M rows should take around 540_000_000/10_000 * 2 minutes = 75 days
.
Database queries
Sample batch, from somewhere near the middle of the table:
UPDATE merge_request_diffs
SET project_id = merge_requests.target_project_id
FROM merge_requests
WHERE merge_requests.id = merge_request_diffs.merge_request_id
AND merge_request_diffs.id IN (
SELECT "merge_request_diffs"."id"
FROM "merge_request_diffs"
WHERE "merge_request_diffs"."id" BETWEEN 1 AND 844732995
AND "merge_request_diffs"."project_id" IS NULL
AND "merge_request_diffs"."id" >= 400000004 AND "merge_request_diffs"."id" < 400000172
)
https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/23870/commands/76396
MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
Related to #429174 (closed)