Skip to content

Add and backfill project_id to merge_request_diffs

What does this MR do and why?

As part of the Cells architecture we will need to add a "sharding key" to all tables that are sharded across cells. Cells are separate GitLab instances and will eventually be how we run GitLab.com. We've documented in https://docs.gitlab.com/ee/architecture/blueprints/organization/isolation.html how this sharding key will be necessary to prevent any cross organization data in the database and allow us to confidently move data between cells.

The plan is to add either project_id or namespace_id to every table that ultimately belongs to an "Organization" and can therefore be moved between Cells as a complete unit. A lot of tables already have one of these columns but we need to add and backfill this column for a few hundred tables. A lot of the migrations will be similar to this one in that we can just look up another foreign key and find the project_id from that related parent table.

This MR is the first in adding and backfilling this column to a large table, with asynchronous index creation and constraint validation.

The merge_request_diffs table is very large, with over half a billion records. If we used the default batch size/delay (1000 records every 2 minutes) the migration would take over a year. To bring the total time down to something reasonable I've used a batch size of 10,000 instead. If we conservatively allow 0.5s for every sub-batch of 100 (see explain plan below), a batch should comfortably complete in around a minute (0.5s * 100 sub-batches, with a small pause between batches), allowing time for any database housekeeping before the next batch starts.

With a batch size of 10000 and a 2.minutes wait between updates, updating ~540M rows should take around 540_000_000/10_000 * 2 minutes = 75 days.

Database queries

Sample batch, from somewhere near the middle of the table:

UPDATE merge_request_diffs
SET project_id = merge_requests.target_project_id
FROM merge_requests
WHERE merge_requests.id = merge_request_diffs.merge_request_id
AND merge_request_diffs.id IN (
  SELECT "merge_request_diffs"."id"
  FROM "merge_request_diffs"
  WHERE "merge_request_diffs"."id" BETWEEN 1 AND 844732995
  AND "merge_request_diffs"."project_id" IS NULL
  AND "merge_request_diffs"."id" >= 400000004 AND "merge_request_diffs"."id" < 400000172
)

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/23870/commands/76396

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #429174 (closed)

Edited by Tiger Watson

Merge request reports