WIP: Migrate Namespace#root_id column
What does this MR do?
Prework of https://gitlab.com/gitlab-org/gitlab-ce/issues/62214
Does this MR meet the acceptance criteria?
Conformity
-
Changelog entry -
Documentation created/updated or follow-up review issue created -
Code review guidelines -
Merge request performance guidelines -
Style guides -
Database guides -
Separation of EE specific content
Performance and testing
-
Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process. - [-] Tested in all supported browsers
Backend to do
-
1. Add a new column on Namespaces
:root_id
- This new column should be
NOT NULL
, but we can't enforce that limit until we have migrated the data. - This new column will store the root namespace equal to the namespace ID, so the child namespaces will use the ID of the root namespace.
- This new column should be
-
2. Create a background migration to fill in Namespaces root_id
- This migration would iterate over root namespaces only (
parent_id = NULL
), then run the usual CTE code to find all child nodes, then update all those.
- This migration would iterate over root namespaces only (
-
3. Include logic to fill root_id
when a namespace is created/updated
Next iteration
-
Once the background migration has been completed, add another migration to include an index on root_id
and to make it not null
Merge request reports
Activity
changed milestone to %12.1
added bugperformance typefeature + 1 deleted label
assigned to @mayra-cabrera
9 DELAY_INTERVAL = 10.minutes.to_i 10 11 disable_ddl_transaction! 12 13 class Namespace < ActiveRecord::Base 14 self.table_name = 'namespaces' 15 16 include EachBatch 17 end 18 19 def up 20 say 'Scheduling `PopulateNamespaceRootIdColumn` jobs' 21 22 # We currently have ~4_600_000 namespace records on GitLab.com 23 # This means it'll schedule ~460 jobs (10k each) within a 10 minutes gap. 24 # so this should take ~153 hours to complete (assuming 30k namespaces per hour) Not sure if my calculations are correct: I'm assuming 30k namespaces per hour as each job is scheduled with a 10 minutes gap, so in an hour we'll only have space for 3 jobs
So
4_600_000
/30_000
=153h
Edited by Mayra Cabrera
@nolith could you please take a look at the backend to do of this MR to see if it makes sense?
2 Warnings CHANGELOG missing. You can create one with:
bin/changelog -m 29772 "Migrate Namespace#root_id column"
If your merge request doesn’t warrant a CHANGELOG entry,
consider adding any of the ~backstage, ci-build, ~Documentation, meta, QA, test labels.
See the documentation.This merge request is missing the database label. 1 Message This merge request adds or changes files that require a review from the Database team. Database Review
The following files require a review from the Database team:
db/migrate/20190617154954_add_root_id_to_namespace.rb
db/migrate/20190617181054_schedule_populate_namespace_root_id.rb
db/schema.rb
lib/gitlab/background_migration/populate_namespace_root_id_column.rb
To make sure these changes are reviewed, take the following steps:
- Edit your merge request, and add
gl-database
to the list of Group approvers. - Mention
@gl-database
in a separate comment, and explain what needs to be reviewed by the team. Please don't mention the team until your changes are ready for review.
Reviewer roulette
Changes that require review have been detected! A merge request is normally reviewed by both a reviewer and a maintainer in its primary category (e.g. frontend or backend), and by a maintainer in all other categories.
To spread load more evenly across eligible reviewers, Danger has randomly picked a candidate for each review slot. Feel free to override this selection if you think someone else would be better-suited, or the chosen person is unavailable.
Once you've decided who will review this merge request, mention them as you normally would! Danger does not (yet?) automatically notify them for you.
Category Reviewer Maintainer database Toon Claes ( @toon
)Andreas Brandl ( @abrandl
)backend Reuben Pereira ( @rpereira2
)Jen-Shin Lin ( @godfat
)Generated by
DangerEdited by 🤖 GitLab Bot 🤖added 272 commits
-
2c36802e...6d4f33ce - 271 commits from branch
master
- 730c0e4f - Migrate root_id column on Namespaces table
-
2c36802e...6d4f33ce - 271 commits from branch
46 46 ldap_group_links: %w[group_id], 47 47 members: %w[source_id created_by_id], 48 48 merge_requests: %w[last_edited_by_id state_id], 49 namespaces: %w[owner_id parent_id], 49 namespaces: %w[owner_id parent_id root_id], This is needed otherwise we'll encounter a failure on https://gitlab.com/gitlab-org/gitlab-ce/-/jobs/232957339
1 # frozen_string_literal: true 2 3 class AddRootIdToNamespace < ActiveRecord::Migration[5.1] 4 include Gitlab::Database::MigrationHelpers 5 6 DOWNTIME = false 7 8 def change 9 add_column :namespaces, :root_id, :integer From https://gitlab.com/gitlab-org/gitlab-ce/-/jobs/232957339, it seems every column with
_id
needs to have a foreign key associated. I wonder if adding a FK would increase/decrease the background migration performance.We could add that foreign key in the next iteration (at the same time we add
NOT NULL
and the index forroot_id
)In the meantime, I've whitelisted that column in the meantime on
spec/db/schema_spec.rb
47 namespaces.each do |child_namespace_id, root_namespace_id| 48 statement = "WHEN #{child_namespace_id} THEN #{root_namespace_id}" 49 namespaces_information << statement 50 end 51 end 52 end 53 54 def update_sql_query(case_statements:, namespace_ids:) 55 <<~SQL 56 UPDATE namespaces 57 SET root_id = CASE id 58 #{case_statements.join("\n")} 59 END 60 WHERE id IN (#{namespace_ids.join(",")}) 61 SQL 62 end This would basically generate something like
UPDATE namespaces SET root_id = CASE id WHEN 1 THEN 1 WHEN 2 THEN 1 WHEN 5 THEN 1 WHEN 7 THEN 1 WHEN 9 THEN 1 WHEN 11 THEN 1 WHEN 13 THEN 1 WHEN 15 THEN 1 WHEN 17 THEN 1 WHEN 19 THEN 1 WHEN 21 THEN 1 WHEN 3 THEN 3 WHEN 4 THEN 3 WHEN 6 THEN 3 WHEN 8 THEN 3 WHEN 10 THEN 3 WHEN 12 THEN 3 WHEN 14 THEN 3 WHEN 16 THEN 3 WHEN 18 THEN 3 WHEN 20 THEN 3 WHEN 22 THEN 3 END WHERE id IN (1,2,5,7,9,11,13,15,17,19,21,3,4,6,8,10,12,14,16,18,20,22)
But with 10k statements. Unsure if this is the best update strategy
Edited by Mayra Cabrera
11 root_namespaces = root_namespaces_between(from_id: from_id, to_id: to_id) 12 return if root_namespaces.empty? 13 14 namespaces_information = associate_children_with_root_namespaces(root_namespaces) 15 sql_query = build_update_namespaces_sql(namespaces_information) 16 17 execute(sql_query) 18 end 19 20 private 21 22 def root_namespaces_between(from_id:, to_id:) 23 Namespace 24 .where('parent_id IS NULL') 25 .where(id: from_id..to_id) 26 end From Chatops:
> SELECT namespaces.* FROM namespaces WHERE (namespaces.parent_id IS NOT NULL) AND (namespaces.id BETWEEN 1 AND 10000) Index Scan using index_namespaces_on_parent_id_and_id on namespaces (cost=0.43..4758.31 rows=447 width=303) (actual time=26.683..26.683 rows=0 loops=1) Index Cond: ((parent_id IS NOT NULL) AND (id >= 1) AND (id <= 10000)) Buffers: shared hit=792 read=15 I/O Timings: read=3.449 Planning time: 3.271 ms Execution time: 26.726 ms
@nolith could you please take a look at the following discussions?
- https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/29772#note_182203139
- https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/29772#note_182201607
- https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/29772#note_183193168
- https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/29772#note_183193169
I could use your opinion on those, thanks
assigned to @nolith
Closing as we decided to use https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/29837
mentioned in merge request gitlab-com/www-gitlab-com!28830 (merged)
added teamDelivery label