User mapping - Create relation contributors in Direct Transfer (!150158) · Merge requests · GitLab.org / GitLab

Sam Word requested to merge 454522-user-mapping-create-relation-contributors-in-direct-transfer into master Apr 19, 2024

What does this MR do and why?

This MR creates a new relation export file for user_contributions. In order to export user contributions as a single relation, we need to cache all referenced user_ids as they are exported during other relations. In order to do that, the UserContributionsExportWorker was created to wait until all other exports finish before querying for Users with the cached user_ids. That query is assigned to the exportable as user_contributions, then passed to the existing RelationExportService.

Because we can't accurately export user contributions until all other exports finish, we also can't create placeholder users on the import side with the right names and usernames. However, when it's time to import the user_contributions, we can simply update existing Import::SourceUsers and their placeholder users to have the right name, email and username if they've already been created. The import side would be handled in User mapping - Integration with Direct Transfer (#443557).

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Database changes

New indexes

I had originally added a new indexes for project_id/group_id and status on bulk_import_exports, but there can only be up to 32 exports per portable, so it's no longer needed. These query plans are just for review to show this.

partial_index_bulk_import_exports_on_project_id_and_status: Added index on project_id and status

Query:

SELECT bulk_import_exports.*
FROM bulk_import_exports
WHERE bulk_import_exports.project_id = (SELECT exports.project_id FROM bulk_import_exports AS exports WHERE exports.project_id IS NOT NULL ORDER BY exports.id DESC LIMIT 1)
AND bulk_import_exports.status = 1;

Execution plans:

Before: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/28094/commands/87512

Index Scan using partial_index_bulk_import_exports_on_project_id_and_relation on public.bulk_import_exports  (cost=0.94..48.15 rows=31 width=116) (actual time=3.974..16.239 rows=31 loops=1)
   Index Cond: (bulk_import_exports.project_id = $0)
   Filter: (bulk_import_exports.status = 1)
   Rows Removed by Filter: 0
   Buffers: shared hit=43 read=15 dirtied=14
   I/O Timings: read=15.616 write=0.000
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.43..0.51 rows=1 width=16) (actual time=0.027..0.029 rows=1 loops=1)
           Buffers: shared hit=4
           I/O Timings: read=0.000 write=0.000
           ->  Index Scan Backward using bulk_import_exports_pkey on public.bulk_import_exports exports  (cost=0.43..291927.18 rows=3446586 width=16) (actual time=0.026..0.027 rows=1 loops=1)
                 Filter: (exports.project_id IS NOT NULL)
                 Rows Removed by Filter: 0
                 Buffers: shared hit=4
                 I/O Timings: read=0.000 write=0.000

After: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/28094/commands/87514

Index Scan using partial_index_bulk_import_exports_on_project_id_and_status on public.bulk_import_exports  (cost=0.94..48.15 rows=31 width=116) (actual time=0.140..0.191 rows=31 loops=1)
   Index Cond: ((bulk_import_exports.project_id = $0) AND (bulk_import_exports.status = 1))
   Buffers: shared hit=32 read=3
   I/O Timings: read=0.061 write=0.000
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.43..0.51 rows=1 width=16) (actual time=0.027..0.028 rows=1 loops=1)
           Buffers: shared hit=4
           I/O Timings: read=0.000 write=0.000
           ->  Index Scan Backward using bulk_import_exports_pkey on public.bulk_import_exports exports  (cost=0.43..291927.18 rows=3446586 width=16) (actual time=0.026..0.026 rows=1 loops=1)
                 Filter: (exports.project_id IS NOT NULL)
                 Rows Removed by Filter: 0
                 Buffers: shared hit=4
                 I/O Timings: read=0.000 write=0.000

partial_index_bulk_import_exports_on_group_id_and_status: Added index on group_id and status

Query:

SELECT bulk_import_exports.*
FROM bulk_import_exports
WHERE bulk_import_exports.group_id = (SELECT exports.project_id FROM bulk_import_exports AS exports WHERE exports.group_id IS NOT NULL ORDER BY exports.id DESC LIMIT 1)
AND bulk_import_exports.status = 1;

Execution Plans:

Before: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/28094/commands/87515

Index Scan using partial_index_bulk_import_exports_on_group_id_and_relation on public.bulk_import_exports  (cost=1.55..15.69 rows=9 width=116) (actual time=0.177..0.177 rows=0 loops=1)
   Index Cond: (bulk_import_exports.group_id = $0)
   Filter: (bulk_import_exports.status = 1)
   Rows Removed by Filter: 0
   Buffers: shared hit=54
   I/O Timings: read=0.000 write=0.000
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.43..1.13 rows=1 width=16) (actual time=0.163..0.164 rows=1 loops=1)
           Buffers: shared hit=54
           I/O Timings: read=0.000 write=0.000
           ->  Index Scan Backward using bulk_import_exports_pkey on public.bulk_import_exports exports  (cost=0.43..291927.18 rows=417444 width=16) (actual time=0.162..0.162 rows=1 loops=1)
                 Filter: (exports.group_id IS NOT NULL)
                 Rows Removed by Filter: 31
                 Buffers: shared hit=54
                 I/O Timings: read=0.000 write=0.000

After: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/28094/commands/87517

Index Scan using partial_index_bulk_import_exports_on_group_id_and_status on public.bulk_import_exports  (cost=1.55..15.69 rows=9 width=116) (actual time=0.079..0.079 rows=0 loops=1)
   Index Cond: ((bulk_import_exports.group_id = $0) AND (bulk_import_exports.status = 1))
   Buffers: shared hit=34
   I/O Timings: read=0.000 write=0.000
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.43..1.13 rows=1 width=16) (actual time=0.071..0.071 rows=1 loops=1)
           Buffers: shared hit=34
           I/O Timings: read=0.000 write=0.000
           ->  Index Scan Backward using bulk_import_exports_pkey on public.bulk_import_exports exports  (cost=0.43..291927.18 rows=417444 width=16) (actual time=0.070..0.070 rows=1 loops=1)
                 Filter: (exports.group_id IS NOT NULL)
                 Rows Removed by Filter: 31
                 Buffers: shared hit=34
                 I/O Timings: read=0.000 write=0.000

How to set up and validate locally

Start your local environment and ensure sidekiq is running
Create a project with any relation listed in lib/gitlab/import_export/project/import_export.yml that has a column with a User reference. Merge requests with an author is an easy example.
Begin the export side of Direct Transfer by initializing and executing BulkImports::ExportService
Once the relations finish, find the export file for user_contributions. Exports sometimes take a little while, keep an eye on export.log to see when "Exporting user_contributions relation" appears in a message.
Verify the export file has all of the referenced users in your project.

Console commands:

# Run the ExportService:
export_service = BulkImports::ExportService.new(portable: your_project, user: current_user, batched: true) # or batched: false, feel free to test both
export_service.execute

# Get the export file location:
export = project.bulk_import_exports.find_by(relation: 'user_contributions')
export.upload.export_file

Related to #454522 (closed)

Edited May 06, 2024 by Sam Word

User mapping - Create relation contributors in Direct Transfer

What does this MR do and why?

MR acceptance checklist

Database changes

New indexes

Query:

Execution plans:

Query:

Execution Plans:

After: https://console.postgres.ai/gitlab/gitlab-production-main/sessions/28094/commands/87517

How to set up and validate locally

Merge request reports