Save repository_type to LfsObjectsProject (!13894) · Merge requests · GitLab.org / GitLab

What does this MR do?

Overview

This MR adds a new property repository_type to the table lfs_objects_projects to record the repository type (one of 'project', 'wiki' or 'design') that the corresponding blob of the LFS object lives in.

As we are enabling the ability for design and wiki repositories to store files in LFS (https://gitlab.com/gitlab-org/gitlab-ee/issues/9490 and https://gitlab.com/gitlab-org/gitlab-ce/issues/43721), recording the repository_type will make it easier to prune unreferenced LFS objects in the future.

Note that this property is currently allows null values. This discussion thread decided to introduce this new column, ensure that all LFS files for the new 'design' repository are correctly saving with the right repository_type, and allow there to be null values (past and future), knowing that these records will all be for the regular 'project' repository. This technical debt will be addressed starting in https://gitlab.com/gitlab-org/gitlab-ce/issues/43721 where all parts of the app that create LfsObjects will be consolidated to save a repository_type value. There will also need to be an additional data migration and not null constraint added to the column after that piece of work.

More detail

https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/13389 adds the ability for images in a project's design repository (project.design_repository) to be stored in LFS.

This discussion thread raised the issue of how in the future we want to prune unreferenced LFS objects, and as LFS pointer files can now be in multiple repositories, to do so efficiently we should record which repository the blob lives in.

As an LFS object's oid is determined by the data in the blob file, it's possible now for a project to have the same LfsObject for blobs in three separate repositories ('project', 'wiki' or 'design'). This means up to three LfsObjectsProject records with the same project and lfs_object but different repository_types. As this model is the many-to-many join table the has_many relationships between lfs_objects and projects now need a distinct scope to limit the records returned to non-duplicates.

Lfs::FileTransformer can now create LfsObjectsProject with repository_type saved. This transformer is used by DesignManagement::SaveDesignsService, so LFS for blobs in the design repo will always have a repository_type saved.

Migration output

== 20190602014139 AddRepositoryTypeToLfsObjectsProject: migrating =============
-- add_column(:lfs_objects_projects, :repository_type, :integer, {:limit=>1, :null=>true})
   -> 0.0026s
== 20190602014139 AddRepositoryTypeToLfsObjectsProject: migrated (0.0027s) ====

EXPLAIN output

See this Slack thread where I've asked for input into the performance of distinct in these queries.

These two EXPLAIN queries compare master branch with 9490-record-repository_type-on-lfs_objects_projects branch.

EXPLAIN for project.lfs_objects query in master:

=> EXPLAIN for: SELECT "lfs_objects".* FROM "lfs_objects" INNER JOIN "lfs_objects_projects" ON "lfs_objects"."id" = "lfs_objects_projects"."lfs_object_id" WHERE "lfs_objects_projects"."project_id" = $1 [["project_id", 33]]
                                                        QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=0.30..16.52 rows=7 width=96)
   ->  Index Scan using index_lfs_objects_projects_on_project_id on lfs_objects_projects  (cost=0.15..7.28 rows=7 width=4)
         Index Cond: (project_id = 33)
   ->  Index Scan using lfs_objects_pkey on lfs_objects  (cost=0.15..1.31 rows=1 width=96)
         Index Cond: (id = lfs_objects_projects.lfs_object_id)
(5 rows)

EXPLAIN for project.lfs_objects query in this MR source branch 9490-record-repository_type-on-lfs_objects_projects:

=> EXPLAIN for: SELECT DISTINCT "lfs_objects".* FROM "lfs_objects" INNER JOIN "lfs_objects_projects" ON "lfs_objects"."id" = "lfs_objects_projects"."lfs_object_id" WHERE "lfs_objects_projects"."project_id" = $1 [["project_id", 28]]
                                                                          QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=2.24..2.28 rows=2 width=158)
   ->  Sort  (cost=2.24..2.24 rows=2 width=158)
         Sort Key: lfs_objects.id, lfs_objects.oid, lfs_objects.size, lfs_objects.created_at, lfs_objects.updated_at, lfs_objects.file, lfs_objects.file_store
         ->  Hash Join  (cost=1.11..2.23 rows=2 width=158)
               Hash Cond: (lfs_objects.id = lfs_objects_projects.lfs_object_id)
               ->  Seq Scan on lfs_objects  (cost=0.00..1.07 rows=7 width=158)
               ->  Hash  (cost=1.09..1.09 rows=2 width=4)
                     ->  Seq Scan on lfs_objects_projects  (cost=0.00..1.09 rows=2 width=4)
                           Filter: (project_id = 28)
(9 rows)

Database checklist

Conforms to the database guides

When adding migrations:

Updated db/schema.rb
Added a down method so the migration can be reverted (uses a change method that is rollback-able)
Added the output of the migration(s) to the MR body
[-] Added tests for the migration in spec/migrations if necessary (e.g. when migrating data)

When adding or modifying queries to improve performance:

[-] Included data that shows the performance improvement, preferably in the form of a benchmark
[-] Included the output of EXPLAIN (ANALYZE, BUFFERS) of the relevant queries

When adding foreign keys to existing tables:

[-] Included a migration to remove orphaned rows in the source table before adding the foreign key
[-] Removed any instances of dependent: ... that may no longer be necessary

When adding tables:

[-] Ordered columns based on the Ordering Table Columns guidelines
[-] Added foreign keys to any columns pointing to data in other tables
[-] Added indexes for fields that are used in statements such as WHERE, ORDER BY, GROUP BY, and JOINs

When removing columns, tables, indexes or other structures:

[-] Removed these in a post-deployment migration
[-] Made sure the application no longer uses (or ignores) these structures

General checklist

[-] Changelog entry added, if necessary
[-] Documentation created/updated
Tests added for this feature/bug
Conforms to the code review guidelines
Conforms to the merge request performance guidelines
Conforms to the style guides

Edited Dec 10, 2020 by 🤖 GitLab Bot 🤖

Save repository_type to LfsObjectsProject

What does this MR do?

Database checklist

General checklist

Merge request reports