Skip to content

Add index to package files on file_name and file_sha256

Context

There're two PyPI endpoints that search for a package using package file's name and SHA256.

Currently those endpoints perform not great when there's a lot of data to process, because of the slow database queries: group level endpoint project level endpoint.

What does this MR do and why?

Add a new index to the packages_package_files table on file_name and file_sha256 columns.

Adding an index to package files on file_name and file_sha256 will allow to use Index Scan when searching for a package using package file name and file sha256.

MR acceptance checklist

Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Screenshots or screen recordings

No.

How to set up and validate locally

Migration up and down:

$ rails db:migrate:main
main: == [advisory_lock_connection] object_id: 125060, pg_backend_pid: 84786
main: == 20240514102603 AddFileNameFileSha256IndexToPackageFiles: migrating =========
main: -- transaction_open?(nil)
main:    -> 0.0000s
main: -- view_exists?(:postgres_partitions)
main:    -> 0.0144s
main: -- index_exists?(:packages_package_files, "file_name, file_sha256", {:name=>"index_packages_package_files_on_file_name_file_sha256", :algorithm=>:concurrently})
main:    -> 0.0079s
main: -- execute("SET statement_timeout TO 0")
main:    -> 0.0002s
main: -- add_index(:packages_package_files, "file_name, file_sha256", {:name=>"index_packages_package_files_on_file_name_file_sha256", :algorithm=>:concurrently})
main:    -> 0.0103s
main: -- execute("RESET statement_timeout")
main:    -> 0.0003s
main: == 20240514102603 AddFileNameFileSha256IndexToPackageFiles: migrated (0.0473s) 

main: == [advisory_lock_connection] object_id: 125060, pg_backend_pid: 84786
$ rails db:rollback:main
main: == [advisory_lock_connection] object_id: 124800, pg_backend_pid: 85099
main: == 20240514102603 AddFileNameFileSha256IndexToPackageFiles: reverting =========
main: -- transaction_open?(nil)
main:    -> 0.0000s
main: -- view_exists?(:postgres_partitions)
main:    -> 0.0133s
main: -- indexes(:packages_package_files)
main:    -> 0.0043s
main: -- execute("SET statement_timeout TO 0")
main:    -> 0.0002s
main: -- remove_index(:packages_package_files, {:algorithm=>:concurrently, :name=>"index_packages_package_files_on_file_name_file_sha256"})
main:    -> 0.0024s
main: -- execute("RESET statement_timeout")
main:    -> 0.0003s
main: == 20240514102603 AddFileNameFileSha256IndexToPackageFiles: reverted (0.0320s) 

main: == [advisory_lock_connection] object_id: 124800, pg_backend_pid: 85099

Database analysis

Query
SELECT "packages_packages".*
FROM "packages_packages"
INNER JOIN "packages_package_files" ON "packages_package_files"."package_id" = "packages_packages"."id"
WHERE "packages_packages"."project_id" = XXX
  AND "packages_packages"."status" IN (0, 1)
  AND "packages_packages"."package_type" = 5
  AND "packages_packages"."version" IS NOT NULL
  AND "packages_package_files"."file_name" = 'XXX'
  AND "packages_package_files"."file_sha256" = 'XXX'
ORDER BY "packages_packages"."id" DESC
LIMIT 1

Now

Explain plan
 Limit  (cost=3151.05..3151.05 rows=1 width=146) (actual time=42361.464..42361.472 rows=1 loops=1)
   Buffers: shared hit=109273 read=60300 dirtied=1
   I/O Timings: read=39719.486 write=0.000
   ->  Sort  (cost=3151.05..3151.05 rows=1 width=146) (actual time=42361.462..42361.468 rows=1 loops=1)
         Sort Key: packages_packages.id DESC
         Sort Method: quicksort  Memory: 25kB
         Buffers: shared hit=109273 read=60300 dirtied=1
         I/O Timings: read=39719.486 write=0.000
         ->  Nested Loop  (cost=2950.40..3151.04 rows=1 width=146) (actual time=42361.410..42361.424 rows=1 loops=1)
               Buffers: shared hit=109270 read=60300 dirtied=1
               I/O Timings: read=39719.486 write=0.000
               ->  Bitmap Heap Scan on public.packages_package_files  (cost=2949.97..3147.58 rows=1 width=8) (actual time=42356.604..42356.613 rows=1 loops=1)
                     Filter: (packages_package_files.file_sha256 = '\x38613132616336323432633236316363663163346134323736323365333530353964383730653435373939343163613033623730373037646431643535623563'::bytea)
                     Rows Removed by Filter: 0
                     Buffers: shared hit=109270 read=60296
                     I/O Timings: read=39715.054 write=0.000
                     ->  Bitmap Index Scan using index_packages_package_files_on_file_name  (cost=0.00..2949.97 rows=129 width=0) (actual time=42354.086..42354.091 rows=1 loops=1)
                           Index Cond: ((packages_package_files.file_name)::text = 'kurant_filtering-1.0.1210442163-py3-none-any.whl'::text)
                           Buffers: shared hit=109270 read=60295
                           I/O Timings: read=39712.627 write=0.000
               ->  Index Scan using index_packages_on_available_pypi_packages on public.packages_packages  (cost=0.43..3.45 rows=1 width=146) (actual time=4.746..4.746 rows=1 loops=1)
                     Index Cond: ((packages_packages.project_id = 21145795) AND (packages_packages.id = packages_package_files.package_id))
                     Buffers: shared read=4 dirtied=1
                     I/O Timings: read=4.431 write=0.000

https://console.postgres.ai/gitlab/gitlab-production-main/sessions/28234/commands/88109

Bitmap Heap Scan on public.packages_package_files with Filter + Bitmap Index Scan using index_packages_package_files_on_file_name 🐌

After adding index

pg.ai setup
exec CREATE INDEX index_packages_package_files_on_file_name_file_sha256 ON packages_package_files USING btree (file_name, file_sha256)
exec VACUUM ANALYZE packages_package_files
Explain plan
 Limit  (cost=7.18..7.18 rows=1 width=146) (actual time=13.626..13.629 rows=1 loops=1)
   Buffers: shared hit=3 read=10 dirtied=1
   I/O Timings: read=12.713 write=0.000
   ->  Sort  (cost=7.18..7.18 rows=1 width=146) (actual time=13.625..13.627 rows=1 loops=1)
         Sort Key: packages_packages.id DESC
         Sort Method: quicksort  Memory: 25kB
         Buffers: shared hit=3 read=10 dirtied=1
         I/O Timings: read=12.713 write=0.000
         ->  Nested Loop  (cost=1.12..7.17 rows=1 width=146) (actual time=13.574..13.581 rows=1 loops=1)
               Buffers: shared read=10 dirtied=1
               I/O Timings: read=12.713 write=0.000
               ->  Index Scan using index_packages_package_files_on_file_name_file_sha256 on public.packages_package_files  (cost=0.69..3.71 rows=1 width=8) (actual time=7.974..7.978 rows=1 loops=1)
                     Index Cond: (((packages_package_files.file_name)::text = 'kurant_filtering-1.0.1210442163-py3-none-any.whl'::text) AND (packages_package_files.file_sha256 = '\x38613132616336323432633236316363663163346134323736323365333530353964383730653435373939343163613033623730373037646431643535623563'::bytea))
                     Buffers: shared read=6
                     I/O Timings: read=7.861 write=0.000
               ->  Index Scan using index_packages_on_available_pypi_packages on public.packages_packages  (cost=0.43..3.45 rows=1 width=146) (actual time=5.588..5.588 rows=1 loops=1)
                     Index Cond: ((packages_packages.project_id = 21145795) AND (packages_packages.id = packages_package_files.package_id))
                     Buffers: shared read=4 dirtied=1
                     I/O Timings: read=4.852 write=0.000

https://console.postgres.ai/gitlab/gitlab-production-main/sessions/28234/commands/88115

The packages_package_files table has 96M rows https://console.postgres.ai/gitlab/gitlab-production-main/sessions/28234/commands/88176.
The index size is 5048 MB.

Index Scan using index_packages_package_files_on_file_name_file_sha256 on public.packages_package_files

Edited by Dzmitry Meshcharakou

Merge request reports