Add index to package files on file_name and file_sha256
Context
There're two PyPI endpoints that search for a package using package file's name and SHA256.
Currently those endpoints perform not great when there's a lot of data to process, because of the slow database queries: group level endpoint project level endpoint.
What does this MR do and why?
Add a new index to the packages_package_files
table on file_name
and file_sha256
columns.
Adding an index to package files on file_name and file_sha256 will allow to use Index Scan when searching for a package using package file name and file sha256.
MR acceptance checklist
Please evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
Screenshots or screen recordings
No.
How to set up and validate locally
Migration up and down:
$ rails db:migrate:main
main: == [advisory_lock_connection] object_id: 125060, pg_backend_pid: 84786
main: == 20240514102603 AddFileNameFileSha256IndexToPackageFiles: migrating =========
main: -- transaction_open?(nil)
main: -> 0.0000s
main: -- view_exists?(:postgres_partitions)
main: -> 0.0144s
main: -- index_exists?(:packages_package_files, "file_name, file_sha256", {:name=>"index_packages_package_files_on_file_name_file_sha256", :algorithm=>:concurrently})
main: -> 0.0079s
main: -- execute("SET statement_timeout TO 0")
main: -> 0.0002s
main: -- add_index(:packages_package_files, "file_name, file_sha256", {:name=>"index_packages_package_files_on_file_name_file_sha256", :algorithm=>:concurrently})
main: -> 0.0103s
main: -- execute("RESET statement_timeout")
main: -> 0.0003s
main: == 20240514102603 AddFileNameFileSha256IndexToPackageFiles: migrated (0.0473s)
main: == [advisory_lock_connection] object_id: 125060, pg_backend_pid: 84786
$ rails db:rollback:main
main: == [advisory_lock_connection] object_id: 124800, pg_backend_pid: 85099
main: == 20240514102603 AddFileNameFileSha256IndexToPackageFiles: reverting =========
main: -- transaction_open?(nil)
main: -> 0.0000s
main: -- view_exists?(:postgres_partitions)
main: -> 0.0133s
main: -- indexes(:packages_package_files)
main: -> 0.0043s
main: -- execute("SET statement_timeout TO 0")
main: -> 0.0002s
main: -- remove_index(:packages_package_files, {:algorithm=>:concurrently, :name=>"index_packages_package_files_on_file_name_file_sha256"})
main: -> 0.0024s
main: -- execute("RESET statement_timeout")
main: -> 0.0003s
main: == 20240514102603 AddFileNameFileSha256IndexToPackageFiles: reverted (0.0320s)
main: == [advisory_lock_connection] object_id: 124800, pg_backend_pid: 85099
Database analysis
Query
SELECT "packages_packages".*
FROM "packages_packages"
INNER JOIN "packages_package_files" ON "packages_package_files"."package_id" = "packages_packages"."id"
WHERE "packages_packages"."project_id" = XXX
AND "packages_packages"."status" IN (0, 1)
AND "packages_packages"."package_type" = 5
AND "packages_packages"."version" IS NOT NULL
AND "packages_package_files"."file_name" = 'XXX'
AND "packages_package_files"."file_sha256" = 'XXX'
ORDER BY "packages_packages"."id" DESC
LIMIT 1
Now
Explain plan
Limit (cost=3151.05..3151.05 rows=1 width=146) (actual time=42361.464..42361.472 rows=1 loops=1)
Buffers: shared hit=109273 read=60300 dirtied=1
I/O Timings: read=39719.486 write=0.000
-> Sort (cost=3151.05..3151.05 rows=1 width=146) (actual time=42361.462..42361.468 rows=1 loops=1)
Sort Key: packages_packages.id DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=109273 read=60300 dirtied=1
I/O Timings: read=39719.486 write=0.000
-> Nested Loop (cost=2950.40..3151.04 rows=1 width=146) (actual time=42361.410..42361.424 rows=1 loops=1)
Buffers: shared hit=109270 read=60300 dirtied=1
I/O Timings: read=39719.486 write=0.000
-> Bitmap Heap Scan on public.packages_package_files (cost=2949.97..3147.58 rows=1 width=8) (actual time=42356.604..42356.613 rows=1 loops=1)
Filter: (packages_package_files.file_sha256 = '\x38613132616336323432633236316363663163346134323736323365333530353964383730653435373939343163613033623730373037646431643535623563'::bytea)
Rows Removed by Filter: 0
Buffers: shared hit=109270 read=60296
I/O Timings: read=39715.054 write=0.000
-> Bitmap Index Scan using index_packages_package_files_on_file_name (cost=0.00..2949.97 rows=129 width=0) (actual time=42354.086..42354.091 rows=1 loops=1)
Index Cond: ((packages_package_files.file_name)::text = 'kurant_filtering-1.0.1210442163-py3-none-any.whl'::text)
Buffers: shared hit=109270 read=60295
I/O Timings: read=39712.627 write=0.000
-> Index Scan using index_packages_on_available_pypi_packages on public.packages_packages (cost=0.43..3.45 rows=1 width=146) (actual time=4.746..4.746 rows=1 loops=1)
Index Cond: ((packages_packages.project_id = 21145795) AND (packages_packages.id = packages_package_files.package_id))
Buffers: shared read=4 dirtied=1
I/O Timings: read=4.431 write=0.000
https://console.postgres.ai/gitlab/gitlab-production-main/sessions/28234/commands/88109
Bitmap Heap Scan on public.packages_package_files
with Filter
+ Bitmap Index Scan using index_packages_package_files_on_file_name
After adding index
pg.ai setup
exec CREATE INDEX index_packages_package_files_on_file_name_file_sha256 ON packages_package_files USING btree (file_name, file_sha256)
exec VACUUM ANALYZE packages_package_files
Explain plan
Limit (cost=7.18..7.18 rows=1 width=146) (actual time=13.626..13.629 rows=1 loops=1)
Buffers: shared hit=3 read=10 dirtied=1
I/O Timings: read=12.713 write=0.000
-> Sort (cost=7.18..7.18 rows=1 width=146) (actual time=13.625..13.627 rows=1 loops=1)
Sort Key: packages_packages.id DESC
Sort Method: quicksort Memory: 25kB
Buffers: shared hit=3 read=10 dirtied=1
I/O Timings: read=12.713 write=0.000
-> Nested Loop (cost=1.12..7.17 rows=1 width=146) (actual time=13.574..13.581 rows=1 loops=1)
Buffers: shared read=10 dirtied=1
I/O Timings: read=12.713 write=0.000
-> Index Scan using index_packages_package_files_on_file_name_file_sha256 on public.packages_package_files (cost=0.69..3.71 rows=1 width=8) (actual time=7.974..7.978 rows=1 loops=1)
Index Cond: (((packages_package_files.file_name)::text = 'kurant_filtering-1.0.1210442163-py3-none-any.whl'::text) AND (packages_package_files.file_sha256 = '\x38613132616336323432633236316363663163346134323736323365333530353964383730653435373939343163613033623730373037646431643535623563'::bytea))
Buffers: shared read=6
I/O Timings: read=7.861 write=0.000
-> Index Scan using index_packages_on_available_pypi_packages on public.packages_packages (cost=0.43..3.45 rows=1 width=146) (actual time=5.588..5.588 rows=1 loops=1)
Index Cond: ((packages_packages.project_id = 21145795) AND (packages_packages.id = packages_package_files.package_id))
Buffers: shared read=4 dirtied=1
I/O Timings: read=4.852 write=0.000
https://console.postgres.ai/gitlab/gitlab-production-main/sessions/28234/commands/88115
The packages_package_files
table has 96M
rows https://console.postgres.ai/gitlab/gitlab-production-main/sessions/28234/commands/88176.
The index size is 5048
MB.
Index Scan using index_packages_package_files_on_file_name_file_sha256
on public.packages_package_files