republishing a maven packages duplicates uploads
Summary
Running mvn deploy
twice or more, will upload and store multiple copies of the same file.
Steps to reproduce
- configure a project with maven
- set a not SNAPSHOT version
- run
mvn deploy
- run
mvn deploy
again
Example Project
https://gitlab.com/nolith/test-pkgs/-/packages/19615
What is the current bug behavior?
It will store multiple copies of the same exact file
What is the expected correct behavior?
Do not upload again (or at least overwrite) when file_md5
, file_sha1
, and file_name
are already on DB.
Relevant logs and/or screenshots
[43] pry(main)> mvn.package_files
Packages::PackageFile Load (0.5ms) SELECT "packages_package_files".* FROM "packages_package_files" WHERE "packages_package_files"."package_id" = $1 [["package_id", 14]]
↳ ./bin/rails:4
=> [#<Packages::PackageFile:0x00007fe480700528
id: 62,
package_id: 14,
created_at: 2019-07-03 10:00:32 UTC,
updated_at: 2019-07-03 10:00:32 UTC,
size: 2416,
file_type: nil,
file_store: 2,
file_md5: "aed4964fd5ddfdc088cdf83a0d2ab729",
file_sha1: "252d3e65bce4198048d656f51a8a598b8ff76de4",
file_name: "my-app-1.0.jar",
file: "my-app-1.0.jar">,
#<Packages::PackageFile:0x00007fe4637dbeb0
id: 59,
package_id: 14,
created_at: 2019-07-03 09:59:12 UTC,
updated_at: 2019-07-03 09:59:12 UTC,
size: 2416,
file_type: nil,
file_store: 2,
file_md5: "aed4964fd5ddfdc088cdf83a0d2ab729",
file_sha1: "252d3e65bce4198048d656f51a8a598b8ff76de4",
file_name: "my-app-1.0.jar",
file: "my-app-1.0.jar">,
#<Packages::PackageFile:0x00007fe4637dba78
id: 63,
package_id: 14,
created_at: 2019-07-03 10:00:36 UTC,
updated_at: 2019-07-03 10:00:36 UTC,
size: 1229,
file_type: nil,
file_store: 2,
file_md5: "95bd2a07ac1017f8eeb93dc7b69cfa35",
file_sha1: "d9b7f54b87fbebb7be1a2f0afa5b9bb735208a60",
file_name: "my-app-1.0.pom",
file: "my-app-1.0.pom">,
#<Packages::PackageFile:0x00007fe4637db5f0
id: 60,
package_id: 14,
created_at: 2019-07-03 09:59:16 UTC,
updated_at: 2019-07-03 09:59:16 UTC,
size: 1229,
file_type: nil,
file_store: 2,
file_md5: "95bd2a07ac1017f8eeb93dc7b69cfa35",
file_sha1: "d9b7f54b87fbebb7be1a2f0afa5b9bb735208a60",
file_name: "my-app-1.0.pom",
file: "my-app-1.0.pom">]
[44] pry(main)>
Impact
According to the following query, today 2019-07-04, we already have 1567 duplicated files out of 193710
explain SELECT package_id, file_name, file_md5, file_sha1, count(*) as cnt FROM "packages_package_files" group by package_id, file_name, file_md5, file_sha1 having count(*) > 1;
HashAggregate (cost=26902.75..28531.72 rows=162897 width=123) (actual time=383.013..443.980 rows=1567 loops=1)
Group Key: package_id, file_name, file_md5, file_sha1
Filter: (count(*) > 1)
Rows Removed by Filter: 187662
Buffers: shared hit=742 read=4773
I/O Timings: read=117.427
-> Seq Scan on packages_package_files (cost=0.00..23997.10 rows=193710 width=115) (actual time=0.985..216.120 rows=193901 loops=1)
Buffers: shared hit=742 read=4773
I/O Timings: read=117.427
Planning time: 1.907 ms
Execution time: 447.629 ms
I don't know how big those files are (they may be small), but this affects Cloud Spend