republishing a maven packages duplicates uploads

Summary

Running mvn deploy twice or more, will upload and store multiple copies of the same file.

Steps to reproduce

  1. configure a project with maven
  2. set a not SNAPSHOT version
  3. run mvn deploy
  4. run mvn deploy again

Example Project

https://gitlab.com/nolith/test-pkgs/-/packages/19615

What is the current bug behavior?

It will store multiple copies of the same exact file

What is the expected correct behavior?

Do not upload again (or at least overwrite) when file_md5, file_sha1, and file_name are already on DB.

Relevant logs and/or screenshots

[43] pry(main)> mvn.package_files
  Packages::PackageFile Load (0.5ms)  SELECT "packages_package_files".* FROM "packages_package_files" WHERE "packages_package_files"."package_id" = $1  [["package_id", 14]]
  ↳ ./bin/rails:4
=> [#<Packages::PackageFile:0x00007fe480700528
  id: 62,
  package_id: 14,
  created_at: 2019-07-03 10:00:32 UTC,
  updated_at: 2019-07-03 10:00:32 UTC,
  size: 2416,
  file_type: nil,
  file_store: 2,
  file_md5: "aed4964fd5ddfdc088cdf83a0d2ab729",
  file_sha1: "252d3e65bce4198048d656f51a8a598b8ff76de4",
  file_name: "my-app-1.0.jar",
  file: "my-app-1.0.jar">,
 #<Packages::PackageFile:0x00007fe4637dbeb0
  id: 59,
  package_id: 14,
  created_at: 2019-07-03 09:59:12 UTC,
  updated_at: 2019-07-03 09:59:12 UTC,
  size: 2416,
  file_type: nil,
  file_store: 2,
  file_md5: "aed4964fd5ddfdc088cdf83a0d2ab729",
  file_sha1: "252d3e65bce4198048d656f51a8a598b8ff76de4",
  file_name: "my-app-1.0.jar",
  file: "my-app-1.0.jar">,
 #<Packages::PackageFile:0x00007fe4637dba78
  id: 63,
  package_id: 14,
  created_at: 2019-07-03 10:00:36 UTC,
  updated_at: 2019-07-03 10:00:36 UTC,
  size: 1229,
  file_type: nil,
  file_store: 2,
  file_md5: "95bd2a07ac1017f8eeb93dc7b69cfa35",
  file_sha1: "d9b7f54b87fbebb7be1a2f0afa5b9bb735208a60",
  file_name: "my-app-1.0.pom",
  file: "my-app-1.0.pom">,
 #<Packages::PackageFile:0x00007fe4637db5f0
  id: 60,
  package_id: 14,
  created_at: 2019-07-03 09:59:16 UTC,
  updated_at: 2019-07-03 09:59:16 UTC,
  size: 1229,
  file_type: nil,
  file_store: 2,
  file_md5: "95bd2a07ac1017f8eeb93dc7b69cfa35",
  file_sha1: "d9b7f54b87fbebb7be1a2f0afa5b9bb735208a60",
  file_name: "my-app-1.0.pom",
  file: "my-app-1.0.pom">]
[44] pry(main)>

Impact

According to the following query, today 2019-07-04, we already have 1567 duplicated files out of 193710

explain SELECT package_id, file_name, file_md5, file_sha1, count(*) as cnt FROM "packages_package_files" group by package_id, file_name, file_md5, file_sha1 having count(*) > 1;

HashAggregate  (cost=26902.75..28531.72 rows=162897 width=123) (actual time=383.013..443.980 rows=1567 loops=1)
  Group Key: package_id, file_name, file_md5, file_sha1
  Filter: (count(*) > 1)
  Rows Removed by Filter: 187662
  Buffers: shared hit=742 read=4773
  I/O Timings: read=117.427
  ->  Seq Scan on packages_package_files  (cost=0.00..23997.10 rows=193710 width=115) (actual time=0.985..216.120 rows=193901 loops=1)
        Buffers: shared hit=742 read=4773
        I/O Timings: read=117.427
Planning time: 1.907 ms
Execution time: 447.629 ms

I don't know how big those files are (they may be small), but this affects Cloud Spend

Edited Sep 02, 2020 by 🤖 GitLab Bot 🤖
Assignee Loading
Time tracking Loading