Investigate the difference in license-db exporter bucket size between dev and prod
We noticed a difference in the size of the exporter bucket folders between dev and prod while investigating a failing sanity test. The test started failing after minimising the size of the exporter bucket and rerunning the feeder and exporter jobs on dev.
All Prod sizes:
v1/rubygem is of size 51409702 expecting 35000000
v1/pypi is of size 330926526 expecting 300000000
v1/packagist is of size 170505630 expecting 150000000
v1/nuget is of size 463417614 expecting 290000000
v1/npm is of size 1185650732 expecting 1100000000
v1/maven is of size 1447820310 expecting 650000000
v1/go is of size 1261803711 expecting 1100000000
v1/conan is of size 154756 expecting 120000
v1 is of size 4911692473 expecting 3800000000
Dev sizes:
v1/rubygem is of size 35761814 expecting 35000000
v1/pypi is of size 106560702 expecting 0 // Changed the sanity value to 0 to pass the test
v1/packagist is of size 154046091 expecting 150000000
v1/nuget is of size 239201025 expecting 0 // Changed the sanity value to 0 to pass the test
v1/npm is of size 1263705172 expecting 1100000000
v1/maven is of size 723026943 expecting 650000000
v1/go is of size 1345154165 expecting 1100000000
v1/conan is of size 118668 expecting 0 // Changed the sanity value to 0 to pass the test
v1 is of size 3867574580 expecting 3800000000
As an example pypi was 315.6MiB on prod and 101.62MiB on dev. After rerunning the pypi feeder and exporter job the size on dev became 236.56 MiB still having a considerable change.
We need to investigate what is the reason behind this difference. Possible reasons could be:
- unknown licenses (ID 0)? We no longer store them, but maybe we haven’t cleared the prod DB after changing that.
- incremental exports are significantly larger than one-shot export-all exports
- regression in the exporter
More information on how to find the size of a GCP bucket and subfolder can be found in the documentation