Allow clients to verify authenticity and integrity of package metadata database exports

Proposal

We should provide a way for the exports stored in the buckets to be verified for authenticity and integrity. Adding these properties to the exports provide confidence that the results produced by our continuous license scanning and continuous vulnerability scanning features can be trusted. As an initial iteration, we should add a way to verify integrity by exporting the sha256 sums for each of the exported files. A common approach for this is to use the sha256sum tool. This tool is commonly found on many *nix style systems, and can also output many checksums into a singular file like so:

# Calculating the sha256sums for the advisory exports
$ sha256sum */*/*/*.ndjson # `> sha256sums.txt` can also output to file
51bd76713e69e1a20e4cadd34a9388cffd6d1c5404937b9f7d71fc8e5a531c6c  v2/conan/1688459246/000000000.ndjson
7ec2f41aca6d3a0dc8db00bec4355678de403f916d4b725dc5799fda62f1ea11  v2/conan/1688540541/000000000.ndjson
1d5ed6b684122d5efb7fc23a275bb2657ac5b22e62ba65a216fefeaf7f1f84ff  v2/conan/1688626965/000000000.ndjson
3fcb9648c1e6be3a97f284f606525f156c9f0f4458a7d8a8455f8a6aa83a1621  v2/conan/1688972549/000000000.ndjson
1a13c0ef2f0cc81a1fd22d4642379fa85c86f9e9be327e54939d5b33979cc9b5  v2/conan/1689145355/000000000.ndjson
62d3a7fc634ef66a6d4ce32d6bebcf818ebc3f775d8fd61a1cfb577c26730980  v2/go/1688459248/000000000.ndjson
00cb4a34e46396f2dc494cb17879e20e995112181f9e6bde9f8472baed2720ae  v2/go/1688972551/000000000.ndjson
b3ac5e3839a3a5cf84ecdf89647a39406433a863a156d5406520227ccba4a1f1  v2/go/1689145356/000000000.ndjson
6931b4b0676cb1359838d42ad74ab1b25ca6355402053cd9ce7b5e5de24a183e  v2/maven/1688459233/000000000.ndjson
131592c9b1c806fe3a23d3acfb569161eef436ed9f7662811a9baf2d095fe95b  v2/maven/1688540534/000000000.ndjson
ada1400565c97ec28910a1c50ba13e99a7ac900a0686f3be662cb33666753267  v2/maven/1688626957/000000000.ndjson
d3895fb64b664c8d61948f9117e4459602dc266d2dbd356e34543b9242382178  v2/maven/1688972541/000000000.ndjson
f6ddd83d190497deb38d480b6e26de1e5488049241b8e348edaa62a35f2cab16  v2/maven/1689058945/000000000.ndjson
1028a1ecab006cc693e488e46d83e034a1a7a03933069f71f92b87a2b18b4dc8  v2/maven/1689145347/000000000.ndjson
a7b36460f925b91a80a93570f1c0ef5a262c50711fcbf52373f9e4fae71204e7  v2/npm/1688459236/000000000.ndjson
5e7c68d9b34c915a57507c9c98c1383cc88a7177bb1c66cdb3104bba03a1cb4b  v2/npm/1688626959/000000000.ndjson
02d8a6ffd6942d6a94e343d44bd1abca0631e5f43c374ee3897621e34826fe7a  v2/npm/1688972543/000000000.ndjson
cf75acedd3388667c801924c1d53d24fc109596a5d1a954fb45e5638d44e4410  v2/npm/1689145348/000000000.ndjson
f0e996318c60ccf5d00c4e46d23fcecd1d3ad4b196ca6319823080d4847cb13b  v2/nuget/1688459250/000000000.ndjson
fbf53c85dd5b1d0eea654c0ff79e9d82cc6ab1591ace80e2d8aff32b07ffccef  v2/nuget/1688626967/000000000.ndjson
9a7a3d0bbd3f86a4c086da7aee6f49df64f49726f80ffde3a6a88ae6a04d37fa  v2/packagist/1688459238/000000000.ndjson
87aadf1774b4a669ba85d3cb68e4335acb62d4a1c51e79766cf01b378dc7d9f5  v2/packagist/1688540537/000000000.ndjson
f328a6437071d4c08703353e97b968cf33ea47615d2c8fe18d8444f7bb7a48e1  v2/packagist/1688713369/000000000.ndjson
73094930d2d7572dc4ec712f3b5e1c5e57c95762e475dd39b6e9f18a21f5f4ba  v2/packagist/1688972545/000000000.ndjson
97f624af70de9d28cd8bedc5c16745df6ae133d3a5a0e53eec18a2b3f01490a8  v2/packagist/1689145350/000000000.ndjson
456c985a3d827503993add60699b5bdc10b6a2672af9fb15d33a78e63c124799  v2/pypi/1688459241/000000000.ndjson
e424e173ffa7d984799cf58649ac7627ee356ad6b438efaf6a28d9c548aa3539  v2/pypi/1688626962/000000000.ndjson
bbc92c5a2a942bd67ddd3154e9cc0f59b573ded80593553797f8b79f2fcc9ae7  v2/pypi/1688713370/000000000.ndjson
76a720d59f44c7de2068d790af0c02da0cb20bda4c97f23da4193e0553b98fab  v2/pypi/1688972547/000000000.ndjson
61976978a89f1eaa684db77c36eec9767a87d6a881864ca5b77d1503cc4d1a49  v2/pypi/1689058949/000000000.ndjson
1a48be8a0aa323047c5d91021f5302f5b3dc270070623d9d9fe097c131a06701  v2/pypi/1689145352/000000000.ndjson
63124025dd265a42263e796955f003dcc3410c4bff7eb937b70802a0ab4ab78c  v2/rubygem/1688459243/000000000.ndjson
189c8b236fb939db7b830d0cb017675f4c1c6f156f23b464cfb9fa060ae252b5  v2/rubygem/1688626963/000000000.ndjson
4db53063c9f8c0b172de105ecfcbf24fa32ab97ef8cc4eba7e15862b03121652  v2/rubygem/1688713371/000000000.ndjson
5e7d7aa31f0f4d9731fa5827529cbc9ed9f4339b170ebd4dcdde89b1a4215c08  v2/rubygem/1688972548/000000000.ndjson
07756972e44ebeb7449674da74943056f4c3e100f05cffd98be12548778280a1  v2/rubygem/1689145353/000000000.ndjson

This approach gives integrity, but it doesn't give authenticity. For that, a signature scheme will need to be adopted, e.g. GPG signatures. Adding signatures can be a complex task, because it will require managing the private key securely, so it's proposed that this portion be done in the following iterations.

So, why should we provide these features if the data is publicly available? Confidentiality is only one aspect of which a malicious actor can focus on. Adding integrity will ensure that the data has not been tampered, and can be useful in handling or even preventing erroneous states. For example, if some data were to be malformed when written by the client, we'd be able to know that exports did not match 100% what was expected, and thus better handle the error state. Likewise, adding authenticity to our exports allows us to verify that not only is the data guaranteed not to be tampered, but we can claim that the data was written by us. This protects us against scenarios where the bucket is compromised, and the exports mutated. A sufficiently competent actor who has compromised the bucket(s), can include valid checksums for the new exports, and a client would be able to confirm that the exports are valid. Unfortunately, they would not be able to confirm that the exports came from GitLab unless they could use GitLab's public key to verify its origin.

Still, the above doesn't explain the why completely. Outside of hardware errors, why would a malicious actor intentionally tamper¹ the public data? Tying this back to our products, license scanning and dependency scanning, clients count on us to ensure that they have the best possible results to make important decisions like adding a new license or a new 3rd party library. A threat actor who knows this, can choose to act in a manner that impacts the results of the features, and ultimately our trust. A full threat model would produce the best results, but I've included some initial examples below:

Some entity could remove the vulnerabilities for a library that has been flagged as malicious.
Alternatively, they could abuse the trust in our solutions to introduce a malicious solution. Let's say that pkgA@1.0.0 is super popular, but it's been discovered to have a vulnerability. It may be possible to provide a solution that says something like "Switch to maliciousPkgA@1.0.0 which has been patched" instead of the valid solution.

To conclude, I think that these features provide a lot of benefits to us, and would also show the commitment that we have to supply chain security when it comes to the services we provide. It's not expected to be the only line of defense (the bucket should be secured as well), but it can act as an additional layer of defense.

Implementation Plan

TODO

Tamper deserves some disambiguation. In this context, it means changing the data in any manner. Even removing it altogether, which affects another product aspect - availability. Availability has its own nuances, and deserves a separate issue for consideration. ↩

Edited Jul 19, 2023 by Oscar Tovar