Calculate sha256 digest of artifact on PublishProvenanceService
## Background #### What are "SLSA provenance statements"? The ~"group::pipeline security" is working towards providing users with [SLSA Level 3 Provenance Attestations](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/slsa_level_3/). Quoting from the [SLSA documentation](https://slsa.dev/spec/v1.1/provenance), it states that attestations are. > It’s the verifiable information about software artifacts describing where, when, and how something was produced. For higher SLSA levels and more resilient integrity guarantees, provenance requirements are stricter and need a deeper, more technical understanding of the predicate. > Describe how an artifact or set of artifacts was produced so that: > - Consumers of the provenance can verify that the artifact was built according to expectations. > - Others can rebuild the artifact, if desired. As a simplified TL;DR, in the [context of GitLab](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/slsa_level_3/), a provenance statement is a JSON document that correlates the sha256 sum of an artifact with the build information. A worker then performs a [digital signature](https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/slsa_level_3/decisions/004_attestation_in_sidekiq/), which is called a provenance attestation. This is a highly sought after feature, particularly for our ~"GitLab Ultimate" customers. #### Current status Currently, we are generating SLSA attestations based on the [artifact archive](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/193550/diffs#80cf108a4264b2db417f9cd5ae0626ec09c16971_0_17) (https://gitlab.com/gitlab-org/gitlab/-/issues/546150). This is an interim step toward generating SLSA provenance attestations for individual artifacts, as described in the [Add individual job artifacts to the SLSA provenance subject](https://gitlab.com/gitlab-org/gitlab/-/issues/547974) issue. #### What are sha256 signatures required for Hashes are the sole mechanism through which the "subjects" are identified. See "subject" documentation in [this page](https://github.com/in-toto/attestation/blob/7aefca35a0f74a6e0cb397a8c4a76558f54de571/spec/v1/statement.md). > Set of software artifacts that the attestation applies to. Each element represents a single software artifact. Each element MUST have digest set. > IMPORTANT: Subject artifacts are matched purely by digest, regardless of content type. If this matters to you, please comment on GitHub Issue ## Technical background Job artifacts are currently stored in a zip file in the GitLab backend, but we would prefer to not have to open that archive to generate the digest as the archive could be large, and processing could be resource intensive. ``` % sha256sum test.txt 3c5bba498d6f7a2cb4c195cf0873c8b68c9407f04dfa9acaad7fe4875e5e93f1 test.txt what we need is: > file = Ci::Build.last.job_artifacts.filter { |a| a.file_type == "archive" }[0].file.file > entry = Zip::File.open(file).entries[0] > Digest::SHA256.hexdigest(entry.get_input_stream.read) 3c5bba498d6f7a2cb4c195cf0873c8b68c9407f04dfa9acaad7fe4875e5e93f1 ``` ## Relevant links * ~~Example POC https://gitlab.com/gitlab-org/gitlab/-/merge_requests/190882/diffs#541b0e0d6fbffd8673186d5ff090e868d060d6a7_32_34.~~ * [Draft: Get SHA256 hashes from artifacts](https://gitlab.com/gitlab-org/gitlab/-/merge_requests/201393) * Related ticket [Add individual job artifacts to the SLSA provenance subject](https://gitlab.com/gitlab-org/gitlab/-/issues/547974) * [Previous discussion](https://gitlab.com/gitlab-org/gitlab/-/issues/546150#note_2542628969) * [Previous work](https://docs.gitlab.com/ci/runners/configure_runners/#artifact-provenance-metadata) ## Non-functional requirements - [x] Documentation: [ADR 005: SLSA SHA-256 hashing location](https://gitlab.com/gitlab-com/content-sites/handbook/-/merge_requests/15396) - [X] Feature flag: This service is already called behind a FF - [x] Performance: we need to analyse the performance and impacts, particularly for large files. See verification below. - [x] Testing: See verification below. ## Implementation plan - [x] Modify PublishProvenanceService so that it has a method that calculates the hash of the artifacts. See PoC above for reference. - [x] Ensure we graciously handle low-disk space conditions https://gitlab.com/gitlab-org/gitlab/-/issues/559267#note_2688826895 - [x] Create unit tests. - [x] Log the output, for verification. ## Verification steps - [x] Measure the performance of hashing a very large file. ~~10GB.~~ 1GB. Used 1GB as that is the hard limit on artifacts by default. Done, see here for output: https://gitlab.com/gitlab-org/gitlab/-/merge_requests/201682#note_2720818052 - [x] Enable the behaviour in production for a specific project and observe the logs.
issue