Proposal for adding binary file support to CI variables

assigned to @darbyfrey

mentioned in issue #345944 (closed)

👋 @darbyfrey, please can you add a type label to this issue to help with issue discovery in issue reports.

added auto updated label

Hi @darbyfrey,

Please add labels to your issue, this aids categorization and locating issues in the future.

Thanks for your help! ❤

You are welcome to help improve this comment.

added Incubation Engineering Department mobile-devops typefeature labels

There's just one security-related thought that I had - Since these are binaries, we need to be sure that nowhere along the process of the binary being handled is the user able to force execution outside of a CI job.

Yup, we should definitely remember about that!

But given the idea that the file received from the through by API (and in the future - through the UI) would be encoded and even additionally encrypted for the storage, and that it would be passed in all places in the encoded form until it's decoded on the runner side, it should be quite safe. It'll not differ from allowing to store a base64 encoded binary in the content of CI/CD variable - which we do already 🙂

@darbyfrey is the proposal here to store binary files in the database? How large these files can be? Are these executables? I think that ideally we should not store anything larger than a public / private key in the database, instead we should probably store them encrypted in the object storage.

Hi @grzesiek, the proposal here is to store the base64 encoded contents of a file in the database, so we'd be writing text data to the DB.

These files aren't large, only a few kilobytes in most cases. Most files are around 3-4 kilobytes after being base64 encoded. The Apple certs end up being a bit larger at around 20 kilobytes encoded. We could enforce an upload size limit too. Also, these files should never be executable.

WDYT? This data would fit within the constraints of the data type in postgres. If we limited the uploads to 20kb I don't think we'd ever get above 30kb encoded, so that would likely be the upper limit.

@darbyfrey I would consider storing them outside of the database, preferably in the object storage as encrypted files / objects.

Please note that you can store binary data in the database - there is no need to Base64 it before storing in PostgreSQL, this could be a waste of space as Base64 comes with around 33% of overhead in byte size.

I think that it should be possible to:

encrypt a binary file
store it in the object storage
save a tracking entry in the database (with IV required to decrypt)
expose to a runner through a pre-signed URL
teach runner how to retrieve it and decrypt
expose link to the file in an environment variable (like a file type secret variable)

Would it work for you @darbyfrey?

Thanks for this @grzesiek!

Yes, I think this would work. I think this approach would look something like:

Data Model

Create a new data model that would track the entry. This would store attributes like the file name, object storage location, encryption iv, project_id, etc.

Upload API

Similar to what is in the current proposal, adjusted to match the new data model:

curl --request POST --header "PRIVATE-TOKEN: <your_access_token>"
     "https://gitlab.example.com/api/v4/projects/1/some_endpoint" 
     --form "filename=upload_keystore.jks"
     --form "file=@/path/to/binary/file"

Retrieval API

The retrieval API would be the endpoint that the Runner would hit get the file information. This payload would include the file name, pre-signed download url, encryption iv, and any other attributes needed to download and decrypt the file.

Runner

The nice thing about this approach is that it could be functional without any changes to the runner. It would just require that the file download and decryption be done in the CI script. We wouldn't want to leave that long term, but that could be a first iteration.

Ideally there would be some way to tell the Runner which files to download and decrypt. Here's an example of how another platform does it. Perhaps a pattern like this would make it easy to tell the Runner which files to download for which jobs. Would that be technically feasible?

build_android:
  stage: build
  script:
    - ...
  secureFiles: (or another name)
    - upload_keystore.jks
  artifacts:
    paths:
      - build/app/outputs/apk/debug

I definitely agree with @grzesiek. While the reason that created this idea is storing small certificates, when added users will quickly start storing there everything: software packages, disk images, gigabytes-measured warez used in abuse scenarios. That should definitely not land in the DB. We should expect all the unexpected use cases here 🙂

I also like the idea of sending the information about these variables separately to the Runner with the pre-signed URL to download. However, we could just introduce an API and protect it with the CI_JOB_TOKEN authentication as other resources accessible from jobs are.

We need to remember that the philosophy behind GitLab CI/CD jobs is to execute shell scripts. Currently variables are defined by adding something like that to the script:

export TEXT_VARIABLE="variable value"
echo "variable value" > /path/to/the/project.tmp/FILE_VARIABLE
export FILE_VARIABLE="/path/to/the/project.tmp/FILE_VARIABLE"

This is of course adjusted to the needs of each shell that we support.

Now, for the proposed binary format this would not be as easy. Passing the binary file content with echo decoded_content_here will not work. Expecting that the target environment will have a proper encoder (especially if we want to leave a way to add more encoding mechanisms in the future) is risky.

With downloading from the URL we can handle this in a similar way how the artifacts are handled - through the GitLab Runner Helper.

So extending what @grzesiek proposed in 5. teach runner how to retrieve it and decrypt and 6. expose link to the file in an environment variable (like a file type secret variable):

We would extend the job API payload to get a new entry like:

type BinaryVariables []BinaryVariable

type BinaryVariable struct {
    VariableKey string `json:"variable_key"`
    URL         string `json:"url"`
    Encoding    string `json:"encoding"`
    Permissions int    `json:"permissions"` // we need to consider if we should support that.
                                            // For example - to make the binary files executable,
                                            // as in the Unix systems it'll be required to set
                                            // proper file permissions.
    Checksum    string `json:"checksum"` // just an idea - we could store the checksum of the file
                                         // received from the user before it was encoded, and then
                                         // check against it after it's decoded by the Runner - to
                                         // be sure that we've got what the user wanted.
                                         // Currently it's not a problem because we work on the data
                                         // created by the user all of the time (we may however encrypt
                                         // them - I don't remember that detail).
}

// (...)

type JobResponse struct {
    // (...)
    BinaryVariables BinaryVariables `json:"binary_variables"`
}

When preparing the variables (basing on the Variables entry in the JobResponse) Runner would additionally iterate over the BinaryVariables slice and create "virtual" variables pairing the variable key with a path made in predictable way. As with the file type variables we would use the "job tmp" directory as the directory and variable key as the file name. So in the script, the variable would be exported like:
```
export BINARY_VARIABLE=/path/to/the/project.tmp/BINARY_VARIABLE
```
Note, that in comparison to the file based variable, we don't have the echo "content" > /path/to/the/project.tmpe/BINARY_VARIABLE part.
We add another job script stage executed before user's script takes control, so exactly like cache and artifacts commands are done. This part would contain commands like:
```
gitlab-runner-helper extract-binary-variable --url "https://gitlab.com/api/v4/job/1234/variable/some_variable_id" --token "${CI_JOB_TOKEN}" --encoding "base64" --checksum "sha256:abcdef123456..." --permission 755 --path "${BINARY_VARIABLE}"
```
These would be added automatically by the Runner. URL, encoding, checksum and permissions would be rewritten 1:1 from the job payload. Token and path would defer to use the variables, that were already exported at the beginning of the script (making sure that we define their value only once).

The helper binary would then executed an HTTP request to the given URL with the given token. I think that this API should return already decrypted content - the encryption and decryption key would be then fully hidden in the GitLab instance. Received content would be decoded using the decoding strategy defined by encoding flag. The permissions would be set on the final file and the checksum would be also checked against the final file.
User's script, when it would take the control, could access the prepared file simply as:
```
sign-my-code path/to/the/code --key-file "${BINARY_VARIABLE}"
```
As the working directory needs to be shared between all job execution stages, the file will be there and the only repeated part will be the declaration of the variable added to the beginning of each stage's script.

BTW, this feature would be a nice solution for the security-like problem that we're struggling with for a longer time! (cc @erushton @DarrenEastman)

For example, when you want to sign files with GPG (like we do for GitLab Runner itself when preparing the DEB/RPM packages and the S3 releases), the only two options currently are:

use a dedicated runner - which doesn't scale well
store the GPG key content as a base64 encoded variable - which is very easy to be leaked (for example, by using the CI_DEBUG_TRACE on a public job), as the content is echoed in the script. And such content usually is not applicable for the variables masking.

With the binary approach downloaded from the URL with the helper binary, the only things that could leak is the URL (which should be protected with authentication), the CI_JOB_TOKEN (which is already masked) and metadata like encoding strategy or file permission. The content of the file is transferred and saved in the execution environment totally outside of script's output.

Thanks for those details @tmaczukin. Very useful. It is also great that this could help with things like exposing GPG or SSH keys to the runner 👍

when added users will quickly start storing there everything: software packages

Yeah, this would be a huge improvement in our caching mechanism 😬

Thanks for the feedback and all of the detail @tmaczukin!

I really like the idea of having the retrieval API do the decryption work. That's a nice simplification to the process and a reuse of the of the CI_JOB_TOKEN authentication we already have in place.

Regarding the BinaryVariable concept for the Runner:

encoding if we did all of the work of decrypting in the retrieval API, I think we could skip the need to encode/decode the file contents, unless you see a reason it would be needed?
permissions I don't think this would be necessary for the majority of use cases. For code signing the files wouldn't need to be executable, and for cases where it could be needed wouldn't it be possible to execute a chmod... as part of the CI script?
checksum I think this is an interesting idea. Are you thinking this would be more to prevent MITM attacks, or as a way to ensure our encryption/decryption is working correctly?

Yeah, this would be a huge improvement in our caching mechanism

Context on my vague joke: I could zip up my node_modules/ directory and add it as a binary CI variable. I don't actually think this would improve performance of anything in anyway, but it's a work around for caching if you don't have proper caching setup.

@darbyfrey

encoding if we did all of the work of decrypting in the retrieval API, I think we could skip the need to encode/decode the file contents, unless you see a reason it would be needed?

Good point. If the API endpoint will stream the binary content then indeed there is no need to encode/decode it with base64 or anything else.

permissions I don't think this would be necessary for the majority of use cases. For code signing the files wouldn't need to be executable, and for cases where it could be needed wouldn't it be possible to execute a chmod... as part of the CI script?

Of course it could be set with manual chmod call. But if we're adding support for binary files, then why not make it easier to use from the beginning 🙂

Especially that the MVC scope is already pointing that adding definitions will be possible only through API. So it's just a matter of accepting an additional parameter and using it on the Runner side. We could make this parameter optional and set it to 600 by default.

I just want to avoid a problem that we've introduced when support for Vault secrets was added - while discussing we've decided that we will support just the file type variables as it's more secure and users can easily load the content of the file to a normal variable. But we've quickly found that the majority of users wants or even needs to use the variables where the secret content is added directly and the need to manually load the secret content adds just a lot of redundant work. So we've implemented a simple improvement to the configuration... which took 8 months to get merged.

Setting permissions on the file is IMHO too simple to not include it in MVC. And my experience shows that it'll be definitely harder to add it after MVC as other priorities will move the focus elsewhere 🙂

checksum I think this is an interesting idea. Are you thinking this would be more to prevent MITM attacks, or as a way to ensure our encryption/decryption is working correctly?

The second. We already recommend that GitLab should be accessible through HTTPS to make the configuration secure, and that should in most cases prevent MITM attacks. If such kind of attack on communication between GitLab and Runner or job execution environment would be applicable, then the attacker could affect the job in a more easy way by just updating the script content. Or could update the checksum together with the binary file content. Checksum without a signature is not a security feature 😉

It's done only to make sure that the received content is proper. The uploaded binary will need to be encrypted and saved on some object storage. Then, when used, it will need to be downloaded from the object storage, decrypted and streamed to the job that requested it. There are many places where something can be messed here. A checksum will ensure that what the user gets in the job is what was uploaded when creating the variable.

The only thing to consider here is what to do when the mismatch will happen. I think that we should fail the job here (just like in case when the job is unable to upload or download the job artifact).

Thanks for the feedback @tmaczukin! I think that all makes sense, one question regarding permissions:

Do you think we'd need to support the complexity of unix style permissions (i.e. 744, etc)? Or could we simplify it to an enum of read-only, read/write, read/write/execute?

It seems like the group and world permissions wouldn't be necessary in the context of a runner, and it seems it like it could also make the process of setting permissions easier on Windows.

Well, I'm not following Windows world actively, but from what I remember Windows doesn't have file permission model at all. Or am I wrong? 🙂

I think we should created some more abstracted way to define the permissions. So that we can start with the simple permissions that you propose but in the future extend it to support the fully unix style or a specific Windows style (or other used OS) with changing the implementation details on the Runner and GitLab sides but without a need to change the API (which would create compatibility problems) 🙂

So the conclusion would be: let's support setting file permissions from the beginning but do it in a little abstracted way. Without exposing permission models specific to different OSes that we're supporting.

I've updated the issue description above to include the updated approach from these discussions. Please let me know if there is an area where I can include more detail or clarification, or if you have any other feedback on this approach. Thanks everyone for the great discussion!

changed the description

FYI - I'm currently in the process of refactoring how we represent variables in the backend and making it consistent everywhere. We currently have two ways to format variables - as a variables hash, and as an array of variable objects. This MR !75561 (closed) changes that so we represent variables as an array consistently all the time everywhere in the codebase. After that's merged I have a MR for adding support for multiple variable types in the CI config - !75439 (closed). Both of these are somewhat related to this initiative.

Thanks for the heads up @matteeyah, I'll follow along with those MRs.

mentioned in merge request !75695 (closed)

added to epic &6713

mentioned in issue #347149 (closed)

Hey everyone, I've got an MR going for the first iteration of this feature, so I thought I would post it here to get some early feedback !75695 (closed). I also created a quick demo to walk through it https://youtu.be/kRwo4aIHd2Q. Would love any feedback you all have!

@grzesiek @tmaczukin @fabiopitino @matteeyah

I've left one suggestion and one question related to the API design.

I'll leave reviewing the code to ones that have more knowledge and experience related to CI stack in Rails 🙂

Thanks again for the feedback @tmaczukin! I've updated the MR with those changes and recorded a new demo https://youtu.be/eK3FUskHfdo. I also created a simple ruby script that can be used to make things work on the Runner side.

I'm still looking to get some more eyes on !75695 (closed), so if there are folks that would have interest in some pairing or helping out with a review, please tag them in this issue. Any help would be greatly appreciated, thanks!

/cc @grzesiek @fabiopitino

changed the description

created merge request !77886 (merged) to address this issue

Proposal for adding binary file support to CI variables

Problem

Proposal

Data Model

Upload API

Get API

Download API

Runner

Designs

Child items 0

Activity

Data Model

Upload API

Retrieval API

Runner