Vault integration for CI/CD proof-of-concept

As part of the epic on Variables and secrets management, we want to build a proof-of-concept integration with Hashicorp Vault as the first step towards a first-class integration with Vault.

Use Case(s)

For this proof of concept we're going to cover at most 1-2 use cases for Vault and GitLab CI integration:

1) Deploy to AWS

Authenticate to Vault - The most ideal way might be using the TLS Certificate auth method.
Request time-boxed AWS credentials, either using the vault cli to request the credentials or the API to learn how to authenticate.
Perform deploy of code to some AWS service.

2) GitLab Infrastructure team use case

TODO: To be discussed directly with infra team relative to https://gitlab.com/groups/gitlab-com/gl-infra/-/epics/55

In Scope

Researching Vault
Discovery around authentication methods to Vault
Getting Vault secrets into a running gitlab-runner environment
Using the secrets to accomplish the use cases (or an example of the use cases) above

Out of Scope

Final decisions on integration implementation details
Installation of Vault by GitLab - see https://gitlab.com/gitlab-org/gitlab-ee/issues/9982
Final decision on how to translate GitLab concepts of protected runners, protected branches/tags, etc. and the final Vault integration
How the final Vault interaction will integrate or replace exisiting GitLab secret variables.

Summary of PoC work

During %11.11 and %12.0 release cycle we've worked on the PoC for Vault integration. During this time we had two meetings and an implementation approach was done.

First meeting (2019-04-29) - PoC direction and possible future plans

Agenda and meeting notes (internal only)

During first meeting we've discussed what could be the direction which we should chose for the PoC. At this moment we had almost ready a Vault service for CI that can be used in CI tests when implementing things in GitLab and GitLab Runner.

After the meeting we've defined a plan for PoC and future improvements:

Configure Vault connection, authentication and secrets from Runner’s config.toml, request the secrets and change them to the variables for the job.

This change allows us to update only Runner (which simplifies the work) but yet it will require all basic mechanisms to be implemented: installation of Vaults’s library, handling connection, handling authentication, handling secrets requesting, handling variables creation and injection. We can next re-use these mechanisms in any other approach for Vault integration or when working on the extensions described below.
Extension: retrieve the short-living token and pass it to the job as another variable. User’s job may then use Vault directly with a short-living credentials with given permissions.
Extension: use Vault secrets to store secrets normally stored in config.toml like cache server credentials, Runner token, session secret etc.
Extension: allow to configure used Vault server and Vault secrets from .gitlab-ci.yml. This is the minimal requirement to make Vault usable on Shared Runners on GitLab.com.
Extension: allow to configure used Vault server and Vault secrets from GitLab’s UI on Project, Group and Instance levels. This could be another approach to enable Vault integration with Shared Runners on GitLab.com

We've also mentioned few other possibilities how GitLab could integrate with Vault, including:

adding Vault to our Omnibus package and Cloud Native Charts
adding an UI to manage Vault secrets from within GitLab
contribute to Vault project to add GitLab authentication method

but we've decided that it will be not the part of PoC.

PoC implementation

Between %11.11 and %12.00 release cycles we've started the implementation of chosen PoC idea. It ended with three connected MRs, where each bases on the previous one:

Add official Vault library for Go to our dependencies
Implement Vault client and several authentication methods

For the PoC we've decided to implement token (the simplest), username & password (also simple) and TLS Certificates (more complex but also more elastic) authentication methods.
Implement secrets reading and injection mechanism.

For the PoC we've decided to implement only support for Key Value V1 and Key Value V2 secret engines.

The implementation was done in a way that allows us to extend Runner with support for other authentication methods and secret engines in the future. It also proposed the configuration syntax for config.toml file.

As part of the implementation we've added tests, including integration tests that are using the Vault CI Service to simulate real usage.

Second meeting (2019-06-04) - PoC implementation review and decisions about next steps

Agenda and meeting notes (internal only)

During the second meeting we've refreshed ourselves with what was decided last time. Next we've reviewed the implementation. For this a test project was created. It contained a job definition that is as simple as printing the two variables that are not defined by default and are configured as secrets retrieved from Vault.

Example project, including the content of used config.toml file, can be found at https://gitlab.com/tmaczukin-test-projects/test-vault-integration. Example job, that gets the secrets and turns them into job variables, can be found at https://gitlab.com/tmaczukin-test-projects/test-vault-integration/-/jobs/224773960.

After seeing the real-life example and after two other engineers reviewed the MRs we've had a discussion where several statements were made:

The current implementation is good and very elastic, but the configuration is too complicated. Having the specific secrets defined in config.toml make the usage of Vault not so easy for first iterations, since one need to manage a Runner with specific configuration to get secrets into a job. At the same time the config.toml file gets bloated with all of the structs defining secrets requesting and transitioning to variables.

The decision was not made finally (yet!) but the we've all agreed that we should not start with implementation that forces to set the configuration in config.toml. Instead we should look on how we can set the secrets definition from GitLab from the very beginning.

For the connection and authentication part we've agreed that it's OK to leave it on GitLab Runner side and eventually add an option to configure it also from the GitLab level.
There was an idea that for KV secret engine we should allow an option to get all of them instead of specifying each secret explicitly. Immediately a counter-argument was raised that with explicit declaration of secrets it's easier to track which job uses which secrets (e.g. with Vault auditing) while with option give me all secrets such information is unavailable (since each job gets all secrets every time). The conclusion was that for KV secret engine we should stay with the explicit declaration.
While for KV it seems it's better to force an explicit declaration of each key, for some secrets there is no reason to force it, for example someone that wants to use AWS secret engine in 99% of cases will want to access both Access Key and Secret Key. The decision was that for each supported secret engine we should think about the best implementation, and when it's reasonable to provide all parts of the secret at once.
Another idea for simplifying configuration and making user live easier was to remove the need for defining variable names on which the secrets should be provided. The variables should be created automatically by Runner, basing on a secret type. For example for a secrets taken from KV store, the variable could follow a pattern like VAULT_[secret_name]_[key_name] which would be created automatically. While for AWS store, the variable could follow a pattern like VAULT_[secret_name]_ACCESS_KEY/VAULT_[secret_name]_SECRET_KEY since these are the only two values that can be taken from AWS secret.

We haven't think about specific format for this, but the general direction should be to follow the convention over configuration rule when implementing the final version of Vault integration. And that we should handle each of the other secret engines individually, when we will decide to implement them in the future.
During the implementation and tests we've found that Vault's Go library requires Go 1.9+, so any final implementation is currently blocked by our Upgrade Go to >1.8 problem. The decision was that we should prioritize the work on moving up from Go 1.8 in next releases.

Final implementation

The decision still needs to be made, but the path that we're thinking about is:

Leave all of the mechanisms implemented in Runner.
Remove entirely the configuration of secrets mapping - leave only vault connection and vault authentication configs (which already is a lot of options and in future even more). For now part of the mechanisms would wait unused until rails side will be done.
Implement the 1st extension: retrieving the short-living token by Runner and sending it to the job as a variable, review, merge and release Runner side of integration.

With this we could ship an initial integration with Vault without waiting for changes added on rails side in GitLab (where the review process will be more complicated and will take more time). We could then get feedback how the configuration of auth and communication with Vault works. To use Vault users would need to request secrets manually from within the job script, but they would not need to pass the main credentials.
Separately we could start working on changes on GitLab side: configuration of secrets to variables mapping and sending this to Runner via an updated API. This would be the next iteration.
During the implementation we should have all of the above concerns in mind (so e.g. the automatic variables naming case etc.)

Edited Jun 11, 2019 by Tomasz Maczukin