"drop job on secrets provider not found" causes regression for jobs requiring vault secrets when passing variables using dotenv files ('The secrets provider can not be found')

Summary

The drop job on secrets provider not found change (feature flag MR, FF issue, main MR) requires that

  • secrets_provider? is true, in turn
  • hashicorp_vault_provider? || azure_key_vault_provider? is true meaning
  • various variables are present; VAULT_SERVER_URL in the case of Hashicorp Vault.

This change moves this validation to the pipeline creation stage, rather than to when the job gets run.

It's therefore no longer possible to supply this variable using a dotenv from a previous job in the pipeline since at pipeline creation, the dotenv doesn't exist yet.

Steps to reproduce

  1. Simple reproduction CI code

    • it is not a requirement for vault access to work to reproduce this error.
    • set the vault IP address to a valid local web server, for example, and the job fails with a runner system error.
    • the bug is in Rails, so to get that far, it's necessary to get through the Rails code.
    minimal reproduction .gitlab-ci.yml
    ---
    stages:
      - one
      - two
    
    make_dotenv:
      stage: one
      script:
        - echo 'VAULT_SERVER_URL=http://192.168.1.12' > vaultenv
      artifacts:
        reports:
          dotenv: vaultenv
    
    use_dotenv:
      stage: two
      secrets:
        SOME_SECRET:
          vault: foo/bar/password@secret
          file: false
      script:
        - echo 'hello world'
  2. In GitLab 16.5 and earlier, when the pipeline creates, make_dotenv starts, but use_dotenv remains created.

    screenshot image
  3. Once the first stage completes use_dotenv runs

    • note: to reproduce this, I did not provide a vault. The bug occurs in Rails when building the pipeline.
    • The output proves that on earlier versions, variables were passed using dotenv and the job would attempt to use them.
    • I specified the IP address of a valid local NGINX server so a HTTP call could be made by the runner.
    screenshot

    There has been a runner system failure, please try again

    image

    Running with gitlab-runner 16.8.1 (a6097117)
      on xxx 4Yvq_4VE, system ID: s_10add9b5c8b1
    Resolving secrets 00:00
    Resolving secret "SOME_SECRET"...
    Using "vault" secret resolver...
    ERROR: Job failed (system failure): resolving secrets: initializing Vault service: preparing authenticated client: checking Vault server health: api error: status code 404: <html>
    <head><title>404 Not Found</title></head>
    <body>
    <center><h1>404 Not Found</h1></center>
    <hr><center>nginx/1.24.0</center>
    </body>
    </html>
  4. Run it in Gitlab 16.6 and later

    • The use_dotenv job immediately fails. At that point, make_dotenv isn't even on a runner yet.
    screenshot image
  5. The secrets provider can not be found

    screenshot image
  6. Once the dotenv exists, the failed job can be re-run. Validation of the vault variable succeeds.

    • Customers don't want to have to retry every job that uses vault secrets.
    • In my reproduction, a runner system failure occurs as expected.

In 16.11 and later the error reads:

The secrets provider can not be found. Check your CI/CD variables and try again.

Example Project

See CI snippet above.

What is the current bug behavior?

dotenv can long longer be used to supply VAULT_SERVER_URL to a job that requires vault secrets, because validation occurs when the job is created, not when the job runs.

What is the expected correct behavior?

Validation should take into account all mechanisms in the product to supply variables.

Workaround

Variable precedence can be used to supply a dummy variable so validation passes, and then as usual at runtime, dotenv supplies the actual value.

use_dotenv:
  stage: two
  variables:
    VAULT_SERVER_URL: 'http://127.0.0.1'
  secrets:
    SOME_SECRET:

Relevant logs and/or screenshots

Output of checks

Customer reported the issue after upgrading from 16.3 to 16.7

The feature flag was removed in 16.6.

Possible fixes

Edited Jun 26, 2024 by Ben Prescott (ex-GitLab)
Assignee Loading
Time tracking Loading