Feature Request: Add Retry Configuration for HashiCorp Vault Secret Retrieval

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Problem Statement

When using GitLab CI/CD's native HashiCorp Vault integration with the secrets keyword, jobs can fail entirely due to intermittent network issues or temporary Vault server unavailability. Currently, there is no built-in retry mechanism for Vault authentication or secret retrieval operations, which means a single transient failure causes the entire pipeline to fail.

Common scenarios where this becomes problematic:

  • Temporary network connectivity issues between GitLab Runner and Vault server
  • Vault server experiencing brief periods of high load (HTTP 429 rate limiting)
  • Transient DNS resolution failures
  • Load balancer or proxy timeouts (HTTP 502/503/504)
  • Temporary Vault seal/unseal operations

Proposed Solution

Add retry configuration options to the Vault secrets integration that allow users to configure:

  1. Number of retry attempts - How many times to retry before failing the job
  2. Retry delay/backoff - Time to wait between retries (with optional exponential backoff)
  3. Retryable status codes - Which HTTP status codes should trigger a retry

Suggested Implementation

Add new CI/CD variables to control retry behavior:

variables:
  VAULT_SERVER_URL: "https://vault.example.com:8200"
  VAULT_AUTH_ROLE: "myproject-staging"
  VAULT_RETRY_ATTEMPTS: "3"                    # Number of retry attempts (default: 0)
  VAULT_RETRY_DELAY: "2"                       # Initial delay in seconds (default: 1)
  VAULT_RETRY_BACKOFF: "exponential"           # Options: none, linear, exponential (default: none)
  VAULT_RETRY_STATUS_CODES: "429,502,503,504"  # Comma-separated list (default: 429,502,503,504)

Or alternatively, extend the secrets syntax:

job_with_secrets:
  id_tokens:
    VAULT_ID_TOKEN:
      aud: https://vault.example.com
  secrets:
    STAGING_DB_PASSWORD:
      vault: 
        path: myproject/staging/db/password@secret
        retry:
          attempts: 3
          delay: 2
          backoff: exponential
          status_codes: [429, 502, 503, 504]
  script:
    - access-staging-db.sh --token $STAGING_DB_PASSWORD

Benefits

  • Improved pipeline reliability: Reduces false failures from transient issues
  • Better user experience: Teams won't need to manually re-run failed jobs
  • Cost efficiency: Fewer wasted compute resources on jobs that fail for non-critical reasons
  • Production readiness: Makes the Vault integration more suitable for production workloads where reliability is critical
  • Industry standard: Retry logic with backoff is a common pattern for distributed systems and API integrations

Use Cases

  1. High-traffic environments: Where Vault may occasionally rate-limit requests
  2. Multi-region deployments: Where network latency/reliability varies
  3. Enterprise environments: With complex network topologies (proxies, firewalls, load balancers)
  4. Large-scale CI/CD: Running hundreds or thousands of concurrent jobs

Workarounds Currently Required

Users currently must implement custom retry logic in their job scripts:

script:
  - |
    for i in {1..3}; do
      if vault kv get -field=password secret/path && break
      sleep 2
    done

However, this approach:

  • Cannot be used with the native secrets: keyword
  • Requires manual JWT authentication handling
  • Duplicates code across multiple jobs
  • Is error-prone and harder to maintain

Related Features

This aligns with existing GitLab retry capabilities:

  • Jobs already support retry: keyword for general job failures
  • API clients typically implement retry logic for transient failures
  • Other secret managers (AWS Secrets Manager, Azure Key Vault) often include retry mechanisms in their SDKs

References

Edited by 🤖 GitLab Bot 🤖