Feature Request: Add Retry Configuration for HashiCorp Vault Secret Retrieval

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

Close this issue

Problem Statement

When using GitLab CI/CD's native HashiCorp Vault integration with the secrets keyword, jobs can fail entirely due to intermittent network issues or temporary Vault server unavailability. Currently, there is no built-in retry mechanism for Vault authentication or secret retrieval operations, which means a single transient failure causes the entire pipeline to fail.

Common scenarios where this becomes problematic:

Temporary network connectivity issues between GitLab Runner and Vault server
Vault server experiencing brief periods of high load (HTTP 429 rate limiting)
Transient DNS resolution failures
Load balancer or proxy timeouts (HTTP 502/503/504)
Temporary Vault seal/unseal operations

Proposed Solution

Add retry configuration options to the Vault secrets integration that allow users to configure:

Number of retry attempts - How many times to retry before failing the job
Retry delay/backoff - Time to wait between retries (with optional exponential backoff)
Retryable status codes - Which HTTP status codes should trigger a retry

Suggested Implementation

Add new CI/CD variables to control retry behavior:

variables:
  VAULT_SERVER_URL: "https://vault.example.com:8200"
  VAULT_AUTH_ROLE: "myproject-staging"
  VAULT_RETRY_ATTEMPTS: "3"                    # Number of retry attempts (default: 0)
  VAULT_RETRY_DELAY: "2"                       # Initial delay in seconds (default: 1)
  VAULT_RETRY_BACKOFF: "exponential"           # Options: none, linear, exponential (default: none)
  VAULT_RETRY_STATUS_CODES: "429,502,503,504"  # Comma-separated list (default: 429,502,503,504)

Or alternatively, extend the secrets syntax:

job_with_secrets:
  id_tokens:
    VAULT_ID_TOKEN:
      aud: https://vault.example.com
  secrets:
    STAGING_DB_PASSWORD:
      vault: 
        path: myproject/staging/db/password@secret
        retry:
          attempts: 3
          delay: 2
          backoff: exponential
          status_codes: [429, 502, 503, 504]
  script:
    - access-staging-db.sh --token $STAGING_DB_PASSWORD

Benefits

Improved pipeline reliability: Reduces false failures from transient issues
Better user experience: Teams won't need to manually re-run failed jobs
Cost efficiency: Fewer wasted compute resources on jobs that fail for non-critical reasons
Production readiness: Makes the Vault integration more suitable for production workloads where reliability is critical
Industry standard: Retry logic with backoff is a common pattern for distributed systems and API integrations

Use Cases

High-traffic environments: Where Vault may occasionally rate-limit requests
Multi-region deployments: Where network latency/reliability varies
Enterprise environments: With complex network topologies (proxies, firewalls, load balancers)
Large-scale CI/CD: Running hundreds or thousands of concurrent jobs

Workarounds Currently Required

Users currently must implement custom retry logic in their job scripts:

script:
  - |
    for i in {1..3}; do
      if vault kv get -field=password secret/path && break
      sleep 2
    done

However, this approach:

Cannot be used with the native secrets: keyword
Requires manual JWT authentication handling
Duplicates code across multiple jobs
Is error-prone and harder to maintain

Related Features

This aligns with existing GitLab retry capabilities:

Jobs already support retry: keyword for general job failures
API clients typically implement retry logic for transient failures
Other secret managers (AWS Secrets Manager, Azure Key Vault) often include retry mechanisms in their SDKs

References

Edited Oct 27, 2025 by 🤖 GitLab Bot 🤖