Feature Request: Add Retry Configuration for HashiCorp Vault Secret Retrieval
Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.
Problem Statement
When using GitLab CI/CD's native HashiCorp Vault integration with the secrets keyword, jobs can fail entirely due to intermittent network issues or temporary Vault server unavailability. Currently, there is no built-in retry mechanism for Vault authentication or secret retrieval operations, which means a single transient failure causes the entire pipeline to fail.
Common scenarios where this becomes problematic:
- Temporary network connectivity issues between GitLab Runner and Vault server
- Vault server experiencing brief periods of high load (HTTP 429 rate limiting)
- Transient DNS resolution failures
- Load balancer or proxy timeouts (HTTP 502/503/504)
- Temporary Vault seal/unseal operations
Proposed Solution
Add retry configuration options to the Vault secrets integration that allow users to configure:
- Number of retry attempts - How many times to retry before failing the job
- Retry delay/backoff - Time to wait between retries (with optional exponential backoff)
- Retryable status codes - Which HTTP status codes should trigger a retry
Suggested Implementation
Add new CI/CD variables to control retry behavior:
variables:
VAULT_SERVER_URL: "https://vault.example.com:8200"
VAULT_AUTH_ROLE: "myproject-staging"
VAULT_RETRY_ATTEMPTS: "3" # Number of retry attempts (default: 0)
VAULT_RETRY_DELAY: "2" # Initial delay in seconds (default: 1)
VAULT_RETRY_BACKOFF: "exponential" # Options: none, linear, exponential (default: none)
VAULT_RETRY_STATUS_CODES: "429,502,503,504" # Comma-separated list (default: 429,502,503,504)
Or alternatively, extend the secrets syntax:
job_with_secrets:
id_tokens:
VAULT_ID_TOKEN:
aud: https://vault.example.com
secrets:
STAGING_DB_PASSWORD:
vault:
path: myproject/staging/db/password@secret
retry:
attempts: 3
delay: 2
backoff: exponential
status_codes: [429, 502, 503, 504]
script:
- access-staging-db.sh --token $STAGING_DB_PASSWORD
Benefits
- Improved pipeline reliability: Reduces false failures from transient issues
- Better user experience: Teams won't need to manually re-run failed jobs
- Cost efficiency: Fewer wasted compute resources on jobs that fail for non-critical reasons
- Production readiness: Makes the Vault integration more suitable for production workloads where reliability is critical
- Industry standard: Retry logic with backoff is a common pattern for distributed systems and API integrations
Use Cases
- High-traffic environments: Where Vault may occasionally rate-limit requests
- Multi-region deployments: Where network latency/reliability varies
- Enterprise environments: With complex network topologies (proxies, firewalls, load balancers)
- Large-scale CI/CD: Running hundreds or thousands of concurrent jobs
Workarounds Currently Required
Users currently must implement custom retry logic in their job scripts:
script:
- |
for i in {1..3}; do
if vault kv get -field=password secret/path && break
sleep 2
done
However, this approach:
- Cannot be used with the native
secrets:keyword - Requires manual JWT authentication handling
- Duplicates code across multiple jobs
- Is error-prone and harder to maintain
Related Features
This aligns with existing GitLab retry capabilities:
- Jobs already support
retry:keyword for general job failures - API clients typically implement retry logic for transient failures
- Other secret managers (AWS Secrets Manager, Azure Key Vault) often include retry mechanisms in their SDKs