OpenBao self-init fails when audit stream endpoint is unavailable during helm upgrade

Problem

When enabling OpenBao via helm upgrade, the self-initialization process fails to complete, resulting in "permission denied" errors when attempting to provision Secrets Manager for a project.

Root Cause (What We Know)

The issue occurs due to a timing problem during helm upgrade:

  1. When global.openbao.enabled: true is added, it changes the Rails configuration
  2. This triggers a rolling update of Webservice pods
  3. OpenBao pods come online before the new Webservice pods are ready
  4. Old Webservice pods (with global.openbao.enabled=false) reject audit stream requests with HTTP 500
  5. OpenBao's self-init process fails when the audit backend test fails
  6. The credential backend is never enabled, leaving OpenBao in an incomplete state
  7. Subsequent authentication attempts fail with "permission denied"

Evidence:

  • OpenBao logs show repeated new audit backend failed test errors with status code not 2xx: 500
  • No credential backend enabled message appears in logs
  • Database has only 21 records instead of 31 (incomplete self-init)
  • See: #582828 (comment 2935176465)

What We Don't Know

Why doesn't OpenBao recover from audit backend test failures?

  • OpenBao retries the audit backend test but never completes self-init
  • The credential backend should eventually be enabled once the audit stream endpoint becomes available

Workarounds

Option 1: Disable Audit Streaming (Temporary)

Set openbao.config.audit.http.enabled: false in values. This works but loses audit streaming functionality.

Option 2: Scale Down/Clean/Scale Up

  1. kubectl scale deploy gitlab-openbao --replicas=0
  2. DELETE FROM openbao_kv_store;
  3. kubectl scale deploy gitlab-openbao --replicas=2

Option 3: Enable, then install

  1. Set global.openbao.enabled=true and upgrade then chart.
  2. Set global.openbao.enabled=true and openbao.install=true and upgrade the chart again.

This ensure that webservice pods can accept connection to the audit stream endpoint when openbao pods go online.

See #582828 (comment 2937146435)

Option 4: Enable HTTP audit device after install

Based on note #2941384091, here's the workaround you can paste in the issue description:

Disable HTTP audit streaming during initial OpenBao deployment, then enable it afterwards:

  1. Initial deployment: Set openbao.config.audit.http.enabled: false in your values:

    global:
      openbao:
        enabled: true
    
    openbao:
      install: true
      config:
        audit:
          http:
            enabled: false
  2. After OpenBao is healthy: Re-run helm upgrade with HTTP audit enabled:

    global:
      openbao:
        enabled: true
    
    openbao:
      install: true
      config:
        audit:
          http:
            enabled: true

This allows OpenBao to complete self-initialization successfully without the audit stream endpoint being unavailable during the Webservice rolling update.

Proposed Solutions

Short-term

Improve the readiness probe to detect incomplete self-init (credential backend not enabled). However, restarting won't fix it - the database needs to be cleaned.

Long-term

Fix OpenBao to properly recover from transient audit backend failures and complete self-init once the endpoint becomes available.

Environment

  • GitLab Chart: v9.6.1 and master
  • OpenBao Chart: v0.8.1 and v0.9.0
  • Deployment method: helm upgrade with OpenBao enabled
Edited by Fabien Catteau