Disable auth mount in Openbao when Secrets Manager disabled for project

Problem Statement

When Secrets Manager is disabled for a project, the related secrets should become inaccessible immediately, and OpenBao resources should be cleaned up appropriately. Users should also be able to re-enable a disabled secrets manager.

Current State Analysis

Already implemented:

  • DISABLED status in enum and model
  • State machine with disable event: activedisabled
  • OpenBao client methods for cleanup operations
  • Existing provision infrastructure

Missing functionality:

  • No disable mutation exists
  • No background jobs for disable/enable operations
  • No re-enable capability in initialize flow

Architecture Overview

Design Decisions

Main Architectural Decision: Async Disable/Enable with Intermediate States

The core decision is to implement disable/enable operations as asynchronous background processes with intermediate states, mirroring the existing provisioning workflow. This means:

  • Disable flow: activedisablingdisabled (via background job)
  • Re-enable flow: disabledenablingactive (via background job)
  • Fresh setup flow: nilprovisioningactive (via background job)
  • Cleanup scope: Only delete pipeline and user JWT role (minimal, reversible operation)
  • Resource preservation: Keep OpenBao engines, policies, and secrets intact for fast re-enabling

Key Design Principle: Separate States for Different Operations

We use distinct intermediate states (provisioning, enabling, disabling) to clearly distinguish between different types of operations:

  • provisioning: Full OpenBao setup from scratch (engines, auth, policies, JWT roles)
  • enabling: Lightweight re-enable of disabled secrets manager (recreate JWT roles only)
  • disabling: Lightweight disable of active secrets manager (delete JWT roles only)

This separation provides operational clarity and allows different services to handle different complexity levels appropriately.

This approach prioritizes consistency with existing patterns, reliability through retries, and user experience with immediate feedback.

Why Intermediate Statuses (provisioning, disabling, enabling)?

  • User experience: Shows users that operation is in progress and they need to wait a few seconds
  • Prevents duplicate operations: Users can't trigger multiple disable/enable operations simultaneously
  • Distributed system coordination: Tracks state between GitLab DB and OpenBao operations
  • Consistent pattern: Matches existing provisioning workflow that users understand
  • Operation tracking: Provides visibility into current state for debugging/monitoring

Why Background Workers for Enable/Disable?

  • Distributed system reliability: Multiple API calls with DB state tracking increases partial failure risk
  • Auto-retry capability: Sidekiq retries can repair partial failures if operations are idempotent
  • Performance: Non-blocking UI operations - user gets immediate feedback
  • Scalability: Doesn't tie up web workers for external API calls
  • Error handling: Proper logging and monitoring of failures
  • Consistency: Same pattern as existing provision workflow

Service Separation

  • ProvisionService: Full setup (engines, auth, policies, JWT roles)
  • EnablingService: Lightweight re-enable (recreate JWT role only)
  • DisablingService: Lightweight disable (delete JWT role only)

Implementation Tasks

  1. Update Status Enum
    • Add DISABLING and ENABLING values to ProjectSecretsManagerStatusEnum
  2. Update Model Status and State Machine
    • Add disabling: 3 and enabling: 4 to STATUSES hash in ProjectSecretsManager
    • Add state transitions for disable and enable events
  3. Create Disable Components
    • Mutations::SecretsManagement::ProjectSecretsManagers::Disable - GraphQL mutation
    • SecretsManagement::ProjectSecretsManagers::InitiateDisableService - initiates disable process
    • SecretsManagement::DisableProjectSecretsManagerWorker - background worker
    • SecretsManagement::ProjectSecretsManagers::DisableService - performs actual cleanup
  4. Create Re-enable Components
    • SecretsManagement::EnableProjectSecretsManagerWorker - background worker
    • SecretsManagement::ProjectSecretsManagers::EnableService - recreates JWT role
  5. Update Existing Components
    • SecretsManagement::ProjectSecretsManagers::InitializeService - add logic to determine whether to call provision vs enable worker based on secrets manager state
    • SecretsManagement::SecretsManagerClient - add delete_jwt_role method

Follow-up Issues

Edge Case: Stuck Intermediate States

Problem Scenario: In rare cases, secrets managers can get stuck in intermediate states (provisioning, disabling, enabling) when:

  • OpenBao API calls succeed but database update fails (network hiccup, database timeout, etc.)
  • Worker process crashes after OpenBao operation but before status update
  • Database transaction rollback after successful OpenBao operation

Technical Implications:

  • Stuck in provisioning: OpenBao resources exist and work, but UI shows "Setting up..."
  • Stuck in disabling: JWT role deleted (pipelines fail), but UI shows "Disabling..."
  • Stuck in enabling: JWT role recreated (pipelines work), but UI shows "Enabling..."

UX Impact:

  • User confusion: Status doesn't match actual functionality
  • Feature appears broken: Users can't disable/enable despite backend working
  • Pipeline failures: For stuck disabling state, CI jobs fail but UI suggests operation in progress
  • Support burden: Users will report "broken" secrets manager

Required Follow-up Issue: Create separate issue to address UX/UI handling of stuck intermediate states:

  • Detection mechanism: How to identify genuinely stuck states vs normal processing
  • User messaging: Clear indication when manual intervention needed
  • Retry mechanisms: Allow users to retry/force completion of stuck operations
  • Monitoring/alerting: Detect and alert on stuck states automatically

Temporary Admin/SRE Solution: For immediate incident response when Sidekiq retries are exhausted, admins can manually re-enqueue the appropriate worker:

  • Stuck in disabling: Re-enqueue DisableProjectSecretsManagerWorker.perform_async(user_id, secrets_manager_id)
  • Stuck in enabling: Re-enqueue EnableProjectSecretsManagerWorker.perform_async(user_id, secrets_manager_id)
  • Stuck in provisioning: Re-enqueue ProvisionProjectSecretsManagerWorker.perform_async(user_id, secrets_manager_id)

This assumes the background services are idempotent and can safely retry the OpenBao operations.

Edited by Erick Bajao