Disable auth mount in Openbao when Secrets Manager disabled for project
Problem Statement
When Secrets Manager is disabled for a project, the related secrets should become inaccessible immediately, and OpenBao resources should be cleaned up appropriately. Users should also be able to re-enable a disabled secrets manager.
Current State Analysis
-
DISABLEDstatus in enum and model - State machine with
disableevent:active→disabled - OpenBao client methods for cleanup operations
- Existing provision infrastructure
- No disable mutation exists
- No background jobs for disable/enable operations
- No re-enable capability in initialize flow
Architecture Overview
Design Decisions
Main Architectural Decision: Async Disable/Enable with Intermediate States
The core decision is to implement disable/enable operations as asynchronous background processes with intermediate states, mirroring the existing provisioning workflow. This means:
-
Disable flow:
active→disabling→disabled(via background job) -
Re-enable flow:
disabled→enabling→active(via background job) -
Fresh setup flow:
nil→provisioning→active(via background job) - Cleanup scope: Only delete pipeline and user JWT role (minimal, reversible operation)
- Resource preservation: Keep OpenBao engines, policies, and secrets intact for fast re-enabling
Key Design Principle: Separate States for Different Operations
We use distinct intermediate states (provisioning, enabling, disabling) to clearly distinguish between different types of operations:
-
provisioning: Full OpenBao setup from scratch (engines, auth, policies, JWT roles) -
enabling: Lightweight re-enable of disabled secrets manager (recreate JWT roles only) -
disabling: Lightweight disable of active secrets manager (delete JWT roles only)
This separation provides operational clarity and allows different services to handle different complexity levels appropriately.
This approach prioritizes consistency with existing patterns, reliability through retries, and user experience with immediate feedback.
Why Intermediate Statuses (provisioning, disabling, enabling)?
- User experience: Shows users that operation is in progress and they need to wait a few seconds
- Prevents duplicate operations: Users can't trigger multiple disable/enable operations simultaneously
- Distributed system coordination: Tracks state between GitLab DB and OpenBao operations
-
Consistent pattern: Matches existing
provisioningworkflow that users understand - Operation tracking: Provides visibility into current state for debugging/monitoring
Why Background Workers for Enable/Disable?
- Distributed system reliability: Multiple API calls with DB state tracking increases partial failure risk
- Auto-retry capability: Sidekiq retries can repair partial failures if operations are idempotent
- Performance: Non-blocking UI operations - user gets immediate feedback
- Scalability: Doesn't tie up web workers for external API calls
- Error handling: Proper logging and monitoring of failures
- Consistency: Same pattern as existing provision workflow
Service Separation
- ProvisionService: Full setup (engines, auth, policies, JWT roles)
- EnablingService: Lightweight re-enable (recreate JWT role only)
- DisablingService: Lightweight disable (delete JWT role only)
Implementation Tasks
- Update Status Enum
- Add
DISABLINGandENABLINGvalues toProjectSecretsManagerStatusEnum
- Add
- Update Model Status and State Machine
- Add
disabling: 3andenabling: 4toSTATUSEShash inProjectSecretsManager - Add state transitions for
disableandenableevents
- Add
- Create Disable Components
-
Mutations::SecretsManagement::ProjectSecretsManagers::Disable- GraphQL mutation -
SecretsManagement::ProjectSecretsManagers::InitiateDisableService- initiates disable process -
SecretsManagement::DisableProjectSecretsManagerWorker- background worker -
SecretsManagement::ProjectSecretsManagers::DisableService- performs actual cleanup
-
- Create Re-enable Components
-
SecretsManagement::EnableProjectSecretsManagerWorker- background worker -
SecretsManagement::ProjectSecretsManagers::EnableService- recreates JWT role
-
- Update Existing Components
-
SecretsManagement::ProjectSecretsManagers::InitializeService- add logic to determine whether to call provision vs enable worker based on secrets manager state -
SecretsManagement::SecretsManagerClient- adddelete_jwt_rolemethod
-
Follow-up Issues
Edge Case: Stuck Intermediate States
Problem Scenario:
In rare cases, secrets managers can get stuck in intermediate states (provisioning, disabling, enabling) when:
- OpenBao API calls succeed but database update fails (network hiccup, database timeout, etc.)
- Worker process crashes after OpenBao operation but before status update
- Database transaction rollback after successful OpenBao operation
Technical Implications:
-
Stuck in
provisioning: OpenBao resources exist and work, but UI shows "Setting up..." -
Stuck in
disabling: JWT role deleted (pipelines fail), but UI shows "Disabling..." -
Stuck in
enabling: JWT role recreated (pipelines work), but UI shows "Enabling..."
UX Impact:
- User confusion: Status doesn't match actual functionality
- Feature appears broken: Users can't disable/enable despite backend working
-
Pipeline failures: For stuck
disablingstate, CI jobs fail but UI suggests operation in progress - Support burden: Users will report "broken" secrets manager
Required Follow-up Issue: Create separate issue to address UX/UI handling of stuck intermediate states:
- Detection mechanism: How to identify genuinely stuck states vs normal processing
- User messaging: Clear indication when manual intervention needed
- Retry mechanisms: Allow users to retry/force completion of stuck operations
- Monitoring/alerting: Detect and alert on stuck states automatically
Temporary Admin/SRE Solution: For immediate incident response when Sidekiq retries are exhausted, admins can manually re-enqueue the appropriate worker:
- Stuck in disabling: Re-enqueue
DisableProjectSecretsManagerWorker.perform_async(user_id, secrets_manager_id) - Stuck in enabling: Re-enqueue
EnableProjectSecretsManagerWorker.perform_async(user_id, secrets_manager_id) - Stuck in provisioning: Re-enqueue
ProvisionProjectSecretsManagerWorker.perform_async(user_id, secrets_manager_id)
This assumes the background services are idempotent and can safely retry the OpenBao operations.