Create low-priority maintenance queue for cleanup and audit tasks
## Problem
Currently, maintenance and cleanup tasks are mixed with business-critical operations in various queues, particularly in the `cron` queue. These tasks include:
- Data cleanup operations
- Audit logging and verification
- Non-critical synchronization
- Archive operations
- Test data cleanup
- Weekly/monthly reports
**Issues**:
1. Maintenance tasks can delay business-critical operations
2. No clear separation between operational and maintenance work
3. Difficult to schedule maintenance during low-traffic periods
4. Can't easily throttle or pause maintenance work during incidents
## Proposal
Create a dedicated low-priority queue for maintenance, cleanup, and audit tasks that can run when system resources are available.
### New Queue: `maintenance` (weight 1-2)
**Purpose**: Non-urgent background tasks that improve system health but don't directly impact customers
**Characteristics**:
- Lowest priority (or second-lowest after action_mailbox queues)
- Can be paused during incidents without customer impact
- Ideal for running during off-peak hours
- Should not block any customer-facing operations
### Jobs to Move Here
**From `cron` queue**:
- `Quality::TestAccountCleanupCronJob` - Test data cleanup
- `Cron::Zuora::LocalCopyAuditJob` - Data consistency audits
- `Cron::ErrorMonitorings::WeeklyReportJob` - Weekly reporting
- `AuditProvisionsCronJob` - Provision auditing
**From other queues** (if applicable):
- Data archival jobs
- Log cleanup operations
- Stale record cleanup
- Database maintenance tasks
- Cache warming operations (non-critical)
**Future additions**:
- Any new audit or cleanup jobs
- Performance optimization tasks
- Data quality checks
- Metrics aggregation (non-real-time)
### Benefits
1. **Better resource utilization**: Maintenance runs when system has capacity
2. **Improved reliability**: Critical operations never blocked by cleanup tasks
3. **Easier incident management**: Can pause maintenance queue during incidents
4. **Clear separation**: Obvious distinction between operational and maintenance work
5. **Flexible scheduling**: Can adjust maintenance queue processing based on load
## Implementation Steps
1. **Identify all maintenance tasks**:
- Audit current cron jobs
- Search for cleanup/audit jobs in codebase
- Categorize by urgency and customer impact
2. **Create base job class**:
```ruby
# app/jobs/maintenance/base_job.rb
module Maintenance
class BaseJob < ApplicationJob
queue_as :maintenance
# Common configuration for maintenance jobs
# - Lower retry attempts
# - Longer timeouts acceptable
# - Can be safely discarded if queue too deep
end
end
```
3. **Update job classes**:
```ruby
# Example: Test cleanup
class Quality::TestAccountCleanupCronJob < Maintenance::BaseJob
def perform
# Cleanup logic
end
end
```
4. **Update `config/sidekiq.yml`**:
```yaml
:queues:
# ... higher priority queues ...
- [maintenance, 2] # or 1, depending on action_mailbox priority
- [action_mailbox_routing, 1]
- [action_mailbox_incineration, 1]
```
5. **Add queue management**:
```ruby
# Ability to pause/resume maintenance queue
# Useful during incidents or high-load periods
module MaintenanceQueue
def self.pause!
# Pause processing
end
def self.resume!
# Resume processing
end
end
```
6. **Document guidelines**:
- When to use `maintenance` queue
- How to pause/resume during incidents
- Expected SLAs (can be hours or days)
- Examples of maintenance vs. operational jobs
## Queue Assignment Guidelines
**Use `maintenance` queue for**:
- ✅ Data cleanup (old records, test data)
- ✅ Audit and verification tasks
- ✅ Non-critical synchronization
- ✅ Report generation (weekly, monthly)
- ✅ Archive operations
- ✅ Performance optimization tasks
- ✅ Data quality checks
**Do NOT use `maintenance` queue for**:
- ❌ Customer-facing operations
- ❌ Revenue-impacting tasks
- ❌ Time-sensitive notifications
- ❌ Real-time synchronization
- ❌ Security-critical operations
## Monitoring and Alerting
**Metrics to track**:
- Queue depth (alert if > 1000 jobs)
- Job age (alert if oldest job > 7 days)
- Failure rate (alert if > 10%)
- Processing rate (jobs per hour)
**Acceptable delays**:
- Hours: Cleanup tasks can wait
- Days: Audit tasks can be delayed
- Weeks: Historical reports can be very delayed
**Not acceptable**:
- Should not grow unbounded
- Should not fail repeatedly
- Should complete eventually (within weeks)
## Success Criteria
- All maintenance tasks identified and moved to new queue
- Maintenance queue has lowest priority (weight 1-2)
- Can pause/resume maintenance queue without impact
- Clear documentation for future maintenance jobs
- No customer-facing operations in maintenance queue
## Related
- Parent epic: gitlab-org&19587
- Related: #14269 (split cron queue)
- Related: #14273 (default queue audit)
issue