Create low-priority maintenance queue for cleanup and audit tasks
## Problem Currently, maintenance and cleanup tasks are mixed with business-critical operations in various queues, particularly in the `cron` queue. These tasks include: - Data cleanup operations - Audit logging and verification - Non-critical synchronization - Archive operations - Test data cleanup - Weekly/monthly reports **Issues**: 1. Maintenance tasks can delay business-critical operations 2. No clear separation between operational and maintenance work 3. Difficult to schedule maintenance during low-traffic periods 4. Can't easily throttle or pause maintenance work during incidents ## Proposal Create a dedicated low-priority queue for maintenance, cleanup, and audit tasks that can run when system resources are available. ### New Queue: `maintenance` (weight 1-2) **Purpose**: Non-urgent background tasks that improve system health but don't directly impact customers **Characteristics**: - Lowest priority (or second-lowest after action_mailbox queues) - Can be paused during incidents without customer impact - Ideal for running during off-peak hours - Should not block any customer-facing operations ### Jobs to Move Here **From `cron` queue**: - `Quality::TestAccountCleanupCronJob` - Test data cleanup - `Cron::Zuora::LocalCopyAuditJob` - Data consistency audits - `Cron::ErrorMonitorings::WeeklyReportJob` - Weekly reporting - `AuditProvisionsCronJob` - Provision auditing **From other queues** (if applicable): - Data archival jobs - Log cleanup operations - Stale record cleanup - Database maintenance tasks - Cache warming operations (non-critical) **Future additions**: - Any new audit or cleanup jobs - Performance optimization tasks - Data quality checks - Metrics aggregation (non-real-time) ### Benefits 1. **Better resource utilization**: Maintenance runs when system has capacity 2. **Improved reliability**: Critical operations never blocked by cleanup tasks 3. **Easier incident management**: Can pause maintenance queue during incidents 4. **Clear separation**: Obvious distinction between operational and maintenance work 5. **Flexible scheduling**: Can adjust maintenance queue processing based on load ## Implementation Steps 1. **Identify all maintenance tasks**: - Audit current cron jobs - Search for cleanup/audit jobs in codebase - Categorize by urgency and customer impact 2. **Create base job class**: ```ruby # app/jobs/maintenance/base_job.rb module Maintenance class BaseJob < ApplicationJob queue_as :maintenance # Common configuration for maintenance jobs # - Lower retry attempts # - Longer timeouts acceptable # - Can be safely discarded if queue too deep end end ``` 3. **Update job classes**: ```ruby # Example: Test cleanup class Quality::TestAccountCleanupCronJob < Maintenance::BaseJob def perform # Cleanup logic end end ``` 4. **Update `config/sidekiq.yml`**: ```yaml :queues: # ... higher priority queues ... - [maintenance, 2] # or 1, depending on action_mailbox priority - [action_mailbox_routing, 1] - [action_mailbox_incineration, 1] ``` 5. **Add queue management**: ```ruby # Ability to pause/resume maintenance queue # Useful during incidents or high-load periods module MaintenanceQueue def self.pause! # Pause processing end def self.resume! # Resume processing end end ``` 6. **Document guidelines**: - When to use `maintenance` queue - How to pause/resume during incidents - Expected SLAs (can be hours or days) - Examples of maintenance vs. operational jobs ## Queue Assignment Guidelines **Use `maintenance` queue for**: - ✅ Data cleanup (old records, test data) - ✅ Audit and verification tasks - ✅ Non-critical synchronization - ✅ Report generation (weekly, monthly) - ✅ Archive operations - ✅ Performance optimization tasks - ✅ Data quality checks **Do NOT use `maintenance` queue for**: - ❌ Customer-facing operations - ❌ Revenue-impacting tasks - ❌ Time-sensitive notifications - ❌ Real-time synchronization - ❌ Security-critical operations ## Monitoring and Alerting **Metrics to track**: - Queue depth (alert if > 1000 jobs) - Job age (alert if oldest job > 7 days) - Failure rate (alert if > 10%) - Processing rate (jobs per hour) **Acceptable delays**: - Hours: Cleanup tasks can wait - Days: Audit tasks can be delayed - Weeks: Historical reports can be very delayed **Not acceptable**: - Should not grow unbounded - Should not fail repeatedly - Should complete eventually (within weeks) ## Success Criteria - All maintenance tasks identified and moved to new queue - Maintenance queue has lowest priority (weight 1-2) - Can pause/resume maintenance queue without impact - Clear documentation for future maintenance jobs - No customer-facing operations in maintenance queue ## Related - Parent epic: gitlab-org&19587 - Related: #14269 (split cron queue) - Related: #14273 (default queue audit)
issue