Evaluate hybrid Sidekiq + Solid Queue approach for critical job reliability
Summary
Instead of adding sidekiq-reliable-fetch to handle job loss during deployments, evaluate running Solid Queue alongside Sidekiq for critical jobs only.
Problem
Currently addressing job loss during deployments with !13833 by adding the sidekiq-reliable-fetch gem. This requires:
- Vendored gem (Sidekiq 6.x compatibility)
- Background cleanup processes
- Working queue management
- Manual recovery procedures
Proposed Solution
Run Solid Queue and Sidekiq in parallel:
- Add Solid Queue gem to the project
- Identify critical jobs that need reliability (e.g.,
ZuoraCallbackJob,UpdateGitlabPlanInfoJob) - Move only those jobs to Solid Queue queues
- Keep remaining jobs on Sidekiq
- No migration of existing job data needed
Advantages
- Lower risk: Only critical jobs use new system
- Gradual migration path: Move more jobs over time if successful
- No job data migration required
- Easy rollback if issues arise
- Simpler than full Solid Queue migration
Disadvantages
- Operational complexity: Managing two job systems
- Still requires Redis for Sidekiq
- Temporary state: Eventually need to fully migrate or commit to reliable-fetch approach
Next Steps
- Evaluate Solid Queue's reliability guarantees vs sidekiq-reliable-fetch
- Identify which jobs are critical and need reliability
- Prototype running both systems in parallel
- Compare operational overhead and reliability outcomes
- Decide on long-term strategy (full Solid Queue migration vs reliable-fetch)