Draft: Different queue handling
What does this MR do and why?
This change improves how a background job system manages the cleanup of expired build artifacts by switching from a dynamic partitioning system to a fixed one.
Key improvements:
- Fixed partition system: Instead of calculating partitions based on the number of running workers (which could change), the system now uses a fixed set of 10,000 partitions that never changes once deployed to production.
- Better resource management: The new system uses Redis locks with timeouts to track which partitions are actively being processed, preventing multiple workers from processing the same partition simultaneously using the reliable queue pattern.
- Crash recovery: Added logic to automatically detect and restore "lost" partitions - when a worker crashes unexpectedly, the scheduler can now identify missing partitions and make them available for processing again.
- Cleaner error handling: The system now properly releases resources (partitions and locks) even when errors occur during processing, using try/finally-style cleanup.
- Simplified logic: Removed complex calculations about occupied partitions and replaced them with a straightforward lock-based approach that's easier to understand and maintain.
The overall effect is a more reliable and predictable system for cleaning up old build artifacts that can better handle worker failures and scale more consistently.
References
Screenshots or screen recordings
| Before | After |
|---|---|
How to set up and validate locally
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
Related to #582116
Edited by Daniel Prause