Analyze job duration and adjust queue weights based on execution time
## Problem Queue weights should consider job execution time to prevent long-running jobs from starving other queues. Currently, we don't have a systematic approach to factoring job duration into weight assignments. **Potential issues**: - Long-running jobs with high weights can monopolize worker threads - Quick jobs with low weights may experience unnecessary delays - No documented relationship between job duration and appropriate weight ## Proposal Analyze job execution times across all queues and adjust weights to balance throughput and fairness. ### Analysis Needed For each queue, gather metrics on: 1. **Job duration** (P50, P95, P99 percentiles) 2. **Queue depth** during normal operations 3. **Job frequency** (jobs per hour/day) 4. **Failure rates** and retry patterns ### Queues to Investigate **Potentially long-running (may need lower weights)**: - `usage_billing` (weight 2): ClickHouse operations, data processing - `Billing::Usage::ConsumptionJob` - `Billing::Usage::EnrichmentJob` - `ExportChDataToS3Job` - `salesforce` (weight 4): External API calls with potential timeouts - `Salesforce::CreateOpportunityJob` - `Salesforce::CreateQuoteForReconciliationJob` - `zuora` (weight 4): Complex synchronization operations - `Zuora::RefreshLocalSubscriptionsJob` - `Zuora::SyncResourceJob` **Potentially quick (could have higher weights)**: - `mailers` (weight 2): Email delivery (usually fast) - `expiration` (weight 3): Simple status updates - `health_check` (weight 4): Quick health checks ### Weight Assignment Guidelines Based on analysis, establish guidelines like: **Quick jobs (< 1 second average)**: - Can have higher weights (7-10) without blocking - Examples: Health checks, simple notifications, status updates **Medium jobs (1-10 seconds average)**: - Moderate weights (4-6) appropriate - Examples: API calls, database operations, email sending **Long jobs (> 10 seconds average)**: - Lower weights (2-3) to prevent starvation - Examples: Bulk data processing, complex synchronization, report generation **Very long jobs (> 30 seconds average)**: - Lowest weights (1-2) or consider breaking into smaller jobs - Examples: Large data exports, comprehensive audits ## Implementation Steps 1. **Gather production metrics** (last 30 days): ```ruby # Example query for Sidekiq metrics # - Job duration by queue # - Queue depth over time # - Job throughput ``` 2. **Analyze patterns**: - Identify queues with high variance in job duration - Find queues where long jobs block quick jobs - Look for correlation between queue depth and job duration 3. **Propose weight adjustments**: - Document current vs. proposed weights - Explain rationale based on metrics - Consider business priority alongside duration 4. **Test in staging**: - Simulate production load - Measure impact on queue latency - Verify no unintended consequences 5. **Monitor after deployment**: - Track queue depth changes - Monitor job latency (enqueue to execution time) - Watch for customer-reported issues 6. **Document findings**: - Create guidelines for future queue weight assignments - Include typical job durations for each queue - Establish process for periodic review ## Success Criteria - All queues have documented average job durations - Weight assignments consider both business priority and execution time - No queue experiences starvation due to long-running jobs in higher-priority queues - Clear guidelines exist for assigning weights to new queues ## Related - Parent epic: gitlab-org&19587 - Related: #14268 (weight granularity) - Related: #14270 (user-facing vs internal)
issue