Analyze job duration and adjust queue weights based on execution time
## Problem
Queue weights should consider job execution time to prevent long-running jobs from starving other queues. Currently, we don't have a systematic approach to factoring job duration into weight assignments.
**Potential issues**:
- Long-running jobs with high weights can monopolize worker threads
- Quick jobs with low weights may experience unnecessary delays
- No documented relationship between job duration and appropriate weight
## Proposal
Analyze job execution times across all queues and adjust weights to balance throughput and fairness.
### Analysis Needed
For each queue, gather metrics on:
1. **Job duration** (P50, P95, P99 percentiles)
2. **Queue depth** during normal operations
3. **Job frequency** (jobs per hour/day)
4. **Failure rates** and retry patterns
### Queues to Investigate
**Potentially long-running (may need lower weights)**:
- `usage_billing` (weight 2): ClickHouse operations, data processing
- `Billing::Usage::ConsumptionJob`
- `Billing::Usage::EnrichmentJob`
- `ExportChDataToS3Job`
- `salesforce` (weight 4): External API calls with potential timeouts
- `Salesforce::CreateOpportunityJob`
- `Salesforce::CreateQuoteForReconciliationJob`
- `zuora` (weight 4): Complex synchronization operations
- `Zuora::RefreshLocalSubscriptionsJob`
- `Zuora::SyncResourceJob`
**Potentially quick (could have higher weights)**:
- `mailers` (weight 2): Email delivery (usually fast)
- `expiration` (weight 3): Simple status updates
- `health_check` (weight 4): Quick health checks
### Weight Assignment Guidelines
Based on analysis, establish guidelines like:
**Quick jobs (< 1 second average)**:
- Can have higher weights (7-10) without blocking
- Examples: Health checks, simple notifications, status updates
**Medium jobs (1-10 seconds average)**:
- Moderate weights (4-6) appropriate
- Examples: API calls, database operations, email sending
**Long jobs (> 10 seconds average)**:
- Lower weights (2-3) to prevent starvation
- Examples: Bulk data processing, complex synchronization, report generation
**Very long jobs (> 30 seconds average)**:
- Lowest weights (1-2) or consider breaking into smaller jobs
- Examples: Large data exports, comprehensive audits
## Implementation Steps
1. **Gather production metrics** (last 30 days):
```ruby
# Example query for Sidekiq metrics
# - Job duration by queue
# - Queue depth over time
# - Job throughput
```
2. **Analyze patterns**:
- Identify queues with high variance in job duration
- Find queues where long jobs block quick jobs
- Look for correlation between queue depth and job duration
3. **Propose weight adjustments**:
- Document current vs. proposed weights
- Explain rationale based on metrics
- Consider business priority alongside duration
4. **Test in staging**:
- Simulate production load
- Measure impact on queue latency
- Verify no unintended consequences
5. **Monitor after deployment**:
- Track queue depth changes
- Monitor job latency (enqueue to execution time)
- Watch for customer-reported issues
6. **Document findings**:
- Create guidelines for future queue weight assignments
- Include typical job durations for each queue
- Establish process for periodic review
## Success Criteria
- All queues have documented average job durations
- Weight assignments consider both business priority and execution time
- No queue experiences starvation due to long-running jobs in higher-priority queues
- Clear guidelines exist for assigning weights to new queues
## Related
- Parent epic: gitlab-org&19587
- Related: #14268 (weight granularity)
- Related: #14270 (user-facing vs internal)
issue