Review and document health_check queue priority
## Problem The `health_check` queue currently has weight 4 (highest priority, tied with critical customer-facing operations). This queue runs every minute to monitor external service health (GitLab.com, Zuora) and manages maintenance mode. **Questions to consider**: 1. Should health monitoring have the same priority as customer provisioning? 2. Does every-minute execution need to compete with customer-facing operations? 3. Could health checks run at slightly lower priority without impacting system reliability? 4. Are there scenarios where health checks could delay critical customer operations? ## Current Behavior **Jobs in queue**: - `HealthCheckCron::CheckGitlabJob` - Runs every minute - `HealthCheckCron::CheckZuoraJob` - Runs every minute **What they do**: - Check if external services are reachable - Enable maintenance mode if services are down - Disable maintenance mode when services recover - Pause/resume Sidekiq queues based on service health **Current priority**: Weight 4 (same as `gitlab`, `zuora`, `salesforce`, `zuora_callback`) ## Options to Consider ### Option 1: Keep Current Priority (Weight 4 or 8-10 in new scale) **Rationale**: - System stability is paramount - Quick detection of outages is critical - Maintenance mode prevents cascading failures - Health checks are very fast (< 1 second) **Pros**: - Fastest possible outage detection - Immediate maintenance mode activation - No risk of delayed health checks **Cons**: - Competes with customer-facing operations - May not be necessary to check every minute with highest priority - Could delay critical provisioning during high load ### Option 2: Slightly Lower Priority (Weight 3 or 6-7 in new scale) **Rationale**: - Health checks are monitoring, not customer-facing - 1-minute frequency provides buffer for slight delays - Still high priority, just not highest - Allows critical customer operations to take precedence **Pros**: - Customer operations never delayed by health checks - Still runs frequently enough for quick outage detection - More appropriate priority for monitoring vs. operations **Cons**: - Slightly slower outage detection (seconds, not immediate) - Could delay maintenance mode activation - May miss brief outages ### Option 3: Separate Critical vs. Routine Health Checks **Rationale**: - Some health checks are more critical than others - Could have different frequencies and priorities **Structure**: - `health_check_critical` (weight 8): GitLab.com, Zuora (every minute) - `health_check_routine` (weight 5): Other services (every 5 minutes) **Pros**: - Granular control over monitoring priorities - Can optimize frequency per service - Critical services monitored with highest priority **Cons**: - More complex configuration - May be over-engineering for current needs - Harder to maintain ## Recommendation Needed We need input from the team on: 1. **Observed behavior**: Have health checks ever delayed customer operations? 2. **Outage scenarios**: How quickly do we need to detect outages? 3. **Maintenance mode**: How critical is immediate activation? 4. **Job duration**: Confirm health checks are consistently fast (< 1 second) 5. **Frequency**: Is every-minute checking necessary, or could we reduce to every 2-3 minutes? ## Implementation Steps 1. **Gather data**: - Review health check job duration (P50, P95, P99) - Analyze queue depth during peak times - Check if health checks have ever been delayed - Review past outage detection times 2. **Discuss with team**: - SRE perspective on monitoring priorities - Engineering perspective on customer impact - Historical incidents related to health checks 3. **Make decision**: - Document rationale for chosen priority - Consider trade-offs between monitoring and operations - Align with overall queue priority strategy 4. **Update configuration** (if needed): ```yaml # config/sidekiq.yml :queues: # Option 1: Keep highest priority - [health_check, 10] # Option 2: Slightly lower - [health_check, 7] # Option 3: Split by criticality - [health_check_critical, 9] - [health_check_routine, 5] ``` 5. **Monitor after change**: - Track outage detection time - Monitor maintenance mode activation speed - Watch for any customer impact 6. **Document decision**: - Why this priority was chosen - Trade-offs considered - Conditions that might warrant re-evaluation ## Success Criteria - Clear understanding of health check priority requirements - Documented rationale for chosen priority - No degradation in outage detection or system reliability - Appropriate balance between monitoring and customer operations ## Related - Parent epic: gitlab-org&19587 - Related: #14268 (weight granularity) - Related: #14271 (job duration analysis)
issue