Review and document health_check queue priority
## Problem
The `health_check` queue currently has weight 4 (highest priority, tied with critical customer-facing operations). This queue runs every minute to monitor external service health (GitLab.com, Zuora) and manages maintenance mode.
**Questions to consider**:
1. Should health monitoring have the same priority as customer provisioning?
2. Does every-minute execution need to compete with customer-facing operations?
3. Could health checks run at slightly lower priority without impacting system reliability?
4. Are there scenarios where health checks could delay critical customer operations?
## Current Behavior
**Jobs in queue**:
- `HealthCheckCron::CheckGitlabJob` - Runs every minute
- `HealthCheckCron::CheckZuoraJob` - Runs every minute
**What they do**:
- Check if external services are reachable
- Enable maintenance mode if services are down
- Disable maintenance mode when services recover
- Pause/resume Sidekiq queues based on service health
**Current priority**: Weight 4 (same as `gitlab`, `zuora`, `salesforce`, `zuora_callback`)
## Options to Consider
### Option 1: Keep Current Priority (Weight 4 or 8-10 in new scale)
**Rationale**:
- System stability is paramount
- Quick detection of outages is critical
- Maintenance mode prevents cascading failures
- Health checks are very fast (< 1 second)
**Pros**:
- Fastest possible outage detection
- Immediate maintenance mode activation
- No risk of delayed health checks
**Cons**:
- Competes with customer-facing operations
- May not be necessary to check every minute with highest priority
- Could delay critical provisioning during high load
### Option 2: Slightly Lower Priority (Weight 3 or 6-7 in new scale)
**Rationale**:
- Health checks are monitoring, not customer-facing
- 1-minute frequency provides buffer for slight delays
- Still high priority, just not highest
- Allows critical customer operations to take precedence
**Pros**:
- Customer operations never delayed by health checks
- Still runs frequently enough for quick outage detection
- More appropriate priority for monitoring vs. operations
**Cons**:
- Slightly slower outage detection (seconds, not immediate)
- Could delay maintenance mode activation
- May miss brief outages
### Option 3: Separate Critical vs. Routine Health Checks
**Rationale**:
- Some health checks are more critical than others
- Could have different frequencies and priorities
**Structure**:
- `health_check_critical` (weight 8): GitLab.com, Zuora (every minute)
- `health_check_routine` (weight 5): Other services (every 5 minutes)
**Pros**:
- Granular control over monitoring priorities
- Can optimize frequency per service
- Critical services monitored with highest priority
**Cons**:
- More complex configuration
- May be over-engineering for current needs
- Harder to maintain
## Recommendation Needed
We need input from the team on:
1. **Observed behavior**: Have health checks ever delayed customer operations?
2. **Outage scenarios**: How quickly do we need to detect outages?
3. **Maintenance mode**: How critical is immediate activation?
4. **Job duration**: Confirm health checks are consistently fast (< 1 second)
5. **Frequency**: Is every-minute checking necessary, or could we reduce to every 2-3 minutes?
## Implementation Steps
1. **Gather data**:
- Review health check job duration (P50, P95, P99)
- Analyze queue depth during peak times
- Check if health checks have ever been delayed
- Review past outage detection times
2. **Discuss with team**:
- SRE perspective on monitoring priorities
- Engineering perspective on customer impact
- Historical incidents related to health checks
3. **Make decision**:
- Document rationale for chosen priority
- Consider trade-offs between monitoring and operations
- Align with overall queue priority strategy
4. **Update configuration** (if needed):
```yaml
# config/sidekiq.yml
:queues:
# Option 1: Keep highest priority
- [health_check, 10]
# Option 2: Slightly lower
- [health_check, 7]
# Option 3: Split by criticality
- [health_check_critical, 9]
- [health_check_routine, 5]
```
5. **Monitor after change**:
- Track outage detection time
- Monitor maintenance mode activation speed
- Watch for any customer impact
6. **Document decision**:
- Why this priority was chosen
- Trade-offs considered
- Conditions that might warrant re-evaluation
## Success Criteria
- Clear understanding of health check priority requirements
- Documented rationale for chosen priority
- No degradation in outage detection or system reliability
- Appropriate balance between monitoring and customer operations
## Related
- Parent epic: gitlab-org&19587
- Related: #14268 (weight granularity)
- Related: #14271 (job duration analysis)
issue