Split-brain Prometheus: Sidekiq k8s migration is partitioning some application metrics between multiple prom servers
This is a placeholder.
As we move sidekiq work into k8s, Sidekiq application metrics are being sharded between multiple prometheus instances and this may lead to the failure (of false alerting) from some of our metrics.
For example, the sidekiq_throttled_jobs_enqueued_without_dequeuing alert uses the following expression:
(
sum by (environment, queue, feature_category) (rate(sidekiq_enqueued_jobs_total{urgency="throttled"}[10m] offset 10m)) > 0
)
unless
(
sum by (environment, queue, feature_category) (rate(sidekiq_jobs_completion_seconds_count{urgency="throttled"}[20m])) > 0
or
sum by (environment, queue, feature_category) (rate(sidekiq_jobs_failed_total{urgency="throttled"}[20m])) > 0
or
sum by (environment, queue, feature_category) (rate(sidekiq_jobs_retried_total{urgency="throttled"}[20m])) > 0
or
sum by (environment, queue, feature_category) (avg_over_time(sidekiq_running_jobs{urgency="throttled"}[20m])) > 0
)
This works fine when all the metrics can be evaluated on a single Prometheus instance.
However, now that project_export has been migrated to k8s, the sidekiq_enqueued_jobs_total value is held on one prometheus server, while sidekiq_jobs_completion_seconds_count (and the rest) are on another.
In this case, this leads to the alert firing when it shouldn't, but it's just as likely that alerts that are split over multiple prometheus servers will not fire when it should, potentially a worse situation.
To make this more complicated, some of the series exist on the prometheus-app instance, while some (project_export) do not. At a glance, this situation may not be noticed.
Potential fixes
- Continue to scrap application metrics to the
prometheus-appprometheus instances. This would require non-k8s Prometheus to access things inside k8s, adding Complexity - Evaluate rules at the Thanos level (not sure this is even possible, and if it is, I'm not sure I would want it)
- Use federation from k8s-Prometheus to the
prometheus-appprometheus instance to "unify" metrics on that instance.
Of the three options I can think of, I think I prefer the third the most.