As an MVC towards the broader anomaly detection and alerting issue (&119) we should provide alerting based on deviation from the weekly mean.
- Define a recording rule for a weekly moving average of desired metrics. See this doc for more details on potential implementations.
- Define a recording rule for a 5 minute moving average of all nodes/pods for each metric.
- Create alert for each metric when the 5 minute moving average is X standard deviations beyond the weekly moving average
- Create alert for each metric when a node's 5 minute moving average is X standard deviations beyond the populations average.
In this case our definitions would be:
- Anomaly: two standard deviations away from the weekly moving average. 2sigma would then only fire, in theory, in the event we are 95% sure of an error. (Assuming normal distribution.)
- Outlier: a single node's metrics are two standard deviations away from the rest of the fleet.