Alerting on abnormal Production behavior

Description

When operating a production, always on, SaaS solution it is important to be alerted when strange or undesirable behavior starts occurring. Ideally these alerts would take place before users are significantly impacted, or the support phones/mailboxes start lighting up. This can help alert not just on performance/stability issues, but potentially also abuse.

One of the major sources for abnormal behavior or errors, is when a new release has been deployed into production. New code is now being executed, with potential new features, which has not been tested in the true production environment before. While this is similar to the broader anomaly detection (https://gitlab.com/gitlab-org/gitlab-ee/issues/3610), we would want to reduce the timescale for our comparisons to increase sensitivity. For example instead of comparing the current behavior against the moving average of a full week, we could compare it against the average for the past 30 minutes.

When this occurs, it is important to carefully monitor the behavior the system and generate alerts if potential problems are found.

Proposal

We should consider:

Upon release, calculate the average of each metric for the past X minutes.
After Y minutes of "warm up" time, begin comparing the 5 minute moving average to the pre-deployment value. Alert if over X standard deviations away.
Along with notifications, these alerts could also halt further deployments of the same branch of code. Manual review could then decide whether to proceed, halt and wait for the next code deploy with a fix, or rollback the changes

Edited Oct 03, 2017 by Joshua Lambert