Integrate production monitoring alerts in release-tools production checks
We need #1052 (closed) before we can start working on this.
This issue is about extending release tools production check with the new promethous alerts.
- The check should be logged into the monthly issue (together with the already existing ones).
- We should compare
cny
andmain
stage metrics to determinecny
reliability. - We should verify
main
stage metrics as a safeguard against automated deployment during an (unreported) incident - The check should be overridable, do not overcome the work done in #941 (closed)
What about sentry?
Today a release manager also check for new errors on sentry, this check is unreliable because often timer old error are not clustered with the old one, resulting in false positive.
A lot of timeout errors happen during a deployment and they are usually classified as new errors on sentry.
With #1052 (closed) and near-missed incidents #1050 (closed) we aim to remove that manual action in favor of two other solutions.
With #1052 (closed) we check for apdex and error rates, with requirements stricter than the one we use to page the EOC. When a canary deployment induces an error, it will be reflected in apdex and error rates, and the production deployment will not be allowed.
Other than that, with #1050 (closed), a developer testing a feature can halt the deployment creating a near-missed incident.