Improve Alerting
-
Have a meeting with the SRE team for observation, monitoring and alerting -
Reduce the severity of TezGraph alerts from Critical to a milder severity, while we have no production usage. -
Initiate an "Emergency ABC" knowledge-base that lists what things can go wrong and how we should fix them -
Find a way to snooze a class of alerts. For some reason, TezGraph has the habit of alerting, then resolving, then firing the same alert (as a new incident). So the whole cycle of acknowledge and snooze has to be repeated, maybe multiple times an hour. -
Study the alerts we had in the past few weeks, group them and have a plan to proactively make them less likely to repeat