Add generic 'high load' alert
We don't have any load related alert other than DB one (https://gitlab.com/gitlab-com/runbooks/blob/master/alerts/db-under-heavy-load.rules#L1). This ends with a 6.5k load on a machine being unnoticed until someone looks on the graphs (as described in infrastructure#3157)!
This MR adds a generic high load
alert based on the node_load1
metrics.
Two things to consider:
-
I've set
100
as the triggering value, but I think it's too big. Considering that our biggest machines are the DB ones with 32 cores, and how load on Linux is measured, having load higher than 40-50 for more then 5 minutes should be enough to raise an alert. And if we exclude DB nodes then even 20-30 would be enough. -
The second thing is the DB alert mentioned above. I don't know it origins, but I assume that someone set the level to 200 for a good reason. It also provides some additional data that are related to DB load. This MR introduces an alert that will be fired much earlier than the specific one for DB. I think that it would be good to exclude hosts matched by https://gitlab.com/gitlab-com/runbooks/blob/master/alerts/db-under-heavy-load.rules#L2 from alert added here.
References infrastructure#3159