Skip to content

Add generic 'high load' alert

Tomasz Maczukin requested to merge tm-add-high-load-alert into master

We don't have any load related alert other than DB one (https://gitlab.com/gitlab-com/runbooks/blob/master/alerts/db-under-heavy-load.rules#L1). This ends with a 6.5k load on a machine being unnoticed until someone looks on the graphs (as described in infrastructure#3157)!

This MR adds a generic high load alert based on the node_load1 metrics.

Two things to consider:

  1. I've set 100 as the triggering value, but I think it's too big. Considering that our biggest machines are the DB ones with 32 cores, and how load on Linux is measured, having load higher than 40-50 for more then 5 minutes should be enough to raise an alert. And if we exclude DB nodes then even 20-30 would be enough.

  2. The second thing is the DB alert mentioned above. I don't know it origins, but I assume that someone set the level to 200 for a good reason. It also provides some additional data that are related to DB load. This MR introduces an alert that will be fired much earlier than the specific one for DB. I think that it would be good to exclude hosts matched by https://gitlab.com/gitlab-com/runbooks/blob/master/alerts/db-under-heavy-load.rules#L2 from alert added here.

References infrastructure#3159

Edited by Craig Miskell

Merge request reports