Deadman with multiple replicas

When using multiple replicas, query from alertmanager are distributed among replicas, in round-robin.

In some rare occasion, the delay between two requests to a specific POD can be greater than the default timeout parameter.

2021-01-22 09:34:08,064 httprobe request updated deadman state. Last ping: 2021-01-22 09:30:36.112817
2021-01-22 09:34:08,064 State is AliveState
100.103.128.0 - - [22/Jan/2021 09:34:08] "GET /httprobe HTTP/1.1" 200 -
2021-01-22 09:34:08,178 httprobe request updated deadman state. Last ping: 2021-01-22 09:29:05.247565
2021-01-22 09:34:08,178 State is AliveState
100.127.0.0 - - [22/Jan/2021 09:34:09] "GET /httprobe HTTP/1.1" 200 -
2021-01-22 09:34:09,281 httprobe request updated deadman state. Last ping: 2021-01-22 09:30:36.112817
2021-01-22 09:34:09,282 State is AliveState
100.114.0.0 - - [22/Jan/2021 09:34:12] "GET /httprobe HTTP/1.1" 200 -
2021-01-22 09:34:12,269 httprobe request updated deadman state. Last ping: 2021-01-22 09:33:05.243775
2021-01-22 09:34:12,269 State is AliveState
100.103.128.0 - - [22/Jan/2021 09:34:14] "GET /httprobe HTTP/1.1" 200 -
2021-01-22 09:34:14,993 httprobe request updated deadman state. Last ping: 2021-01-22 09:29:05.247565
2021-01-22 09:34:14,993 State is DeadState
2021-01-22 09:34:14,993 Send alert to receiver <deadman_state_machine.receivers.HttpPostJsonReceiver object at 0x7fe55a618240>
2021-01-22 09:34:14,993 POST request sent to HttpPostReceiver with data : {"alerts": [{"labels": {"severity": "critical", "alertname": "Watchdog"}, "status": "firing", "annotations": {"description": "The Man is Dead"}}], "alertname": "Watchdog"}
100.114.0.0 - - [22/Jan/2021 09:34:15] "GET /httprobe HTTP/1.1" 200 -
2021-01-22 09:34:15,462 httprobe request updated deadman state. Last ping: 2021-01-22 09:33:05.243775
2021-01-22 09:34:15,462 State is AliveState
100.127.0.0 - - [22/Jan/2021 09:34:18] "GET /httprobe HTTP/1.1" 200 -
2021-01-22 09:34:18,064 httprobe request updated deadman state. Last ping: 2021-01-22 09:30:36.112817
2021-01-22 09:34:18,064 State is AliveState
100.103.128.0 - - [22/Jan/2021 09:34:18] "GET /httprobe HTTP/1.1" 200 -
2021-01-22 09:34:18,179 httprobe request updated deadman state. Last ping: 2021-01-22 09:29:05.247565
2021-01-22 09:34:18,179 State is DeadState
100.127.0.0 - - [22/Jan/2021 09:34:19] "GET /httprobe HTTP/1.1" 200 -
2021-01-22 09:34:19,281 httprobe request updated deadman state. Last ping: 2021-01-22 09:30:36.112817
2021-01-22 09:34:19,281 State is AliveState
100.114.0.0 - - [22/Jan/2021 09:34:22] "GET /httprobe HTTP/1.1" 200 -
2021-01-22 09:34:22,270 httprobe request updated deadman state. Last ping: 2021-01-22 09:33:05.243775
2021-01-22 09:34:22,270 State is AliveState
100.103.128.0 - - [22/Jan/2021 09:34:24] "GET /httprobe HTTP/1.1" 200 -
2021-01-22 09:34:24,993 httprobe request updated deadman state. Last ping: 2021-01-22 09:29:05.247565
2021-01-22 09:34:24,993 State is DeadState

The default alertmanager repeat_interval parameter of 60s, for 3 replicas of deadman and with 300s timeout is not optimal.

Proposal:

Increase the timeout
Decrease the repeat_interval
calculate the optimal timeout parameter using the repeat_interval and the number of replicas.

300/60 = 5 means that 5 requests will be sent to 3 POD during the 300s timeout period.

What do you think about that @rsicart ?

Edited Jan 22, 2021 by azman0101