I’m probably not the first to write about heartbeats and its mechanism. So just a quick overview of how it works.
Heartbeat mechanisme: An agent sends out a heartbeat on the monitoring port 5723 to a management servers. The management server (rms actually) keeps track of the last received heartbeat of each agent in its management group. When an agent’s heartbeat hasn’t been received for X time alert “heartbeat failure” alert is generated. The heartbeat failure alert in turn triggers a ping on the agent-managed server’s FQDN. When this fails as well another alert will be generated, the “server unreachable” alert.
This blog is about the “X” time to wait before a heartbeat failure alert will be generated. X isn’t actually a direct configurable time. It’s a combination of 2 settings. The agent setting “the heartbeat interval” in seconds and the server setting “number of missed heartbeats allowed”. The default for this setting is 60 seconds and 3 missed heartbeats allowed. So “X” is default 60×3 = 3 minutes.
I assume the default is ok in most cases, but if you have a large and complex network or reboot servers without setting maintenance it might generate a lot of noise as well. So here’s what we did to find the optimal setting for X without starting to experiment with settings themselves.
By running the query below you can see the number of alerts for “Health Service HeartBeat Failure” and “Failed to Connect to Computer” with a closed status.
Select alertstringname, count(*) as Number_of_alerts from AlertView
where ResolutionState = 255
and (alertstringname = ‘Health Service HeartBeat Failure’ or alertstringname = ‘Failed to Connect to Computer’)
group by AlertStringNAme
order by 2 DESC
When you haven’t changed retention for closed alerts it gives the number for about 1 week. For our environment it turned out to be around 3000 heartbeat alerts, which is about a weekly alert for each server. However most of these alerts are gone before someone looked at the “problem”.
In this post i’ve given some queries to identify the auto closing alerts already. I modified the query a bit to only see heartbeat failure and failed to connect to computer alerts.
By running the query below you can see the number of heartbeat failure alerts which were closed within 2 minutes after creation.
Select alertstringname, count(*) as Number_of_alerts from AlertView
where ResolutionState = 255
and ResolvedBy =’system’
and (alertstringname = ‘Health Service HeartBeat Failure’ or alertstringname = ‘Failed to Connect to Computer’)
and DATEDIFF(MI,TimeRaised,TimeResolved) <= 2
group by AlertStringNAme
order by 2 DESC
The result of this query in my environment was 1450 alerts were auto-closed within the first 2 minutes. So if X would have been 5, it probably would have prevented 1450 alerts.
I’ve plotted the X versus the expected number of heartbeat failures. Please note i left out quite a lot of values for X, but i haven’t adjusted the scale for this and after 1 hour i still would get a few heartbeat failures.
So what’s the optimal X for this environment?
Actually you still can’t say what the setting should be. It still depends on what is acceptable for your environment as we’re talking about how fast you can detect whether a server is down or not. Setting X to 60 would give us the least of heartbeats, but it wouldn’t make any sense either. I believe finding a balance between noise and when we have to take a look is more important, i’d say the optimal X for my environment is 7-8. This will leave about 800 heartbeats alerts weekly, but this is acceptable for us.
Also note you might miss unexpected reboots whatever the value for X is. If it’s important not to miss them, just pick up the event about unexpected reboot from the system eventlog by an alert rule and make that alert critical.