JAMA00

SCOM 2007 R2

What is the optimal setting for my environment when it comes to missed heartbeats?

Posted by rob1974 on July 14, 2010

I’m probably not the first to write about heartbeats and its mechanism. So just a quick overview of how it works.

Heartbeat mechanisme: An agent sends out a heartbeat on the monitoring port 5723 to a management servers. The management server (rms actually) keeps track of the last received heartbeat of each agent in its management group. When an agent’s heartbeat hasn’t been received for X time alert “heartbeat failure” alert is generated. The heartbeat failure alert in turn triggers a ping on the agent-managed server’s FQDN. When this fails as well another alert will be generated, the “server unreachable” alert.

This blog is about the “X” time to wait before a heartbeat failure alert will be generated. X isn’t actually a direct configurable time. It’s a combination of 2 settings. The agent setting “the heartbeat interval” in seconds and the server setting “number of missed heartbeats allowed”. The default for this setting is 60 seconds and 3 missed heartbeats allowed. So “X” is default 60×3 = 3 minutes.

image

image 

I assume the default is ok in most cases, but if you have a large and complex network or reboot servers without setting maintenance it might generate a lot of noise as well. So here’s what we did to find the optimal setting for X without starting to experiment with settings themselves.

By running the query below you can see the number of alerts for “Health Service HeartBeat Failure” and “Failed to Connect to Computer” with a closed status.

Select alertstringname, count(*) as Number_of_alerts from AlertView
where ResolutionState = 255
and (alertstringname = ‘Health Service HeartBeat Failure’ or alertstringname = ‘Failed to Connect to Computer’)
group by AlertStringNAme
order by 2 DESC

When you haven’t changed retention for closed alerts it gives the number for about 1 week. For our environment it turned out to be around 3000 heartbeat alerts, which is about a weekly alert for each server. However most of these alerts are gone before someone looked at the “problem”.

In this post i’ve given some queries to identify the auto closing alerts already. I modified the query a bit to only see heartbeat failure and failed to connect to computer alerts.

By running the query below you can see the number of heartbeat failure alerts which were closed within 2 minutes after creation.

Select alertstringname, count(*) as Number_of_alerts from AlertView
where ResolutionState = 255
and ResolvedBy =’system’
and (alertstringname = ‘Health Service HeartBeat Failure’ or alertstringname = ‘Failed to Connect to Computer’)
and DATEDIFF(MI,TimeRaised,TimeResolved) <= 2
group by AlertStringNAme
order by 2 DESC

The result of this query in my environment was 1450 alerts were auto-closed within the first 2 minutes. So if X would have been 5, it probably would have prevented 1450 alerts.

I’ve plotted the X versus the expected number of heartbeat failures. Please note i left out quite a lot of values for X, but i haven’t adjusted the scale for this and after 1 hour i still would get a few heartbeat failures.

image

So what’s the optimal X for this environment?

Actually you still can’t say what the setting should be. It still depends on what is acceptable for your environment as we’re talking about how fast you can detect whether a server is down or not. Setting X to 60 would give us the least of heartbeats, but it wouldn’t make any sense either. I believe finding a balance between noise and when we have to take a look is more important, i’d say the optimal X for my environment is 7-8. This will leave about 800 heartbeats alerts weekly, but this is acceptable for us.

Also note you might miss unexpected reboots whatever the value for X is. If it’s important not to miss them, just pick up the event about unexpected reboot from the system eventlog by an alert rule and make that alert critical.

Advertisements

4 Responses to “What is the optimal setting for my environment when it comes to missed heartbeats?”

  1. Dominique said

    Select alertstringname, count(*) as Number_of_alerts from AlertView

    =====
    I could not see any string matching the two
    alertstringname = ‘Health Service HeartBeat Failure’ or alertstringname = ‘Failed to Connect to Computer’

    It is strange ??
    =====

    Select alertstringname, count(*) as Number_of_alerts from AlertView
    where alertstringname like ‘%heartbeat%’
    group by AlertStringNAme

    I have only Heartbeat failed
    ====

    Select alertstringname, count(*) as Number_of_alerts from AlertView
    where alertstringname like ‘%connect%’
    group by AlertStringNAme

    I have several items but not the one expected !!!

    Alert Message For Cluster Discovery Connect Status Monitor
    Alert Message For Cluster State Connect Status Monitor
    Connection Refused
    Connection Timeout
    Exchange 2007 Test Active Sync Connectivity Alert
    Exchange 2007 Test IMAP Connectivity Alert
    Exchange 2007 Test OWA Connectivity Internal Alert
    Exchange 2007 Test POP3 Connectivity Alert
    Name Service Provider Interface (NSPI) Proxy failed to connect to the global catalog. This server is down or unreachable. Clients will not be directed to this global catalog until it is available again.
    The backup of the Exchange store database was halted by the client or the connection with the client failed. Examine the log files of your third-party backup application or NT Backup.
    The connection between the Client Access server and the Mailbox server failed.

    Did I miss a step?

    Thanks,
    Dom

  2. rob1974 said

    you could not have any alerts of course or have overrides in place so you don’t get alerts. Heartbeat failed might be another alert (e.g. cluster link), but it might also have something to do with the core mp you are using (This was done with r2 and the core update at that time, allthough i don’t believe it has changed since).

    An easy way to generate one is to stop an agent and wait for a critical to appear and start the agent again. Just to test if the mechanism works though, because no reason to start tuning those alerts when you don’t have any 🙂

  3. DOMINIQUE said

    let me try this stop as I need to know when a server is down…

    Thanks,
    Dom

  4. Hello,

    I would like to get a graph from these queries, is it easy?
    Thanks,
    Dom

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: