JAMA00

SCOM 2007 R2

Archive for July, 2010

What is the optimal setting for my environment when it comes to missed heartbeats?

Posted by rob1974 on July 14, 2010

I’m probably not the first to write about heartbeats and its mechanism. So just a quick overview of how it works.

Heartbeat mechanisme: An agent sends out a heartbeat on the monitoring port 5723 to a management servers. The management server (rms actually) keeps track of the last received heartbeat of each agent in its management group. When an agent’s heartbeat hasn’t been received for X time alert “heartbeat failure” alert is generated. The heartbeat failure alert in turn triggers a ping on the agent-managed server’s FQDN. When this fails as well another alert will be generated, the “server unreachable” alert.

This blog is about the “X” time to wait before a heartbeat failure alert will be generated. X isn’t actually a direct configurable time. It’s a combination of 2 settings. The agent setting “the heartbeat interval” in seconds and the server setting “number of missed heartbeats allowed”. The default for this setting is 60 seconds and 3 missed heartbeats allowed. So “X” is default 60×3 = 3 minutes.

image

image 

I assume the default is ok in most cases, but if you have a large and complex network or reboot servers without setting maintenance it might generate a lot of noise as well. So here’s what we did to find the optimal setting for X without starting to experiment with settings themselves.

By running the query below you can see the number of alerts for “Health Service HeartBeat Failure” and “Failed to Connect to Computer” with a closed status.

Select alertstringname, count(*) as Number_of_alerts from AlertView
where ResolutionState = 255
and (alertstringname = ‘Health Service HeartBeat Failure’ or alertstringname = ‘Failed to Connect to Computer’)
group by AlertStringNAme
order by 2 DESC

When you haven’t changed retention for closed alerts it gives the number for about 1 week. For our environment it turned out to be around 3000 heartbeat alerts, which is about a weekly alert for each server. However most of these alerts are gone before someone looked at the “problem”.

In this post i’ve given some queries to identify the auto closing alerts already. I modified the query a bit to only see heartbeat failure and failed to connect to computer alerts.

By running the query below you can see the number of heartbeat failure alerts which were closed within 2 minutes after creation.

Select alertstringname, count(*) as Number_of_alerts from AlertView
where ResolutionState = 255
and ResolvedBy =’system’
and (alertstringname = ‘Health Service HeartBeat Failure’ or alertstringname = ‘Failed to Connect to Computer’)
and DATEDIFF(MI,TimeRaised,TimeResolved) <= 2
group by AlertStringNAme
order by 2 DESC

The result of this query in my environment was 1450 alerts were auto-closed within the first 2 minutes. So if X would have been 5, it probably would have prevented 1450 alerts.

I’ve plotted the X versus the expected number of heartbeat failures. Please note i left out quite a lot of values for X, but i haven’t adjusted the scale for this and after 1 hour i still would get a few heartbeat failures.

image

So what’s the optimal X for this environment?

Actually you still can’t say what the setting should be. It still depends on what is acceptable for your environment as we’re talking about how fast you can detect whether a server is down or not. Setting X to 60 would give us the least of heartbeats, but it wouldn’t make any sense either. I believe finding a balance between noise and when we have to take a look is more important, i’d say the optimal X for my environment is 7-8. This will leave about 800 heartbeats alerts weekly, but this is acceptable for us.

Also note you might miss unexpected reboots whatever the value for X is. If it’s important not to miss them, just pick up the event about unexpected reboot from the system eventlog by an alert rule and make that alert critical.

Posted in Agent Settings, Management Servers | 4 Comments »

The Moving Average threshold alerts.

Posted by rob1974 on July 14, 2010

A common tuning strategy is to look at the top 25 most common alerts. In the ’Microsoft ODR Report Library’ you can find this report. When i first started looking at this report i noticed i had several alerts in this topX, which i had never seen this in the console before. The only way i could see these alerts was by creating a “closed alert” view. Most of these alerts were closed by “system” within 1-2 minutes.

I wanted to know how many alerts there were in environment which were automatically closed within 5 minutes. The query below gives the count for those alerts.

select count(*) from AlertView
where ResolutionState = 255
and ResolvedBy =’system’
and DATEDIFF(MI,TimeRaised,TimeResolved) <= 5

In my environment where more then 1/3 of the total alerts (run only the first line) closed within 5 minutes. Most of these alerts aren’t being looked at, because they close so fast. So why are they generated in the first place?

With a few more queries I hoped to find some details about the alerts.To identify which alerts are generated the most and closed within the time limit (thanks to Brian McDermott for helping out with these).

Select top 10 alertstringname, count(*) from AlertView
where ResolutionState = 255
and ResolvedBy =’system’
and DATEDIFF(MI,TimeRaised,TimeResolved) < 5
group by AlertStringNAme
order by 2 DESC

To identify which objects generated the most of these errors.

select top 10 monitoringobjectfullname, count(*) from AlertView
where ResolutionState = 255
and ResolvedBy =’system’
and DATEDIFF(MI,TimeRaised,TimeResolved) < 5
group by monitoringobjectfullname
order by 2 DESC

I can’t really give a strategy to tune these alerts as it seems to be incidents where 1 server is near its threshold value for some time (just going over it, dropping under it again, etc). Based on the monitor and its configuration you should tune this. However the most alerts seems to happen with “moving average” type alerts a lot.

A moving average is an average over several samples, each new sample it will drop the oldest sample value for the new one. With a moving average the monitor compares its average value to it’s threshold at the same rate as the sample frequency and this can lead to a lot of state changes and thus alerts.

In the table below I’ve created fictional data to proof my point. The 3rd column display’s the moving average over 5 samples. As you can see after the 5th sample, it gives a value every sample. The 4th column takes just the average over 5 samples, takes 5 new samples and then creates a new value.

Suppose we have a 1 minute sample rate. For moving average it means getting 9 alerts in 30 minutes which exist for 1 to 3 minutes. When we just make an average over 5 minutes and then drop the results and collect 5 new points it would have been 3 alerts which exist 5 to 10 minutes. Still a lot, but because the alert exists as “new” longer, it might have been noted by some operator and some admin might have taken a look at the server and resolved the issue.

 image

Tuning “moving average” monitors is quite difficult and the only way to reduce alerts is to use a lower sample rate. A higher threshold value doesn’t work as it are incidents where a server is near the threshold value.

Posted in management packs, troubleshooting | 1 Comment »

another DNS tuning post!

Posted by rob1974 on July 6, 2010

 

Actually, this is my first DNS tuning post, but as the DNS mp has proven itself quite noisy you can find loads of other blogs with tuning tips. I haven’t found this one anywhere, so here goes my tip about the “DNS 200X Forwarder Availability Monitor” and why you should disable it or configure it properly.

What does this monitor do:

It just runs an A-record (ns)lookup for http://www.microsoft.com. So basically it assumes we have a forwarder in place and with http://www.microsoft.com we actually use this forwarder. Allthough many probably do (unless you use roothints), it’s actually a lot similar to the “DNS 200X External Resolution Monitor”, which does a NS-record lookup against http://www.microsoft.com (but override this to microsoft.com for a better result as that’s a correct ns-record lookup).

So the “forwarder availability monitor” doesn’t actually test forwarder availability, at least not which is already being tested by the external resolution monitor.

But there’s an use for this test. When you do use “conditional forwarding” in a DNS server, you can configure this test to lookup a domain record in the forwarding rule.

E.g. you have a conditional forwarding rule to “my2nddomain.com”. Set an override on the
“host” to a-valid-a-record.my2nddomain.com and actually make use of the forwarder availability monitor.

When you make use of conditional forwarders, configure this rule to actually use one of them. But when you don’t use forwarders or you use the “forward all domains” option, then just do like me and disable the monitor.

image

Posted in management packs | Leave a Comment »