JAMA00

SCOM 2007 R2

The Moving Average threshold alerts.

Posted by rob1974 on July 14, 2010

A common tuning strategy is to look at the top 25 most common alerts. In the ’Microsoft ODR Report Library’ you can find this report. When i first started looking at this report i noticed i had several alerts in this topX, which i had never seen this in the console before. The only way i could see these alerts was by creating a “closed alert” view. Most of these alerts were closed by “system” within 1-2 minutes.

I wanted to know how many alerts there were in environment which were automatically closed within 5 minutes. The query below gives the count for those alerts.

select count(*) from AlertView
where ResolutionState = 255
and ResolvedBy =’system’
and DATEDIFF(MI,TimeRaised,TimeResolved) <= 5

In my environment where more then 1/3 of the total alerts (run only the first line) closed within 5 minutes. Most of these alerts aren’t being looked at, because they close so fast. So why are they generated in the first place?

With a few more queries I hoped to find some details about the alerts.To identify which alerts are generated the most and closed within the time limit (thanks to Brian McDermott for helping out with these).

Select top 10 alertstringname, count(*) from AlertView
where ResolutionState = 255
and ResolvedBy =’system’
and DATEDIFF(MI,TimeRaised,TimeResolved) < 5
group by AlertStringNAme
order by 2 DESC

To identify which objects generated the most of these errors.

select top 10 monitoringobjectfullname, count(*) from AlertView
where ResolutionState = 255
and ResolvedBy =’system’
and DATEDIFF(MI,TimeRaised,TimeResolved) < 5
group by monitoringobjectfullname
order by 2 DESC

I can’t really give a strategy to tune these alerts as it seems to be incidents where 1 server is near its threshold value for some time (just going over it, dropping under it again, etc). Based on the monitor and its configuration you should tune this. However the most alerts seems to happen with “moving average” type alerts a lot.

A moving average is an average over several samples, each new sample it will drop the oldest sample value for the new one. With a moving average the monitor compares its average value to it’s threshold at the same rate as the sample frequency and this can lead to a lot of state changes and thus alerts.

In the table below I’ve created fictional data to proof my point. The 3rd column display’s the moving average over 5 samples. As you can see after the 5th sample, it gives a value every sample. The 4th column takes just the average over 5 samples, takes 5 new samples and then creates a new value.

Suppose we have a 1 minute sample rate. For moving average it means getting 9 alerts in 30 minutes which exist for 1 to 3 minutes. When we just make an average over 5 minutes and then drop the results and collect 5 new points it would have been 3 alerts which exist 5 to 10 minutes. Still a lot, but because the alert exists as “new” longer, it might have been noted by some operator and some admin might have taken a look at the server and resolved the issue.

 image

Tuning “moving average” monitors is quite difficult and the only way to reduce alerts is to use a lower sample rate. A higher threshold value doesn’t work as it are incidents where a server is near the threshold value.

Advertisements

One Response to “The Moving Average threshold alerts.”

  1. […] Comments (RSS) « The Moving Average threshold alerts. […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: