JAMA00

SCOM 2007 R2

Archive for the ‘troubleshooting’ Category

Microsoft Exchange Server 2013 discovery.

Posted by rob1974 on May 13, 2014

The Microsoft Exchange Server management pack has again some big changes. Loads of people already blogged about that. Personally i like the simplicity of it and the fact that Exchange Admins are in control a.k.a. they don’t bother me anymore. However, Windows servers admins started to bother me with script errors related to the Exchange Server 2013 Discovery on servers that didn’t run Exchange 2013 in the first place.

After some investigation it turns out the discovery for Microsoft Exchange 2013 is a powershell script, which is targeted to the windows computer class. This is where it goes wrong. When you are monitoring Windows 2003 or Windows 2008 server chances are you don’t have powershell installed on those servers. Furthermore, why is the Exchange 2013 Discovery running on those servers as it’s a not supported OS for Exchange Server 2013.

So easy enough, i decided to override the discovery for Windows 2003. Simply choosing override for a group, select the Windows 2003 computer group and set the “enabled” value to false. Job done.

Now I wanted to disable the discovery for the Windows 2008 servers as, but not for the Windows 2008 R2 computer. Windows 2008 R2 is a supported OS for Exchange 2013, besides powershell is installed by default so there’s no issue here. The discovery will run and return nothing (or not an exchange server) if exchange isn’t installed. It won’t return a script error because there’s no powershell.

The Windows 2008 computer group in the Windows Server 2008 (discovery) management pack contains also the Windows 2008 R2 computers, so it’s not so easy as with Windows Server 2003. I needed to create a Windows Server 2008 Group which doesn’t contain Windows 2008 R2 server.

Luckily I remembered a blogpost by Kevin Holman about creating a group with computers not containing computers in another group (btw glad he’s back on the support front, I really missed those deep dives in SCOM). I created a new group, edited the xml and set the override. The only difference between my group exclusion and Kevin Holman’s group is I use a reference to another MP in the “notcontained” section as i use the “Microsoft Windows Server 2008 R2 Computer Group” which already exists in the Windows Server 2008 (discovery) mp. This means the reference to that mp needs to be included in the xml below.

The result is here. Save this as ValueBlueOverrideMicrosoft.Exchange.Server.xml (remove the “windows server 2003” override and reference to “Microsoft.Windows.Server.2003” if you don’t run this anymore):

<?xml version=”1.0″ encoding=”utf-8″?><ManagementPack ContentReadable=”true” xmlns:xsd=”http://www.w3.org/2001/XMLSchema” xmlns:xsl=”http://www.w3.org/1999/XSL/Transform”>
<Manifest>
<Identity>
<ID>ValueBlueOverrideMicrosoft.Exchange.Server</ID>
<Version>1.0.1.0</Version>
</Identity>
<Name>ValueBlueOverrideMicrosoft Exchange Server 2013</Name>
<References>
<Reference Alias=”Exchange”>
<ID>Microsoft.Exchange.15</ID>
<Version>15.0.620.18</Version>
<PublicKeyToken>31bf3856ad364e35</PublicKeyToken>
</Reference>
<Reference Alias=”MicrosoftSystemCenterInstanceGroupLibrary6172210″>
<ID>Microsoft.SystemCenter.InstanceGroup.Library</ID>
<Version>6.1.7221.0</Version>
<PublicKeyToken>31bf3856ad364e35</PublicKeyToken>
</Reference>
<Reference Alias=”SystemCenter”>
<ID>Microsoft.SystemCenter.Library</ID>
<Version>6.1.7221.81</Version>
<PublicKeyToken>31bf3856ad364e35</PublicKeyToken>
</Reference>
<Reference Alias=”MicrosoftWindowsServer2008Discovery6066670″>
<ID>Microsoft.Windows.Server.2008.Discovery</ID>
<Version>6.0.6667.0</Version>
<PublicKeyToken>31bf3856ad364e35</PublicKeyToken>
</Reference>
<Reference Alias=”Windows”>
<ID>Microsoft.Windows.Server.2003</ID>
<Version>6.0.6667.0</Version>
<PublicKeyToken>31bf3856ad364e35</PublicKeyToken>
</Reference>
</References>
</Manifest>
<TypeDefinitions>
<EntityTypes>
<ClassTypes>
<ClassType ID=”ValueBlue.Microsoft.Server.2008.Only.Group” Accessibility=”Public” Abstract=”false” Base=”MicrosoftSystemCenterInstanceGroupLibrary6172210!Microsoft.SystemCenter.InstanceGroup” Hosted=”false” Singleton=”true” />
</ClassTypes>
</EntityTypes>
</TypeDefinitions>
<Monitoring>
<Discoveries>
<Discovery ID=”ValueBlue.Microsoft.Server.2008.Only.Group.DiscoveryRule” Enabled=”true” Target=”ValueBlue.Microsoft.Server.2008.Only.Group” ConfirmDelivery=”false” Remotable=”true” Priority=”Normal”>
<Category>Discovery</Category>
<DiscoveryTypes>
<DiscoveryRelationship TypeID=”MicrosoftSystemCenterInstanceGroupLibrary6172210!Microsoft.SystemCenter.InstanceGroupContainsEntities” />
</DiscoveryTypes>
<DataSource ID=”GroupPopulationDataSource” TypeID=”SystemCenter!Microsoft.SystemCenter.GroupPopulator”>
<RuleId>$MPElement$</RuleId>
<GroupInstanceId>$MPElement[Name=”ValueBlue.Microsoft.Server.2008.Only.Group”]$</GroupInstanceId>
<MembershipRules>
<MembershipRule>
<MonitoringClass>$MPElement[Name=”MicrosoftWindowsServer2008Discovery6066670!Microsoft.Windows.Server.2008.Computer”]$</MonitoringClass>
<RelationshipClass>$MPElement[Name=”MicrosoftSystemCenterInstanceGroupLibrary6172210!Microsoft.SystemCenter.InstanceGroupContainsEntities”]$</RelationshipClass>
<Expression>
<NotContained>
              <MonitoringClass>$MPElement[Name=”MicrosoftWindowsServer2008Discovery6066670!Microsoft.Windows.Server.2008.R2.ComputerGroup”]$</MonitoringClass>
</NotContained>
</Expression>
</MembershipRule>
</MembershipRules>
</DataSource>
</Discovery>
</Discoveries>
<Overrides>
<DiscoveryPropertyOverride ID=”OverrideForDiscoveryMicrosoftExchange15ServerDiscoveryRuleForContextMicrosoftWindowsServer2003ComputerGroupd5787b8329934b19ba24ca637d805307″ Context=”Windows!Microsoft.Windows.Server.2003.ComputerGroup” ContextInstance=”cb87057c-606b-43e7-e861-8e5a0df201f6″ Enforced=”false” Discovery=”Exchange!Microsoft.Exchange.15.Server.DiscoveryRule” Property=”Enabled”>
<Value>false</Value>
</DiscoveryPropertyOverride>
<DiscoveryPropertyOverride ID=”OverrideForDiscoveryMicrosoftExchange15ServerDiscoveryRuleForContextValueBlue.Microsoft.Server.2008.OnlyGroup4700b619d2184cb2af12302026deee09″ Context=”ValueBlue.Microsoft.Server.2008.Only.Group” ContextInstance=”4cc2c8a2-918a-2ec3-f05e-fa1042b4a4db” Enforced=”false” Discovery=”Exchange!Microsoft.Exchange.15.Server.DiscoveryRule” Property=”Enabled”>
<Value>false</Value>
</DiscoveryPropertyOverride>
</Overrides>
</Monitoring>
<LanguagePacks>
<LanguagePack ID=”NLD” IsDefault=”false”>
<DisplayStrings>
<DisplayString ElementID=”ValueBlueOverrideMicrosoft.Exchange.Server”>
<Name>ValueBlueOverrideMicrosoft Exchange Server 2013</Name>
</DisplayString>
<DisplayString ElementID=”ValueBlue.Microsoft.Server.2008.Only.Group”>
<Name>ValueBlueWindows 2008 servers only (no windows 2008 r2 servers)</Name>
</DisplayString>
<DisplayString ElementID=”ValueBlue.Microsoft.Server.2008.Only.Group.DiscoveryRule”>
<Name>Populate ValueBlueWindows 2008 servers only (no windows 2008 r2 servers)</Name>
<Description>This discovery rule populates the group ‘ValueBlueWindows 2008 servers only (no windows 2008 r2 servers)'</Description>
</DisplayString>
</DisplayStrings>
</LanguagePack>
<LanguagePack ID=”ENU” IsDefault=”false”>
<DisplayStrings>
<DisplayString ElementID=”ValueBlueOverrideMicrosoft.Exchange.Server”>
<Name>ValueBlueOverrideMicrosoft Exchange Server 2013</Name>
</DisplayString>
<DisplayString ElementID=”ValueBlue.Microsoft.Server.2008.Only.Group”>
<Name>ValueBlueWindows 2008 servers only (no windows 2008 r2 servers)</Name>
</DisplayString>
<DisplayString ElementID=”ValueBlue.Microsoft.Server.2008.Only.Group.DiscoveryRule”>
<Name>Populate ValueBlueWindows 2008 servers only (no windows 2008 r2 servers)</Name>
<Description>This discovery rule populates the group ‘ValueBlueWindows 2008 servers only (no windows 2008 r2 servers)'</Description>
</DisplayString>
</DisplayStrings>
</LanguagePack>
</LanguagePacks>
</ManagementPack>

Posted in grouping and scoping, management packs, troubleshooting | Tagged: , , , | 2 Comments »

SNMP port claimed, but not by the snmp services.

Posted by rob1974 on October 24, 2013

 

I recently came across the following issue. SCOM was throwing errors every minute and auto closing them fast from this monitor: Network Monitoring SNMP Trap port is already in use by another program. 2 things I notice when I looked at it. The first one was it was a manual reset monitor how can that be auto closed? And the second one, it was about the management server itself and it’s inability to claim the SNMP trap receiver port (UDP 162), while I was sure I had disabled the SNMP services.

So what claimed the SNMP port? With 2 simple commands you can find out:

> netstat –ano –p UDP| find “:162”.

This resulted in some PID. Now check the PID versus tasklist /svc and we should have the process responsible for the claim.

> tasklist /svc /fi “pid eq <pidnumberfound>”

and it returned nothing.

After quite some rechecking my results in different ways I came to the conclusion that the UPD port really had been claimed by an process that didn’t exist anymore. I feel that’s it is some bug in Windows server, it should always close the handles whenever a process dies, but let’s be pragmatic about it > reboot server. After the reboot I ran the commands above again to find out the “monitoringhost.exe” claimed the port. W00t we solved the problem, SCOM can receive traps again.

As mentioned in the start of the post. The alerts were closing fast. And because of that fact, if someone saw it they wouldn’t pay any attention to it. SCOM wasn’t receiving any traps in the above condition, so why was the alert of a manual reset monitor being auto closed fast?

The explanation is quite simple: the recovery task for this monitor. This recovery task runs and disables the Windows SNMP services and will do a reset of the monitor. The problem with this is that the recovery didn’t do anything (the SNMP services were already disabled), but it did reset the monitor.

I think the solution should be rewrite of the monitor, where the monitor checks if SCOM really has claimed the SNMP port. I could do this, but for now I will leave it. It might have just been one of those exotic errors you’ll see once in a lifetime…

Posted in Management Servers, troubleshooting | Leave a Comment »

Website monitoring, another gotcha!

Posted by rob1974 on February 13, 2012

Recently I had an issue with website monitoring in a SCOM demo environment. I had configured a website test (through the template) every 2 minutes. I had created a DNS zone hosting the FQDN for this website. Then I paused the DNS zone and waited for the HTTP test to fail. I expected the HTTP test to fail and have an alert in SCOM within 2 minutes. However, after 10 minutes I had nothing… Not really what you want when you do a live demo.

So what was happening here? Some of you might already got it, it’s DNS caching of the client running the HTTP test. So how to stop this? Well there are 3 things you can do.

 

1. Default the DNS Client service will be started on a windows machine. Simply stopping the DNS Client service and the caching will stop (dns queries will still resolve).

2. Increase the frequency of the HTTP test. Anything more than 15 minutes will do…

3. Decrease the default cache time for the queries to something less than your test frequency.

 

As I was giving a demo, option 2 was not an option for me. But I would seriously thing about this when I do website tests in production. Obviously service level agreements should play a part in this, but a delay of max 30 minutes on a SLA of 8 hours would definitely be acceptable for me.

Option 1 was no option either. Beside caching the dns client service also registers domain joined hosts in dns, so not something I would recommend either. Besides caching helps performance wise, not sure if I ever wanted to disable this.

So left with option 3, but how to do this? In HKLM\SYSTEM\CurrentControlSet\services\Dnscache\parameters create the DWord MaxCacheTtl (and MaxNegativeCacheTtl if it’s not already on 0 with “NegativeCacheTime”) and give it a value of below the frequency of the website test. For a 2 minutes test I used a value of 90 (seconds).

dnssettings

Normally I think option 2 will be the best to go for. No use to run tests more than you would need to. However if you have a dedicated host for running website tests and you run those tests more often than every 15 minutes, consider reducing the max. cache time of the dns client.

Posted in general, troubleshooting | Tagged: , , | Leave a Comment »

SCOM’s “un-discovery”. What doesn’t work here… And how to correct it.

Posted by rob1974 on January 26, 2011

 

SCOM’s main benefit of monitoring imho is it’s ability to discover what is running on a server and based on that information start to monitor the server with the appropriate rules. When you follow Microsoft’s best practices you’ll first perform a lightweight discovery to create a very basic class and have the more heavy discoveries run against that basic class. This is pretty good stuff actually. it helps quite a lot for the performance of an agent as it will only run heavy discoveries if the server has an application role and never run on servers which have nothing to do with that application.

However, I’ve recently found out a drawback with this 2 step discovery, which I can probably explain the best with a real world example:

Discover the windows domain controllers on “windows computers” (the management pack from where this discovery runs in is an exception. usually it’s in the application mp itself; apparently MS thought of domain controllers being basic info. similar discoveries for workstation and member servers can be found in this mp as well). For this discovery a wmi query is used to determine if the “windows computer” is a domain controller as well (SELECT NumberOfProcessors FROM Win32_ComputerSystem WHERE DomainRole > 3; if this returns something, it’s a dc)

image

When it is a “windows domain controller” it will run a few other discoveries to determine more info.image

Just by looking at the classes you can imagine it’s not really lightweight anymore.

image

So far so good, on all my windows computer I run a simple query and if that query returns something SCOM will also run a script that founds more interesting stuff about the DC.

But here’s the catch with this kind of discovery. Suppose I don’t need a certain DC anymore, but I still need to keep the server as it’s running some application I still need to use and monitor. What will happen? The lightweight discovery will do its job. It will correctly determine that the server is not a “windows domain controller” anymore and as a result it won’t run the script-discovery anymore.

You might ask, why is that bad, we didn’t want that, did we? Yes you are correct, we didn’t want to run this discovery against servers that aren’t DC’s, but SCOM doesn’t unlearn the discovered classes automatically. Because this discovery never runs again SCOM never unlearns this server doesn’t have the “Active Directory Domain Controller Computer Role” anymore. And this is the class that is used for targetting rules and monitors. So allthough SCOM knows the server isn’t a “windows domain controller” anymore, it still is monitoring the “Active Directory Domain Controller Computer Role”. This will result in quite a lot of noise (script errors, ldap failures, etc).

For now, there’s just a workaround available. You will need to override the 2nd discovery for that particular server. As the first discovery doesn’t include this server as an object of class, you can’t override the discovery for a “specific object of class: Windows Domain Controller”. You’ll need to create a group and include the server object. Then use the override the object discovery “for a group…” and choose the group you’ve just created.

image

What’s the point of disabling a discovery that didn’t run anyway? Well now you can go to powershell and run the “Remove-DisabledMonitoringObject” cmdlet. This will remove the discovered object classes for this discovery and all of the monitoring attached to those classes.

Discoveries make SCOM stand out from other monitoring tools, but it needs to work both ways. Finding out this took me about 1 day. And that’s just 1 issue with 1 server (DNS was also installed on this server and had the same issue). Loads of servers might change role without me knowing about it and when it’s not being reported to me I’ll just have extra noise in SCOM. I’m just not sure if this can be picked up within SCOM itself or that the “un-discovery” needs to be done by the mp’s logic. For the AD part it needs to be picked up by Microsoft anyway, but if the logic is build in the management pack then it will have an impact on all the custom build mp’s by all you SCOM authors out there.

Posted in general, management packs, troubleshooting | 3 Comments »

WINS Connector Alert

Posted by rob1974 on August 26, 2010

The WINS connector checks the WINS lookup by a DNS server. I suppose the monitor only runs when you have configured DNS to use WINS forward lookup.

image

In order for this monitor to work you need to have a static record in WINS which doesn’t exist in DNS (this is not mentioned in the knowledge, but it runs a nslookup, so make sure it doesn’t resolve by using dns) and configure the monitor to lookup this HostName. Default the monitor looks up “PlaceHolder”, so you could just create a Wins record named “PlaceHolder”.

I had done this, but i still received errors and when i ran the nslookup query i did receive a valid response for placeholder, so the wins connector does work.

The knowledge mention something about a debug flag to get some helpful troubleshooting information to solve the issue. The description of the debug events i got:

DNS.TTL.vbs : Starting DNS.TTL.vbs Host:PlaceHolder Server:xxx.xxx.xxx.xxx

DNS.TTL.vbs : Writing Property Bag . State=False ttl1:0 ttl2:0 Authority Flag:

It’s not helpful at all and even worse it’s wrong as well. The vbscript file’s name is ttl.vbs, so look for this in the “system center management” folder. Also the parameters are wrong so manually running it fails as well.

To run the script manually on the dns server open a commandline box and run:

path.to.ttl.vbs>cscript /nologo ttl.vbs <hostname> <dnsserverip> false

<hostname>= static wins entry, which doesnt exist in dns (placeholder)  (The script says to fill in fqdn, but this is incorrect as well).

<dnsserverip> = listening ip address of the dns server 

bolean = debug flag. When you set this to true you get the worthless debug information in scom, so keep it on false 🙂

When you save the output to an xml file and open it you’ll get something like this.

<Collection>

  <DataItem type=”System.PropertyBagData” time=”2010-08-25T17:18:49.1162249+02:00” sourceHealthServiceId=”BA0AF2AD-5058-0DA0-D5D0-BF3CDD878B88“>

    <ConversionType>StateData</ConversionType>

    <Property Name=”state” VariantType=”8“>ERROR</Property>

  </DataItem>

</Collection>

i’ve modified the script so it will run show the stdout for the nslookup and show a line that it has exited the regex compare function (which it shouldn’t when it functions ok).

Just save the script below to a temp location and run it from there. When you run this vbs and all goes well you should just get this as output:

std_output:

————
Got answer:
    HEADER:
        opcode = QUERY, id = 1, rcode = NOERROR
        header flags:  response, want recursion, recursion avail.
        questions = 1,  answers = 1,  authority records = 0,  additional = 0

    QUESTIONS:
        xxxxxxxxxxx, type = PTR, class = IN
    ANSWERS:
    ->  xxxxxxxxxxxxxxxxx

        name = xxxxxxxxxxxxxxxxxx
        ttl = 168 (2 mins 48 secs)

————
Server:  xxxxxxxxxxxxxxxxxx
Address:  xxxxxxxxxxxxxxxxxx

————
Got answer:
    HEADER:
        opcode = QUERY, id = 2, rcode = NOERROR
        header flags:  response, auth. answer, want recursion, recursion avail.
        questions = 1,  answers = 1,  authority records = 0,  additional = 0

    QUESTIONS:
        xxxxxxxxxxxxxx, type = A, class = IN
    ANSWERS:
    ->  xxxxxxxxxxxx

        internet address = xxxxxxxxxxxx
        ttl = 1200 (20 mins)

————
Name:    xxxxxxxxxxxxxxx
Address:  xxxxxxxxxxxx

When it fails it will log a line before this output:

exit function at regex: 16

This means the WINS_LOOKUP_REGEX array fails at  “”^\s*ttl = “,_” or the line after that.

I couldn’t be bothered to figure out the exact regular expression mismatch, rewrite the WINS_LOOKUP_REGEX array, disable the monitor and create a new one with a new script. I’ve just disabled this monitor as it’s just gives me incorrect information.

Modified script:

'
' Microsoft Corporation
' Copyright (c) Microsoft Corporation. All rights reserved.
'
' ttl.vbs
'
' Determine if a wins connector is healthy.
'
' Parameters -
'                       TargetComputer	The FQDN of the computer targeted by the script.
'			Server - the listening ip
'                      	DebugFlag          True / False

Option Explicit

SetLocale("en-us")

Const DNS_TRACEEVENTNUMBER	= 1125
Const DNS_SCRIPTNAME = "DNS.TTL.vbs"
Const SCOM_ERROR=1
Const SCOM_WARNING=2
Const SCOM_INFORMATIONAL=4
Const SCOM_PB_STATEDATA = 3
Const NSLOOKUP_PATH = "%SystemRoot%\system32\nslookup.exe" 

Dim ImagePath, oWMI, rc, oArgs, oAPI, oDiscoveryData, oInst, SourceID, ManagedEntityId, TargetComputer, OSVersion, oDebugFlag
Dim TTL1 , TTL2, AuthorityFlag
dim host,server,bolDebug
dim objAPI  ,boolWins , oPropertyBag
Dim  sCommand,iErrCode, sOutput, sError,m_sNetshPath ,aSubMatches,  oShell, objArgs

Dim WINS_LOOKUP_REGEX

WINS_LOOKUP_REGEX = Array( _
                                 ".*\r\n",_
                                 ".*\r\n",_
                                 ".*\r\n",_
                                 ".*\r\n",_
                                 ".*\r\n",_
			      "^[\s]*questions = [0-9]*,",_
			      "^[\s]*answers = ",_
			      "[0-9]*,",_
			      "^[\s]*authority records = ",_
			      "[0-9]*",_
                                 ".*\r\n",_
                                 ".*\r\n",_
                                 ".*\r\n",_
                                 ".*\r\n",_
                                 ".*\r\n",_
                                 ".*\r\n",_
                                 "^\s*ttl = ",_
                                 "[0-9]*.*" )         

'***************
'
' start here.
'
'***************
On Error Resume Next

Set objAPI = CreateObject("MOM.ScriptAPI")
If Err.Number <> 0 Then
  Wscript.Quit
end if
Set oPropertyBag = objAPI.CreateTypedPropertyBag(3)
If Err.Number <> 0 Then
  ThrowErrorAndExit "CreateStateDataTypedPropertyBag failed. code = " & Err.Number
end if

Set objArgs = WScript.Arguments
If objArgs.Count <> 3 Then
    Call objAPI.LogScriptEvent( DNS_SCRIPTNAME & " <host>  <computername>  [debug [true | false]")
    wscript.Quit
End If 

host = objArgs(0)
server = objArgs(1)
bolDebug= objArgs(2)

Set oShell = CreateObject("WScript.Shell")
boolWins=cbool(false)
TTL1=0
TTL2=0

Set oShell = CreateObject("WScript.Shell")

trace "Starting " & DNS_SCRIPTNAME & " Host:" & host  & " Server:" & server 

sOutput=ExecuteCmd("-debug -querytype=a " +host + " "+ server ,NSLOOKUP_PATH  ,true)
If  LCase(sOutput) <> "error" Then

	If   GetSubMatches( WINS_LOOKUP_REGEX, sOutput, sOutput, aSubMatches) Then

		TTL1=cint(aSubMatches(8) )
		AuthorityFlag=cint(aSubMatches(4) )

    end if
end if

if 	AuthorityFlag=1 then
	Wscript.sleep (1020)
	sOutput=ExecuteCmd("-debug -querytype=a " +host + " "+ server ,NSLOOKUP_PATH  ,true)
	If  LCase(sOutput) <> "error" Then

		If   GetSubMatches( WINS_LOOKUP_REGEX, sOutput, sOutput, aSubMatches) Then

			TTL2=cint(aSubMatches(8) )
			if ttl1>ttl2 then
				 boolWins=cbool(true)
			end if
		end if
    end if
end if	

wscript.echo ""
wscript.echo "std_output:"
wscript.echo sOutput 

trace "Writing Property Bag . State=" & cstr(boolWins) & " ttl1:" & cstr(ttl1) & " ttl2:" & cstr(ttl2) & " Authority Flag:" & cstr(	AuthorityFlag)

if boolWins=true then
	oPropertyBag.AddValue "state", "OK"
else
	oPropertyBag.AddValue "state", "ERROR"
end if
objAPI.AddItem(oPropertyBag)
If Err.Number <> 0 Then ThrowErrorAndExit "Error adding state data to property bag. code = " & Err.Number
objAPI.ReturnItems
If Err.Number <> 0 Then ThrowErrorAndExit "Error returning property bag data. code = " & Err.Number  

Set objAPI = Nothing
Set oPropertyBag = Nothing

Wscript.Quit

'*******************************************************

Sub ThrowErrorAndExit(Message)

   Err.Clear
   Call oAPI.LogScriptEvent(DNS_SCRIPTNAME, DNS_TRACEEVENTNUMBER, SCOM_ERROR, Message)
   WScript.Quit

End Sub

Sub Trace(Message)
   If (bolDebug) Then
      Call objAPI.LogScriptEvent(DNS_SCRIPTNAME, DNS_TRACEEVENTNUMBER, SCOM_INFORMATIONAL, Message)
   End If

End Sub

Function GetSubMatches(ByVal aRegexes, ByVal sText, ByRef sRemainingText, ByRef aCapturedSubMatches)
  Dim oRegex
  Set oRegex = New RegExp
  oRegex.Global = False

  Dim oMatches
  Dim oMatch
  Dim sPattern
  Dim aSubMatches()
  aCapturedSubMatches = aSubMatches

  GetSubMatches = False

  Dim i
  Dim lSubMatchCount

  lSubMatchCount = 0
  sRemainingText = sText

  dim intCount
  intCount = 0

  For i = 0 To UBound(aRegexes)
    sPattern = aRegexes(i)
    oRegex.Pattern = "^" & sPattern
    Set oMatches = oRegex.Execute(sRemainingText)
    if oMatches.Count = 0 then
        wscript.echo "exit function at regex: " & intCount
    end if
    If oMatches.Count <> 1 Then
      sRemainingText = sText
      Exit Function
    End If

    Set oMatch = oMatches(0)
    sRemainingText = Mid(sRemainingText, oMatch.Length + 1)

    ' save output If odd line, or only line.

    If i Mod 2 = 1 Then
      lSubMatchCount = lSubMatchCount + 1
      ReDim Preserve aSubMatches(lSubMatchCount - 1)
      aSubMatches(lSubMatchCount - 1) = oMatch.Value
    elseIf UBound(aRegexes)=0 Then
      lSubMatchCount = lSubMatchCount + 1
      ReDim Preserve aSubMatches(lSubMatchCount - 1)
      aSubMatches(lSubMatchCount - 1) = oMatch.Value
    End If
    intCount = intCount + 1
  Next

  GetSubMatches = True
  aCapturedSubMatches = aSubMatches

End Function

Function ExecuteCmd(strOptionToUse, strCmdToUse, boolReadOutput)
Dim ncControlcommand
Dim oShell
Dim curDir
Dim strExecOut

Set oShell = CreateObject("WScript.Shell")
curDir = oShell.CurrentDirectory
ncControlcommand =  "cmd.exe /C """ & QuoteWrap(strCmdToUse) & " " & strOptionToUse & " " &"""" 

IF boolReadOutput Then
    strExecOut = RunCmd(ncControlcommand,true)
Else
    strExecOut = RunCmd(ncControlcommand,false)
End If
ExecuteCmd = strExecOut
End Function

Function RunCmd(CmdString, boolGetOutPut)
    Dim wshshell
    Dim oExec
    Dim output
    Dim strOutPut

    Set wshshell = CreateObject("WScript.Shell")
    Set oExec = wshshell.Exec(CmdString)
    Set output = oExec.StdOut
    Do While oExec.Status = 0
         WScript.Sleep 100
         if output.AtEndOfStream = false then
            IF boolGetOutPut Then
                    strOutPut = strOutPut & output.ReadAll
                End IF
         else
              exit Do
         End If
    Loop
    IF boolGetOutPut Then
        strOutPut = strOutPut & output.ReadAll
    Else
        strOutPut = "1"
    End IF    

    If oExec.ExitCode <> 0 Then
         strOutPut = "Error"

    End If

    Set wshshell = Nothing
    RunCmd = strOutPut
End Function

Function QuoteWrap(myString)
      If (myString <> "") And (left(mySTring,1) <> Chr(34)) And (Right(myString,1) <> Chr(34)) Then
            QuoteWrap = Chr(34) & myString & Chr(34)
      Else
            QuoteWrap = myString
      End If
End Function

Function IsValidObject(ByVal oObject)
  IsValidObject = False

  If IsObject(oObject) Then
    If Not oObject Is Nothing Then
      IsValidObject = True
    End If
  End If
End Function

Posted in management packs, troubleshooting | Leave a Comment »

The Moving Average threshold alerts.

Posted by rob1974 on July 14, 2010

A common tuning strategy is to look at the top 25 most common alerts. In the ’Microsoft ODR Report Library’ you can find this report. When i first started looking at this report i noticed i had several alerts in this topX, which i had never seen this in the console before. The only way i could see these alerts was by creating a “closed alert” view. Most of these alerts were closed by “system” within 1-2 minutes.

I wanted to know how many alerts there were in environment which were automatically closed within 5 minutes. The query below gives the count for those alerts.

select count(*) from AlertView
where ResolutionState = 255
and ResolvedBy =’system’
and DATEDIFF(MI,TimeRaised,TimeResolved) <= 5

In my environment where more then 1/3 of the total alerts (run only the first line) closed within 5 minutes. Most of these alerts aren’t being looked at, because they close so fast. So why are they generated in the first place?

With a few more queries I hoped to find some details about the alerts.To identify which alerts are generated the most and closed within the time limit (thanks to Brian McDermott for helping out with these).

Select top 10 alertstringname, count(*) from AlertView
where ResolutionState = 255
and ResolvedBy =’system’
and DATEDIFF(MI,TimeRaised,TimeResolved) < 5
group by AlertStringNAme
order by 2 DESC

To identify which objects generated the most of these errors.

select top 10 monitoringobjectfullname, count(*) from AlertView
where ResolutionState = 255
and ResolvedBy =’system’
and DATEDIFF(MI,TimeRaised,TimeResolved) < 5
group by monitoringobjectfullname
order by 2 DESC

I can’t really give a strategy to tune these alerts as it seems to be incidents where 1 server is near its threshold value for some time (just going over it, dropping under it again, etc). Based on the monitor and its configuration you should tune this. However the most alerts seems to happen with “moving average” type alerts a lot.

A moving average is an average over several samples, each new sample it will drop the oldest sample value for the new one. With a moving average the monitor compares its average value to it’s threshold at the same rate as the sample frequency and this can lead to a lot of state changes and thus alerts.

In the table below I’ve created fictional data to proof my point. The 3rd column display’s the moving average over 5 samples. As you can see after the 5th sample, it gives a value every sample. The 4th column takes just the average over 5 samples, takes 5 new samples and then creates a new value.

Suppose we have a 1 minute sample rate. For moving average it means getting 9 alerts in 30 minutes which exist for 1 to 3 minutes. When we just make an average over 5 minutes and then drop the results and collect 5 new points it would have been 3 alerts which exist 5 to 10 minutes. Still a lot, but because the alert exists as “new” longer, it might have been noted by some operator and some admin might have taken a look at the server and resolved the issue.

 image

Tuning “moving average” monitors is quite difficult and the only way to reduce alerts is to use a lower sample rate. A higher threshold value doesn’t work as it are incidents where a server is near the threshold value.

Posted in management packs, troubleshooting | 1 Comment »

Monitoring Multiple Active Directory Forests Without A Trust

Posted by rob1974 on January 26, 2010

SCOM can monitor other forests without have a trust to that forest. This can be achieved by using certificates. My colleague already made a post on how these certificates need to be set up. Even though  SCOM management servers and its agents have communication individual management packs might have the requirement of having a trust in place.

As we all read the management pack guides before we import a new management pack, you immediately see on page 9 multiple forest topology discovery and views: “to discover other forest, a trust relationship is required between the forest hosting the Operations Manager Manager Root Management Server and other forests”.

We don’t have a trust nor do we want one, but most of our monitored servers are in other forests and there are quite a lot of domain controllers amongst those servers. Does this line mean we don’t have to bother with importing the AD mp? Fortunately, this is only about topology and their corresponding views of forests and domains. All domain controllers are discovered and monitored by this MP, even when they are in different forests than the SCOM servers.

We’ve disabled the topology discoveries as we just have a single forest, single domain topology for SCOM and the people in charge of the domain controllers are scoped to those servers on the computer object, so they won’t see the topology anyway. However, I wouldn’t recommend disabling the discoveries when the main forest you are monitoring is the same as the SCOM forest or has a trust with that forest. By disabling these discoveries you basically lose your health model for AD and are left with just the health of domain controllers.

If you do want to disable the topology discovery, just disable the discoveries below:

image

As you can see i filtered on “root management server” and this is the reason why you need to have the trust. These discoveries all run on the RMS.

So, now we have monitoring of domain controllers without having knowledge about the domain/forest topology. Well not quite, we’re not there yet. We do have monitoring of domain controllers, but we also have lots of errors about several scripts not being able to run with an error description “failed to create the object ‘McActiveDir.ActiveDirectory’”.

The management pack guide states that the Active Directory Helper object will be installed when you install the agent. However, this is only true when you push the agent (well i think, i haven’t really tested it). For manual installed agents, none of the helper objects will be installed and you need to install this helper object (OOMADS.msi found on the SCOM CD) on all domain controllers manually.

The installation is so easy, i won’t even make a screenshot of it, no selections at all, just next, finish and done. But as soon as the installation has finished the “failed to create object McActiveDir.ActiveDirectory” errors have disappeared. And pretty soon after the health explorer for a domain controller looks like this instead of some warnings we had before:

image

So now all we have to do is install the OOMADS.msi on all domain controllers. However, we don’t like manual installs and want to script the helper object installation in our scripted agent install.

We can find some some setting to determine whether a server is a domain controller or not, but we’d prefer the same setting as SCOM would use to determine whether a server is a domain controller. So we started looking in the mp’s and AD discoveries for such a setting. Soon we found a script called ADLocalDiscoveryDC.vbs which has the following description (including the typo’s :)):

‘*************************************************************************
‘ Script Name – AD Local Discovery DC

‘ Purpose     – Discovers weather a local server is DC or not
‘              
‘ Parameters  – Targer fqdn, netbiosname

‘ (c) Copyright 2003, Microsoft Corporation, All Rights Reserved
‘ Proprietary and confidential to Microsoft Corporation             
‘*************************************************************************

So this must be how a DC is discovered, but it isn’t. If you look more closer, it’s actually targeted at domain controllers. This discovers a lot of roles of a DC, but it does not discover if a computer is a domain controller.

However, this script is still useful. Because it does install the helper object on a domain controller provided OOMADS.msi is located in scomagentinstallpath\HelperObjects (there are some more checks before OOMADS.msi will be installed, if you are interested just find the function InstallOOMADs() and figure out what it does; this function might be different for each os version). This means we just have to place the correct helper object (x86, amd64, ia64) in this directory and the next time this discovery script runs OOMADS will be installed automatically. As this discovery only runs on domain controllers, you can copy OOMADS.msi to all agent-managed servers, which also prevents manual action whenever a server will be promoted to DC after agent installation.

A bit off topic but we still wondered how SCOM determined whether a server is a DC or not. This discovery takes place in Microsoft.SystemCenter.Internal by a wmi query “SELECT NumberOfProcessors FROM Win32_ComputerSystem WHERE DomainRole > 3”. If this returns something the computer will be marked as a domain controller. Most likely NumberOfProcessors is used as each computer has at least 1, so it will always return something. Similar queries for client (domainrole =< 1) and server (domainrole >1 and < 4) computers are used and also discovered in the internal mp.

To sum things up:

– The AD mp discovers and monitors all domain controllers regardless of having a trust.

– Forest/domain topology is only discovered when there’s a trust between the SCOM forest and other forests. This only affects the health model of AD, not the health of an individual domain controller. In other words, domains/forests without a trust are not monitored, but all domain controllers are.

– To have all tests functional OOMADS.msi must be installed on all domain controllers. By placing the correct version of OOMADS.msi in “scomagentinstallpath\HelperObjects” it will be installed on each domain controller automatically.

Posted in management packs, troubleshooting | 3 Comments »

One or more management servers do not get new updates from the RMS.

Posted by rob1974 on December 22, 2009

We’ve had a few issues where the management servers had events 21024 (requesting new updates) regularly, but never had received a new update. This management still was functional with the old configuration. This happened when we installed a new management pack, approved an agent or set an override. But it did not happen every time, which makes finding the problem very hard. However, most likely it will have something to do with file locking of the .edb files. We made sure our antivirus wasn’t scanning these files. This seemed to help a bit, but it does still happen every now and then.

To solve this issue we had to go and remove the configuration for this management group on the RMS and restart the healthservice.

Stop health service

rename the health service state (..\System Center Operations Manager 2007\Health Service State)

Start health service

Because the management servers keep on serving their agents, it’s difficult to determine whether the management server has an issue. We used to stumble upon this issue (e.g. we couldn’t move an agent to a management server or we kept getting alerts from a rule which we had disabled), but we really wanted to know this as soon as this happened.

We compared the modification dates of the opsmgrconnector.config.xml (in ..\System Center Operations Manager 2007\Health Service State\Connector Configuration Cache\<management group>\) files and found that the RMS differed quite a bit from the management servers, but the management servers all had more or less the same date.

We found the modify dates between the management servers and the root management server were always under 24 hours (looks like a forced configuration update once a day, although it might just be some discovery or our set agent proxy script). The management servers’ configuration xml were always within 1 minute of each other.

We’ve created and scheduled a script on an agent managed machine to check the differences between the config files every hour. When the threshold has been passed the script generates an event to the application log. The thresholds are shown in the table below.

  informational warning critical
Diff MS-RMS >24 hours >36 hours >48 hours
Diff MS-MS >1 minute >2 minutes >5 minutes

 

We’ve chosen to let these events be picked up by SCOM as the management servers are still accepting alerts even when they don’t have a configuration update recently. Just make sure the initial rules are distributed to the agent.

When you experiencing the same issue please vote for the bug report on the connect site as well.

Posted in Management Servers, troubleshooting | 2 Comments »