JAMA00

SCOM 2007 R2

Archive for October, 2013

SNMP port claimed, but not by the snmp services.

Posted by rob1974 on October 24, 2013

 

I recently came across the following issue. SCOM was throwing errors every minute and auto closing them fast from this monitor: Network Monitoring SNMP Trap port is already in use by another program. 2 things I notice when I looked at it. The first one was it was a manual reset monitor how can that be auto closed? And the second one, it was about the management server itself and it’s inability to claim the SNMP trap receiver port (UDP 162), while I was sure I had disabled the SNMP services.

So what claimed the SNMP port? With 2 simple commands you can find out:

> netstat –ano –p UDP| find “:162”.

This resulted in some PID. Now check the PID versus tasklist /svc and we should have the process responsible for the claim.

> tasklist /svc /fi “pid eq <pidnumberfound>”

and it returned nothing.

After quite some rechecking my results in different ways I came to the conclusion that the UPD port really had been claimed by an process that didn’t exist anymore. I feel that’s it is some bug in Windows server, it should always close the handles whenever a process dies, but let’s be pragmatic about it > reboot server. After the reboot I ran the commands above again to find out the “monitoringhost.exe” claimed the port. W00t we solved the problem, SCOM can receive traps again.

As mentioned in the start of the post. The alerts were closing fast. And because of that fact, if someone saw it they wouldn’t pay any attention to it. SCOM wasn’t receiving any traps in the above condition, so why was the alert of a manual reset monitor being auto closed fast?

The explanation is quite simple: the recovery task for this monitor. This recovery task runs and disables the Windows SNMP services and will do a reset of the monitor. The problem with this is that the recovery didn’t do anything (the SNMP services were already disabled), but it did reset the monitor.

I think the solution should be rewrite of the monitor, where the monitor checks if SCOM really has claimed the SNMP port. I could do this, but for now I will leave it. It might have just been one of those exotic errors you’ll see once in a lifetime…

Advertisements

Posted in Management Servers, troubleshooting | Leave a Comment »

Scom 2012 Design for a serviceprovider.

Posted by rob1974 on October 24, 2013

In the last year we’ve created a (high level) design to monitor an unlimited amount of windows servers (initial amount around 20.000). However, halfway during the year it became clear that this  company decided to on a new architecture all together. Not only SCOM 2012 would become obsolete in this architecture, also offshoring monitoring would make Marc and me obsolete (not to worry, we both have a new job). As i feel it’s kind of a waste of our work i’ll blog the 2 things that I feel were quite unique in this design.

There’s still a SCOM 2007 environment running designed for around 10.000 servers. We’ve designed this as a single instance and it is monitoring 50+ customers. Those customers were seperated by scoping on their computers (see our blog and/or lookup the service provider chapter in operations manager 2012 unleashed). However, we never reached 10.000 servers due to a few factors. The first was the amount of number of concurrent consoles used. The number of admins needing access will increase with the amount of servers you monitor, however the “official” support numbers of Microsoft will decrease the number of concurrent consoles with the number of supported agent (50 console with 6000 servers, 25 with 10.000 server). Now this isn’t a really hard number, but keep adding more hardware will just shift bottlenecks (from MS to DB and from to disk I/O to network I/O) and of course the price of hardware goes up exponentially.

The second issue we had, we had to support a lot of management packs and versions. We supported all Microsoft applications from windows 2000 and up. So we didn’t only run, Windows 2000, 2003 and 2008, but all versions for DHCP, DNS, SQL, Exchange, Sharepoint, IIS etc, etc. Some of the older ones still are the dreaded converted mp’s, some of which we had to  fix volatile discoveries ourselves. By the way, I’ve been quite negative about how Microsoft developed their mp’s and especially the lack of support on them, but the most recent mp’s are really done quite well. MS keep that up!

In the new design we’ve used a different approach. We decided not to aim for the biggest possible instance of SCOM, but to go to several smaller ones, plotted initially over different kind of customers my former company has to support. Some customers only needed OS support, some OS + middleware and others required monitoring of everything.

Based on our experience we decided to create only one technical design for all the management groups and have a support formula for the number of supported agents. Keep in mind this is not set in stone; it’s just to give the management an estimate about the costs and insight in the penalty for keep adding new functionality (mp’s) to the environment. All the management groups were still to hold more then one customers, but one customer should only be in one management group to be able to create distributed applications (automatic discoveries should be possible).

This is what we wrote down in the design:

The amount of agents a management group can support depends on a lot of different factors. The performance of the management servers and the database will be the bottleneck of the management group as a whole. A number of factors influence the performance e.g. number of concurrent consoles, number of management packs, health state of the monitored environments.

When a management group reaches it thresholds a few things can be done to improve performance:

1. Reduce the amount of data in the management group (solving issues and tuning, both leading to less alerts and performance data).

2. Reduce the console usage (reduces load on management servers and databases)

3. Reduce the monitored applications (less management packs).

4. Increase the hardware specs (more cpu’s, memory, faster disk access, more management servers).

5. Move agents to another management group (which reduces load in terms of data and console usage)

We’ll try to have 1 design for all management groups, so the hardware is the same. However the number of management packs is different in each management group. To give an estimate how this affects the number of supported agents per management group we’ve created a support formula.

The formula is based on 2 factors, the number of concurrent consoles and the number of management packs[1]. The health of the monitored environments is taken out of the equation. The hardware component is taken out as well but assumed to be well in line with the sizer helper of SCOM design guides from Microsoft. With less than 30 consoles and <10 monitored applications/operating systems the sizing is set to 6000 agent managed systems (no network, *nix or dedicated URL monitoring).

The tables for penalties for the number of consoles and managements packs and the resulting formula are given below. Please note, the official support limit by Microsoft for the number of consoles is set to 50.

consoles penalty (%) management packs penalty (%)
0 0 0 0
5 0 5 2
10 0 10 4
15 0 15 7
20 0 20 11
25 0 25 16
30 0 30 23
35 2,5 35 32
40 5 40 44
45 7,5 45 60
50 10 50 78
55 30 55 85
60 50 60 90
65 95
70 97
75 99

Number of supported agents = (1-mppenalty/100)*(1-consolepenalty/100)*6000

This results in the following table with shows the number of supported agents.

horizontal = concurrent consoles

vertical = number of management packs

<30 35 40 45 50 55 60
0 6000 5850 5700 5550 5400 4200 3000
5 5880 5733 5586 5439 5292 4116 2940
10 5760 5616 5472 5328 5184 4032 2880
15 5580 5441 5301 5162 5022 3906 2790
20 5340 5207 5073 4940 4806 3738 2670
25 5040 4914 4788 4662 4536 3528 2520
30 4620 4505 4389 4274 4158 3234 2310
35 4080 3978 3876 3774 3672 2856 2040
40 3360 3276 3192 3108 3024 2352 1680
45 2400 2340 2280 2220 2160 1680 1200
50 1320 1287 1254 1221 1188 924 660
55 900 878 855 833 810 630 450
60 600 585 570 555 540 420 300
65 300 293 285 278 270 210 150
70 180 176 171 167 162 126 90
75 60 59 57 56 54 42 30

Again we never put this to the test. But those figures resulted in our goals for this design. One management group for OS monitoring to hold 6000 servers, one for middleware to hold around 3500 and full blown monitoring for monitoring of around 2000 servers in one management group. Please note our first remark about tuning and solving alerts, those things are still very important for a healthy environment!

The next part that made our design stand out is the ability to move agents or actually entire customers from one management group to another. Now this is of course always possible, but we wanted to do this automated and keep all customizations if applicable. The reasons for such a move would be sizing/performance of the management group, mp support (e.g. the customer starts using a new OS, which isn’t supported in mg he is in), contract change with the customer (e.g. from OS support to middleware support or the other way round).

In order to be able to keep all customizations we’ve created a mp structure/naming convention:

jamaOverride<mpname> = holds all generic overrides to <mpname>. This mp is sealed and should be kept in sync between all management groups holding the <mpname> as it contains the enterprise monitoring standards

jama<customer>Override<mpname> = holds all customer specific overrides to <mpname> and/or a change to the generic sealed override mp. This is of course unsealed as it holds guids of specific instances.

jama<customer><name> = holds all specific monitoring rules for a customer and DA’s. This is not limited to just 1 mp, normal distribution rules should apply. In other words don’t create a big “default” mp but keep things grouped based on “targets” of the rules/monitors.

Now when we need to move a customer we can just move an agent to a new management group and import all the  jama<customer> mp’s in that mg. Of course this is only applicable if the new mg supports the same mp’s as the old one. jamaOverride<mpname> should already exists as it just contains enterprise wide overrides to classes. The jama<customer>Override<mpname> would contain specific instances. However specific instances get a guid on discovery. If you remove an agent and reinstall and agent you get a new guid, so the overrides wouldn’t apply anymore. Fortunately SCOM supports multihoming and one of the big advantages is cookdown. Cookdown will only work when the same scripts are using the same parameters, so this means that the agent’s guids must be the same when multihomed as that guid is a parameter for many datasources.

To make sure the guid is kept, a move between management groups needs to use this multihoming principle and a move would become: Add an agent to the new management group, make sure all discoveries have run (basically time/maybe a few agent restarts) and finally remove the old management group. Because this is a relative easy command (no actual install) the add and remove can be done via a SCOM task (add the “add new mg task” to the old mg and the “remove old mg” to the new mg).

Again we never built this and we don’t expect to be as smooth as we described here (always some unexpected references in an override mp :)), but hopefully it is to some use to someone.


[1] 1 management pack equals 1 application or OS. E.g. windows2008 is counted as 1 management despite it are 2 management packs in scom (discovery and monitoring).

Posted in general, management packs, Management Servers | Leave a Comment »