Constant, random "unknown" statuses
-
- Posts: 4
- Joined: Thu Feb 19, 2004 11:06 am
- Location: San Francisco, CA
Constant, random "unknown" statuses
Hi all,
we've been fine tuning Host Mon for a few weeks now, and we're currently monitoring about twenty servers. The problem we're having is that Host Mon is constantly returning unknown statuses, which has my boss worried. It seems to happen the most on our Exchange and SQL database servers, usually on tests monitoring certain services and the event logs. I currently have the action profile set up to repeat the test after it comes back with a "bad" or "unknown" for a second time, followed by an email alert after the third failure and a pager alert after the fourth. In most cases, this has kept us from getting deluged with notifications, but in a few instances, like on the exchange server, sometimes the tests still don't come back okay even after 4 tries.
I guess what I'm trying to discern is, what causes an "Unknown" status? is it connected to the test timing out due to network latency, or because there are too many resources already being used on the target server? Is there any way we can reduce the number of "unknown" statuses we get back (would changing the time interval between tests make a difference?). I've seen something mentioned on the board about running Host Mon as a service - does that tend to produce better results?
we've been fine tuning Host Mon for a few weeks now, and we're currently monitoring about twenty servers. The problem we're having is that Host Mon is constantly returning unknown statuses, which has my boss worried. It seems to happen the most on our Exchange and SQL database servers, usually on tests monitoring certain services and the event logs. I currently have the action profile set up to repeat the test after it comes back with a "bad" or "unknown" for a second time, followed by an email alert after the third failure and a pager alert after the fourth. In most cases, this has kept us from getting deluged with notifications, but in a few instances, like on the exchange server, sometimes the tests still don't come back okay even after 4 tries.
I guess what I'm trying to discern is, what causes an "Unknown" status? is it connected to the test timing out due to network latency, or because there are too many resources already being used on the target server? Is there any way we can reduce the number of "unknown" statuses we get back (would changing the time interval between tests make a difference?). I've seen something mentioned on the board about running Host Mon as a service - does that tend to produce better results?
I don't think HostMonitor will work differently in service mode. As I understand you have problem with 2 test methods: Service test and NT Event Log test?
Service test displays "Unknown" status when system cannot connect to remote Service Control Manager. As well as NT Event Log test displays "Unknown" status when cannot establish connection to remote system.
If this error occurs irregularly, one of possible reasons could be high network traffic that cause DNS requests to fail. In this case usage of IP addresses instead of name of the systems should help.
Regards
Alex
Service test displays "Unknown" status when system cannot connect to remote Service Control Manager. As well as NT Event Log test displays "Unknown" status when cannot establish connection to remote system.
If this error occurs irregularly, one of possible reasons could be high network traffic that cause DNS requests to fail. In this case usage of IP addresses instead of name of the systems should help.
Regards
Alex
-
- Posts: 4
- Joined: Thu Feb 19, 2004 11:06 am
- Location: San Francisco, CA
-
- Posts: 4
- Joined: Thu Feb 19, 2004 11:06 am
- Location: San Francisco, CA
I thought of one other thing to try - Most of the services tests are scheduled for every ten minutes. Perhaps it was trying to poll ALL of them on a given server at exactly every ten minutes? I tried staggering the services tests 5-10 seconds apart on two of my worst offenders (exchange server and database server), and so far they've been clean of unknown statuses for the past hour. I will continue to monitor this and try it on other servers.
May be not all of them but many of them. There is "Don't start more than [N] tests per second" option on Behavior page in the Options dialog. This parameter defines how many tests per second the program will start. Default value (32) is good for most networks but may be its not good in your case.I thought of one other thing to try - Most of the services tests are scheduled for every ten minutes. Perhaps it was trying to poll ALL of them on a given server at exactly every ten minutes?
Regards
Alex
-
- Posts: 4
- Joined: Thu Feb 19, 2004 11:06 am
- Location: San Francisco, CA
We still have issues with random unkown statuses...
Hi Alex,
We receive unknowns often for performance counter and event log checks for several servers. The other tests that are pointing to the same servers are fine (TCP, PING, service checks, etc).
Here is another strange thing: If the test that is "unknown" is currently using an IP address and we change it to hostname it will then go back to a good status....OR VICE VERSA....If the test is currently using hostname and we switch it to IP address this sometimes fixes it as well.
But a simple test refresh or disable\enable will not work.
There are one or two that don't come back at all. Even after enabling\disabling and trying all the steps above.
No permissions have been changed on any of the servers.
We are using hostmonitor 4.30. Windows 2000 Server - All the latest SP's and patches. Running about 760 tests with approximately 3,750,550 tests completed in the last two weeks. Server utilization averages about 30% for the CPU. Compaq 1850R - PII500. Available Memory is fine.
Is this too many tests over that time period?
We do notice that hostmonitor "freezes" for about 30 seconds or so every few minutes and the interface becomes unusable...and after the 30 seconds everything goes back to normal and we can move around in the hostmonitor interface again. We do not know if this is contributing to the unkown statuses or not but wanted to make mention of it to see if anyone else is experiencing this.
Hostmonitor is still doing a great job alerting and nothing else has been an issue. Just the freeze up issue and the unknown status issue.
Thanks for any assistance you can provide. I'm sending a screen shot of the tests for one of the servers to your email address in case that will help.
Thx,
Mark[/img]
We receive unknowns often for performance counter and event log checks for several servers. The other tests that are pointing to the same servers are fine (TCP, PING, service checks, etc).
Here is another strange thing: If the test that is "unknown" is currently using an IP address and we change it to hostname it will then go back to a good status....OR VICE VERSA....If the test is currently using hostname and we switch it to IP address this sometimes fixes it as well.
But a simple test refresh or disable\enable will not work.
There are one or two that don't come back at all. Even after enabling\disabling and trying all the steps above.
No permissions have been changed on any of the servers.
We are using hostmonitor 4.30. Windows 2000 Server - All the latest SP's and patches. Running about 760 tests with approximately 3,750,550 tests completed in the last two weeks. Server utilization averages about 30% for the CPU. Compaq 1850R - PII500. Available Memory is fine.
Is this too many tests over that time period?
We do notice that hostmonitor "freezes" for about 30 seconds or so every few minutes and the interface becomes unusable...and after the 30 seconds everything goes back to normal and we can move around in the hostmonitor interface again. We do not know if this is contributing to the unkown statuses or not but wanted to make mention of it to see if anyone else is experiencing this.
Hostmonitor is still doing a great job alerting and nothing else has been an issue. Just the freeze up issue and the unknown status issue.
Thanks for any assistance you can provide. I'm sending a screen shot of the tests for one of the servers to your email address in case that will help.
Thx,
Mark[/img]
To add to the last post....
I have to stop and restart hostmonitor to get the tests to go back to a "good" status most of the time.
Thx,
Mark
Thx,
Mark
Eh!
We do not believe anymore that pdh.dll can work in multithreaded environment. So, we decided to implement 4th method to work with this DLL - external application. We will implement simple utility that will be called by HostMonitor and perform the test. This way pdh.dll will be loaded for single test only. Of course it will need more resources
but I hope it will work releable.
Regards
Alex
We do not believe anymore that pdh.dll can work in multithreaded environment. So, we decided to implement 4th method to work with this DLL - external application. We will implement simple utility that will be called by HostMonitor and perform the test. This way pdh.dll will be loaded for single test only. Of course it will need more resources

Regards
Alex
Same issue
We have spent dozens of hours attempting to find a pattern with these constant 'Uknown' statuses. We have more than 30 tests setup to the MS Exchange server, and when we cutover to a new Exchange server, most of the tests went into the 'Unknown' limbo state quite regularly. The old server running NT worked fine for tens of thousands of sample, no 'Unknowns', now the new server runs on 2000 SP4 and nothing but problems. These tests are all disabled due to the unreliability of the tests.
One of our SQL servers (which also runs 2000 SP4) is also having major problems (i.e. returns 'Unknowns' a large amount of the time).
We cutover to a new hostmonitor server, running XP, and couldn't even get performance counter tests to work on more than 20 tests (all would go into "Unknown' if more tests were enabled). So we rebuilt the server to 2000 SP4, and had the same result. Next we rebuilt the box to 2000 SP3 (to match the OS of the old server), and now are back to the regular amount of 'Unknowns' when conducting Microsoft server testing.
99% of our 'Unknown' responses are coming from Performance Counter Tests.
I do hope that the new solution from Hostmonitor can deal effectively with this issue.
Thank you,
Scott
One of our SQL servers (which also runs 2000 SP4) is also having major problems (i.e. returns 'Unknowns' a large amount of the time).
We cutover to a new hostmonitor server, running XP, and couldn't even get performance counter tests to work on more than 20 tests (all would go into "Unknown' if more tests were enabled). So we rebuilt the server to 2000 SP4, and had the same result. Next we rebuilt the box to 2000 SP3 (to match the OS of the old server), and now are back to the regular amount of 'Unknowns' when conducting Microsoft server testing.
99% of our 'Unknown' responses are coming from Performance Counter Tests.
I do hope that the new solution from Hostmonitor can deal effectively with this issue.
Thank you,
Scott
There is new module at www.ks-soft.net/download/hm445.zip
This version supports new "External" mode for Performance Counter test. It should fix problems with the test. Use Miscellaneous page in the Options dialog to set new mode.
Also this version supports new SNMP Trap test method. If you want to try this method, I would recommend to copy your existent Advanced Host Monitor into another directory (or another computer), unzip new modules, and setup SNMP Trap tests for testing purpose only.
Regards
Alex
This version supports new "External" mode for Performance Counter test. It should fix problems with the test. Use Miscellaneous page in the Options dialog to set new mode.
Also this version supports new SNMP Trap test method. If you want to try this method, I would recommend to copy your existent Advanced Host Monitor into another directory (or another computer), unzip new modules, and setup SNMP Trap tests for testing purpose only.
Regards
Alex
Alex:
I'm not sure this is related but.....
We find that after an overnight reboot of the roughly 60 machines
that are part of a web farm, several perfmon tests will get stuck
in an 'unknown' state. Because all these tests are actually performed
by RMA, we've found restarting the RMA will fix this in many instances.
Here is what we think is going on. We believe that HM is requesting the
perfmon data from the RMA too soon after reboot -- indeed these tests are currently dependent only upon an IP Ping of the remote machine. Thus, if a test is performed AFTER IP network becomes available but BEFORE the actual module that supports the specific perfmon test (for example, Web Publishing Services) has completed initialization, the RMA will return 'unknown', and then becomes 'stuck' at that value until the RMA is restarted.
We believe that we may be able to correct this by making such perfmon tests dependent upon some other kind of test (rather than simple ping) that would ensure relevent services/modules are loaded and initialized prior to calling RMA to perform perfmon test. Just haven't had time to think this through completely.
I'm not sure this is related but.....
We find that after an overnight reboot of the roughly 60 machines
that are part of a web farm, several perfmon tests will get stuck
in an 'unknown' state. Because all these tests are actually performed
by RMA, we've found restarting the RMA will fix this in many instances.
Here is what we think is going on. We believe that HM is requesting the
perfmon data from the RMA too soon after reboot -- indeed these tests are currently dependent only upon an IP Ping of the remote machine. Thus, if a test is performed AFTER IP network becomes available but BEFORE the actual module that supports the specific perfmon test (for example, Web Publishing Services) has completed initialization, the RMA will return 'unknown', and then becomes 'stuck' at that value until the RMA is restarted.
We believe that we may be able to correct this by making such perfmon tests dependent upon some other kind of test (rather than simple ping) that would ensure relevent services/modules are loaded and initialized prior to calling RMA to perform perfmon test. Just haven't had time to think this through completely.
Timn,
Yesterday we uploaded update for HostMonitor.
Today we uploaded update for RMA: www.ks-soft.net/download/rma119.zip It also includes perfobj.exe module - external performance counter retriever.
If you update RMA, copy perfobj.exe into RMA's directory and add "PerfWorkMode=3" line into rma.ini file ([Misc] section), it should fix the problem. Please note: you have to restart agent if you made changes in rma.ini file
Regards
Alex
Yesterday we uploaded update for HostMonitor.
Today we uploaded update for RMA: www.ks-soft.net/download/rma119.zip It also includes perfobj.exe module - external performance counter retriever.
If you update RMA, copy perfobj.exe into RMA's directory and add "PerfWorkMode=3" line into rma.ini file ([Misc] section), it should fix the problem. Please note: you have to restart agent if you made changes in rma.ini file
Regards
Alex
Alex:
I tried this on one machine. I did upgrade agent to 1.19, added a "[Misc]" section to the RMA.INI file with line reading "PerfWorkMode=3", I restarted the agent . But now all my perfmon tests for this machine fail with message (for example):
RMA: 301 - Error: Invalid Result (C:\Program Files\RMA-Win>C:\Program Files\RMA-Win\perfobj.exe "\Process(inetinfo)\Thread Count" -n 1)
Any ideas? What did I do wrong?
I tried this on one machine. I did upgrade agent to 1.19, added a "[Misc]" section to the RMA.INI file with line reading "PerfWorkMode=3", I restarted the agent . But now all my perfmon tests for this machine fail with message (for example):
RMA: 301 - Error: Invalid Result (C:\Program Files\RMA-Win>C:\Program Files\RMA-Win\perfobj.exe "\Process(inetinfo)\Thread Count" -n 1)
Any ideas? What did I do wrong?
Its already fixed. New modules located at the same place:
www.ks-soft.net/download/rma119.zip
www.ks-soft.net/download/hm445.zip
Regards
Alex
www.ks-soft.net/download/rma119.zip
www.ks-soft.net/download/hm445.zip
Regards
Alex