When you post information about some problem, please include the following details: - OS version (e.g. Windows 2000 Professional SP3); HostMonitor version; problem description.
I our case, there's no reason or schedule for all tests to be fired off all at once.
Why do you think so? If some tests scheduled to be performed every 20 sec and other every 5 min, that does not mean these tests cannot be performed at the same time (from time to time).
Also there are other reasons:
- master tests status change
- some HM script commands
- user interaction (well, that's probably not your case)
...
I think that firing off all tests is a symptom of the underlaying problem.
I think 95 percent of our tests use the default 10 minutes schedule.
As I was looking through the log file, I noticed that the "firing all tests at once" thing happens every midnight, which makes me wonder if the HM test scheduler gets a reset every day at 00:00:00 ???
We added one more logging option into HostMonitor. It can be activated thru HostMonitor or RCC GUI or Telnet interface.
But we want to finish code verification and test some new options before release...
If new version will stop monitoring, please either open Auditing Tool (HostMonitor/RCC menu View->Auditing Tool) or use telnet client (and Telnet Service) to send "droplog1" command to HostMonitor.
Then send debuglog1.txt file to support@ks-soft.net
I've installed the new version 9.32 and tested the debuglog feature.
It works:
06.11.2012 09:48:27
Timer1: 1 06.11.2012 09:48:26
Timer2: 1
PoolRecAvail: 4035
TTLimit1: 64
TCnt1: 13006
TTThreads: 190
LIdx: 3614
ATCnt2: 32
----
06.11.2012 13:28:59
Timer1: 1 06.11.2012 13:28:58
Timer2: 1
PoolRecAvail: 4096
TTLimit1: 64
TCnt1: 13006
TTThreads: 217
LIdx: 13006
ATCnt2: 32
It would be better, however, if the host monitor periodically further writes debug log itself.
Our system will automatically restart after 15 minutes if no tests are run at this time. Since this occurs at night may, I am unable to update the debug log manually.
I have the new version of Host Monitor installed yesterday. Each day there are 3 million tests. The error is always on at around 12 million tests.
We have to wait for some days ...
It may grow for a while then stop.
We checked number of threads, handles, memory allocation when problem appeared (according to information provided by customers), its not a case...
Ok, we get better picture.
It looks like performance is bad but HostMonitor did not hung and did not stop monitoring. It waits for tests to be completed and should perform some new tests after pause.
And performance is bad enough to look like stopped monitoring but not bad enough to trigger syslog alerts.
That's why we could not find any mistake for such long time. It actually works...
So lets investigate "bad performance" issue instead of "stopped monitoring" issue.
In your case HostMonitor performs about 64 tests per second? If many of these tests require about 3 min to complete (or some tests require 1 min to complete, others needs 5 min), this may lead to such problem.
Which of your tests requires so much time?
On the other hand, the same result can be caused by some "hung" threads, some hung test items. In such case HostMonitor should record errors in system log (but you need to wait for this message, do not reboot HostMonitor for a while).
- If a lot of threads hung in the same time, you will see tests with Unknown test status and "Timed out" Reply string (HostMonitor sets it after 15 min) plus message in system log;
- if some threads hungs from time to time, you will see Unknown status as well but you may need to wait too long for error message in system log...
So, lets check log for tests with Unknown status and "Timed out" reply.
If there are no such items, try to find tests that spending significant time in "Checking" status.
Then check type of tests, tests settings, if test performed by HostMonitor, Passive RMA or Active RMA?
May be you setup too long timeout interval for some tests?
In principle, it sounds logical to yes what you say. Since the host monitor or the RCC is still responding, one must assume that the monitoring task is waiting, but for what? My timeouts are between 2 and 5 seconds in individual cases also at 30 seconds (ldap). At URL checks often the Windows timeout is registered. Incidentally, there is in the UDP test method a small bug. In the GUI I enter 2 seconds for timeout. The variable AlertThreshold but shows 2 ms.
According Auditing Tool I have 37 tests/sec and 13,000 total tests. Timeouts, I sometimes get in SNMP checks (25 tests/sec) with a timeout of 2000 ms. These are sporadic, but has only been shown to cause a few alarms per day (Unknown status = Bad)
If the "Stop-Monitoring-error" occurs, you see how the value goes down shortly before standstill:
[12.11.2012 11:10:33] HostMonitor: (tests/sec) Ok 35,45
[12.11.2012 11:15:35] HostMonitor: (tests/sec) Ok 36,84
[12.11.2012 11:20:37] HostMonitor: (tests/sec) Ok 35,88
[12.11.2012 11:25:40] HostMonitor: (tests/sec) Ok 36,27
[12.11.2012 11:30:42] HostMonitor: (tests/sec) Ok 35,92
[12.11.2012 11:35:43] HostMonitor: (tests/sec) Ok 21,62
[12.11.2012 11:40:45] HostMonitor: (tests/sec) Warning 0
[12.11.2012 11:45:47] HostMonitor: (tests/sec) Warning 0
[12.11.2012 11:50:55] HostMonitor: (tests/sec) Warning 0.00
[12.11.2012 11:55:55] HostMonitor: (tests/sec) Warning 0.00
[12.11.2012 12:00:56] HostMonitor: (tests/sec) Ok 36,16
[12.11.2012 12:05:57] HostMonitor: (tests/sec) Ok 36,33
[12.11.2012 12:10:59] HostMonitor: (tests/sec) Ok 36,4
[12.11.2012 12:16:00] HostMonitor: (tests/sec) Ok 36,71
The phenomenon is in my system that reliably after about 12 to 15 million tests depends - never before. The time of day does not matter. So I repeat: Any parameter is a limit here on all 4 days.
What we need in my opinion is a more detailed debug logging: Perhaps it is possible to generate messages in syslog when within a minute no more tests were performed. If appear in the log a list of current tests, one could limit the error might easily.
I have never been before can find a reference to the error.
After 15 min of testing rate 0/sec I leave the system to reboot automatically. A longer interruption I do not accept.
The parameter test/sec by the way I have put on 64 for the system to give the possibility for delays in the test interval this again by a short high frequency back towards balance promptly.