HM freezes after update from V8.86 to V9.06

Kris · Joined: 12 May 2010 Posts: 375

I our case, there's no reason or schedule for all tests to be fired off all at once.

I think that firing off all tests is a symptom of the underlaying problem.

KS-Soft · Joined: 03 Apr 2002 Posts: 12805 Location: USA

Kris · Joined: 12 May 2010 Posts: 375

I think 95 percent of our tests use the default 10 minutes schedule.

As I was looking through the log file, I noticed that the "firing all tests at once" thing happens every midnight, which makes me wonder if the HM test scheduler gets a reset every day at 00:00:00 ???

KS-Soft · Joined: 03 Apr 2002 Posts: 12805 Location: USA

This may depend on various settings, like Midnight logging.
If you send your config files to us, we can check

Regards
Alex

KS-Soft · Joined: 03 Apr 2002 Posts: 12805 Location: USA

We added one more logging option into HostMonitor. It can be activated thru HostMonitor or RCC GUI or Telnet interface.
But we want to finish code verification and test some new options before release...

Regards
Alex

KS-Soft · Joined: 03 Apr 2002 Posts: 12805 Location: USA

Unfortunately we still unable to find the reason and we cannot reproduce the problem

But we made some changes in version 9.32
http://www.ks-soft.net/hostmon.eng/downpage.htm

If new version will stop monitoring, please either open Auditing Tool (HostMonitor/RCC menu View->Auditing Tool) or use telnet client (and Telnet Service) to send "droplog1" command to HostMonitor.
Then send debuglog1.txt file to support@ks-soft.net

Regards
Alex

rc · Joined: 01 Aug 2005 Posts: 100

Hi Alex,

I've installed the new version 9.32 and tested the debuglog feature.
It works:

06.11.2012 09:48:27
Timer1: 1 06.11.2012 09:48:26
Timer2: 1
PoolRecAvail: 4035
TTLimit1: 64
TCnt1: 13006
TTThreads: 190
LIdx: 3614
ATCnt2: 32
----
06.11.2012 13:28:59
Timer1: 1 06.11.2012 13:28:58
Timer2: 1
PoolRecAvail: 4096
TTLimit1: 64
TCnt1: 13006
TTThreads: 217
LIdx: 13006
ATCnt2: 32

It would be better, however, if the host monitor periodically further writes debug log itself.
Our system will automatically restart after 15 minutes if no tests are run at this time. Since this occurs at night may, I am unable to update the debug log manually.

Best regards
Enrico

KS-Soft · Joined: 03 Apr 2002 Posts: 12805 Location: USA

rc · Joined: 01 Aug 2005 Posts: 100

I have the new version of Host Monitor installed yesterday. Each day there are 3 million tests. The error is always on at around 12 million tests.
We have to wait for some days ...

Otherwise everything is running as usual.

Get back!

Enrico

rc · Joined: 01 Aug 2005 Posts: 100

A trend can be seen already.
The value for TTThreads apparently constantly increasing:

----
06.11.2012 16:40:30
Timer1: 1 06.11.2012 16:40:29
Timer2: 1
PoolRecAvail: 4046
TTLimit1: 64
TCnt1: 13006
TTThreads: 242
LIdx: 4336
ATCnt2: 32
----
07.11.2012 08:15:29
Timer1: 1 07.11.2012 08:15:28
Timer2: 1
PoolRecAvail: 4096
TTLimit1: 64
TCnt1: 13006
TTThreads: 376
LIdx: 11240
ATCnt2: 32
----
07.11.2012 08:17:18
Timer1: 1 07.11.2012 08:17:18
Timer2: 1
PoolRecAvail: 4096
TTLimit1: 64
TCnt1: 13006
TTThreads: 377
LIdx: 13006
ATCnt2: 32
----
07.11.2012 12:37:46
Timer1: 1 07.11.2012 12:37:40
Timer2: 1
PoolRecAvail: 4046
TTLimit1: 64
TCnt1: 12942
TTThreads: 422
LIdx: 6704
ATCnt2: 32

Is this normal or here, there is a threshold at which the system stops?

KS-Soft · Joined: 03 Apr 2002 Posts: 12805 Location: USA

It may grow for a while then stop.
We checked number of threads, handles, memory allocation when problem appeared (according to information provided by customers), its not a case...

Regards
Alex

rc · Joined: 01 Aug 2005 Posts: 100

Hi Alex,

today, unfortunately, it is happening again. After 15.6 million tests monitoring was not performed. The access over RCC was possible.

State of the host monitor process at this time:
hostmon.exe
Handles: 514
Threads: 21

HMVersion: HostMonitor v. 9.32
HMVersionBin: 2348

Debug-Log:
09.11.2012 11:11:10
Timer1: 1 09.11.2012 11:11:10
Timer2: 1
PoolRecAvail: 4095
TTLimit1: 64
TCnt1: 12951
TTThreads: 383
LIdx: 3581
ATCnt2: 32
----
12.11.2012 08:53:40
Timer1: 1 12.11.2012 08:53:39
Timer2: 1
PoolRecAvail: 4075
TTLimit1: 64
TCnt1: 12952
TTThreads: 992
LIdx: 12952
ATCnt2: 32
---- Monitoring is no longer running! ----
12.11.2012 11:46:48
Timer1: 1 12.11.2012 11:46:47
Timer2: 1
PoolRecAvail: 4096
TTLimit1: 64
TCnt1: 12952
TTThreads: 1001
LIdx: 6162
ATCnt2: 32
----
12.11.2012 11:47:26
Timer1: 1 12.11.2012 11:47:25
Timer2: 1
PoolRecAvail: 4096
TTLimit1: 64
TCnt1: 12952
TTThreads: 1001
LIdx: 6162
ATCnt2: 32
---- after restarting the service ----
12.11.2012 12:04:17
Timer1: 1 12.11.2012 12:04:17
Timer2: 1
PoolRecAvail: 4096
TTLimit1: 64
TCnt1: 12952
TTThreads: 16
LIdx: 12952
ATCnt2: 32

I hope you can do something with these results.

Best regards
Enrico

KS-Soft · Joined: 03 Apr 2002 Posts: 12805 Location: USA

Ok, we get better picture.
It looks like performance is bad but HostMonitor did not hung and did not stop monitoring. It waits for tests to be completed and should perform some new tests after pause.
And performance is bad enough to look like stopped monitoring but not bad enough to trigger syslog alerts.
That's why we could not find any mistake for such long time. It actually works...

So lets investigate "bad performance" issue instead of "stopped monitoring" issue.
In your case HostMonitor performs about 64 tests per second? If many of these tests require about 3 min to complete (or some tests require 1 min to complete, others needs 5 min), this may lead to such problem.
Which of your tests requires so much time?

On the other hand, the same result can be caused by some "hung" threads, some hung test items. In such case HostMonitor should record errors in system log (but you need to wait for this message, do not reboot HostMonitor for a while).
- If a lot of threads hung in the same time, you will see tests with Unknown test status and "Timed out" Reply string (HostMonitor sets it after 15 min) plus message in system log;
- if some threads hungs from time to time, you will see Unknown status as well but you may need to wait too long for error message in system log...

So, lets check log for tests with Unknown status and "Timed out" reply.
If there are no such items, try to find tests that spending significant time in "Checking" status.
Then check type of tests, tests settings, if test performed by HostMonitor, Passive RMA or Active RMA?
May be you setup too long timeout interval for some tests?

Regards
Alex

rc · Joined: 01 Aug 2005 Posts: 100

Hi Alex,

In principle, it sounds logical to yes what you say. Since the host monitor or the RCC is still responding, one must assume that the monitoring task is waiting, but for what? My timeouts are between 2 and 5 seconds in individual cases also at 30 seconds (ldap). At URL checks often the Windows timeout is registered. Incidentally, there is in the UDP test method a small bug. In the GUI I enter 2 seconds for timeout. The variable AlertThreshold but shows 2 ms.

According Auditing Tool I have 37 tests/sec and 13,000 total tests. Timeouts, I sometimes get in SNMP checks (25 tests/sec) with a timeout of 2000 ms. These are sporadic, but has only been shown to cause a few alarms per day (Unknown status = Bad)

If the "Stop-Monitoring-error" occurs, you see how the value goes down shortly before standstill:

[12.11.2012 11:10:33] HostMonitor: (tests/sec) Ok 35,45
[12.11.2012 11:15:35] HostMonitor: (tests/sec) Ok 36,84
[12.11.2012 11:20:37] HostMonitor: (tests/sec) Ok 35,88
[12.11.2012 11:25:40] HostMonitor: (tests/sec) Ok 36,27
[12.11.2012 11:30:42] HostMonitor: (tests/sec) Ok 35,92
[12.11.2012 11:35:43] HostMonitor: (tests/sec) Ok 21,62
[12.11.2012 11:40:45] HostMonitor: (tests/sec) Warning 0
[12.11.2012 11:45:47] HostMonitor: (tests/sec) Warning 0
[12.11.2012 11:50:55] HostMonitor: (tests/sec) Warning 0.00
[12.11.2012 11:55:55] HostMonitor: (tests/sec) Warning 0.00
[12.11.2012 12:00:56] HostMonitor: (tests/sec) Ok 36,16
[12.11.2012 12:05:57] HostMonitor: (tests/sec) Ok 36,33
[12.11.2012 12:10:59] HostMonitor: (tests/sec) Ok 36,4
[12.11.2012 12:16:00] HostMonitor: (tests/sec) Ok 36,71

The phenomenon is in my system that reliably after about 12 to 15 million tests depends - never before. The time of day does not matter. So I repeat: Any parameter is a limit here on all 4 days.

What we need in my opinion is a more detailed debug logging: Perhaps it is possible to generate messages in syslog when within a minute no more tests were performed. If appear in the log a list of current tests, one could limit the error might easily.

I have never been before can find a reference to the error.
After 15 min of testing rate 0/sec I leave the system to reboot automatically. A longer interruption I do not accept.

The parameter test/sec by the way I have put on 64 for the system to give the possibility for delays in the test interval this again by a short high frequency back towards balance promptly.

until tomorrow...

Enrico