View previous topic :: View next topic |
Author |
Message |
KS-Soft
Joined: 03 Apr 2002 Posts: 12807 Location: USA
|
Posted: Tue Nov 13, 2012 6:28 am Post subject: |
|
|
Quote: | Since the host monitor or the RCC is still responding, one must assume that the monitoring task is waiting, but for what? |
It waits until some of these 1000 threads finish execution.
Do you see some "Timed out" or "Checking" test items?
Quote: | Incidentally, there is in the UDP test method a small bug. In the GUI I enter 2 seconds for timeout. The variable AlertThreshold but shows 2 ms |
You are right, our mistake
Will be fixed in next version.
Quote: | At URL checks often the Windows timeout is registered. |
How many URL test do you have?
Quote: | What we need in my opinion is a more detailed debug logging: Perhaps it is possible to generate messages in syslog when within a minute no more tests were performed. If appear in the log a list of current tests, one could limit the error might easily. |
I think we don't need just messages about pause. We need list of tests that take too much time for execution.
And probably you already have such log. "Timed out" items should be recorded into regular log that stores test results. Do you see such items?
Its more difficult to catch tests that take 3 or 5 min for execution.
You can adjust one internal parameter and tell HostMonitor to mark test as "timed out" after 3 or 5 min (instead of 15 min by default) but then some "normal" tests that need a lot of time for execution will fail...
When problem appears, could you check for "Timed out" test items (Unknown status, "Timed out" reply string)?
Test items that stay in "Checking" status for a while?
Do you see "timed out" items in the log?
Regards
Alex |
|
Back to top |
|
|
rc
Joined: 01 Aug 2005 Posts: 100
|
Posted: Tue Nov 13, 2012 7:23 am Post subject: |
|
|
I analyzed the complete log from yesterday. I've found no evidence before the failure, indicating that something is wrong. With a total of 500 tests of 3.2 million tests, the result was "timeout". There were only SNMP Get tests.
I have 9130 Snmp Get tests in my configuration:
0001 x 1000 ms timeout
8976 x 2000 ms timeout
0063 x 3000 ms timeout
0049 x 5000 ms timeout
0041 x 10000 ms timeout
To another question:
I have a total of 103 url checks and 86 http checks
At the relevant time there was only one test in the state Unknown.
Checking the condition "Checking" is impossible for me at 13000 tests a whole. Can I create a view for it maybe? Today for me it is simply not visible in the GUI or in the log what the application is doing.
Quote: | You can adjust one internal parameter and tell HostMonitor to mark test as "timed out" after 3 or 5 min (instead of 15 min by default) but then some "normal" tests that need a lot of time for execution will fail... |
I've found in the hostmon.ini no parameters to this.
Perhaps it would be a nice to have an indicator under auditing tool that display tests that pose a potential danger |
|
Back to top |
|
|
KS-Soft
Joined: 03 Apr 2002 Posts: 12807 Location: USA
|
Posted: Tue Nov 13, 2012 11:34 am Post subject: |
|
|
Quote: | I analyzed the complete log from yesterday. I've found no evidence before the failure, indicating that something is wrong. With a total of 500 tests of 3.2 million tests, the result was "timeout". There were only SNMP Get tests |
There are 2 different messages
"Timeout" is not a problem (unless timeout specified for the test is very long)
"Timed out" indicates a problem regardles of test settings.
Quote: | I have 9130 Snmp Get tests in my configuration:
0001 x 1000 ms timeout
8976 x 2000 ms timeout
0063 x 3000 ms timeout
0049 x 5000 ms timeout
0041 x 10000 ms timeout |
Should not be a problem. Unless Retries value is high...
Could you send HML file with tests to support@ks-soft.net? May be we can find something...
Quote: | Checking the condition "Checking" is impossible for me at 13000 tests a whole. Can I create a view for it maybe? Today for me it is simply not visible in the GUI or in the log what the application is doing. |
There is no such option.
Quote: | Perhaps it would be a nice to have an indicator under auditing tool that display tests that pose a potential danger |
Its not easy to tell...
Probably HostMonitor should collect some statistics, this would be more usefull than check test settings
Regards
Alex |
|
Back to top |
|
|
apaitoperations
Joined: 24 Feb 2011 Posts: 40
|
Posted: Tue Nov 20, 2012 3:56 am Post subject: Cant find the debuglog-feature |
|
|
Hello !
I also installed the version 9.32 but i cant find the debuglog-feature in RCC-> Auditing Tool
What am i missing ?
wbr
Georg Höllebauer |
|
Back to top |
|
|
KS-Soft
Joined: 03 Apr 2002 Posts: 12807 Location: USA
|
Posted: Tue Nov 20, 2012 9:24 am Post subject: |
|
|
You just open Auditing Tool, log file is created automatically.
Regards
Alex |
|
Back to top |
|
|
apaitoperations
Joined: 24 Feb 2011 Posts: 40
|
Posted: Tue Nov 20, 2012 9:30 am Post subject: debuglog location |
|
|
and where will the debuglog be createtd ?
on HM-Server oder Maschine where RCC ist running an Auditing Tool is opened ?
wbr
Georg Höllebauer |
|
Back to top |
|
|
KS-Soft
Joined: 03 Apr 2002 Posts: 12807 Location: USA
|
Posted: Tue Nov 20, 2012 9:32 am Post subject: |
|
|
On system where HostMonitor is running, check folder where HostMonitor configuration files located.
Regards
Alex |
|
Back to top |
|
|
apaitoperations
Joined: 24 Feb 2011 Posts: 40
|
Posted: Tue Nov 20, 2012 9:35 am Post subject: found it :-) |
|
|
HM just stopped working about 30 minutes ago - here is the debug-log
the last thing that gets updatet is the userlog.xml - so HM is recieving commands like refresh but actually does nothing.
----
2012-11-20 10:47:18
Timer1: 1 2012-11-20 10:47:17
Timer2: 1
PoolRecAvail: 4096
TTLimit1: 128
TCnt1: 9725
TTThreads: 70
LIdx: 3242
ATCnt2: 32
----
2012-11-20 10:48:21
Timer1: 1 2012-11-20 10:48:20
Timer2: 1
PoolRecAvail: 4096
TTLimit1: 128
TCnt1: 9725
TTThreads: 95
LIdx: 8862
ATCnt2: 32
----
2012-11-20 10:50:44
Timer1: 1 2012-11-20 10:50:44
Timer2: 1
PoolRecAvail: 4096
TTLimit1: 128
TCnt1: 9725
TTThreads: 128
LIdx: 9725
ATCnt2: 32
----
2012-11-20 16:00:59
Timer1: 1 2012-11-20 16:00:58
Timer2: 1
PoolRecAvail: 4096
TTLimit1: 128
TCnt1: 9734
TTThreads: 1003
LIdx: 3154
ATCnt2: 32
----
2012-11-20 16:19:52
Timer1: 1 2012-11-20 16:19:52
Timer2: 1
PoolRecAvail: 4096
TTLimit1: 128
TCnt1: 9734
TTThreads: 1003
LIdx: 3154
ATCnt2: 32
----
2012-11-20 16:29:03
Timer1: 1 2012-11-20 16:29:02
Timer2: 1
PoolRecAvail: 4096
TTLimit1: 128
TCnt1: 9734
TTThreads: 1003
LIdx: 3154
ATCnt2: 32
----
2012-11-20 16:33:18
Timer1: 1 2012-11-20 16:33:18
Timer2: 1
PoolRecAvail: 4096
TTLimit1: 128
TCnt1: 9734
TTThreads: 1003
LIdx: 3154
ATCnt2: 32 |
|
Back to top |
|
|
KS-Soft
Joined: 03 Apr 2002 Posts: 12807 Location: USA
|
Posted: Tue Nov 20, 2012 9:42 am Post subject: |
|
|
Looks similar - monitoring is active but HostMonitor does not start new tests because there are a lot of started/unfinished test probes.
Do you see items with Unknown status and "Timed out" reply value?
Do you see items that hold "Checking" status for long time?
Regards
Alex |
|
Back to top |
|
|
apaitoperations
Joined: 24 Feb 2011 Posts: 40
|
Posted: Tue Nov 20, 2012 9:51 am Post subject: |
|
|
Same here :
Hard to tell if there are many items in Status "Checking" - we use only views for filtering for Unknown and Bad items.
In view properties there is no possibility to filter for items in status checking.
Is there any other way to see only items in status checking or a statistic which items had status checking for a long time ?
wbr
Georg Höllebauer |
|
Back to top |
|
|
KS-Soft
Joined: 03 Apr 2002 Posts: 12807 Location: USA
|
Posted: Tue Nov 20, 2012 10:15 am Post subject: |
|
|
Update: www.ks-soft.net/download/hm932c.zip
(rcc.exe update is optional)
When this version creates debuglog1.txt, it checks for test items that have "Checking" status for 40 sec or more and records information about such items (status period, testID and test name).
But update means you have to restart HostMonitor so current list of "checking" test items will be lost
Right now you may create View to check for "Unknown/Timed out" test items. Also if there are such items, you should see records in regular log with test results.
Regards
Alex |
|
Back to top |
|
|
apaitoperations
Joined: 24 Feb 2011 Posts: 40
|
Posted: Tue Nov 20, 2012 10:22 am Post subject: Update |
|
|
Hi !
I will update HM in the next 60 minutes and post again with new information.
Thanks for the quick update-file !
wbr
Georg Höllebauer |
|
Back to top |
|
|
apaitoperations
Joined: 24 Feb 2011 Posts: 40
|
Posted: Wed Nov 21, 2012 2:54 am Post subject: new debuglog |
|
|
Hi !
Here is the new debug-log with HM version 9.32c running :
----
2012-11-21 09:37:29
Timer1: 1 2012-11-21 09:37:28
Timer2: 1
PoolRecAvail: 4096
TTLimit1: 64
TCnt1: 9726
TTThreads: 1001
LIdx: 4402
ATCnt2: 32
----
2012-11-21 09:38:32
Timer1: 1 2012-11-21 09:38:32
Timer2: 1
PoolRecAvail: 4096
TTLimit1: 64
TCnt1: 9726
TTThreads: 1001
LIdx: 4402
ATCnt2: 32
The TTThreads always stops at around 1001-1003 and never goes beyond that
I dont see any Tests in Status Timed Out or Checking when HM stops monitoring.
With my Setup it stops every 3 to 3.5 hours.
I have 39 checks/sec
wbr
Georg Höllebauer |
|
Back to top |
|
|
apaitoperations
Joined: 24 Feb 2011 Posts: 40
|
Posted: Wed Nov 21, 2012 7:02 am Post subject: another debuglog |
|
|
HM froze again an here the actual debug-log :
2012-11-21 14:00:00
Timer1: 1 2012-11-21 13:59:59
Timer2: 1
PoolRecAvail: 4096
TTLimit1: 40
TCnt1: 9732
TTThreads: 1001
LIdx: 1596
ATCnt2: 32
Again the 1001 TTThreads - maybe thats a limitation somewhere in Windows-Server 2008 R2 SP1 ?
wbr
Georg Höllebauer |
|
Back to top |
|
|
KS-Soft
Joined: 03 Apr 2002 Posts: 12807 Location: USA
|
Posted: Wed Nov 21, 2012 8:32 am Post subject: |
|
|
1001 threads and no test items in "checking" status?? no test items with "timed out" reply??
How many threads displayed by Windows Task Manager for hostmon.exe process?
Regards
Alex |
|
Back to top |
|
|
|