HM freezes after update from V8.86 to V9.06

When you post information about some problem, please include the following details: - OS version (e.g. Windows 2000 Professional SP3); HostMonitor version; problem description.
Post Reply
rc
Posts: 100
Joined: Mon Aug 01, 2005 7:51 am

HM freezes after update from V8.86 to V9.06

Post by rc »

Hi Aleks,

yesterday I updated our HM installation from V8.86 to V9.06 and it was running approximately 6 hours without problems. But after this HM freezed. The test execution was broken. I have no error messages and the same issue after restart HM service but number of threads was approximately 1000! It's a pity because old version V8.86 is very stable.

We use Windows Server 2003 SP2 on HP ProLiant DL380 G6 (8 Core 2.27 Ghz Intel Xeon/4086 MB RAM). We have 29 test/sec

If you like i would send you our configuration files.

kind regards
Enrico
KS-Soft
Posts: 12869
Joined: Wed Apr 03, 2002 6:00 pm
Location: USA
Contact:

Post by KS-Soft »

There are thousand changes in the code - that's why we uploaded Beta version and wait for bug reports 6 weeks. There are over 14,000 downloads but nobody sent bug reports for Beta version (as usually) :(
Also we spent many many weeks testing software on our servers, it works great here.

Do you have ODBC Query tests? ODBC logging?
Have you changed ODBC driver recently?
Yes, please send HML, LST and INI files to support@ks-soft.net

Regards
Alex
rc
Posts: 100
Joined: Mon Aug 01, 2005 7:51 am

Post by rc »

...Ok, I send you the files.
We don't use ODBC logging but we have 5 ODBC Query tests.
Also I don't changed ODBC driver recently.

Regards
Enrico
apaitoperations
Posts: 40
Joined: Thu Feb 24, 2011 1:55 am

same here

Post by apaitoperations »

I upgraded from 8.68 to 8.82 an then the problems started.

HM-Service stuck every few hours on server 2008 R2 64bit.

see http://www.ks-soft.net/cgi-bin/phpBB/vi ... torder=asc
rc
Posts: 100
Joined: Mon Aug 01, 2005 7:51 am

...finally somebody with the same issue

Post by rc »

... unfortunately, I also have furthermore this phenomenon.
Always after approximately 16 millions explained tests no more checks are explained.
However, the application still reacts. There never are entries in the log or unusually high values in the task manager. I restart the service for months about a second Hostmonitor automatically.

In the meantime, I have evacuated all ODBC tests on a RMA agent and have updated the ODBC driver. ODBC-Logging I do not have. In it it does not lie definitively.

The behaviour of the Hostmonitors (V9.18) is always same. Everything functions perfectly up to the border of 16 millions explained tests (3 millions per day).

It would be nice if the mistake is still found.

Best greetings
Enrico
apaitoperations
Posts: 40
Joined: Thu Feb 24, 2011 1:55 am

now we are already three

Post by apaitoperations »

Hello Enrico,

We are already three users with the same issue :

You, me and Kris !

As far as i understand we use different versions on different Operating-Systems so that cannot be the problem - it must be something in the code after version 8.68 but maybe before version 8.82.

Maybe we should open a new topic for the three of us and post there ?
What do you think ?

wbr
Georg
apaitoperations
Posts: 40
Joined: Thu Feb 24, 2011 1:55 am

haow to count checks

Post by apaitoperations »

Enrico,

How do you count the checks so you know about the 16 millions border ?

My HM-Service freezes about every 5 hours.
I have 33 checks per second.

Also no ODBC-Logging and no ODBC-Checks - i removed all i had because of permanent problems with odbc-drivers. I now use my own shell-scripts to connect to databases.

But i use about 200 different Text-Logs.

I also never see any entries in any logfile when hm is stuck.

When HM is frozen : rcc stays functional and it even sends commands to HM-Service and the commands reach HM-Service - i see that in the userlog.xml.

So HM-Service just sits and waits and doesnt perform any checks any more. I also see that in Ressource-Manager there is always one thread that keeps the other threads waiting - use "Analyze Wait Chain" in Ressource Monitor on hostmon.exe

wbr
Georg
rc
Posts: 100
Joined: Mon Aug 01, 2005 7:51 am

Post by rc »

Hello Georg,

if it helps the solution, with pleasure :wink:

Enrico
rc
Posts: 100
Joined: Mon Aug 01, 2005 7:51 am

Post by rc »

Hello Georg,

I have just seen that you have added still an other contribution.

I let myself send an SMS with status information by the Hostmonitor every day. If the service is restarted because no more test is explained, I also get an SMS from 2nd Hostmonitor. About that I can see or calculate how many tests were explained.

This number is about always same. This puts out with me approximately 5 days up to the next new start. These are 37 tests per second

Enrico
rc
Posts: 100
Joined: Mon Aug 01, 2005 7:51 am

Yesterday it happened again

Post by rc »

Hello George,

Yesterday it happened again.
After almost exactly carried out 16 million checks in about 5 days my second instance of HostMonitor has again determined that the value of testing / per second for the main installation has reached 0 again and triggered an automatic restart of the service.

What is amazing here again is the fact that the host monitor is more accessible, because the second host monitor can easily detect even the values ​​for the test method "Check Host Monitor".

Without this check, I could not see from the outside so that no more checks are performed. In the meantime I've swapped all ODBC checks on an RMA and the values ​​for CPU and memory are all the time at a low level. Nevertheless, I have the latest version v9.22 the same phenomenon.

@ Alex: Is this event can not be documented in the system log?
There must be some method to get to the stop monitoring an error message.

The constant restarting the service is not bad, but it has the disadvantage that the alarm profile with dependency actions filed after the restart can not be executed, for example, reset the interval to the original value. In this case, I must always intervene manually.

In the hope of a solution
Enrico
KS-Soft
Posts: 12869
Joined: Wed Apr 03, 2002 6:00 pm
Location: USA
Contact:

Post by KS-Soft »

1st we should understand what exactly happened.
May be HostMonitor works just fine but there is some mistake in HM Monitor test method or some statistic counter? Are you sure HostMonitor did not perform tests?

Could you remove "restart service" action and instead of restart, try to check what is wrong.
1) use Auditing Tool to check for errors/warnings
2) check HostMonitor system log (specified on System Log page in HostMonitor Options dialog) for errors
3) check if HostMonitor can perform tests, try to refresh some simple Ping test that does not have any Master tests and performed directly by HostMonitor. Check "Recurrences" and "Last test time" fields
4) check test statuses. Do you see a lot of tests with Unknown or Checking status?
5) check resource usage for each process started on the system. You may use standard Windows Task Manager to check Handles, GDI and USER objects. What is the total resource usage on the system? How many handles/threads/GDI objects used by hostmon.exe process?
Write some notes, counters, then restart HostMonitor.
@ Alex: Is this event can not be documented in the system log?
What kind of the event? HostMonitor records event when monitoring is stopped. Also you may setup HostMonitor to start actions when monitoring is stopped or paused.
But we do not have any idea what happened on your system. Monitoring was stopped? Paused? or may be HostMonitor delayed test execution because of some logging related problem? or may be HostMonitor delayed test execution because a lot of tests cannot be finished waiting for answer from target systems?
Have you checked system log, Auditing Tool? Any errors?

Regards
Alex
rc
Posts: 100
Joined: Mon Aug 01, 2005 7:51 am

Post by rc »

OK Alex, thanks for the quick reply.

However, I can only write always the same:
I have never found any error messages.
If the error occurs, no further checks can also be manually executed. The "Disable / Enable Host Monitor" is then no function.
The error also occurs not at a particular time, but with me always exactly carried out after 16 million tests. This is, unfortunately, often on weekends or at night. Since my availability monitoring is important, the application will be restarted immediately.

Incidentally, I use the function "store historical data in the file" for a total of 12 639 tests.

As described in the application behaves normally otherwise. Only the main task - the testing is not met.

Because I just do not have the time to narrow down the cause further, I've solved the problem by automatically restarting the service and how you see other people have the same problem.

If there is no option that produces an error message, I honestly see no chance to find the cause of the problem.

I'm sorry.

Enrico
KS-Soft
Posts: 12869
Joined: Wed Apr 03, 2002 6:00 pm
Location: USA
Contact:

Post by KS-Soft »

The "Disable / Enable Host Monitor" is then no function
You cannot disable test??
Using RCC? Can you login to system directly and try to disable test item using HostMonitor GUI?

No errors in the log..
What about Unknown test statuses? Do you see a lot of such records in the log (not system log but regular log with test results)?

Regards
Alex
rc
Posts: 100
Joined: Mon Aug 01, 2005 7:51 am

Post by rc »

You cannot disable test??
Using RCC? Can you login to system directly and try to disable test item using HostMonitor GUI?

No errors in the log..
What about Unknown test statuses? Do you see a lot of such records in the log (not system log but regular log with test results)?
No, I mean the feature stop / start monitor.
Tests in the status of "Unknown" are not displayed.
The log (I log each test result) ends just at the time when checking stops.
apaitoperations
Posts: 40
Joined: Thu Feb 24, 2011 1:55 am

exactly the same here

Post by apaitoperations »

exactly the same happens here with my installation but about every 5 hours - thats about 450.000 to 550.000 tests.

it just stops to execute the checks. before it happens i think it gets slower and slower - the intervalls are not met anymore - i see that in full logging logfiles - should perform check every minute - suddenly check is performed after 2 minutes then 5 minutes then it stops completely.

also here no sign of any event message in any logfile or eventlog.

wbr
Georg
Post Reply