Event logs intermittent transient failures

All questions related to installations, configurations and maintenance of Advanced Host Monitor (including additional tools such as RMA for Windows, RMA Manager, Web Servie, RCC).
Post Reply
averylarry
Posts: 11
Joined: Tue Dec 17, 2019 8:02 am

Event logs intermittent transient failures

Post by averylarry »

Perhaps this is just going to be normal, but it's frustrating.

All of my NT Event Log tests fail intermittently with "Cannot open event log. The handle is invalid.". I have put them in a folder and set the folder property "Non-simultaneously test execution". This seems to help the issue. But it does not eliminate the issue entirely.

HM 11.98 running on Windows Server 2019. It happens whether running as a service (Local System) or running as an application.
KS-Soft
Posts: 13012
Joined: Wed Apr 03, 2002 6:00 pm
Location: USA
Contact:

Post by KS-Soft »

Usually this happens on Windows Server 2012, but we though Microsoft fixed the problem in Windows 2012 R2 and Windows 2016.
Not sure we can do something about this, but will check...

Regards
Alex
averylarry
Posts: 11
Joined: Tue Dec 17, 2019 8:02 am

Post by averylarry »

This is also happening with Service tests, though less often. "Win32 error #1722 The RPC server is unavailable."
KS-Soft
Posts: 13012
Joined: Wed Apr 03, 2002 6:00 pm
Location: USA
Contact:

Post by KS-Soft »

We rechecked our code, also testing debug version with extra logging - HostMonitor works correctly, receives and uses correct handles. Looks like Windows bugs :(

Regards
Alex
averylarry
Posts: 11
Joined: Tue Dec 17, 2019 8:02 am

Post by averylarry »

I wish I knew more about how it works internally. So far when I restart the hostmonitor service, the tests all immediately start working again. Is there some type of connection pool being re-used that needs to be flushed?

Aside from hostmonitor, similar problems happen with web servers' application pools going stale -- they can be fixed by recycling/restarting the app pool.

Alternatively -- is there a way perhaps you could capture the issue and retry the test or force the test to use a new connection? Again, I don't really know how it works internally so I don't know if the underlying issue is trying to re-use a "dead" socket connection.

I don't want to be a pest about this -- but it will be very frustrating if Event Log and Service tests are simply unreliable. My Service tests are set to check every 30 minutes and they have failed 4 times in the last 8 hours. My Event Log tests run every 1-2 minutes and have failed 5 times in the last 8 hours.
KS-Soft
Posts: 13012
Joined: Wed Apr 03, 2002 6:00 pm
Location: USA
Contact:

Post by KS-Soft »

Is there some type of connection pool being re-used that needs to be flushed?
Nothing on HostMonitor side, not 100% sure about Windows.
is there a way perhaps you could capture the issue and retry the test or force the test to use a new connection?
We are testing some ideas..
Also, you can set "repeat test" action for Unknown status and some other actions for 'Bad" status.
This is also happening with Service tests, though less often. "Win32 error #1722 The RPC server is unavailable."
As we know Services related Windows API is reliable.
May be there is something wrong with your network, router or software (e.g. antivirus)? May be target system(s) too busy and does not respond sometimes?
Could you try to setup TCP test to check target servers, e.g. port 135? Will it set "No answer" status sometimes?
Have you checked memory, handles, CPU usage on target system(s)?
Old Windows 2008 system always performed Service tests without errors?

Regards
Alex
KS-Soft
Posts: 13012
Joined: Wed Apr 03, 2002 6:00 pm
Location: USA
Contact:

Post by KS-Soft »

We checked our old records and perform a lot of new tests, conclusion is the same - there are bugs in Windows wevtapi.dll :-(
E.g. you can call API function providing correct handle, it will say "the handle is invalid", then you call the same API using exactly the same parameters and the same handle and it will be accepted. Sounds Ok, just make 2nd call? No, it does work well either, because 2nd call can be accepted or not, also it can throw memory access violation error.

Anyway, we modified our code, found some workarounds but we cannot fix it completely. Microsoft should do this...
If you updated to version 12.00, then you can apply hot fix (unzip and replace hostmon.exe)
www.ks-soft.net/download/hm1201.zip

Regards
Alex
averylarry
Posts: 11
Joined: Tue Dec 17, 2019 8:02 am

Post by averylarry »

I really appreciate your effort. My service tests are running over 60% unknown status and my event log tests have been running at 45% unknown status.

I will upgrade to the beta and apply the hotfix and report my results next week.
KS-Soft
Posts: 13012
Joined: Wed Apr 03, 2002 6:00 pm
Location: USA
Contact:

Post by KS-Soft »

Service test should work fine
As we know Services related Windows API is reliable.
May be there is something wrong with your network, router or software (e.g. antivirus)? May be target system(s) too busy and does not respond sometimes?
Could you try to setup TCP test to check target servers, e.g. port 135? Will it set "No answer" status sometimes?
Have you checked memory, handles, CPU usage on target system(s)?
Old Windows 2008 system always performed Service tests without errors?

Regards
Alex
averylarry
Posts: 11
Joined: Tue Dec 17, 2019 8:02 am

Post by averylarry »

On the old server, statistics for event log tests over the last 1300 days:
99.17% alive
0.03% dead
0.79% unknown

Statistics for Service tests over the last 500 days:
98.32% alive
1.4% dead
0.03% unknown

If I turn the old server's HM back on, the Event log and Service tests are working (though clearly only for a few minutes). I can let them run over the weekend for comparison.
averylarry
Posts: 11
Joined: Tue Dec 17, 2019 8:02 am

Post by averylarry »

I'm back from vacation.

Update:
Event log tests all seem to be fine.

Service tests:
On the old server (windows 2008 R2) running HM 8.14 all of the service tests run with only 0.04% unknown status (in other words, they work and are stable).
On the new server (windows 2019) running HM 12.01 hotfix the service tests vary. When testing a service on a windows 10 workstation, they are running at 30-40% unknown status (RPC server unavailable). When testing a service on a windows server (2008 R2 and 2016 and 2019), they are running properly and stable at 0.05% unknown status.
averylarry
Posts: 11
Joined: Tue Dec 17, 2019 8:02 am

Post by averylarry »

The big new problem we have, however, I believe is a memory leak somewhere. Hostmonitor process increases memory usage until it hits about 1.9Gb and then all of the tests basically stop working (and I get hundreds of emails). If I try to connect via RCC, I get this in the log:

[12/29/2019 7:13:30 AM] Connecting... Ok. TCP Connection established
[12/29/2019 7:13:30 AM] Authentication... Ok
[12/29/2019 7:13:30 AM] RCC handshake... Ok
[12/29/2019 7:13:30 AM] Retrieving palettes... Ok
[12/29/2019 7:13:30 AM] Retrieving RMA list... Ok
[12/29/2019 7:13:30 AM] Retrieving reports... Ok
[12/29/2019 7:13:30 AM] Retrieving global variables... Ok
[12/29/2019 7:13:30 AM] Retrieving user profiles... Ok
[12/29/2019 7:13:30 AM] Retrieving user menus... Ok
[12/29/2019 7:13:30 AM] Retrieving scripts... Ok
[12/29/2019 7:13:30 AM] Retrieving schedules... Ok
[12/29/2019 7:13:30 AM] Retrieving action list... Ok
[12/29/2019 7:13:30 AM] Retrieving options... Ok
[12/29/2019 7:13:30 AM] Retrieving test list... Error:
[12/29/2019 7:13:34 AM] Disconnecting... Disconnected

If I restart the service, everything is fine. hostmon.exe memory usage is 15Mb.
If I happen to already have RCC open and connected, it works and I can see things having problems. Mostly the ping tests go to status unknown and all the other tests go to status "Wait for Master" because they are all based on the ping tests as master tests.
KS-Soft
Posts: 13012
Joined: Wed Apr 03, 2002 6:00 pm
Location: USA
Contact:

Post by KS-Soft »

Please contact support by e-mail, send your config files if you can.
(support@ks-soft.net)

Regards
Alex
Post Reply