View previous topic :: View next topic |
Author |
Message |
averylarry
Joined: 17 Dec 2019 Posts: 11
|
Posted: Tue Dec 17, 2019 9:45 am Post subject: Event logs intermittent transient failures |
|
|
Perhaps this is just going to be normal, but it's frustrating.
All of my NT Event Log tests fail intermittently with "Cannot open event log. The handle is invalid.". I have put them in a folder and set the folder property "Non-simultaneously test execution". This seems to help the issue. But it does not eliminate the issue entirely.
HM 11.98 running on Windows Server 2019. It happens whether running as a service (Local System) or running as an application. |
|
Back to top |
|
|
KS-Soft
Joined: 03 Apr 2002 Posts: 12795 Location: USA
|
Posted: Tue Dec 17, 2019 10:39 am Post subject: |
|
|
Usually this happens on Windows Server 2012, but we though Microsoft fixed the problem in Windows 2012 R2 and Windows 2016.
Not sure we can do something about this, but will check...
Regards
Alex |
|
Back to top |
|
|
averylarry
Joined: 17 Dec 2019 Posts: 11
|
Posted: Wed Dec 18, 2019 8:28 am Post subject: |
|
|
This is also happening with Service tests, though less often. "Win32 error #1722 The RPC server is unavailable." |
|
Back to top |
|
|
KS-Soft
Joined: 03 Apr 2002 Posts: 12795 Location: USA
|
Posted: Wed Dec 18, 2019 8:33 am Post subject: |
|
|
We rechecked our code, also testing debug version with extra logging - HostMonitor works correctly, receives and uses correct handles. Looks like Windows bugs
Regards
Alex |
|
Back to top |
|
|
averylarry
Joined: 17 Dec 2019 Posts: 11
|
Posted: Wed Dec 18, 2019 9:05 am Post subject: |
|
|
I wish I knew more about how it works internally. So far when I restart the hostmonitor service, the tests all immediately start working again. Is there some type of connection pool being re-used that needs to be flushed?
Aside from hostmonitor, similar problems happen with web servers' application pools going stale -- they can be fixed by recycling/restarting the app pool.
Alternatively -- is there a way perhaps you could capture the issue and retry the test or force the test to use a new connection? Again, I don't really know how it works internally so I don't know if the underlying issue is trying to re-use a "dead" socket connection.
I don't want to be a pest about this -- but it will be very frustrating if Event Log and Service tests are simply unreliable. My Service tests are set to check every 30 minutes and they have failed 4 times in the last 8 hours. My Event Log tests run every 1-2 minutes and have failed 5 times in the last 8 hours. |
|
Back to top |
|
|
KS-Soft
Joined: 03 Apr 2002 Posts: 12795 Location: USA
|
Posted: Wed Dec 18, 2019 10:21 am Post subject: |
|
|
Quote: | Is there some type of connection pool being re-used that needs to be flushed? |
Nothing on HostMonitor side, not 100% sure about Windows.
Quote: | is there a way perhaps you could capture the issue and retry the test or force the test to use a new connection? |
We are testing some ideas..
Also, you can set "repeat test" action for Unknown status and some other actions for 'Bad" status.
Quote: | This is also happening with Service tests, though less often. "Win32 error #1722 The RPC server is unavailable." |
As we know Services related Windows API is reliable.
May be there is something wrong with your network, router or software (e.g. antivirus)? May be target system(s) too busy and does not respond sometimes?
Could you try to setup TCP test to check target servers, e.g. port 135? Will it set "No answer" status sometimes?
Have you checked memory, handles, CPU usage on target system(s)?
Old Windows 2008 system always performed Service tests without errors?
Regards
Alex |
|
Back to top |
|
|
KS-Soft
Joined: 03 Apr 2002 Posts: 12795 Location: USA
|
Posted: Fri Dec 20, 2019 9:29 am Post subject: |
|
|
We checked our old records and perform a lot of new tests, conclusion is the same - there are bugs in Windows wevtapi.dll
E.g. you can call API function providing correct handle, it will say "the handle is invalid", then you call the same API using exactly the same parameters and the same handle and it will be accepted. Sounds Ok, just make 2nd call? No, it does work well either, because 2nd call can be accepted or not, also it can throw memory access violation error.
Anyway, we modified our code, found some workarounds but we cannot fix it completely. Microsoft should do this...
If you updated to version 12.00, then you can apply hot fix (unzip and replace hostmon.exe)
www.ks-soft.net/download/hm1201.zip
Regards
Alex |
|
Back to top |
|
|
averylarry
Joined: 17 Dec 2019 Posts: 11
|
Posted: Fri Dec 20, 2019 9:56 am Post subject: |
|
|
I really appreciate your effort. My service tests are running over 60% unknown status and my event log tests have been running at 45% unknown status.
I will upgrade to the beta and apply the hotfix and report my results next week. |
|
Back to top |
|
|
KS-Soft
Joined: 03 Apr 2002 Posts: 12795 Location: USA
|
Posted: Fri Dec 20, 2019 10:57 am Post subject: |
|
|
Service test should work fine
As we know Services related Windows API is reliable.
May be there is something wrong with your network, router or software (e.g. antivirus)? May be target system(s) too busy and does not respond sometimes?
Could you try to setup TCP test to check target servers, e.g. port 135? Will it set "No answer" status sometimes?
Have you checked memory, handles, CPU usage on target system(s)?
Old Windows 2008 system always performed Service tests without errors?
Regards
Alex |
|
Back to top |
|
|
averylarry
Joined: 17 Dec 2019 Posts: 11
|
Posted: Fri Dec 20, 2019 11:28 am Post subject: |
|
|
On the old server, statistics for event log tests over the last 1300 days:
99.17% alive
0.03% dead
0.79% unknown
Statistics for Service tests over the last 500 days:
98.32% alive
1.4% dead
0.03% unknown
If I turn the old server's HM back on, the Event log and Service tests are working (though clearly only for a few minutes). I can let them run over the weekend for comparison. |
|
Back to top |
|
|
averylarry
Joined: 17 Dec 2019 Posts: 11
|
Posted: Mon Jan 06, 2020 3:49 pm Post subject: |
|
|
I'm back from vacation.
Update:
Event log tests all seem to be fine.
Service tests:
On the old server (windows 2008 R2) running HM 8.14 all of the service tests run with only 0.04% unknown status (in other words, they work and are stable).
On the new server (windows 2019) running HM 12.01 hotfix the service tests vary. When testing a service on a windows 10 workstation, they are running at 30-40% unknown status (RPC server unavailable). When testing a service on a windows server (2008 R2 and 2016 and 2019), they are running properly and stable at 0.05% unknown status. |
|
Back to top |
|
|
averylarry
Joined: 17 Dec 2019 Posts: 11
|
Posted: Mon Jan 06, 2020 3:53 pm Post subject: |
|
|
The big new problem we have, however, I believe is a memory leak somewhere. Hostmonitor process increases memory usage until it hits about 1.9Gb and then all of the tests basically stop working (and I get hundreds of emails). If I try to connect via RCC, I get this in the log:
[12/29/2019 7:13:30 AM] Connecting... Ok. TCP Connection established
[12/29/2019 7:13:30 AM] Authentication... Ok
[12/29/2019 7:13:30 AM] RCC handshake... Ok
[12/29/2019 7:13:30 AM] Retrieving palettes... Ok
[12/29/2019 7:13:30 AM] Retrieving RMA list... Ok
[12/29/2019 7:13:30 AM] Retrieving reports... Ok
[12/29/2019 7:13:30 AM] Retrieving global variables... Ok
[12/29/2019 7:13:30 AM] Retrieving user profiles... Ok
[12/29/2019 7:13:30 AM] Retrieving user menus... Ok
[12/29/2019 7:13:30 AM] Retrieving scripts... Ok
[12/29/2019 7:13:30 AM] Retrieving schedules... Ok
[12/29/2019 7:13:30 AM] Retrieving action list... Ok
[12/29/2019 7:13:30 AM] Retrieving options... Ok
[12/29/2019 7:13:30 AM] Retrieving test list... Error:
[12/29/2019 7:13:34 AM] Disconnecting... Disconnected
If I restart the service, everything is fine. hostmon.exe memory usage is 15Mb.
If I happen to already have RCC open and connected, it works and I can see things having problems. Mostly the ping tests go to status unknown and all the other tests go to status "Wait for Master" because they are all based on the ping tests as master tests. |
|
Back to top |
|
|
KS-Soft
Joined: 03 Apr 2002 Posts: 12795 Location: USA
|
Posted: Mon Jan 06, 2020 4:07 pm Post subject: |
|
|
Please contact support by e-mail, send your config files if you can.
(support@ks-soft.net)
Regards
Alex |
|
Back to top |
|
|
|