KS-Soft. Network Management Solutions
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister    ProfileProfile    Log inLog in 

Event logs intermittent transient failures

 
Post new topic   Reply to topic    KS-Soft Forum Index -> Configuration, Maintenance, Troubleshooting
View previous topic :: View next topic  
Author Message
averylarry



Joined: 17 Dec 2019
Posts: 11

PostPosted: Tue Dec 17, 2019 9:45 am    Post subject: Event logs intermittent transient failures Reply with quote

Perhaps this is just going to be normal, but it's frustrating.

All of my NT Event Log tests fail intermittently with "Cannot open event log. The handle is invalid.". I have put them in a folder and set the folder property "Non-simultaneously test execution". This seems to help the issue. But it does not eliminate the issue entirely.

HM 11.98 running on Windows Server 2019. It happens whether running as a service (Local System) or running as an application.
Back to top
View user's profile Send private message
KS-Soft



Joined: 03 Apr 2002
Posts: 12791
Location: USA

PostPosted: Tue Dec 17, 2019 10:39 am    Post subject: Reply with quote

Usually this happens on Windows Server 2012, but we though Microsoft fixed the problem in Windows 2012 R2 and Windows 2016.
Not sure we can do something about this, but will check...

Regards
Alex
Back to top
View user's profile Send private message Visit poster's website
averylarry



Joined: 17 Dec 2019
Posts: 11

PostPosted: Wed Dec 18, 2019 8:28 am    Post subject: Reply with quote

This is also happening with Service tests, though less often. "Win32 error #1722 The RPC server is unavailable."
Back to top
View user's profile Send private message
KS-Soft



Joined: 03 Apr 2002
Posts: 12791
Location: USA

PostPosted: Wed Dec 18, 2019 8:33 am    Post subject: Reply with quote

We rechecked our code, also testing debug version with extra logging - HostMonitor works correctly, receives and uses correct handles. Looks like Windows bugs

Regards
Alex
Back to top
View user's profile Send private message Visit poster's website
averylarry



Joined: 17 Dec 2019
Posts: 11

PostPosted: Wed Dec 18, 2019 9:05 am    Post subject: Reply with quote

I wish I knew more about how it works internally. So far when I restart the hostmonitor service, the tests all immediately start working again. Is there some type of connection pool being re-used that needs to be flushed?

Aside from hostmonitor, similar problems happen with web servers' application pools going stale -- they can be fixed by recycling/restarting the app pool.

Alternatively -- is there a way perhaps you could capture the issue and retry the test or force the test to use a new connection? Again, I don't really know how it works internally so I don't know if the underlying issue is trying to re-use a "dead" socket connection.

I don't want to be a pest about this -- but it will be very frustrating if Event Log and Service tests are simply unreliable. My Service tests are set to check every 30 minutes and they have failed 4 times in the last 8 hours. My Event Log tests run every 1-2 minutes and have failed 5 times in the last 8 hours.
Back to top
View user's profile Send private message
KS-Soft



Joined: 03 Apr 2002
Posts: 12791
Location: USA

PostPosted: Wed Dec 18, 2019 10:21 am    Post subject: Reply with quote

Quote:
Is there some type of connection pool being re-used that needs to be flushed?

Nothing on HostMonitor side, not 100% sure about Windows.

Quote:
is there a way perhaps you could capture the issue and retry the test or force the test to use a new connection?

We are testing some ideas..
Also, you can set "repeat test" action for Unknown status and some other actions for 'Bad" status.

Quote:
This is also happening with Service tests, though less often. "Win32 error #1722 The RPC server is unavailable."

As we know Services related Windows API is reliable.
May be there is something wrong with your network, router or software (e.g. antivirus)? May be target system(s) too busy and does not respond sometimes?
Could you try to setup TCP test to check target servers, e.g. port 135? Will it set "No answer" status sometimes?
Have you checked memory, handles, CPU usage on target system(s)?
Old Windows 2008 system always performed Service tests without errors?

Regards
Alex
Back to top
View user's profile Send private message Visit poster's website
KS-Soft



Joined: 03 Apr 2002
Posts: 12791
Location: USA

PostPosted: Fri Dec 20, 2019 9:29 am    Post subject: Reply with quote

We checked our old records and perform a lot of new tests, conclusion is the same - there are bugs in Windows wevtapi.dll
E.g. you can call API function providing correct handle, it will say "the handle is invalid", then you call the same API using exactly the same parameters and the same handle and it will be accepted. Sounds Ok, just make 2nd call? No, it does work well either, because 2nd call can be accepted or not, also it can throw memory access violation error.

Anyway, we modified our code, found some workarounds but we cannot fix it completely. Microsoft should do this...
If you updated to version 12.00, then you can apply hot fix (unzip and replace hostmon.exe)
www.ks-soft.net/download/hm1201.zip

Regards
Alex
Back to top
View user's profile Send private message Visit poster's website
averylarry



Joined: 17 Dec 2019
Posts: 11

PostPosted: Fri Dec 20, 2019 9:56 am    Post subject: Reply with quote

I really appreciate your effort. My service tests are running over 60% unknown status and my event log tests have been running at 45% unknown status.

I will upgrade to the beta and apply the hotfix and report my results next week.
Back to top
View user's profile Send private message
KS-Soft



Joined: 03 Apr 2002
Posts: 12791
Location: USA

PostPosted: Fri Dec 20, 2019 10:57 am    Post subject: Reply with quote

Service test should work fine
As we know Services related Windows API is reliable.
May be there is something wrong with your network, router or software (e.g. antivirus)? May be target system(s) too busy and does not respond sometimes?
Could you try to setup TCP test to check target servers, e.g. port 135? Will it set "No answer" status sometimes?
Have you checked memory, handles, CPU usage on target system(s)?
Old Windows 2008 system always performed Service tests without errors?

Regards
Alex
Back to top
View user's profile Send private message Visit poster's website
averylarry



Joined: 17 Dec 2019
Posts: 11

PostPosted: Fri Dec 20, 2019 11:28 am    Post subject: Reply with quote

On the old server, statistics for event log tests over the last 1300 days:
99.17% alive
0.03% dead
0.79% unknown

Statistics for Service tests over the last 500 days:
98.32% alive
1.4% dead
0.03% unknown

If I turn the old server's HM back on, the Event log and Service tests are working (though clearly only for a few minutes). I can let them run over the weekend for comparison.
Back to top
View user's profile Send private message
averylarry



Joined: 17 Dec 2019
Posts: 11

PostPosted: Mon Jan 06, 2020 3:49 pm    Post subject: Reply with quote

I'm back from vacation.

Update:
Event log tests all seem to be fine.

Service tests:
On the old server (windows 2008 R2) running HM 8.14 all of the service tests run with only 0.04% unknown status (in other words, they work and are stable).
On the new server (windows 2019) running HM 12.01 hotfix the service tests vary. When testing a service on a windows 10 workstation, they are running at 30-40% unknown status (RPC server unavailable). When testing a service on a windows server (2008 R2 and 2016 and 2019), they are running properly and stable at 0.05% unknown status.
Back to top
View user's profile Send private message
averylarry



Joined: 17 Dec 2019
Posts: 11

PostPosted: Mon Jan 06, 2020 3:53 pm    Post subject: Reply with quote

The big new problem we have, however, I believe is a memory leak somewhere. Hostmonitor process increases memory usage until it hits about 1.9Gb and then all of the tests basically stop working (and I get hundreds of emails). If I try to connect via RCC, I get this in the log:

[12/29/2019 7:13:30 AM] Connecting... Ok. TCP Connection established
[12/29/2019 7:13:30 AM] Authentication... Ok
[12/29/2019 7:13:30 AM] RCC handshake... Ok
[12/29/2019 7:13:30 AM] Retrieving palettes... Ok
[12/29/2019 7:13:30 AM] Retrieving RMA list... Ok
[12/29/2019 7:13:30 AM] Retrieving reports... Ok
[12/29/2019 7:13:30 AM] Retrieving global variables... Ok
[12/29/2019 7:13:30 AM] Retrieving user profiles... Ok
[12/29/2019 7:13:30 AM] Retrieving user menus... Ok
[12/29/2019 7:13:30 AM] Retrieving scripts... Ok
[12/29/2019 7:13:30 AM] Retrieving schedules... Ok
[12/29/2019 7:13:30 AM] Retrieving action list... Ok
[12/29/2019 7:13:30 AM] Retrieving options... Ok
[12/29/2019 7:13:30 AM] Retrieving test list... Error:
[12/29/2019 7:13:34 AM] Disconnecting... Disconnected

If I restart the service, everything is fine. hostmon.exe memory usage is 15Mb.
If I happen to already have RCC open and connected, it works and I can see things having problems. Mostly the ping tests go to status unknown and all the other tests go to status "Wait for Master" because they are all based on the ping tests as master tests.
Back to top
View user's profile Send private message
KS-Soft



Joined: 03 Apr 2002
Posts: 12791
Location: USA

PostPosted: Mon Jan 06, 2020 4:07 pm    Post subject: Reply with quote

Please contact support by e-mail, send your config files if you can.
(support@ks-soft.net)

Regards
Alex
Back to top
View user's profile Send private message Visit poster's website
Display posts from previous:   
Post new topic   Reply to topic    KS-Soft Forum Index -> Configuration, Maintenance, Troubleshooting All times are GMT - 6 Hours
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group

KS-Soft Forum Index