KS-Soft. Network Management Solutions
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister    ProfileProfile    Log inLog in 

Intermittent global ping test fails
Goto page 1, 2  Next
 
Post new topic   Reply to topic    KS-Soft Forum Index -> Configuration, Maintenance, Troubleshooting
View previous topic :: View next topic  
Author Message
ckratsch
Guest





PostPosted: Fri Feb 27, 2015 7:52 am    Post subject: Intermittent global ping test fails Reply with quote

Platform:

Windows Server 2008 R2 Enterprise SP1 (6.1.7601)
Hostmon v9.90
Single Gbe NIC connected at 1Gbps

Our HM config contains a large number of ping tests, including ping tests for switches. These tests are configured to depend on their parent: Server ping depends on the ping for the switch it connects to, that switch depends on a router it connects to, that router depends on the core switch, the core switch depends on the switch that Hostmon server is connected to. This way, if the core switch goes down, we're not getting alerts for systems on the "other side" of it. Our whole test environment cascades in this manner.

Occasionally, we'll suddenly get a flood of false ping failure alerts from devices and systems all around the network. The only way to get things back to normal is to hope and wait. This morning, I think I finally narrowed it down some.

While this condition was going on, I noticed that the ping test for our core switch was "flapping." It would switch to "good," then *all* the child ping tests beneath it would start back up at the same time (with many of them failing). The core switch test would change status to "Checking," and then it would fail - setting all the child tests back to "Wait for master." Back and forth like this.

While this was going on - HM sending out so many ping tests at once - I opened a command prompt, and pinged the core switch IP from there. Replies came back with normal TTL, but the rate at which those replies were displayed in the cmd window were slow and laggy, not the normal click, click, click rhythm of a ping -t.

So, I think I have narrowed this down to a shortcoming of the NIC, of the NIC firmware, NIC driver, or Windows TCP/IP stack. One or more of those is being flooded with ping traffic, and Hostmon is interpreting that flood as a test failure. Then Hostmon runs the ping tests all at once, creating a new ping flood. There's the loop.

How to resolve this? Is there a way to have Hostmon avoid creating those kinds of ping floods in the first place? Should I bind two or more Gbe NICs together? For the record, I have just disabled autotuninglevel and set TcpAckFrequency and TCPNoDelay as per http://www.ks-soft.net/cgi-bin/phpBB/viewtopic.php?t=6800 and rebooted the Hostmon server. Everything is quiet again now, but I'm not sure whether those settings were effective, or whether simply rebooting the server gave the ping flood a chance to break down.
Back to top
KS-Soft



Joined: 03 Apr 2002
Posts: 12795
Location: USA

PostPosted: Fri Feb 27, 2015 9:34 am    Post subject: Reply with quote

If you have 5000 tests, HostMonitor will not start all tests at once.
"Flood" depends on settings.

How many tests do you have? How many Ping tests? ICMP packets? Test interval?
Could you check average load using Auditing Tool (menu View)?

What settings do you use for the following options:
Don't start more than [N] tests per second
HostMonitor is multi-threaded so it can test many hosts simultaneously. This parameter defines how many tests per second the program will start.

Recheck dependant test items when master test status has been changed
This option tells HostMonitor to recheck all dependant test items immediately after their master test failure or recovery (otherwise tests will be executed on regular schedule). This helps to maintain more accurate statistic information.

Consider status of the master test obsolete after N seconds
This parameter is used by the program to determine whether the Master test status is up-to-date. Before starting a dependent test HostMonitor checks the status values of the Master tests defined on it. If Master test was performed more than N seconds ago, HostMonitor will recheck the Master test item before checking dependant items.

Regards
Alex
Back to top
View user's profile Send private message Visit poster's website
ckratsch
Guest





PostPosted: Fri Feb 27, 2015 9:40 am    Post subject: Reply with quote

We have 395 ping tests in auditing. Our total tests are 995.

I believe the rest of the settings are going to be the defaults, I don't recall changing any:

Don't start more than 32 tests per second (Should be lower?)
Recheck dependant test items is checked now (I imagine I should uncheck it!)
Consider status of master test obsolete after 5 seconds (I imagine this should be much higher to avoid that master test from rechecking like it has been.)
Back to top
KS-Soft



Joined: 03 Apr 2002
Posts: 12795
Location: USA

PostPosted: Fri Feb 27, 2015 10:03 am    Post subject: Reply with quote

ICMP packets? Test interval?
Could you check average load using Auditing Tool (menu View)?

Quote:
Consider status of master test obsolete after 5 seconds (I imagine this should be much higher to avoid that master test from rechecking like it has been.)

If you have 10-20 master tests in total, 5 sec is fine.

Quote:
Recheck dependant test items is checked now (I imagine I should uncheck it!)

Yes, I think its better to disable this option
(it was disabled by default)

Regards
Alex
Back to top
View user's profile Send private message Visit poster's website
ckratsch
Guest





PostPosted: Fri Feb 27, 2015 10:09 am    Post subject: Reply with quote

Avg load seems to be 7 tests/sec, if I'm looking at the right thing (Estimated workload tab).

I unchecked the "Recheck master" box just now. It may be days, weeks, even months before this crops up again - if ever. This kind of event is very intermittent. I'm willing to bet it occurs *only* when our core switch ping test has a short blip. And now that the "recheck" is disabled, I bet it's solved.

Nothing to do now but wait and see.

Edit: Oh, yes, ICMP tests. 1000ms, 10 byte packets, 4 packets sent. %100 to fail.
Back to top
KS-Soft



Joined: 03 Apr 2002
Posts: 12795
Location: USA

PostPosted: Fri Feb 27, 2015 12:55 pm    Post subject: Reply with quote

7 tests per second, 4 packets, 10 bytes.. should not be a problem.

Regards
Alex
Back to top
View user's profile Send private message Visit poster's website
ckratsch
Guest





PostPosted: Mon Mar 02, 2015 9:08 am    Post subject: Reply with quote

Well, looks like it's decided to freak out again this morning. It is currently misbehaving (I have disabled alerts to avoid annoyance.) If there's any information you'd like me to collect, now is a great time.
Back to top
KS-Soft



Joined: 03 Apr 2002
Posts: 12795
Location: USA

PostPosted: Mon Mar 02, 2015 9:23 am    Post subject: Reply with quote

What exactly mean "freak out"?
Master test returns "Host is alive" status while dependant items return "No answer" status and HostMonitor starts actions assigned to dependant test items?

Regards
Alex
Back to top
View user's profile Send private message Visit poster's website
ckratsch
Guest





PostPosted: Mon Mar 02, 2015 9:30 am    Post subject: Reply with quote

Sorry - same symptoms as original post.

Our core switch ping test turns up "bad," everything goes to "Wait for master." Core switch comes back "Good" on next test interval, the waiting tests line up to be retested (with Recheck dependant test items unchecked). In the meantime, the core switch test turns "Bad" again, repeat until things finally "settle down" (Edit: not settling down yet). Again, my command line ping results come back with appropriate TTL, but they don't appear in the command window at the expected rate.

"Freak out" has just been our shorthand for when this happens.


Last edited by ckratsch on Mon Mar 02, 2015 9:43 am; edited 1 time in total
Back to top
ckratsch
Guest





PostPosted: Mon Mar 02, 2015 9:33 am    Post subject: Reply with quote

Hm, I'm seeing something else interesting as I am watching the HM console: The ping test for our core switch appears to be rechecking itself about every ten seconds - even though it's configured to test every two minutes.
Back to top
KS-Soft



Joined: 03 Apr 2002
Posts: 12795
Location: USA

PostPosted: Mon Mar 02, 2015 9:56 am    Post subject: Reply with quote

Of course
- Consider status of master test obsolete after 5 seconds

Regards
Alex
Back to top
View user's profile Send private message Visit poster's website
ckratsch
Guest





PostPosted: Mon Mar 02, 2015 10:03 am    Post subject: Reply with quote

Yes, that is set for 5 seconds, should it be different?
Back to top
KS-Soft



Joined: 03 Apr 2002
Posts: 12795
Location: USA

PostPosted: Mon Mar 02, 2015 10:17 am    Post subject: Reply with quote

Well, as I understand HostMonitor works correctly?
Why network connection to 1s switch is not releable? Sorry, we cannot tell.
May be too high network traffic and switch drops ICMP packets in favour of TCP? bad network card?

Regards
Alex
Back to top
View user's profile Send private message Visit poster's website
KS-Soft



Joined: 03 Apr 2002
Posts: 12795
Location: USA

PostPosted: Mon Mar 02, 2015 10:19 am    Post subject: Reply with quote

>Yes, that is set for 5 seconds, should it be different?

Already answered:

>>Consider status of master test obsolete after 5 seconds (I imagine this should be much higher to avoid that master test from rechecking like it has been.)

>If you have 10-20 master tests in total, 5 sec is fine.

Regards
Alex
Back to top
View user's profile Send private message Visit poster's website
ckratsch
Guest





PostPosted: Mon Mar 02, 2015 10:46 am    Post subject: Reply with quote

We don't have that few master tests. All of our tests are configured as master-child; all of them are connected in a giant tree, so that when something in between fails, we only get alerts for the actual failure, instead of having to fish the failure out of a bunch of false positives for tests that access destinations on the other side of the failure.

Example:

$Server Application Log test depends on $Server ping test
$Server ping test depends on $Switch3 ping test
$Switch3 ping test depends on $Switch2 ping test
$Switch2 ping test depends on $CoreSwitch ping test
Hostmon server is connected directly to $CoreSwitch

We have several hundred servers and other devices that are all configured to have tests cascading in this fashion.

----

I certainly understand that we may have a hardware issue, and we have replaced the ethernet cable and upgraded the NIC drivers on the Hostmon server. I know those things are outside of your scope. I just want to make sure that I have Hostmon configured properly in our environment. Thanks for your patience.
Back to top
Display posts from previous:   
Post new topic   Reply to topic    KS-Soft Forum Index -> Configuration, Maintenance, Troubleshooting All times are GMT - 6 Hours
Goto page 1, 2  Next
Page 1 of 2

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group

KS-Soft Forum Index