Status reset after RMA connection error

All questions related to installations, configurations and maintenance of Advanced Host Monitor (including additional tools such as RMA for Windows, RMA Manager, Web Servie, RCC).
Post Reply
Kapz
Posts: 216
Joined: Mon Dec 06, 2004 2:33 pm
Location: Denmark

Status reset after RMA connection error

Post by Kapz »

Hi !

HM5.10 on Win2003 but also seen in HM4.86:

* A test has status Bad
* HM executes the alert profile as supposed
* HM looses for some reason the connection to the remote agent resulting in a status RMA Connection Error
* The connection to the remote agent i reestablished
* HM detects that the test still has status Bad
* HM executes the alert profile again

Is there a way to tell HM that an RMA Connection Error should not reset the tests state ?
This behavior only occurs on test with status Bad. If a test with status Ok gets status RMA Connection Error and then is reconnected the profile isn't triggered.

Note, that we do belive that HM 4.70 and earlier did not behave this way - we have an idea that it wasn't until 4.86 that an RMA Connection Error triggered a reset of a Bad test's state.

Thanks !

Kasper :O)
KS-Soft
Posts: 12821
Joined: Wed Apr 03, 2002 6:00 pm
Location: USA
Contact:

Post by KS-Soft »

I think you have changed "Treat Unknown status as Bad" option (Test Properties dialog). If this option enabled, HostMonitor will not reset "Recurrences" counter when status changes Unknown<->Bad

Regards
Alex
Kapz
Posts: 216
Joined: Mon Dec 06, 2004 2:33 pm
Location: Denmark

Post by Kapz »

Alex,

> I think you have changed "Treat Unknown status as Bad" option (Test Properties dialog). If this option enabled, HostMonitor will not reset "Recurrences" counter when status changes Unknown<->Bad

Yes, the could very well be it.
On all our tests, that depends on an agent I have disabled the "Treat Unknown status as Bad" option as we got quite a few yellow Unknown statuses when nothing really was wrong.

Is there a hidden switch somewhere that can prevent HM from resetting the Recurrences counter even when "Treat Unknows status as Bad" isn't enabled - or will this break some logic somewhere ?

Thanks !

Kasper :O)
KS-Soft
Posts: 12821
Joined: Wed Apr 03, 2002 6:00 pm
Location: USA
Contact:

Post by KS-Soft »

On all our tests, that depends on an agent I have disabled the "Treat Unknown status as Bad" option
You should ENABLE this option. If you want to keep Recurrences counter when test status changes from Unknown to Bad, you should enable the option.
we got quite a few yellow Unknown statuses when nothing really was wrong
Usually if HostMonitor cannot perform some test, it displays error message in "Reply" field. What do you see in "Reply" field?
Is there a hidden switch somewhere that can prevent HM from resetting the Recurrences counter even when "Treat Unknows status as Bad" isn't enabled - or will this break some logic somewhere ?
It will break logic. When this option is disabled, HostMonitor does not start alerts for Unknown status. What actions should be executed when test changes status to Unknown and then (e.g. after 2 Unknown statuses) to Bad without resetting Recurrences counter? HM will launch actions that should be executed after 3rd, 4th... consecutive bad result. It means 1st and 2nd action never will be started.

Regards
Alex
KS-Soft
Posts: 12821
Joined: Wed Apr 03, 2002 6:00 pm
Location: USA
Contact:

Post by KS-Soft »

BTW: If you want to keep status/statistics for some tests (TestA, TestB) that depend on RMA functionality, add TCP test to check HostMonitor<->RMA connection and setup this test as Master test for TestA and TestB

Regards
Alex
Kapz
Posts: 216
Joined: Mon Dec 06, 2004 2:33 pm
Location: Denmark

Post by Kapz »

Alex,

>> On all our tests, that depends on an agent I have disabled the "Treat Unknown status as Bad" option
>You should ENABLE this option. If you want to keep Recurrences counter when test status changes from Unknown to Bad, you should enable the option.
Yes, that's what I learned from your answers.

>> we got quite a few yellow Unknown statuses when nothing really was wrong
> Usually if HostMonitor cannot perform some test, it displays error message in "Reply" field. What do you see in "Reply" field?
Sorry for not beeing accurate in my post, with yellow Unknows I simply meant RMA Connection Errors.

>> Is there a hidden switch somewhere that can prevent HM from resetting the Recurrences counter even when "Treat Unknows status as Bad" isn't enabled - or will this break some logic somewhere ?
> It will break logic. When this option is disabled, HostMonitor does not start alerts for Unknown status. What actions should be executed when test changes status to Unknown and then (e.g. after 2 Unknown statuses) to Bad without resetting Recurrences counter? HM will launch actions that should be executed after 3rd, 4th... consecutive bad result. It means 1st and 2nd action never will be started.
Point taken. We only operate with one single action for all our tests and that is triggered after the third consecutive Bad result so I didn't take multiple actions into consideration.

> BTW: If you want to keep status/statistics for some tests (TestA, TestB) that depend on RMA functionality, add TCP test to check HostMonitor<->RMA connection and setup this test as Master test for TestA and TestB
We already do that, but there is no guarantee that even though the RMA Agent answers on a given TCP port a test performed by the agent won't result in a RMA Connection Error. This is actually the reason why we unchecked "Treat Unknown Status as Bad" in the first place.
While we can always count on a simple TCP test to the agent RMA connection errors aren't that rare on tests performed *by* the agent.

So, bottom line: Unless I enable "Treat Unknown Status as Bad" I will still experience my Bad Profile beeing triggered again upon an RMA connection error on a test that already had status Bad.
This will on the other hand result in an SMS sent to me saying "TestA Bad, RMA Connection Error" whenever a test with status Ok experiences three consecutive RMA Connection Errors - perhaps simply because the remote server is busy rather than the test is actually Bad.

Can you verify my conclusion ?
Please note, that I'm not trying to sound grumpy; I'm simply trying to get this right in my head :)

Kasper :O)
KS-Soft
Posts: 12821
Joined: Wed Apr 03, 2002 6:00 pm
Location: USA
Contact:

Post by KS-Soft »

We already do that, but there is no guarantee that even though the RMA Agent answers on a given TCP port a test performed by the agent won't result in a RMA Connection Error. This is actually the reason why we unchecked "Treat Unknown Status as Bad" in the first place.
While we can always count on a simple TCP test to the agent RMA connection errors aren't that rare on tests performed *by* the agent.
So, sometimes you see "RMA Connection Error" and than next check returns "good" (or "bad") result? Looks like you should increase timeout specified for the agent (Agent Connection Parameters dialog).

Regards
Alex
Kapz
Posts: 216
Joined: Mon Dec 06, 2004 2:33 pm
Location: Denmark

Post by Kapz »

Alex,

> So, sometimes you see "RMA Connection Error" and than next check returns "good" (or "bad") result?
Yes, that's what happens.

> Looks like you should increase timeout specified for the agent (Agent Connection Parameters dialog).
Our agents already have a time out value of 120 seconds, so it shouldn't be a timeout issue (also the answer RMA Connection Error occurs way before 120 seconds).
Trouble is, that I cannot detect what exactly is causing these errors as the reply"RMA Connection Error" is quite generic and the agent down't put anything in its bad log. I'd be happy to help tracking down what exactly goes wrong so if some logs could potentially reveal anything let me know where to look - or if you have some switch that enables logging for development purposes that can be enabled I can do that.

Kasper :O)
KS-Soft
Posts: 12821
Joined: Wed Apr 03, 2002 6:00 pm
Location: USA
Contact:

Post by KS-Soft »

and the agent down't put anything in its bad log.
Looks like agent does not receive connection. What system is running agent? Windows workstation? or server edition?
How many test items performed by the agent?

Probably this problem described here: http://www.ks-soft.net/hostmon.eng/rma- ... m#problems

Regards
Alex
Post Reply