Active/passive RMA test

Stoltze · Post by **Stoltze** » Tue Apr 14, 2009 2:18 pm

Hi,

I would like to have a test, to test if the connection from a RMA goes down.. Could this be possible..?

To do a ping on localhost, isn't good enough, if you're running with backup agents..

An idea..?

KS-Soft · Post by **KS-Soft** » Tue Apr 14, 2009 2:24 pm

If you are talking about Passive RMA, you may use TCP test.
If you are using Active RMA and you have assigned backup agent... h'm that's problem

Regards
Alex

Stoltze · Post by **Stoltze** » Tue Apr 14, 2009 2:32 pm

KS-Soft wrote:If you are using Active RMA and you have assigned backup agent... h'm that's problem

And that's the one I'm using...

greyhat64 · Post by **greyhat64** » Tue Apr 14, 2009 5:18 pm

Couldn't you have a test to see if the active RMA service is running on the remote machine? If the test returns "Unknown", treat it as 'Bad' and take whatever action you choose. Maybe even make it the 'Master' for all your other RMA dependant tests.
Or am I missing something here?

Stoltze · Post by **Stoltze** » Tue Apr 14, 2009 11:05 pm

greyhat64 wrote:Couldn't you have a test to see if the active RMA service is running on the remote machine? If the test returns "Unknown", treat it as 'Bad' and take whatever action you choose.

Yes, I could do that.. But, the connection can still be down, even though the service is running..

That's why I would prefer a test done by AHM it self..

greyhat64 · Post by **greyhat64** » Wed Apr 15, 2009 12:23 am

I must still be missing something. . .

If the connection is down a test for the service should return an 'Unknown' status, assuming you are using the agent to perform the test. And if you are using a backup agent, have a service test set up for each of them. If the 'primary' fails, the secondary will report it, and vice-versa.

Stoltze · Post by **Stoltze** » Wed Apr 15, 2009 12:59 am

I will try to explain..

2 servers, each running a RMA-active agent, placed in separate physical or logical locations
They are backup for each other

But, for some reason in the network, agent 1 is unable to connect to the AHM server. The service is still running, and every test is being done by agent 2

AHM knows that agent 1 is down, and therefore a test where you can choose agents from a dropdown to alert on missing connection, would be very nice..

KS-Soft · Post by **KS-Soft** » Wed Apr 15, 2009 7:41 am

Yes, we should add some option... I have added task into "to do" list.

Regards
Alex

Stoltze · Post by **Stoltze** » Wed Apr 15, 2009 7:45 am

KS-Soft wrote:Yes, we should add some option... I have added task into "to do" list.

Thanx Alex..

greyhat64 · Post by **greyhat64** » Wed Apr 15, 2009 3:18 pm

I think I understand now, but one last question. . .
If 'agent 1' is indeed accessible but not communicating, could the alert you are looking for be derived from the RMA audit log file?
If so, you could include in your test list a couple of log file tests for 'agent 1' and 'agent 2'.
And if the data logged is not sufficient, it might be an easier to implement logging enhancements. This would have the side benefit of making an additional 'specialty' test type unnecessary.
BTW - Out of curiosity, does the non-communicative 'agent 1' communicate via the RMA Manager port/interface?

(OK so I had two questions)

I new this rang a bell. Stolze, you are in good company. . .
This sounds closely related to an earlier request by Robert_in_MTL and myself for a watchdog 'monitor of monitors' - A separate application/service/utility that would check the viability of the hostmon server and it's agents - necessary when no communication is possible from either.

Alex, You even stated,

We plan to implement watchdog utility. Most likelly this year.

That was on Apr 14, 2008, almost exactly a year ago! I just know this is part of the upcoming v8.

Stoltze · Post by **Stoltze** » Thu Apr 16, 2009 12:59 am

greyhat64 wrote:If 'agent 1' is indeed accessible but not communicating, could the alert you are looking for be derived from the RMA audit log file?
If so, you could include in your test list a couple of log file tests for 'agent 1' and 'agent 2'.

Hmm.. So I guess, what you're thinking about, is a text file test, with bad and good text, and RMA-active logging into the same file

So, this must a decision for KS-Soft:

1) Use text log test
Pro: Existing test
Cons: You need to perform a test, every x minutes extra
Not logical for new users

2) New test method
Pro: If possible, implemented without testinterval, but react on "signal" from AHM server. Signal means eventbased, and not time based reaction
Very logical for new users, and very easy to set up
Cons: New test method, possibly different from other tests..

3) Watchdog utility
Let the new utility do the testing
Pro: No new test method
Cons: Separated from AHM, and therefore also separated alarms

Alex, what do you think...?

greyhat64 wrote:BTW - Out of curiosity, does the non-communicative 'agent 1' communicate via the RMA Manager port/interface? (OK so I had two questions)

In this case, it wasn't because of an existing problem, but in theory..

Eg, if the agents uses different ISP etc..

KS-Soft · Post by **KS-Soft** » Thu Apr 16, 2009 7:59 am

If 'agent 1' is indeed accessible but not communicating

What exactly this means? Do you mean agent connects to HostMonitor but does not respond to test requests? Then HostMonitor should use backup agent (if specified).
This may happen if you set too short timeout interval, agent just do not have time to perform some tests.
If agent completely hungs (I hope it never happens), HostMonitor should disconnect agent.

Out of curiosity, does the non-communicative 'agent 1' communicate via the RMA Manager port/interface?

We are talking about hypothetical situation, right? I think agent works fine on all systems in your network?
Agent uses 2 TCP connections: 1 for HostMonitor, another for RMA Manager. However agent uses many threads for tests execution, so agent receives requests using 1 connection but it starts many threads. If some thread hungs, agent still should respond to HostMonitor. It may loose connection (for another reason, e.g. network error) but it should not hung.
On the other hand, I don't know why entire RMA may hung (not just some thread that performs some test). If I don't know this reason, I cannot tell you can this unknown reason hung both agent connections (HostMonitor and RMA Manager) or just 1. E.g. if your system crashes, RMA will stop working with HostMonitor and RMA Manager. If you have something else in mind please provide more specific example.

Alex, You even stated, We plan to implement watchdog utility. Most likelly this year.

Yes, I did. Then I said: If you own Enterprise license, we can offer 2nd license for 3 test items at no cost. After that we changed priority for this task to "low". I think there are no many reasons to develop another utility with limited functionality when you can use 2nd instance of HostMonitor.
If you own Lite license, then this utility will not be inluded into Lite package anyway.

Regards
Alex

greyhat64 · Post by **greyhat64** » Thu Apr 16, 2009 4:46 pm

I've had a 'delerious' day, and there's lots to respond to here, so I'll do my best.
First of all, I jumped into this conversation because it's a question I've asked before myself, and I'm curious about other's needs/experiences.

Alex,
As it turns out, my first two questions are a moot point. I was trying to identify what was/wasn't working, while it sounds like Stoltze is working through a theoretical problem. He said:

In this case, it wasn't because of an existing problem, but in theory.. Eg, if the agents uses different ISP etc..

Your point about a crashed or rebooted remote system is well taken!

There are numerous unknown root causes for his theoretical RMA communication failure.

Stoltze, using your example('agents uses different ISP. . ."):
If two agents are using two different ISP's, wouldn't you already be running tests that provide info for the most common root causes for such communication failures - response of the ISP router/gateway, firewall, or specific TCP/UDP ports, for instance? I would consider these much more valid and informative.

I like your thought regarding event driven tests rather than schedule driven tests, but that can be accommodated, in some cases and to some degree, by 'dependant expressions'. They are still run on a schedule, event dependant -vs- event driven, but it helps. Simply run them on a short enough schedule to be effective and modify schedule on 'Bad' event if necessary.

Which brings me back around to an earlier point. I should be able to determine if a specific agent is not responding:

Use the system's agent name as the 'Test by:'
Have the 'agent 1' system test itself by setting up a 'dummy' test - ping, service, TCP, . . . pick the test, using the 'agent 1' system name/IP address as the 'host'
Better yet, use WMI or a script test to get relavent network connection info, specifying 'agent 1' as the source and the AHM server as the remote destination, with the destination port being the active RMA port.

- Such a test should reveal agent failure, even if the backup agent is in use.
- Just don't use 'localhost' or 127.0.0.1 as the host- for obvious reasons.

- Then the work really begins, as mentioned before - trying to determine the root cause.
- The aforementioned tests (router, firewall, etc.) could be made 'dependants' of this test.

Or have I missed the target again?

(Alex, I won't clutter this topic further, but look for a new entry in the 'watchdog utility' thread mentioned in my earlier response. In short I agree with you, with one caveat.)