RMA Client in Unkown status causing flase alets
RMA Client in Unkown status causing flase alets
I have the lastest version and installed about 18 sites with 1 RMA on each site. 17 Test per site. Every now and then the RMA at sites does not respond and causes unknown status. When I force the test the its alive. It seems unstable ? What could be causing this ?
-
- Posts: 2832
- Joined: Tue May 16, 2006 4:41 am
- Contact:
I'm experiencing the same issue occasionally... What is strange, is we will sometimes have several tests that are performed by one agent, and one test will time out and show unknown and the other will show up as OK. I'm kind of lost on it, the only thing I can think of is to play around with the settings pertaining to the number of tests that are initiated at once.
I don't want to hijack this thread, but in the interest of getting a resolution for both of us, here are our details.
The issue has occurred on several different environments. All use active RMA.
Here is one example of what happens when the issue occurs:
That particular test was on a server08 machine, however it occurred at the same time on several servers, the rest of which are all 2003. ie,
What would be a good setting for the max # of tests initiated at once? could lowering it or raising it have an effect? I have set it to 100, and the problem hasn't reoccurred, yet...
The issue has occurred on several different environments. All use active RMA.
Here is one example of what happens when the issue occurs:
Code: Select all
Test: server.domain.local C Drive
Method: Drive space
12/15/2008 6:34:41 PM Unknown Timed out
12/15/2008 6:44:42 PM Ok 23 Gb
12/15/2008 6:56:37 PM Unknown RMA not connected
12/15/2008 6:58:56 PM Ok 23 Gb
12/16/2008 12:00:18 AM Ok 23 Gb
Code: Select all
Test: WMI Service
Method: check service
12/14/2008 12:00:04 AM Ok 0 ms
12/15/2008 12:00:07 AM Ok 0 ms
12/15/2008 6:56:37 PM Unknown RMA not connected
12/15/2008 6:59:11 PM Ok 0 ms
12/16/2008 12:00:17 AM Ok 0 ms
Looks like connection was dropped and Agent could not reconnect for several minutes.RMA not connected
Could you check RMA logs? By default these text logs stored in the same folder where agent is installed (unless you changed location).
I think there is some external problem: network error, firewall or antivirus monitor...What would be a good setting for the max # of tests initiated at once? could lowering it or raising it have an effect? I have set it to 100, and the problem hasn't reoccurred, yet...
Do you have antivirus monitor installed on the systems?
What version of HostMonitor and RMA do you use?
Regards
Alex
but I dont understand how out of 2 tests from one single RMA, one will come back OK and one will come back unknown, when the tests run at the same time?
edit: all RMA's and HM are on the latest available versions
Also: we added today a simple connectivity test for each of our remote clients today, which just pings 127.0.0.1 and returns the result to HM. I was watching it, and as it came time for the test to run, many of the agents showed "checking" for a long time, and eventually went to unknown. I am connected through remote desktop to those agents tho, so I know it can't be a network connection issue
edit: all RMA's and HM are on the latest available versions
Also: we added today a simple connectivity test for each of our remote clients today, which just pings 127.0.0.1 and returns the result to HM. I was watching it, and as it came time for the test to run, many of the agents showed "checking" for a long time, and eventually went to unknown. I am connected through remote desktop to those agents tho, so I know it can't be a network connection issue
Here are a couple of the test stsatistics for the connectivity tests. Here are 2 separate RMA's:
As I was watching, and they said checking, I right clicked on one and told it to refresh selected test, and it returned to OK.
EDIT: I'm not sure if the statistics are accurate... the tests have the same test name, perhaps that is throwing it off and giving me the stats for the same test?
Code: Select all
2/16/2008 1:43:30 PM Host is alive 16 ms
12/16/2008 1:43:30 PM Host is alive 0 ms
12/16/2008 2:01:33 PM Unknown Timed out
12/16/2008 2:04:35 PM Host is alive 0 ms
12/16/2008 4:14:55 PM Unknown Timed out
12/16/2008 4:14:55 PM Unknown Timed out
12/16/2008 4:14:55 PM Unknown Timed out
12/16/2008 4:14:55 PM Unknown Timed out
12/16/2008 4:14:55 PM Unknown Timed out
12/16/2008 4:14:55 PM Unknown Timed out
12/16/2008 4:14:55 PM Unknown Timed out
12/16/2008 4:15:17 PM Host is alive 0 ms
12/16/2008 4:15:20 PM Host is alive 0 ms
Code: Select all
12/16/2008 1:43:30 PM Host is alive 16 ms
12/16/2008 1:43:30 PM Host is alive 16 ms
12/16/2008 1:43:30 PM Host is alive 0 ms
12/16/2008 2:01:33 PM Unknown Timed out
12/16/2008 2:04:35 PM Host is alive 0 ms
12/16/2008 4:14:55 PM Unknown Timed out
12/16/2008 4:14:55 PM Unknown Timed out
12/16/2008 4:14:55 PM Unknown Timed out
12/16/2008 4:14:55 PM Unknown Timed out
12/16/2008 4:14:55 PM Unknown Timed out
12/16/2008 4:14:55 PM Unknown Timed out
12/16/2008 4:14:55 PM Unknown Timed out
12/16/2008 4:15:17 PM Host is alive 0 ms
12/16/2008 4:15:20 PM Host is alive 0 ms
EDIT: I'm not sure if the statistics are accurate... the tests have the same test name, perhaps that is throwing it off and giving me the stats for the same test?
The issue is occurring at the moment for me. I have an agent which in HM, had its tests come back as unknown, and it says RMA not connected, however in the RMA manager, it shows connected. Any idea what could cause that? Also, this test is supposed to run every couple minutes, but if I dont do anything, it just stays on unknown. If I manually refresh it, it will come back as OK. (BTW, I have access to the remote network via RDP, and it is connected still as well)
Any help is appreciated.
Any help is appreciated.
According to your previous post 2 different test items failed at the same timebut I dont understand how out of 2 tests from one single RMA, one will come back OK and one will come back unknown, when the tests run at the same time?
================
Test: server.domain.local C Drive Method: Drive space
12/15/2008 6:56:37 PM Unknown RMA not connected
...
Test: WMI Service Method: check service
12/15/2008 6:56:37 PM Unknown RMA not connected
================
Could you please check RMA logs?
Well. "Timed out" and "RMA not connected" thats 2 different problems.Also: we added today a simple connectivity test for each of our remote clients today, which just pings 127.0.0.1 and returns the result to HM. I was watching it, and as it came time for the test to run, many of the agents showed "checking" for a long time, and eventually went to unknown. I am connected through remote desktop to those agents tho, so I know it can't be a network connection issue
1) "RMA not connected" means agent cannot connect to HostMonitor for several minutes. Actually this error may appear right after connection drop if you manually force test to be "refreshed".
If you check RMA logs, probably you may find some more information why RMA cannot reconnect
2) "Timed out" means agent did not return test result within 15 min. That looks strange...
Could you try to setup Passive RMA instead of Active? Just for testing...
Its better to use unique test names. Patterns can help you to do thisEDIT: I'm not sure if the statistics are accurate... the tests have the same test name, perhaps that is throwing it off and giving me the stats for the same test?
http://www.ks-soft.net/hostmon.eng/mfra ... tterns.htm
You may use Quick Log to check latest test results for specific item.
http://www.ks-soft.net/hostmon.eng/mfra ... ickLogPane
That's possible. 2 different applications uses different sockets (TCP ports) and/or IP addresses.The issue is occurring at the moment for me. I have an agent which in HM, had its tests come back as unknown, and it says RMA not connected, however in the RMA manager, it shows connected. Any idea what could cause that?
As I said several times - lets check the logs
Could you please start Auditing? Menu View -> Auditing tool.Also, this test is supposed to run every couple minutes, but if I dont do anything, it just stays on unknown. If I manually refresh it, it will come back as OK. (BTW, I have access to the remote network via RDP, and it is connected still as well)
Any warnings?
Regards
Alex
auditing shows no issues except that a couple alert wav files could not be found.
i checked the RMA on one server that we have specifically been having issues with... this is, i believe, the only one that has had the "not connected" issue.... the others are just timing out.. but here is a segment of that log, maybe you can tell me what it means.
However, I do see similar errors on some other RMA logs:
I also changed all the test names so there are no repeats.
I am out of my office atm, but I will try and set someone up for passive. We were hoping to avoid this, in order to avoid having to mess with firewall rules on all of our remote clients.
Thanks for your help
i checked the RMA on one server that we have specifically been having issues with... this is, i believe, the only one that has had the "not connected" issue.... the others are just timing out.. but here is a segment of that log, maybe you can tell me what it means.
Code: Select all
[12/16/2008 9:43 PM] SERVER2.flnet.local Decode error: Cannot read data (RMA Manager)
[12/16/2008 10:14 PM] SERVER2.flnet.local Decode error: Cannot read data
[12/16/2008 10:14 PM] SERVER2.flnet.local Decode error: Cannot read data (RMA Manager)
[12/16/2008 10:15 PM] SERVER2.domain.local Connection error
[12/16/2008 11:58 PM] SERVER2.domain.local Connection error
[12/17/2008 2:31 AM] SERVER2.domain.local Decode error: Cannot read data
[12/17/2008 2:31 AM] SERVER2.domain.local Decode error: Cannot read data (RMA Manager)
[12/17/2008 2:31 AM] SERVER2.domain.local Connection error
[12/17/2008 5:59 AM] SERVER2.domain.local Decode error: Cannot read data
[12/17/2008 9:29 AM] SERVER2.domain.local Decode error: Cannot read data (RMA Manager)
Code: Select all
[11/16/2008 11:44 AM] server.domain.com Decode error: Cannot read data. An existing connection was forcibly closed by the remote host.
[11/16/2008 11:45 AM] server.domain.com Connection error
[11/16/2008 11:45 AM] server.domain.com Connection error
I am out of my office atm, but I will try and set someone up for passive. We were hoping to avoid this, in order to avoid having to mess with firewall rules on all of our remote clients.
Thanks for your help
It looks like some other application (not HostMonitor and not RMA Manager) accepted connection request from the agent[12/17/2008 2:31 AM] SERVER2.domain.local Decode error: Cannot read data
[12/17/2008 2:31 AM] SERVER2.domain.local Decode error: Cannot read data (RMA Manager)
Not sure how this is possible....
I still think there is some external problem. Do you have installed some antivirus monitors, personal firewall, content monitoring software? Non stanard winsock components?
Regards
Alex
-
- Posts: 38
- Joined: Tue Sep 02, 2008 5:45 am
work around
i had the same problem which was a REAL PAIN !!!!
hated waking up in the morning with 400 emails on my blackberry.
aaaaaanyway... my workaround is to UnTick the "treat unknown reply as bad" in the properties of the tests.
it's not perfect. but it works for me
hated waking up in the morning with 400 emails on my blackberry.
aaaaaanyway... my workaround is to UnTick the "treat unknown reply as bad" in the properties of the tests.
it's not perfect. but it works for me