Hi !
Just upgraded our HM 6.75 to 6.80 and tried to update two RMA's through RMA Manager.
After the 'update' the two RMA's are now v. 3.35 - but they used to be 3.38 from an update performed probably a few months ago from a HM 6.70 or similar.
Is this decrease in version number expected - or did I miss something ?
Thanks !
Kasper
RMA version number decreased
Alex,
Okay, I think I've found the problem.
After having upgraded HM from 6.75 to 6.80 I started getting alarms from four DNS tests that I run through RMA on four different Win2003 SP1 servers. The tests look like this:
; ------- Test #6061 -------
Method = DNS
;--- Common properties ---
;DestFolder = Root\DC Produktion\NS1\DNS Checks\
RMAgent = NS1
Title = NS1 DNS Check (www.anet.dk)
Comment =
RelatedURL =
ScheduleMode= Regular
Schedule = Mo-Su 0000-2359
Interval = 60
Alerts = Send SMS Mo-Su 0000-2359
ReverseAlert= No
UnknownIsBad= No
WarningIsBad= No
UseCommonLog= Yes
PrivLogMode = Default
CommLogMode = Default
SyncCounters= Yes
SyncAlerts = No
DependsOn = list
MasterTest-Alive = NS1 Port 53 (DNS)
;--- Test specific properties ---
Server = 195.41.139.21
Port = 53
Timeout = 60
Protocol = UDP
Hostname = www.anet.dk
QueryType = A
;-----------------------------------------------------------------------------
The only difference between the four tests is which RMA that performs it together with the Server parameter in the Test Specific Properties section.
I started to get the reply "RMA: Cannot read data" but after a while I would get an OK. I cannot see a pattern in this.
Now, it seems like that if I upgrade any of these four servers they all change from 3.38 to 3.35 except for one of them which statys at 3.38.
If I upgrade other machines the all goes to 3.44 as expected.
It seems like the file rma$.upd is the culprit. If I try to delete this file on one of the four servers Windows tells me that it's beeing used by another program. I cannot stop the RMA service but am forced to kill rma.exe through Task Manager. As soon as I do this a new rma.exe is launced, Service Manager reports the RMA service to be running - and *now* RMA is reported as 3.44 in RMA Manager.
So basically it seems like the DNS tests messes up the RMA's performing them - and afterwards the upgrade procedure fails bacause Windows cannot stop the service in order to replace rma.exe with rma$.upd.
Did anything perhaps change from HM 6.75 to 6.80 in the DNS test method that could make the RMA's fail ?
/Kasper
Okay, I think I've found the problem.
After having upgraded HM from 6.75 to 6.80 I started getting alarms from four DNS tests that I run through RMA on four different Win2003 SP1 servers. The tests look like this:
; ------- Test #6061 -------
Method = DNS
;--- Common properties ---
;DestFolder = Root\DC Produktion\NS1\DNS Checks\
RMAgent = NS1
Title = NS1 DNS Check (www.anet.dk)
Comment =
RelatedURL =
ScheduleMode= Regular
Schedule = Mo-Su 0000-2359
Interval = 60
Alerts = Send SMS Mo-Su 0000-2359
ReverseAlert= No
UnknownIsBad= No
WarningIsBad= No
UseCommonLog= Yes
PrivLogMode = Default
CommLogMode = Default
SyncCounters= Yes
SyncAlerts = No
DependsOn = list
MasterTest-Alive = NS1 Port 53 (DNS)
;--- Test specific properties ---
Server = 195.41.139.21
Port = 53
Timeout = 60
Protocol = UDP
Hostname = www.anet.dk
QueryType = A
;-----------------------------------------------------------------------------
The only difference between the four tests is which RMA that performs it together with the Server parameter in the Test Specific Properties section.
I started to get the reply "RMA: Cannot read data" but after a while I would get an OK. I cannot see a pattern in this.
Now, it seems like that if I upgrade any of these four servers they all change from 3.38 to 3.35 except for one of them which statys at 3.38.
If I upgrade other machines the all goes to 3.44 as expected.
It seems like the file rma$.upd is the culprit. If I try to delete this file on one of the four servers Windows tells me that it's beeing used by another program. I cannot stop the RMA service but am forced to kill rma.exe through Task Manager. As soon as I do this a new rma.exe is launced, Service Manager reports the RMA service to be running - and *now* RMA is reported as 3.44 in RMA Manager.
So basically it seems like the DNS tests messes up the RMA's performing them - and afterwards the upgrade procedure fails bacause Windows cannot stop the service in order to replace rma.exe with rma$.upd.
Did anything perhaps change from HM 6.75 to 6.80 in the DNS test method that could make the RMA's fail ?
/Kasper
I assume HostMonitor<->RMA connection timeout is shorter than DNS test timeout. That's why HostMonitor drops connection before RMA returns test result. Try to increase timeout specified for the agent (Agent Connection Parameters dialog).I started to get the reply "RMA: Cannot read data" but after a while I would get an OK. I cannot see a pattern in this.
We did not touch DNS test since version 6.72. I don't think DNS test is the reason of this problem. Theoretically it may delay service restart for a while (up to 60 sec if you are using 60 sec timeout) but this should not lead to problem like this... unless you have a lot of DNS tests and DNS servers stop respond at the same time.So basically it seems like the DNS tests messes up the RMA's performing them - and afterwards the upgrade procedure fails bacause Windows cannot stop the service in order to replace rma.exe with rma$.upd.
Did anything perhaps change from HM 6.75 to 6.80 in the DNS test method that could make the RMA's fail ?
Tests like CPU Usage, Process, Performance Counter can lead to "restart service" problem if you have a lot of tests and target servers stop respond.
Regards
Alex
Alex,
> I assume HostMonitor<->RMA connection timeout is shorter than DNS test timeout.
Actually not. Connection timeout to these agents is set to 120 seconds The DNS test itself has a timeout of 60 seconds.
From the Bad Log on the server running the RMA I have quite a few of these lines:
[Time] [IP of my HM Server] Windows Socket error: An established connection was aborted by the software in your host machine (10053), on API 'send'
Does this provide any useful information ?
Thanks !
/Kasper
> I assume HostMonitor<->RMA connection timeout is shorter than DNS test timeout.
Actually not. Connection timeout to these agents is set to 120 seconds The DNS test itself has a timeout of 60 seconds.
From the Bad Log on the server running the RMA I have quite a few of these lines:
[Time] [IP of my HM Server] Windows Socket error: An established connection was aborted by the software in your host machine (10053), on API 'send'
Does this provide any useful information ?
Thanks !
/Kasper
Sorry for delay
Regards
Alex
Strange... Does HostMonitor show Checking status for 60 sec and then changes status to Unknown? Or probably HostMonitor changes status from Checking to Unknown within several seconds?Actually not. Connection timeout to these agents is set to 120 seconds The DNS test itself has a timeout of 60 seconds.
Regards
Alex
Alex,
> Sorry for delay
Same here
> Strange... Does HostMonitor show Checking status for 60 sec and then changes status to Unknown? Or probably HostMonitor changes status from Checking to Unknown within several seconds?
It's hard to tell because after having upgraded all agents on the machines giving the errors to v. 3.44 I haven't been able to reproduce the problem - which of course on the other hand is a good thing
Bottomline seems to be that the problem went away with an upgrade of all the agents.
Still though, these servers seem to be the only one constantly generating these "Windows Socket error: An established connection was aborted ..." entries in their Bad Log's so I guess that there might be a little thingy going on here, but I'll try to get some info on that and write it in a separate thread later.
/Kasper
> Sorry for delay
Same here

> Strange... Does HostMonitor show Checking status for 60 sec and then changes status to Unknown? Or probably HostMonitor changes status from Checking to Unknown within several seconds?
It's hard to tell because after having upgraded all agents on the machines giving the errors to v. 3.44 I haven't been able to reproduce the problem - which of course on the other hand is a good thing

Bottomline seems to be that the problem went away with an upgrade of all the agents.
Still though, these servers seem to be the only one constantly generating these "Windows Socket error: An established connection was aborted ..." entries in their Bad Log's so I guess that there might be a little thingy going on here, but I'll try to get some info on that and write it in a separate thread later.
/Kasper
If RMA still records new error messages, there are some test(s) still fail. If you would be able to check how long HostMonitor shows "Checking" status, we will know either:Still though, these servers seem to be the only one constantly generating these "Windows Socket error: An established connection was aborted ..." entries in their Bad Log's
- there is some problem with agent that cannot complette test within specified timeout
- or there is some 3rd party application (antivirus? firewall?) that forcedly drops connection
Regards
Alex