Windows 2003 RMA CPU Issue

All questions related to installations, configurations and maintenance of Advanced Host Monitor (including additional tools such as RMA for Windows, RMA Manager, Web Servie, RCC).
scott.carroll@brulant.com
Posts: 29
Joined: Fri Dec 29, 2006 10:17 am

Windows 2003 RMA CPU Issue

Post by scott.carroll@brulant.com »

Hi, I'm having a problem using RMA to monitor CPU on a few Windows 2003 boxes.

I recently installed 6 RMA agents on 4 Windows Server 2003 R2 Enterprise and 2 Windows Server 2003 Enterprise servers and have HostMonitor set to check the CPU on each of them every minute. The majority of the time the monitors work properly, but they randomly throw a "RMA: 301 - Cannot retrieve data." All of the other tests I run on these servers are working 100% of the time (services, processes, disk space, etc.), so it's only with CPU and it seems random.

I have already checked all of the services and dlls outlined in other tickets (rpc and remote registry are on, dlls are enabled, and the admin account has access to the proper registry keys) and everything looks in order.

How do I stop that error from happening? It's beginning to effect our client SLAs.
KS-Soft Europe
Posts: 2832
Joined: Tue May 16, 2006 4:41 am
Contact:

Post by KS-Soft Europe »

It could be a timeout issue. You should increase timeout, specified by rma_cfg utility (on system where agent is running). This timeout specifies the maximum amount of time (in seconds) that agent will keep waiting for the complete request packet from HostMonitor (after initial TCP connection established) before dropping the connection.
http://www.ks-soft.net/hostmon.eng/rma- ... m#Settings
Also you may increase timeout that is specified on HostMonitor's side in "Agent Connection Parameters" dialog. This timeout specifies a maximum amount of time that HostMonitor will wait for an answer from the agent. Please note: this timeout should be big enough to allow an agent to perform a test before sending an answer to HostMonitor. http://www.ks-soft.net/hostmon.eng/mfra ... #agentlist

Probably, in your case you have to increase test interval, because "CPU usage" test takes some time and also some ammount of time is needed for data exchange between HostMonitor and the agent. So, 1 minute might be a not enough if the system is heavy loaded.

Does the RMA check CPU usage on local system (system where RMA is running)?

Regards,
Max
scott.carroll@brulant.com
Posts: 29
Joined: Fri Dec 29, 2006 10:17 am

Post by scott.carroll@brulant.com »

Thanks for the quick reply!

Each of the agents configs were set to 10 seconds locally on the servers they were installed on, so I changed them each to 1 minute. The config on HostMonitor's side however via the RMA Manager is 2 minutes already for each of them and I know that it doesn't take that long for the individual monitors to error out...

Do you think the change from 10 seconds to 1 minute on the agent-side will resolve the issue?
KS-Soft Europe
Posts: 2832
Joined: Tue May 16, 2006 4:41 am
Contact:

Post by KS-Soft Europe »

scott.carroll@brulant.com wrote:Do you think the change from 10 seconds to 1 minute on the agent-side will resolve the issue?
It should help, I suppose.

Regards,
Max
scott.carroll@brulant.com
Posts: 29
Joined: Fri Dec 29, 2006 10:17 am

Post by scott.carroll@brulant.com »

Hi, again. The issue is still happening, so I was wondering if you have any further ideas? The timeout on the Host Monitor side in RMA Manager is 120 seconds for each agent and the timout on the agent side is the 1 minute I made it yesterday.

I have several other Windows 2003 agents installed on other servers that are having no problems at all, it just seems to be these 6 servers.
KS-Soft
Posts: 13012
Joined: Wed Apr 03, 2002 6:00 pm
Location: USA
Contact:

Post by KS-Soft »

Does HostMonitor show "Checking" status for 120 sec and then sets Unknown status? Or HostMonitor sets Unknown status right away or within 5-10 sec?
Do you use these RMA to check CPU Usage on local system (system where RMA is installed)? Do you use <local computer> as parameter of the test?

Regards
Alex
scott.carroll@brulant.com
Posts: 29
Joined: Fri Dec 29, 2006 10:17 am

Post by scott.carroll@brulant.com »

The few times that I have actually been able to get to the Host Monitor box while they were in error, each of the problem CPU monitors would only take 5-10 seconds to error out (not the full 120 seconds). Also, yes each RMA monitors the CPU locally on the box it is installed on and each of the CPU monitors in Host Monitor use <local compuer> as the paramater for the computer to monitor.

As I mentioned before, the majority of the time these monitors work no problem, but once or twice a day they throw this error for some reason and it lasts anywhere from a minute to up to 10 or 15 minutes.
KS-Soft
Posts: 13012
Joined: Wed Apr 03, 2002 6:00 pm
Location: USA
Contact:

Post by KS-Soft »

Looks like something drops connection... Do you see any errors in RMA's log file (named log2.txt by default)?

Regards
Alex
scott.carroll@brulant.com
Posts: 29
Joined: Fri Dec 29, 2006 10:17 am

Post by scott.carroll@brulant.com »

Only one of the servers actually has the log file so I'm assuming that the other RMAs haven't needed to log any errors yet. The one log file that does exist one has 4 lines in it from back on august 29 (I believe when it was installed), but nothing recent and I have seen the issue once on each server today.

Do you think this is a RMA-side error then? It doesn't look like a timeout error since that is set to 60 seconds. Maybe a Windows setting or RPC issue?
KS-Soft
Posts: 13012
Joined: Wed Apr 03, 2002 6:00 pm
Location: USA
Contact:

Post by KS-Soft »

Strange... may be RMA logging is disabled?
BTW: What version of HostMonitor / RMA do you use?

Regards
Alex
scott.carroll@brulant.com
Posts: 29
Joined: Fri Dec 29, 2006 10:17 am

Post by scott.carroll@brulant.com »

All RMAs are 3.44 and Host Monitor is 6.82 (they should all be the most up to date versions).

Also, all of the RMA configs have "Failure audit log to" checked and the log is "log2.txt" as you said.

What about re-installing the agents or installing backup agents? Or any other possiblities?
KS-Soft
Posts: 13012
Joined: Wed Apr 03, 2002 6:00 pm
Location: USA
Contact:

Post by KS-Soft »

I don't think reinstalling may help. Backup RMA will not help either because agent works fine and accepts connection then something happens to TCP channel :roll:
Could you send rma.ini (RMA system) and agents.lst (HostMonitor's folder) files to support@ks-soft.net? May be we find something...

BTW As workaround you may add "advanced" mode action into alert profile assigned to the tests. You may use "Repeat test" action and start condition like ('%Status%'=='Unknown')
http://www.ks-soft.net/hostmon.eng/mfra ... ncedaction

Regards
Alex
scott.carroll@brulant.com
Posts: 29
Joined: Fri Dec 29, 2006 10:17 am

Post by scott.carroll@brulant.com »

Thanks so much again! I sent the files in with a pdf copy of the ticket so whoever gets it will have the history. If you could let me know what you think I would really appreciate it!

Also, I'd like to save the use of the Advanced mode as a last resort, but I'll keep it in mind if we can't find another resolution.

Thanks again and I look forward to hearing from you!
KS-Soft
Posts: 13012
Joined: Wed Apr 03, 2002 6:00 pm
Location: USA
Contact:

Post by KS-Soft »

Thank for the files, settings looks Ok.
Looks like we misunderstood the problem. I was thinking about "Cannot read data" error while you are receiving "Cannot retrieve data". My apologies :oops:
Well, "Cannot retrieve data" means there is nothing wrong with HostMonitor<->RMA communication. That's why log file does not show any error. RMA cannot perform the test because Windows API does not provide access to the registry. Why? H'm... hard to say.
Do you check local system (system where RMA is running)? Do you use "<local computer>" as parameter of the test? Or you are using hostname? IP address?

Regards
Alex
scott.carroll@brulant.com
Posts: 29
Joined: Fri Dec 29, 2006 10:17 am

Post by scott.carroll@brulant.com »

Hi, again,

Yes, I am only testing the local machine that the RMA agent is installed on and yes, I use <local computer> instead of a hostname or IP.

Any ideas on a solution?

Also, if I end up having to go with the Advanced profile method instead of the standard, right now I have it set that after 5 bad attempts it sends an alert. If I change it to what you mention above (to not send an alert if the status=unknown) is there also a way to keep that 5 bad then send alert setting?

Thanks.
Post Reply