Windows 2003 RMA CPU Issue
-
- Posts: 29
- Joined: Fri Dec 29, 2006 10:17 am
Windows 2003 RMA CPU Issue
Hi, I'm having a problem using RMA to monitor CPU on a few Windows 2003 boxes.
I recently installed 6 RMA agents on 4 Windows Server 2003 R2 Enterprise and 2 Windows Server 2003 Enterprise servers and have HostMonitor set to check the CPU on each of them every minute. The majority of the time the monitors work properly, but they randomly throw a "RMA: 301 - Cannot retrieve data." All of the other tests I run on these servers are working 100% of the time (services, processes, disk space, etc.), so it's only with CPU and it seems random.
I have already checked all of the services and dlls outlined in other tickets (rpc and remote registry are on, dlls are enabled, and the admin account has access to the proper registry keys) and everything looks in order.
How do I stop that error from happening? It's beginning to effect our client SLAs.
I recently installed 6 RMA agents on 4 Windows Server 2003 R2 Enterprise and 2 Windows Server 2003 Enterprise servers and have HostMonitor set to check the CPU on each of them every minute. The majority of the time the monitors work properly, but they randomly throw a "RMA: 301 - Cannot retrieve data." All of the other tests I run on these servers are working 100% of the time (services, processes, disk space, etc.), so it's only with CPU and it seems random.
I have already checked all of the services and dlls outlined in other tickets (rpc and remote registry are on, dlls are enabled, and the admin account has access to the proper registry keys) and everything looks in order.
How do I stop that error from happening? It's beginning to effect our client SLAs.
-
- Posts: 2832
- Joined: Tue May 16, 2006 4:41 am
- Contact:
It could be a timeout issue. You should increase timeout, specified by rma_cfg utility (on system where agent is running). This timeout specifies the maximum amount of time (in seconds) that agent will keep waiting for the complete request packet from HostMonitor (after initial TCP connection established) before dropping the connection.
http://www.ks-soft.net/hostmon.eng/rma- ... m#Settings
Also you may increase timeout that is specified on HostMonitor's side in "Agent Connection Parameters" dialog. This timeout specifies a maximum amount of time that HostMonitor will wait for an answer from the agent. Please note: this timeout should be big enough to allow an agent to perform a test before sending an answer to HostMonitor. http://www.ks-soft.net/hostmon.eng/mfra ... #agentlist
Probably, in your case you have to increase test interval, because "CPU usage" test takes some time and also some ammount of time is needed for data exchange between HostMonitor and the agent. So, 1 minute might be a not enough if the system is heavy loaded.
Does the RMA check CPU usage on local system (system where RMA is running)?
Regards,
Max
http://www.ks-soft.net/hostmon.eng/rma- ... m#Settings
Also you may increase timeout that is specified on HostMonitor's side in "Agent Connection Parameters" dialog. This timeout specifies a maximum amount of time that HostMonitor will wait for an answer from the agent. Please note: this timeout should be big enough to allow an agent to perform a test before sending an answer to HostMonitor. http://www.ks-soft.net/hostmon.eng/mfra ... #agentlist
Probably, in your case you have to increase test interval, because "CPU usage" test takes some time and also some ammount of time is needed for data exchange between HostMonitor and the agent. So, 1 minute might be a not enough if the system is heavy loaded.
Does the RMA check CPU usage on local system (system where RMA is running)?
Regards,
Max
-
- Posts: 29
- Joined: Fri Dec 29, 2006 10:17 am
Thanks for the quick reply!
Each of the agents configs were set to 10 seconds locally on the servers they were installed on, so I changed them each to 1 minute. The config on HostMonitor's side however via the RMA Manager is 2 minutes already for each of them and I know that it doesn't take that long for the individual monitors to error out...
Do you think the change from 10 seconds to 1 minute on the agent-side will resolve the issue?
Each of the agents configs were set to 10 seconds locally on the servers they were installed on, so I changed them each to 1 minute. The config on HostMonitor's side however via the RMA Manager is 2 minutes already for each of them and I know that it doesn't take that long for the individual monitors to error out...
Do you think the change from 10 seconds to 1 minute on the agent-side will resolve the issue?
-
- Posts: 2832
- Joined: Tue May 16, 2006 4:41 am
- Contact:
-
- Posts: 29
- Joined: Fri Dec 29, 2006 10:17 am
Hi, again. The issue is still happening, so I was wondering if you have any further ideas? The timeout on the Host Monitor side in RMA Manager is 120 seconds for each agent and the timout on the agent side is the 1 minute I made it yesterday.
I have several other Windows 2003 agents installed on other servers that are having no problems at all, it just seems to be these 6 servers.
I have several other Windows 2003 agents installed on other servers that are having no problems at all, it just seems to be these 6 servers.
Does HostMonitor show "Checking" status for 120 sec and then sets Unknown status? Or HostMonitor sets Unknown status right away or within 5-10 sec?
Do you use these RMA to check CPU Usage on local system (system where RMA is installed)? Do you use <local computer> as parameter of the test?
Regards
Alex
Do you use these RMA to check CPU Usage on local system (system where RMA is installed)? Do you use <local computer> as parameter of the test?
Regards
Alex
-
- Posts: 29
- Joined: Fri Dec 29, 2006 10:17 am
The few times that I have actually been able to get to the Host Monitor box while they were in error, each of the problem CPU monitors would only take 5-10 seconds to error out (not the full 120 seconds). Also, yes each RMA monitors the CPU locally on the box it is installed on and each of the CPU monitors in Host Monitor use <local compuer> as the paramater for the computer to monitor.
As I mentioned before, the majority of the time these monitors work no problem, but once or twice a day they throw this error for some reason and it lasts anywhere from a minute to up to 10 or 15 minutes.
As I mentioned before, the majority of the time these monitors work no problem, but once or twice a day they throw this error for some reason and it lasts anywhere from a minute to up to 10 or 15 minutes.
-
- Posts: 29
- Joined: Fri Dec 29, 2006 10:17 am
Only one of the servers actually has the log file so I'm assuming that the other RMAs haven't needed to log any errors yet. The one log file that does exist one has 4 lines in it from back on august 29 (I believe when it was installed), but nothing recent and I have seen the issue once on each server today.
Do you think this is a RMA-side error then? It doesn't look like a timeout error since that is set to 60 seconds. Maybe a Windows setting or RPC issue?
Do you think this is a RMA-side error then? It doesn't look like a timeout error since that is set to 60 seconds. Maybe a Windows setting or RPC issue?
-
- Posts: 29
- Joined: Fri Dec 29, 2006 10:17 am
I don't think reinstalling may help. Backup RMA will not help either because agent works fine and accepts connection then something happens to TCP channel
Could you send rma.ini (RMA system) and agents.lst (HostMonitor's folder) files to support@ks-soft.net? May be we find something...
BTW As workaround you may add "advanced" mode action into alert profile assigned to the tests. You may use "Repeat test" action and start condition like ('%Status%'=='Unknown')
http://www.ks-soft.net/hostmon.eng/mfra ... ncedaction
Regards
Alex

Could you send rma.ini (RMA system) and agents.lst (HostMonitor's folder) files to support@ks-soft.net? May be we find something...
BTW As workaround you may add "advanced" mode action into alert profile assigned to the tests. You may use "Repeat test" action and start condition like ('%Status%'=='Unknown')
http://www.ks-soft.net/hostmon.eng/mfra ... ncedaction
Regards
Alex
-
- Posts: 29
- Joined: Fri Dec 29, 2006 10:17 am
Thanks so much again! I sent the files in with a pdf copy of the ticket so whoever gets it will have the history. If you could let me know what you think I would really appreciate it!
Also, I'd like to save the use of the Advanced mode as a last resort, but I'll keep it in mind if we can't find another resolution.
Thanks again and I look forward to hearing from you!
Also, I'd like to save the use of the Advanced mode as a last resort, but I'll keep it in mind if we can't find another resolution.
Thanks again and I look forward to hearing from you!
Thank for the files, settings looks Ok.
Looks like we misunderstood the problem. I was thinking about "Cannot read data" error while you are receiving "Cannot retrieve data". My apologies
Well, "Cannot retrieve data" means there is nothing wrong with HostMonitor<->RMA communication. That's why log file does not show any error. RMA cannot perform the test because Windows API does not provide access to the registry. Why? H'm... hard to say.
Do you check local system (system where RMA is running)? Do you use "<local computer>" as parameter of the test? Or you are using hostname? IP address?
Regards
Alex
Looks like we misunderstood the problem. I was thinking about "Cannot read data" error while you are receiving "Cannot retrieve data". My apologies

Well, "Cannot retrieve data" means there is nothing wrong with HostMonitor<->RMA communication. That's why log file does not show any error. RMA cannot perform the test because Windows API does not provide access to the registry. Why? H'm... hard to say.
Do you check local system (system where RMA is running)? Do you use "<local computer>" as parameter of the test? Or you are using hostname? IP address?
Regards
Alex
-
- Posts: 29
- Joined: Fri Dec 29, 2006 10:17 am
Hi, again,
Yes, I am only testing the local machine that the RMA agent is installed on and yes, I use <local computer> instead of a hostname or IP.
Any ideas on a solution?
Also, if I end up having to go with the Advanced profile method instead of the standard, right now I have it set that after 5 bad attempts it sends an alert. If I change it to what you mention above (to not send an alert if the status=unknown) is there also a way to keep that 5 bad then send alert setting?
Thanks.
Yes, I am only testing the local machine that the RMA agent is installed on and yes, I use <local computer> instead of a hostname or IP.
Any ideas on a solution?
Also, if I end up having to go with the Advanced profile method instead of the standard, right now I have it set that after 5 bad attempts it sends an alert. If I change it to what you mention above (to not send an alert if the status=unknown) is there also a way to keep that 5 bad then send alert setting?
Thanks.