System crashes after 24-48 hours

timn · Post by **timn** » Thu Jan 05, 2006 9:56 am

Update from the originator of this thread:

This is a long and convoluted thread -- but the end result was that the problem was NOT in Host Monitor. It was eventually discovered to be due to a memory leak in a driver for CommView -- third-party packet sniffing software that was installed on the machine running Host Monitor.  

(In fairness, Tamosoft, the vendor of CommView, had apparently already discovered and resolved this leak earlier, but for some reason, our installation had mismatched drivers.)

This issue required nearly 3 weeks of debugging to discover, but persistence eventually paid off.  I would like to thank both Alex and Yoorix for their patience and assistence in trouble-shooting the issue. As always, Host Monitor rocks.  

This incident perhaps serves as a good example of how a problem that appears to be a 'Host Monitor' issue may actually be caused indirectly by other software on the box - and over which Host Monitor has no control.

-Tim

Alex:

This is the issue that I've alluded to in a couple of prior posts. Basically, every 24-48 hours, the system that HM is running on appears to run out of memory or some similar memory-related resource. After that, pretty much everything is hosed and a reboot is required.

Something tells me this is going to be difficult to nail down. I'd greatly appreciate any thoughts or suggestions you might have.

First, the facts:

Host Monitor version: 5.70

System Info: wrote:
Windows 2000 Server
Version 5.0.2195 Service Pack 4 Build 2195
OS Manufacturer Microsoft Corporation
System Name WATCHDOG
System Manufacturer Dell Computer Corporation
System Model PowerEdge 1650
System Type X86-based PC
Processor x86 Family 6 Model 11 Stepping 1 GenuineIntel ~1390 Mhz
Processor x86 Family 6 Model 11 Stepping 1 GenuineIntel ~1390 Mhz
BIOS Version Phoenix ROM BIOS PLUS Version 1.10 A11
Windows Directory C:\WINNT
System Directory C:\WINNT\system32
Boot Device \Device\Harddisk0\Partition2
Locale United States
Time Zone Central Standard Time
Total Physical Memory 2,096,560 KB
Available Physical Memory 1,772,492 KB
Total Virtual Memory 6,131,640 KB
Available Virtual Memory 5,609,052 KB
Page File Space 4,035,080 KB
Page File C:\pagefile.sys

As I've indicated in previous messages, this is a totally fresh install of Windows 2000 Server, with all the latest patches from MS. This hardware (Dell PE1650) is the same hardwrare that I've been sucessfully running HM on for over a year with no problem. (I did replace the hardware with a different PE1650 after this started but it made no difference.)

So, we are pretty sure this is not a hardware issue.

Other than the Windows OS, there are only 2 other pieces of user software on this box: NetTime, used to sync all clock on all Windows boxes in the company. This is used on over 250 computers here and has never caused an issue. The other software is pcAnywhere v 11 - used for remote console access. While pcAnywhere does not rank among my most favorite programs, it has been fairly stable and has never caused the kind of problem we are seeing here.

On the old hardware, we also ran TrendMicro OfficeScan. Temporarily, we've left that off this new box, thinking that may have been the source of our problem. But.. apparently not.

Crash occurs 24-48 hours after reboot.

Our first indication that something is wrong is this error in the Application Event Log:

EventID: 1000
Source: UserEnv
Windows cannot determine the user or computer name. Return value (14).

Within 20 minutes or so after that, the following System Event error occurs:

EventID: 26
Source: Application Popup
Application popup: hostmon.exe - Bad Image : The application or
DLL C:\WINNT\system32\DBMSSOCN.DLL is not a valid Windows image.
Please check this against your installation diskette.

Regarding the first event (as per Microsoft):

Value 14 (Error code 14) = "Not enough storage is available to complete this operation." - Do one of the following, then retry the operation:
1. reduce the number of running programs
2. remove unwanted files from the disk the Paging File is on and restart the system
3. check the paging file disk for an I/O error
4. install additional memory in your system."

Item 1 - is not practical - there no user programs running other than Host Monitor, NetTime and pcAnywhere.

Item 2 - There's not a lot of garbage on the disk - "C:" drive capacity is 33.8 GB with 30.4GB free

Item 3 - Ran a CHKDSK on the C drive - no errors found

Item 4 - System already has 2GB RAM, so this is not an insufficient physical memory issue (but likely some memory leak issue).

Also, regarding 2nd event., we have looked at C:\WINNT\system32\DBMSSOCN.DLL and it appears to be fine - same exact size and version info as on our other Win2K boxes.

What would you suggest for tracking this down further?

Thanks, Tim

KS-Soft · Post by **KS-Soft** » Fri Jan 06, 2006 8:01 am

Regarding corrupted DLL: my colleague says he had similar problem on his home computer. Problem was caused by cold air near the window!

When system was moved to warm place, problem gone.

Regarding resources problem. I don't think its hardware problem either.
Probably you have configured some new test items that lead to such problem? E.g. ODBC Query check or ODBC Logging? Some ODBC driver may cause resource leakage.
Or may be you have added Performance Counters to check some 3rd party software? Ok, you said there are no 3rd party software.
So, what test methods, alerts do you use? Do you use ODBC Logging? What ODBC driver do you use? Have you updated that driver before problem started? Microsoft security patches often leads to various problems as well, but our Windows 2000 works fine (the worst experience we have with Windows XP).

1st step to track the problem should be - check what kind of resources are leaking. And what process causes this leakage. Its pretty simple - start standard Windows Task Manager and check Memory, Handles, GDI Objects and USER Objects for each process.

Regards
Alex

timn · Post by **timn** » Sat Jan 07, 2006 2:33 pm

KS-Soft wrote: Probably you have configured some new test items that lead to such problem? E.g. ODBC Query check or ODBC Logging? Some ODBC driver may cause resource leakage.

This is interesting. In fact, some of my most recently defined test were a set of OBDC queries, some of which were constantly erroring out because development had not quite yet caught up to my monitoring...

ODBC Maintenance Job Log Check - QVSQL06 Unknown Error: Access violation at address 00000000. Read of address 00000000 ODBC test (QVSQL06)

Had crash again this morning, but after reading your suggestion I have disabled all of the constantly failing ODBC tests. We'll see if this make a difference.

KS-Soft wrote: Or may be you have added Performance Counters to check some 3rd party software? Ok, you said there are no 3rd party software.

Well, I do check SQL Server Performance Counters on remote nodes
- so this would technically be 'third party software' - but it's not been the source of trouble in the past. I'm more inclined to think this may be associated with my recently added ODBC checks.

KS-Soft wrote: So, what test methods, alerts do you use?

Nearly all

- we find Host Monitor to be a very useful tool!

Note: at some point it would be nice if there were a "Copy Results" button on the "Estimate Load" dialog box

More seriously, during our season, we have ~4,300 (4.300) tests with a load of about 21 tests/sec.

KS-Soft wrote: Do you use ODBC Logging? What ODBC driver do you use? Have you updated that driver before problem started? Microsoft security patches often leads to various problems as well, but our Windows 2000 works fine (the worst experience we have with Windows XP).

Hmmm. Started to say answer was NO, but looked in my log dir and found an old ODBC dBase (.dbf) file that is still being updated -- this should not be here - i will locate and turn off.

KS-Soft wrote: 1st step to track the problem should be - check what kind of resources are leaking. And what process causes this leakage. Its pretty simple - start standard Windows Task Manager and check Memory, Handles, GDI Objects and USER Objects for each process.

Thanks for your suggestions. I let you know how this goes.

timn · Post by **timn** » Mon Jan 09, 2006 9:41 am

Alex:

Just a quick update. HM crashed again this morning but I discovered that I had not disabled ALL OBDC tests - there were 3 still active and 2 were getting continuos connect errors.

I have now disabled ALL ODBC tests (as shown by View | Estimate Load... 0 active, 15 disabled)

We'll let this run for a period and see what happens...

KS-Soft · Post by **KS-Soft** » Mon Jan 09, 2006 11:45 am

What about resource usage?

timn · Post by **timn** » Mon Jan 09, 2006 5:25 pm

KS-Soft wrote:What about resource usage?

Well, there is one oddball here. The Dell OpenManage software (omaws32 - Embedded Systems Management for Dell servers) is maintaining 1,160 handles - more than double of any other process.

I checked a couple of other servers and the count of Handles is roughly the same.

Still, if this becomes suspect we can disable it - at least temporarily as a test.

(A quick search of Google however, does NOT indicate that omaws32 is a well known trouble maker

KS-Soft · Post by **KS-Soft** » Tue Jan 10, 2006 10:34 am

1,600 handles is not a problem unless total umber of handles allocated by other processes below 10-12 thousand.
My Windows 2000 works stable until 11,000 - 12,000 handles allocated.

What about GDI, User objects? Have you checked resource usage at the moment when application crashed? Or you have checked resources when everything works fine? Its good to perform 2 checks and compare resource usage.

Regards
Alex

timn · Post by **timn** » Fri Jan 13, 2006 9:03 am

KS-Soft wrote:1,600 handles is not a problem unless total umber of handles allocated by other processes below 10-12 thousand.
My Windows 2000 works stable until 11,000 - 12,000 handles allocated.

What about GDI, User objects? Have you checked resource usage at the moment when application crashed? Or you have checked resources when everything works fine? Its good to perform 2 checks and compare resource usage.

Yes, I've compared this to what I'm seeing on other machines and the values seem reasonable.

I'm having trouble catching it at momement of crash - typically 1:00am to 5:00am

Have checked resources when everything looks fine and the baseline values seem to stay pretty stable. There does not appear to be a runaway resource yet -- there must be.

I've set up a couple of self-checks on the HM box - %Committed Bytes and Memory Pool Non-Paged Byets, but these don't appear to have a problem at time of crash. CPU utilization averages about 18% and physical memory hovers around 75% free (of 2GB RAM)

Symptoms at time of crash vary but include:

Host Monitor fails many tests, then appeasr to lock up
Host Monitor process terminates
PCAnywhere process terminates
Process do not appear to have write access to the hard drives
When attempting reboot, meesage "You do not have permission to shut down this machine" appears - requires hard power cycle.

All this varies but I do consistently get this in the Application Event log just before 'crash':

EventID: 1000
Source: UserEnv
Windows cannot determine the user or computer name. Return value (14).

I've set alert to page me now when this error occurs -- but it may be too late by the time this occurs. I'll know shortly.

KS-Soft · Post by **KS-Soft** » Mon Jan 16, 2006 8:56 am

We spent some time on microsoft.com and various tech forums, found some articles with similar problems but did not find any article with solution. My colleague (and some articles) says it can be problem with active directory replication, corrupted sysvol or corrupted computer account...
Do you see error messages in security event log? BTW: system with the problem - domain controller? If not, check event logs on domain controler as well.
E.g. if the computer account for the computer listed is corrupted, following steps fix the problem: join the computer to a workgroup named WORKGROUP, and then restart the computer, rejoin the computer to the domain, and then restart the computer again.
http://support.microsoft.com/?kbid=329708

Regards
Alex

timn · Post by **timn** » Mon Jan 16, 2006 11:01 am

Alex:

I think you are on to something here.

The Host Monitor box (WATCHDOG) is not a domain controller itself but belongs to our local domain.

The security Eevent log on WATCHDOG did not show anything strange, BUT there were several messages in the DC's security event log indicating login failures from WATCHDOG.

So, I unjoined WATCHDOG from the domain (joining workgroup WORKGROUP) but I then received a warning message saying (roughly) "You have unjoined the domain but the computer account on the Domain Controller could not be disabled"). So I went to the DC, and manually deleted the computer account for WATCHDOG.

Then I rebooted WATCHDOG. Then attempted to rejoin domain. But...WATCHDOG rejoined the domain just a little too quickly - it usually takes several seconds to join a machine to the domain but this 'join' came back almost immediately. Then I rebooted WATCHDOG again.

After rebooting, I noticed that the name WATCHDOG did not appear in the list of computer accounts on the DC. I waited several minutes but the name never shows up.

So, I unjoined WATCHDOG from the domain once again (back to WORKGROUP) and again received the strange dialog warning about being unable to disable computer account.

Next, I changed the name of the Host Monitor machine to WATCHDOG7, then rebooted. Then rejoined domain. Then rebooted again. This time, WATCHDOG7 appears in the list of computer accounts on the DC.

Now that you've pointed me in this direction, I recall running into a similar problem a couple of years ago (not involving Host Monitor) where problems resulted from replacing hardware and naming it the same as the old hardware.

Anyway, given all this weirdness, I strongly suspect you are correct about this. We should know more in a day or 2. Thanks!

-Tim

timn · Post by **timn** » Tue Jan 17, 2006 2:04 pm

OK, problem still exists but.. after solving previous problem, we are getting more information - the problem appears to be with resource \Memory\Pool Nonpaged Bytes.

Immediately after a reboot, \Memory\Pool Nonpaged Bytes is

13,094,912 (13 MB)

The value steadily increments for the next 24 hours. System
crashes when \Memory\Pool Nonpaged Bytes reaches:

266,874,880 (266 MB)

Event log now shows Event ID 2019, src "Srv":

"The server was unable to allocate from the system
nonpaged pool because the pool was empty. "

Strange thing is, Task manager shows no 'hog' of this resource. Largest is SERVICES.EXE at about 1/2 megabyte

I maintain a private text log on this and can see this resource is being eaten up at at 12 MB / hour. The increase appears pretty linear - right up to crash point.

Next. how to find?

None of the processes seem to be taking up much:

Currently

\Process(_Total)\Pool Nonpaged Bytes = 1.1 MB

but

\Memory\Pool Nonpaged Bytes = 54 MB

Where's Waldo? (i.e the missing 53 MB of Non-Paged bytes)??

Yoorix · Post by **Yoorix** » Tue Jan 17, 2006 2:59 pm

Strange problem. Maybe, following links would be usefull:

http://support.microsoft.com/kb/888928/en-us or
http://support.microsoft.com/kb/833266/en-us.

Are you sure you have not installed any antivirus software like Norton Antivirus or McAfee?

Could you, please, review this link to match particular problem?
http://eventid.net/display.asp?eventid= ... rv&phase=1

timn · Post by **timn** » Tue Jan 17, 2006 6:38 pm

Yoorix:

Definitely no AV installed at this point.

I have also uninstall the Dell OpenManage software and drivers but this has made no difference.

Also, just FYI, from the console, I restarted all service (well, all that would let me) hoping the restart would release the pool resource and I'd see the counter drop, but alas, this yielded no result.

I am currently exploring the other links you've suggested and will post more when finished.

Thanks for your help.

timn · Post by **timn** » Wed Jan 18, 2006 10:38 am

Yoorix:

I've gone through most of the info at the links you provided -- I haven't found much that seems relevant yet -- but I've still got a couple of things to follow up on.

But my latest testing seems to confirm that the resource leak is associated with some driver on my system that HM needs to use to perform testing.

Stopping the Host Monitor process and the HM Web Service does NOT restore the memory resource to the pool. And no (visible) process appears to be hogging the NP Pool. Thus, it is NOT Host Monitor itself that is the problem.
I cretaed a new test list with only 1 test - i.e. monitor PerfCounter: \Memory\Pool Nonpaged Bytes. For the past 2 hours the return value has stayed constant at 16,015,360

So, resource leak is most likely in a system or third-party driver that HM must use in order to perform tests. But with over 3,000 active tests, pinning down which tests are driving the resource leak could take some time.

A large percentage of my tests use either RMA or SNMP. I think I'll ignore tests actually performed by RMA and focus on tests performed by Host Monitor itself.

I also will check to make sure I am using latest drivers and firmware for NIC, SCSI, etc. (they should be the lastest because I just rebuilt this machine from scratch and downloaded the various latest drivers at that time - but maybe I missed something). Also consider, what if the latest driver contains the resource leak but older driver -- on older Host Monitor machine -- did not? Hmmmm.

KS-Soft · Post by **KS-Soft** » Wed Jan 18, 2006 12:11 pm

So, resource leak is most likely in a system or third-party driver that HM must use in order to perform tests.

I agree.

A large percentage of my tests use either RMA or SNMP. I think I'll ignore tests actually performed by RMA and focus on tests performed by Host Monitor itself.

Right.

I also will check to make sure I am using latest drivers and firmware for NIC, SCSI, etc. (they should be the lastest because I just rebuilt this machine from scratch and downloaded the various latest drivers at that time - but maybe I missed something). Also consider, what if the latest driver contains the resource leak but older driver -- on older Host Monitor machine -- did not? Hmmmm.

I am afraid problem relates to NEW driver/DLL/system module. You did not have this problem some time ago, this means some update caused this problem

I understand its very hard to catch such problem. You may try to disable some test methods...
1) I think you may keep TCP, Ping, SMTP, POP, IMAP, DNS, USP, LDAP, RADIUS, NTP, Trace and HTTP tests. These test methods use only HostMonitor's code and Windows Winsock API. Winsock is pretty releable, I don't think problem is related to winsock (you don't have any antivirus monitors, content monitoring software and personal firewalls, right?).

2) Try to disable (if possible) or increase test interval (e.g. perform test every 30 minutes instead of 5 min) for the following tests:
- file related tests that checks files on remote systems
- UNC tests (network client or driver can lead to the problem)
- URL (some update for IE can be a reason of this error)
- Event Log, CPU Usage, Performance Counter, Service, Process
- SNMP GET, SNMP Trap, Traffic Monitor, ODBC, Active Script.
If you cannot disable some of these tests, you may assign agent to perform the test (agent should be installed on different system).

3) If problem fixed, try to enable tests one by one: UNC tests, then file related tests, then SNMP GET, SNMP Trap, Traffic Monitor...

Regards
Alex