Code: Select all
This is a long and convoluted thread -- but the end result was that the problem was NOT in Host Monitor. It was eventually discovered to be due to a memory leak in a driver for CommView -- third-party packet sniffing software that was installed on the machine running Host Monitor.
(In fairness, Tamosoft, the vendor of CommView, had apparently already discovered and resolved this leak earlier, but for some reason, our installation had mismatched drivers.)
This issue required nearly 3 weeks of debugging to discover, but persistence eventually paid off. I would like to thank both Alex and Yoorix for their patience and assistence in trouble-shooting the issue. As always, Host Monitor rocks.
This incident perhaps serves as a good example of how a problem that appears to be a 'Host Monitor' issue may actually be caused indirectly by other software on the box - and over which Host Monitor has no control.
-Tim
Alex:
This is the issue that I've alluded to in a couple of prior posts. Basically, every 24-48 hours, the system that HM is running on appears to run out of memory or some similar memory-related resource. After that, pretty much everything is hosed and a reboot is required.
Something tells me this is going to be difficult to nail down. I'd greatly appreciate any thoughts or suggestions you might have.
First, the facts:
Host Monitor version: 5.70
As I've indicated in previous messages, this is a totally fresh install of Windows 2000 Server, with all the latest patches from MS. This hardware (Dell PE1650) is the same hardwrare that I've been sucessfully running HM on for over a year with no problem. (I did replace the hardware with a different PE1650 after this started but it made no difference.)System Info: wrote:
Windows 2000 Server
Version 5.0.2195 Service Pack 4 Build 2195
OS Manufacturer Microsoft Corporation
System Name WATCHDOG
System Manufacturer Dell Computer Corporation
System Model PowerEdge 1650
System Type X86-based PC
Processor x86 Family 6 Model 11 Stepping 1 GenuineIntel ~1390 Mhz
Processor x86 Family 6 Model 11 Stepping 1 GenuineIntel ~1390 Mhz
BIOS Version Phoenix ROM BIOS PLUS Version 1.10 A11
Windows Directory C:\WINNT
System Directory C:\WINNT\system32
Boot Device \Device\Harddisk0\Partition2
Locale United States
Time Zone Central Standard Time
Total Physical Memory 2,096,560 KB
Available Physical Memory 1,772,492 KB
Total Virtual Memory 6,131,640 KB
Available Virtual Memory 5,609,052 KB
Page File Space 4,035,080 KB
Page File C:\pagefile.sys
So, we are pretty sure this is not a hardware issue.
Other than the Windows OS, there are only 2 other pieces of user software on this box: NetTime, used to sync all clock on all Windows boxes in the company. This is used on over 250 computers here and has never caused an issue. The other software is pcAnywhere v 11 - used for remote console access. While pcAnywhere does not rank among my most favorite programs, it has been fairly stable and has never caused the kind of problem we are seeing here.
On the old hardware, we also ran TrendMicro OfficeScan. Temporarily, we've left that off this new box, thinking that may have been the source of our problem. But.. apparently not.
Crash occurs 24-48 hours after reboot.
Our first indication that something is wrong is this error in the Application Event Log:
EventID: 1000
Source: UserEnv
Windows cannot determine the user or computer name. Return value (14).
Within 20 minutes or so after that, the following System Event error occurs:
EventID: 26
Source: Application Popup
Application popup: hostmon.exe - Bad Image : The application or
DLL C:\WINNT\system32\DBMSSOCN.DLL is not a valid Windows image.
Please check this against your installation diskette.
Regarding the first event (as per Microsoft):
Value 14 (Error code 14) = "Not enough storage is available to complete this operation." - Do one of the following, then retry the operation:
1. reduce the number of running programs
2. remove unwanted files from the disk the Paging File is on and restart the system
3. check the paging file disk for an I/O error
4. install additional memory in your system."
Item 1 - is not practical - there no user programs running other than Host Monitor, NetTime and pcAnywhere.
Item 2 - There's not a lot of garbage on the disk - "C:" drive capacity is 33.8 GB with 30.4GB free
Item 3 - Ran a CHKDSK on the C drive - no errors found
Item 4 - System already has 2GB RAM, so this is not an insufficient physical memory issue (but likely some memory leak issue).
Also, regarding 2nd event., we have looked at C:\WINNT\system32\DBMSSOCN.DLL and it appears to be fine - same exact size and version info as on our other Win2K boxes.
What would you suggest for tracking this down further?
Thanks, Tim