Escalation process for UN-Acknowledged alerts and repeats

Need new test, action, option? Post request here.
Post Reply
Svend
Posts: 38
Joined: Fri Sep 23, 2005 5:00 pm

Escalation process for UN-Acknowledged alerts and repeats

Post by Svend »

It would also be good to be able to set up a criteria to send an alert on to another notification mechanism or person should an alert go un-acknowledged for a particular period of time or if an evert keeps repeating over a defined period.

Simple example is : if the hard disk gets too full, send an email to the helpdesk, if not acknowledged for three days start sounding the beeper on the host mon server, and send an SMS email alert to the Group IT Manager saying that nobody has acknowledged the issue.

Complex example : Database Server's CPU sticks at 90% for 40min checked every 4 min, alert notification is sent to IT Helpdesk. This occurs five times within a two week period - notification gets sent to Group IT Manager letting her know that IT Helpdesk are struggling to come to terms with a CPU issue.

Svend.
KS-Soft
Posts: 13012
Joined: Wed Apr 03, 2002 6:00 pm
Location: USA
Contact:

Post by KS-Soft »

It would also be good to be able to set up a criteria to send an alert on to another notification mechanism or person should an alert go un-acknowledged for a particular period of time or if an evert keeps repeating over a defined period.
You may use "advanced" mode actions and the following macro variables
%AcknowledgedBy%
%AckRecurrences%
%Recurrences%
Quote from the manual
This variable is useful when you want to configure alert profile to launch different actions for "non-acknowledged" and "acknowledged" test items. E.g. if condition to trigger action execution looks like ('%SimpleStatus%=='DOWN') and (%AckRecurrences%==1), HostMonitor will start action when user confirmed status and test failed once again (after acknowledgement).
2nd example is hard to implement within current HM architecture

Regards
Alex
Kapz
Posts: 216
Joined: Mon Dec 06, 2004 2:33 pm
Location: Denmark

Post by Kapz »

Svend,

> Alex: 2nd example is hard to implement within current HM architecture

You can probably accomplish this through a little workaround. Below is an example of how this could be done - we use a lot of these funny constructions ;)

>This occurs five times within a two week period - notification gets sent to
> Group IT Manager letting her know that IT Helpdesk are struggling to
> come to terms with a CPU issue.

As part of the alert profile "alert notification is sent to IT Helpdesk" is triggered because "Database Server's CPU sticks at 90% for 40min checked every 4 min" you call a batch file that creates a text file in C:\MyDir with a random number as file name. e.g. 28461.txt. (simply echo > C:\MyDir\%RANDOM%.txt).
Once every day a scheduled task calls a utility that checks for files in C:\MyDir with time stamps older than 14 days (deltmpfiles.exe can do this for you) - and deletes them.
Also, once every day a check in HostMonitor count the number of files in C:\MyDir. If result is 5 or more an alert profile notifies your Group IT Manager.

That's it - it might not be very sophisticated but it'll do the job and also the time stamp on the .txt-files will show your Group IT Manager when the CPU was under heavy load which might be usable when tracking down activity on the SQL Server.
Once acknowledged by your Group IT Manager she can simply delete the .txt-files and thus "reset" the error count.

H2H

Kasper :O)
Post Reply