AntonyP wrote:I believe that the warning status should not be added, simply because it would be easier to set the alarm trigger at an earlier stage.
E.g. on a disk usage test
hard disk has 2gb free space
I set 1gb free space alarm for HM
Now, what would be the meaning of having the warning status on 1.2bg? I can set the alarm at 1.2gb instead...
The difference is getting a phone call early in the morning for all red statuses.
So there is a need to separate warnings from real problems.
Warning means you have to do something not getting a real problem - but not immediately
I had posed this question last year, and it is similar to the requests being asked for here. The vast majority of the tests that we have are performance based tests (not fault based tests). It would be very handy to have a qualification for when these types of tests actually go into alarm or are marked as 'Bad'. Like some of the other users on this forum have pointed out, many of the our IT staff are no longer looking at the alarms since there are so many performance based tests in alarm at any given time. We have about 15 alarms at any given time (mostly preformance based alarms (CPU, Disk time, Page faults, etc), And most are from different servers each minute (so the test is only in alarm during a single test interval), but typically only average a few faults a day.
My thoughts are to add a feature where the person setting up the tests could decide how many tests would need to be in a 'bad' state, prior to setting the test to bad, while keeping the same test interval. For example, if a server's CPU was 100% for 10 minutes in a row (or 10 - test cycles set at 60 second intervals), then set the test to 'bad' rather than just go bad the first time it gets a 100% test result.
I know that you stated that this would require additional coding to the core functionality. However, my organization is thinking on moving away from HM, and our two licenses, because they feel the product does not deal effectively with performance based tests.
Probably we should keep "basic" scheme as is and provide ability to set additional statuses using expressions. Just like we did with "standard" and "advanced" actions.
E.g. implement 2 new statuses and 2 options
[x] Use expression to set Warning status
[ ] Use expression to set Normal status
So you will be able to use expressions lke ('%Reply%'>'70 %') and ('%Reply%'<'90 %') and ('%MainRouter::SimpleStatus%'=='UP')
Warning/Normal statuses will be handled just like other bad/good statuses for statistics purposes. But such items can be displayed in different color, HostMonitor may apply different sorting order, generate separate HTML reports.
This way we keep "basic" setup simple enough and provide great flexibility when you really need that.
- and it will be possible to set the test to bad after the 5th test between 70 and 90 % CPU ?
- So we can have an HTML report only showing warning and bad tests in different colours ?
- and it will be possible to set the test to bad after the 5th test between 70 and 90 % CPU ?
H'm..
- expression like "('%Reply%'>70 %') and ('%Reply%'<=90 %')" will set Warning status when CPU Usage between 70 and 90 % (Bad status if CPU Usage over 90%)
- expression like "('%SimpleStatus%'=='DOWN') and (%Recurrences<5)" will set Warning status for 1st..4th failed probe (5th failed probe will use Bad status)
- you may combine condition, e.g. "('%Reply%'>70 %') and ('%Reply%'<=90 %') and ('%SimpleStatus%'=='DOWN') and (%Recurrences<5).
But its impossible to combine in your way (HostMonitor does not have history for all previous Reply values, except log of course). Unless Warning status resets Recurrences. In such case we will need to redesign actions related behaviour (don't really want to do that until version 8 or something).
- So we can have an HTML report only showing warning and bad tests in different colours ?