Overall SLA based on dependent tests

peterjwest · Post by **peterjwest** » Thu Mar 21, 2013 7:33 am

Hi,

I'm trying to achieve something specific using Host Monitor but I have no idea if it's possible or not. Hopefully some kind soul on here might be able to give me a clue.

We have a complex Citrix environment which we would like to be able to give an 'overall' SLA report for.

My original idea was to make tests for each of the elements of the system that may affect the uptime of the system overall. I could then build the dependency hierarchy using the dependency section of the host monitor test and this would ultimately mean that we could just report on the top-level test and this should give us an overview of the uptime.

Unfortunately what actually seems to happen is that if a dependent test goes 'bad' then the parent test just doesn't run until it goes 'good' again. What this means is that the very top level test thinks we have 100% uptime when in fact we have had periods of downtime lower down the structure. They just don't get picked up because the top level test doesn't run.

I'm not sure i've explained this very well - but hopefully you get the idea. If not then please let me know and i'll try to give a better explanation.

Thanks

Pete

KS-Soft · Post by **KS-Soft** » Thu Mar 21, 2013 7:51 am

Unfortunately what actually seems to happen is that if a dependent test goes 'bad' then the parent test just doesn't run until it goes 'good' again.

Actually it works other way around.
E.g. you setup Ping test as Master test for CPU Usage, Service, Process tests. If Ping test returns "no answer" status, dependant tests will not be performed.
If you mark "Synchronize counters" option for CPU Usage, Service, Process tests then HostMonitor will increment "Bad" counters for dependant tests as well.
If you want some SLA report with "uptime" related to dependant test items, I do not see any problems.

Quote from the manual

Synchronize counters
This option only applies to tests that have one or more master tests. When the option is turned off, and some test is not being launched because its launch condition has not been met, HostMonitors simply marks such a test with the "Wait for Master" status and does not change any counters. If, however, the option is turned on, HostMonitor will update statistics information accordingly to the Master test status. Thus, if a router on which other tests depend has been tested to a "No answer" status, HostMonitor will increment respective counters (like "Dead time", "Failed tests", etc) for router and for all dependent tests with the "Synchronize counters" option on.

Regards
Alex

peterjwest · Post by **peterjwest** » Fri Apr 05, 2013 4:52 am

Hi Alex,

Thanks for taking the time to reply - given your response I think I understand the setup a bit better now.

But the issue I now face is that I can only have 20 dependent tests - and it looks like making a hierarchy to handle the limit won't work.

I wish to aggregate the results from tests over 4 hosts and on each host I have 7 or 8 different tests. This gives me a total of 30+.

My original idea was to make it so that the primary test on each host would be a 'ping' test and this would be dependent on the other tests for that host. It seems to work because I can see that if the disk space test (which is a 'child' of the ping) fails then the failure count also updates on the Ping test.

But the problem is that the top level test doesn't also update. So although hosts 1,2 3 and 4 may have failure coulds of 5,5,10 and 5 respectively we only ever have a failure could of zero at the top level (in fact the count at the top level won't match any of these values because it only should increment by 1 irrespective of how many child tests have failed).

So to generate our SLA report for Citrix we need to limit the number of tests for the SLA to no more than 20 it would seem.

I don't know if i'm missing something so if you have any more ideas they would be most welcomed.

Thanks

Pete

KS-Soft · Post by **KS-Soft** » Fri Apr 05, 2013 7:24 am

Sorry, I don't understand you

There are no limit for dependant tests, 100,000 test items can depend on single or several master tests. May be you are using "dependant" word instead of "master"?

Also I do not understand what master-dependant relation has to do with your report. If you want to create report for some group of tests, put these tests into folder and use folder level options (or "Generate reports" action) to create report for these tests.

Regards
Alex

peterjwest · Post by **peterjwest** » Fri Apr 05, 2013 9:41 am

I totally understand your confusion - it's proving hard for me to explain.

The issue is that if you perform an SLA Report on a number of tests that relate to a single system then you don't really get a true picture of the systems availability.

If, for example, you monitor a number of Services and then also perform a Ping test then all of those will show downtime if the Server isn't online.

What i'm trying to achieve is a single 'parent' test that will give the overall availability of the system based on a number of dependent tests.

The concept is that if one Service is offline then the parent would reflect that. If two Services were down then the parent would still show the same result because it doesn't matter if one or two services is offline, the failure of either one would ultimately result in downtime of the system.

I'm basically trying to build a structure which gives a very simple SLA report for a complex system - the people seeing these reports don't care what component of the system was down - they just want to know what %age uptime we have on our Citrix Environment for the month of January, February or whatever.

The sychronisation of counter data appears to always be passed 'up' the chain of tests so the only way for me to do what i'm attempting is to make sure that my top-level test is dependent on the tests below it. And this is where the 20 items limit kicks in.

KS-Soft · Post by **KS-Soft** » Fri Apr 05, 2013 9:47 am

The issue is that if you perform an SLA Report on a number of tests that relate to a single system then you don't really get a true picture of the systems availability.
If, for example, you monitor a number of Services and then also perform a Ping test then all of those will show downtime if the Server isn't online.

IMHO its a true picture - if server does not respond then its not available.

The concept is that if one Service is offline then the parent would reflect that. If two Services were down then the parent would still show the same result because it doesn't matter if one or two services is offline, the failure of either one would ultimately result in downtime of the system

Do you mean you need item that will be "bad" when ANY of checks for some specific server is "bad"?
And this test should be "good" if ALL checks for some specific server is "good"?

Regards
Alex

peterjwest · Post by **peterjwest** » Fri Apr 05, 2013 10:15 am

KS-Soft wrote:Do you mean you need item that will be "bad" when ANY of checks for some specific server is "bad"?
And this test should be "good" if ALL checks for some specific server is "good"?

That's pretty much it

We don't care how many checks fail - if any of them fail then the environment is not available.

KS-Soft · Post by **KS-Soft** » Fri Apr 05, 2013 10:33 am

Then you may create 1 additional test with predefined "good" result (e.g. Ping localhost) and use the following options
- This test depends on expression (%FolderCurrent_BadTests% + %FolderCurrent_UnknownTests% == 0) or ((%FolderCurrent_BadTests%==1) and ("%Status%"=="Bad"))
- Otherwise status: Bad
- Synchronize counters: On
- Synchronize status and alerts: On

This test will have "Bad" status when ANY other test within folder (folder where test located) has Bad or Unknown status.
This test will have "Host is alive" status when ALL other tests within folder have "Host is alive", "Ok", "Normal", "Disabled"... statuses

Regards
Alex