NetApp free space monitoring

bmekler · Post by **bmekler** » Tue Aug 06, 2019 12:01 am

I'm monitoring several NetApp clusters that frequently have volumes added and removed. Among other things, I monitor SNMP table 1.3.6.1.4.1.789.1.5.4.1.6 (.iso.org.dod.internet.private.enterprises.netapp.netapp1.filesys.dfTable.dfEntry.dfPerCentKBytesCapacity) for entries exceeding 80 (volumes or aggregates over 80% full). One caveat about this is that every NetApp controller has a small system aggregate with a thick-provisioned system volume sized to 95% of that aggregate, which is hardcoded into the system and cannot be changed, so I have the test filter out entries that are equal to 95, and thus far, it's been working well - until I upgraded to ONTAP 9.5. Now, some of the '.snapshot' entries which are also part of this table report their space utilization percentage as hundreds of percent instead of zero, and this screws up the output. I have tried moving to the 'Drive Free Space (NetApp)' test, but that one does not seem to have a way to filter out the system aggregates - I can set it to monitor specific volumes/aggregates, but not to not monitor something I don't want monitored, and so I had to set it to alert if free space falls below 4%, which is less than ideal.

Would it be possible to add more robust filtering capabilities to the 'Drive Free Space (NetApp)' test?

Also, on the subject of NetApp, how feasible would it be to add a test to monitor SnapMirror/SnapVault relationships health status and lag thresholds? Right now, I'm monitoring specific rows in the 1.3.6.1.4.1.789.1.29.1.1 (.iso.org.dod.internet.private.enterprises.netapp.netapp1.sm.snapmirrorRelStatusTable.snapmirrorRelStatusEntry) table, but configuring those and matching the row number to the actual SnapMirror source/destination is time-consuming.

Thank you.

KS-Soft · Post by **KS-Soft** » Tue Aug 06, 2019 2:40 am

Now, some of the '.snapshot' entries which are also part of this table report their space utilization percentage as hundreds of percent instead of zero, and this screws up the output.

May be this happens because snapshots exceed the reserve space and spill into the active file system?

Also, if you want to skip some volumes, you may choose "Check listed drive(s)" option and use list, you may specify wildcard for some items, e.g. /vol/vol1*
Will this suit your needs?

Also, on the subject of NetApp, how feasible would it be to add a test to monitor SnapMirror/SnapVault relationships health status and lag thresholds?

We have to check, may be add such test in version 12

Regards
Alex

bmekler · Post by **bmekler** » Tue Aug 06, 2019 10:58 pm

KS-Soft wrote: May be this happens because snapshots exceed the reserve space and spill into the active file system?

Huh, you're right - I always set volumes to 0% snapshot space reservation in order to prevent this, but this upgrade reset a number of volumes back to the default 5%. These volumes were all part of a vSphere (vsi_on_nas) application, so that's probably what did it. I set them back to 0% reservation and this fixed the issue.

Also, if you want to skip some volumes, you may choose "Check listed drive(s)" option and use list, you may specify wildcard for some items, e.g. /vol/vol1*
Will this suit your needs?

As I understand it, this includes volumes rather than excludes them, so I would need to set up over 200 independent tests, then maintain them as volumes are added and removed, which is not really feasible. In any case, after fixing the snapshot reservation, my original test is working again.

Also, on the subject of NetApp, how feasible would it be to add a test to monitor SnapMirror/SnapVault relationships health status and lag thresholds?
We have to check, may be add such test in version 12

Regards
Alex

Thank you. If you need access to a test system, I may be able to help you out.