The FreeBSD Diary

The FreeBSD Diary (TM)

Providing practical examples since 1998

If you buy from Amazon USA, please support us by using this link.
[ HOME | TOPICS | INDEX | WEB RESOURCES | BOOKS | CONTRIBUTE | SEARCH | FEEDBACK | FAQ | FORUMS ]
3Ware Nagios plugin 3 September 2010
Need more help on this topic? Click here
This article has 1 comment
Show me similar articles

I use Nagios to monitor my servers and work stations. If something goes wrong, I usually get told by Nagios before I notice the problem myself. A week or so back, I noticed a rather odd RAID problem. Eventually, the problem was solved by upgrading the firmware on the controller.

In the meantime, I had located and installed a Nagios 3ware plugin. I like it and I'm using it on more than one server. However, now that I turned on AUTO-VERIFY, I've found a spot where I can improve the plugin.

Verifying...!

Earlier today, I turned on AUTO-VERIFY for this controller. Tonight, Nagios is reporting:

Status: UNKNOWN
Status Information: UNKNOWN: 
/c0/u0 RAID-10 VERIFYING - 56% 64K 195.548 ON ON - 
/c0/u1 SPARE VERIFYING - 0% - 69.2404 - ON - 
/c0/u2 SPARE VERIFYING - 0% - 69.2404 - ON - 

If I look at the status output, I see:

$ sudo /usr/local/sbin/tw_cli info c0 u0
Password:

Unit     UnitType  Status         %RCmpl  %V/I/M  Port  Stripe  Size(GB)
------------------------------------------------------------------------
u0       RAID-10   VERIFYING      -       62%     -     64K     195.548
u0-0     RAID-1    VERIFYING      62%     -       -     -       -
u0-0-0   DISK      OK             -       -       p0    -       65.1826
u0-0-1   DISK      OK             -       -       p2    -       65.1826
u0-1     RAID-1    VERIFYING      62%     -       -     -       -
u0-1-0   DISK      OK             -       -       p6    -       65.1826
u0-1-1   DISK      OK             -       -       p5    -       65.1826
u0-2     RAID-1    VERIFYING      63%     -       -     -       -
u0-2-0   DISK      OK             -       -       p3    -       65.1826
u0-2-1   DISK      OK             -       -       p4    -       65.1826
u0/v0    Volume    -              -       -       -     -       195.548

Now I'd rather have something other than UNKNOWN. Fortunately, I have the source.

The patch!

This is the patch:

--- /usr/local/libexec/nagios/check_3ware.sh	2010-08-27 02:34:55.000000000 +0100
+++ /home/dan/bin/check_3ware.sh	2010-09-02 01:08:39.000000000 +0100
@@ -66,6 +66,12 @@
 				MSG="$MSG $STATUS -"
 				PREEXITCODE=1
 				;;
+			VERIFYING)
+				CHECKUNIT=`$TWCLI info $i unitstatus | ${GREP} -E "${UNIT[$COUNT]}" | ${AWK} '{print $1,$3,$5}'`
+				STATUS="/$i/$CHECKUNIT"
+				MSG="$MSG $STATUS -"
+				PREEXITCODE=1
+				;;
 			DEGRADED)
 				CHECKUNIT=`$TWCLI info $i unitstatus | ${GREP} -E "${UNIT[$COUNT]}" | ${AWK} '{print $1,$3}'`
 				STATUS="/$i/$CHECKUNIT"

This is what it outputs:
$ sudo ~/bin/check_3ware.sh
WARNING:  /c0/u0 VERIFYING 89% - /c0/u1 VERIFYING 0% - /c0/u2 VERIFYING 0% -

After replacing the original script, I get this output when testing it from the command line on the Nagios server:

$ /usr/local/libexec/nagios/check_nrpe2 -H supernews-vpn -c check_3ware.sh
WARNING:  /c0/u0 VERIFYING 99% - /c0/u1 VERIFYING 1% - /c0/u2 VERIFYING 0% -

I now see this on my Nagios webpage:

Status: WARNING
Status Information: WARNING:
/c0/u0 VERIFYING 99% - 
/c0/u1 VERIFYING 1% - 
/c0/u2 VERIFYING 0% - 

Other ideas

Tonight I started a battery test. The status immediately went to CRITICAL. That got me thinking about this patch:

$ diff -ruN /usr/local/libexec/nagios/check_3ware.sh ~/bin/check_3ware.sh
--- /usr/local/libexec/nagios/check_3ware.sh    2010-09-02 01:08:39.000000000 +0100
+++ /home/dan/bin/check_3ware.sh        2010-09-02 02:52:39.000000000 +0100
@@ -100,7 +100,7 @@
        # Check BBU's
        BBU=(`$TWCLI info $i |${GREP} -E "^bbu"|${AWK} '{print $1,$2,$3,$4,$5}'`)
        if [ "${BBU[0]}" = "bbu" ]; then
-               if [ "${BBU[1]}" != "On" ] || [ "${BBU[2]}" != "Yes" ] || [ "${BBU[3]}" != "OK" ] || [ "${BBU[4]}" != "OK" ]; then
+               if [ "${BBU[1]}" != "On" ] || [ "${BBU[2]}" != "Yes" ] || [ "${BBU[3]}" != "OK" && "${BBU[3]}" != "Testing" ] || [ "${BBU[4]}" != "OK" ]; then
                     BBUEXITCODE=2
                     BBUERROR="BBU on $i failed"
                fi

I also think I may change the status for VERIFYING from WARNING to OK, because really, everything IS OK. The controller is merely running VERIFY.

FYI: I sent an email to the plugin author before I published this.


Need more help on this topic? Click here
This article has 1 comment
Show me similar articles