The FreeBSD Diary

The FreeBSD Diary (TM)

Providing practical examples since 1998

If you buy from Amazon USA, please support us by using this link.
[ HOME | TOPICS | INDEX | WEB RESOURCES | BOOKS | CONTRIBUTE | SEARCH | FEEDBACK | FAQ | FORUMS ]

Things look quiet here. But I've been doing a lot of blogging at dan.langille.org because I prefer WordPress now. Not all my posts there are FreeBSD related. I am in the midst of migrating The FreeBSD Diary over to WordPress (and you can read about that here). Once the migration is completed, I'll move the FreeBSD posts into the new FreeBSD Diary website.

3Ware Nagios plugin 3 September 2010
Share
Need more help on this topic? Click here
This article has 1 comment
Show me similar articles

I use Nagios to monitor my servers and work stations. If something goes wrong, I usually get told by Nagios before I notice the problem myself. A week or so back, I noticed a rather odd RAID problem. Eventually, the problem was solved by upgrading the firmware on the controller.

In the meantime, I had located and installed a Nagios 3ware plugin. I like it and I'm using it on more than one server. However, now that I turned on AUTO-VERIFY, I've found a spot where I can improve the plugin.

Verifying...!

Earlier today, I turned on AUTO-VERIFY for this controller. Tonight, Nagios is reporting:

Status: UNKNOWN
Status Information: UNKNOWN: 
/c0/u0 RAID-10 VERIFYING - 56% 64K 195.548 ON ON - 
/c0/u1 SPARE VERIFYING - 0% - 69.2404 - ON - 
/c0/u2 SPARE VERIFYING - 0% - 69.2404 - ON - 

If I look at the status output, I see:

$ sudo /usr/local/sbin/tw_cli info c0 u0
Password:

Unit     UnitType  Status         %RCmpl  %V/I/M  Port  Stripe  Size(GB)
------------------------------------------------------------------------
u0       RAID-10   VERIFYING      -       62%     -     64K     195.548
u0-0     RAID-1    VERIFYING      62%     -       -     -       -
u0-0-0   DISK      OK             -       -       p0    -       65.1826
u0-0-1   DISK      OK             -       -       p2    -       65.1826
u0-1     RAID-1    VERIFYING      62%     -       -     -       -
u0-1-0   DISK      OK             -       -       p6    -       65.1826
u0-1-1   DISK      OK             -       -       p5    -       65.1826
u0-2     RAID-1    VERIFYING      63%     -       -     -       -
u0-2-0   DISK      OK             -       -       p3    -       65.1826
u0-2-1   DISK      OK             -       -       p4    -       65.1826
u0/v0    Volume    -              -       -       -     -       195.548

Now I'd rather have something other than UNKNOWN. Fortunately, I have the source.

The patch!

This is the patch:

--- /usr/local/libexec/nagios/check_3ware.sh	2010-08-27 02:34:55.000000000 +0100
+++ /home/dan/bin/check_3ware.sh	2010-09-02 01:08:39.000000000 +0100
@@ -66,6 +66,12 @@
 				MSG="$MSG $STATUS -"
 				PREEXITCODE=1
 				;;
+			VERIFYING)
+				CHECKUNIT=`$TWCLI info $i unitstatus | ${GREP} -E "${UNIT[$COUNT]}" | ${AWK} '{print $1,$3,$5}'`
+				STATUS="/$i/$CHECKUNIT"
+				MSG="$MSG $STATUS -"
+				PREEXITCODE=1
+				;;
 			DEGRADED)
 				CHECKUNIT=`$TWCLI info $i unitstatus | ${GREP} -E "${UNIT[$COUNT]}" | ${AWK} '{print $1,$3}'`
 				STATUS="/$i/$CHECKUNIT"

This is what it outputs:
$ sudo ~/bin/check_3ware.sh
WARNING:  /c0/u0 VERIFYING 89% - /c0/u1 VERIFYING 0% - /c0/u2 VERIFYING 0% -

After replacing the original script, I get this output when testing it from the command line on the Nagios server:

$ /usr/local/libexec/nagios/check_nrpe2 -H supernews-vpn -c check_3ware.sh
WARNING:  /c0/u0 VERIFYING 99% - /c0/u1 VERIFYING 1% - /c0/u2 VERIFYING 0% -

I now see this on my Nagios webpage:

Status: WARNING
Status Information: WARNING:
/c0/u0 VERIFYING 99% - 
/c0/u1 VERIFYING 1% - 
/c0/u2 VERIFYING 0% - 

Other ideas

Tonight I started a battery test. The status immediately went to CRITICAL. That got me thinking about this patch:

$ diff -ruN /usr/local/libexec/nagios/check_3ware.sh ~/bin/check_3ware.sh
--- /usr/local/libexec/nagios/check_3ware.sh    2010-09-02 01:08:39.000000000 +0100
+++ /home/dan/bin/check_3ware.sh        2010-09-02 02:52:39.000000000 +0100
@@ -100,7 +100,7 @@
        # Check BBU's
        BBU=(`$TWCLI info $i |${GREP} -E "^bbu"|${AWK} '{print $1,$2,$3,$4,$5}'`)
        if [ "${BBU[0]}" = "bbu" ]; then
-               if [ "${BBU[1]}" != "On" ] || [ "${BBU[2]}" != "Yes" ] || [ "${BBU[3]}" != "OK" ] || [ "${BBU[4]}" != "OK" ]; then
+               if [ "${BBU[1]}" != "On" ] || [ "${BBU[2]}" != "Yes" ] || [ "${BBU[3]}" != "OK" && "${BBU[3]}" != "Testing" ] || [ "${BBU[4]}" != "OK" ]; then
                     BBUEXITCODE=2
                     BBUERROR="BBU on $i failed"
                fi

I also think I may change the status for VERIFYING from WARNING to OK, because really, everything IS OK. The controller is merely running VERIFY.

FYI: I sent an email to the plugin author before I published this.

Share
Need more help on this topic? Click here
This article has 1 comment
Show me similar articles