The FreeBSD Diary |
![]() |
(TM) | Providing practical examples since 1998If you buy from Amazon USA, please support us by using this link. |
|
3Ware Nagios plugin
3 September 2010
|
|
I use Nagios to monitor my servers and work stations. If something goes wrong, I usually get told by Nagios before I notice the problem myself. A week or so back, I noticed a rather odd RAID problem. Eventually, the problem was solved by upgrading the firmware on the controller. In the meantime, I had located and installed a Nagios 3ware plugin. I like it and I'm using it on more than one server. However, now that I turned on AUTO-VERIFY, I've found a spot where I can improve the plugin. |
|
Verifying...!
|
|
Earlier today, I turned on AUTO-VERIFY for this controller. Tonight, Nagios is reporting: Status: UNKNOWN Status Information: UNKNOWN: /c0/u0 RAID-10 VERIFYING - 56% 64K 195.548 ON ON - /c0/u1 SPARE VERIFYING - 0% - 69.2404 - ON - /c0/u2 SPARE VERIFYING - 0% - 69.2404 - ON - If I look at the status output, I see: $ sudo /usr/local/sbin/tw_cli info c0 u0 Password: Unit UnitType Status %RCmpl %V/I/M Port Stripe Size(GB) ------------------------------------------------------------------------ u0 RAID-10 VERIFYING - 62% - 64K 195.548 u0-0 RAID-1 VERIFYING 62% - - - - u0-0-0 DISK OK - - p0 - 65.1826 u0-0-1 DISK OK - - p2 - 65.1826 u0-1 RAID-1 VERIFYING 62% - - - - u0-1-0 DISK OK - - p6 - 65.1826 u0-1-1 DISK OK - - p5 - 65.1826 u0-2 RAID-1 VERIFYING 63% - - - - u0-2-0 DISK OK - - p3 - 65.1826 u0-2-1 DISK OK - - p4 - 65.1826 u0/v0 Volume - - - - - 195.548 Now I'd rather have something other than UNKNOWN. Fortunately, I have the source. |
|
The patch!
|
|
This is the patch:
--- /usr/local/libexec/nagios/check_3ware.sh 2010-08-27 02:34:55.000000000 +0100
+++ /home/dan/bin/check_3ware.sh 2010-09-02 01:08:39.000000000 +0100
@@ -66,6 +66,12 @@
MSG="$MSG $STATUS -"
PREEXITCODE=1
;;
+ VERIFYING)
+ CHECKUNIT=`$TWCLI info $i unitstatus | ${GREP} -E "${UNIT[$COUNT]}" | ${AWK} '{print $1,$3,$5}'`
+ STATUS="/$i/$CHECKUNIT"
+ MSG="$MSG $STATUS -"
+ PREEXITCODE=1
+ ;;
DEGRADED)
CHECKUNIT=`$TWCLI info $i unitstatus | ${GREP} -E "${UNIT[$COUNT]}" | ${AWK} '{print $1,$3}'`
STATUS="/$i/$CHECKUNIT"
This is what it outputs:
$ sudo ~/bin/check_3ware.sh WARNING: /c0/u0 VERIFYING 89% - /c0/u1 VERIFYING 0% - /c0/u2 VERIFYING 0% - After replacing the original script, I get this output when testing it from the command line on the Nagios server: $ /usr/local/libexec/nagios/check_nrpe2 -H supernews-vpn -c check_3ware.sh WARNING: /c0/u0 VERIFYING 99% - /c0/u1 VERIFYING 1% - /c0/u2 VERIFYING 0% - I now see this on my Nagios webpage: Status: WARNING Status Information: WARNING: /c0/u0 VERIFYING 99% - /c0/u1 VERIFYING 1% - /c0/u2 VERIFYING 0% - |
|
Other ideas
|
|
Tonight I started a battery test. The status immediately went to CRITICAL. That got me thinking about this patch:
$ diff -ruN /usr/local/libexec/nagios/check_3ware.sh ~/bin/check_3ware.sh
--- /usr/local/libexec/nagios/check_3ware.sh 2010-09-02 01:08:39.000000000 +0100
+++ /home/dan/bin/check_3ware.sh 2010-09-02 02:52:39.000000000 +0100
@@ -100,7 +100,7 @@
# Check BBU's
BBU=(`$TWCLI info $i |${GREP} -E "^bbu"|${AWK} '{print $1,$2,$3,$4,$5}'`)
if [ "${BBU[0]}" = "bbu" ]; then
- if [ "${BBU[1]}" != "On" ] || [ "${BBU[2]}" != "Yes" ] || [ "${BBU[3]}" != "OK" ] || [ "${BBU[4]}" != "OK" ]; then
+ if [ "${BBU[1]}" != "On" ] || [ "${BBU[2]}" != "Yes" ] || [ "${BBU[3]}" != "OK" && "${BBU[3]}" != "Testing" ] || [ "${BBU[4]}" != "OK" ]; then
BBUEXITCODE=2
BBUERROR="BBU on $i failed"
fi
I also think I may change the status for VERIFYING from WARNING to OK, because really, everything IS OK. The controller is merely running VERIFY. FYI: I sent an email to the plugin author before I published this. |