The FreeBSD Diary

The FreeBSD Diary (TM)

Providing practical examples since 1998

If you buy from Amazon USA, please support us by using this link.
[ HOME | TOPICS | INDEX | WEB RESOURCES | BOOKS | CONTRIBUTE | SEARCH | FEEDBACK | FAQ | FORUMS ]
Monitor your 3Ware battery backup unit (BBU) 3 September 2010
Need more help on this topic? Click here
This article has no comments
Show me similar articles

NOTE: This was originally intended to be published on 2006-09-04. But. Umm. I forgot. It's been here, on the server, but never published. Tonight, I fixed that. :) Just over 4 years to the day.

NetSaint is a well known and long-established network monitor. Sure, it has been superseded by Nagios, but that's no reason to change! :) Really. I've been using NetSaint for a number of years, and I've had no desire to move to Nagios.

I started using NetSaint in 2001 and have written a few plug-ins for it. This past week, I run a test on the battery backup unit. It was then that I decided I should monitor the BBU. This would build upon the plug-in I wrote last month. The things we do when the rest of the family is out of town....

Checking the BBU status

The easiest way to find out the BBU status is through the 3Ware CLI. I do it like this:

# tw_cli /c0/bbu show

Name  OnlineState  BBUReady  Status    Volt     Temp     Hours  LastCapTest
---------------------------------------------------------------------------
bbu   On           No        Charging  OK       OK       255    30-Aug-2006

#

You can see that the BBU is charging and it's not ready yet. You can also see that I've run the capacity test and the unit is good for 255 hours.

There is also a longer version:

# tw_cli /c0/bbu show show all
/c0/bbu Firmware Version          = BBU: 2.00.00.011
/c0/bbu Serial Number             = L021902B6081199
/c0/bbu Online State              = On
/c0/bbu BBU Ready                 = No
/c0/bbu BBU Status                = Charging
/c0/bbu Battery Voltage           = OK
/c0/bbu Battery Temperature       = OK
/c0/bbu Estimated Backup Capacity = 255 Hours
/c0/bbu Last Capacity Test        = 30-Aug-2006
/c0/bbu Battery Installation Date = 09-Aug-2006
/c0/bbu Bootloader Version        = BBU 0.02.02.001
/c0/bbu PCB Revision              = 65

#
Both versions are relatively easy to parse. Regardless, I took another approach.
# tw_cli info

Ctl   Model        Ports   Drives   Units   NotOpt   RRate   VRate   BBU
------------------------------------------------------------------------
c0    9550SX-8LP   8       8        3       0        4       4       Charging

#

Why?

I think the status of the RAID card itself is more important. The data provided by the info command includes all 3Ware RAID cards in the system, the number of drives, units, how many are not optimal, and the BBU status. Such information allows the creation of certain tests to verify this data meets expected values. For example, I could use this:

service[opti]=RAID controller;0;24x7;3;2;1;raid-admins;120;24x7;1;1;1;;check_raid3ware.pl! c0 8 3

The code could then verify that there are 8 drives and three units. Any deviation from this would raise an error. I haven't coded that part yet. I hope to get that specific later this month.

Extracts from the plug-in code

I decided to make a general plug that might have multiple uses. In this case, I'm thinking that the script on the NetSaint server could take the number of drives and units and compare that to supplied arguments. The device status could then be set accordingly.

The first step is getting the status on the server (in this context, server means the computer on which the 3Ware RAID card is installed). Here is the script I used:

sub raid3ware {
        my $controller = shift;

        my $controllerlisting;
        my $command = "$commandlist{$os}{raid3ware}";

        open(PROCOUT, "$command |") || die;
        $_ = <PROCOUT>;
        while($_ = <PROCOUT>) {
                if (/^(c\d+)\s+(\S+)\s*(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s*/) {
                        $controllerlisting .= '(' . $1 . ','  . $2 . ',' . $3 .
                              ',' . $4 . ',' . $5 . ',' . $6 . ',' . $7 . ',' .
                              $8 . ',' . $9 . ')';
                }
        }
        if (defined($controllerlisting)) {
                print Client $controllerlisting;
        } else {
                print Client "no controllers?";
        }

        $controllerlisting = undef;
        close(PROCOUT);
}

I'm making use of a regular expression, looking for a line that starts with c followed by a number. It returns information for all controllers found. Over on the NetSaint server, I wrote this script:

$ cat check_3wareraid.pl
#!/usr/bin/perl
#
# See LICENSE for copyright information
#
# check_3wareraid.pl <host>
#
# NetSaint host script to get the 3ware RAID status from a client that is running
# netsaint_statd.
#

require 5.003;
BEGIN { $ENV{PATH} = '/bin' }
use Socket;
use POSIX;

sub usage;

my $TIMEOUT = 15;

my %ERRORS = ('UNKNOWN', '-1',
		'OK', '0',
		'WARNING', '1',
		'CRITICAL', '2');
my $remote     = shift || &usage(%ERRORS);
my $controller = shift || &usage(%ERRORS);
my $port       = shift || 1040;

my $remoteaddr = inet_aton("$remote");
my $paddr      = sockaddr_in($port, $remoteaddr) || die "Can't create info for connection: #!\n";;
my $proto      = getprotobyname('tcp');

socket(Server, PF_INET, SOCK_STREAM, $proto) || die "Can't create socket: $!";
setsockopt(Server, SOL_SOCKET, SO_REUSEADDR, 1);
connect(Server, $paddr) || die "Can't connect to server: $!";

my $state = "OK";
my $answer = undef;

# Just in case of problems, let's not hang NetSaint
$SIG{'ALRM'} = sub { 
     close(Server);
     select(STDOUT);
     print "No Answer from Client\n";
     exit $ERRORS{"UNKNOWN"};
};
alarm($TIMEOUT);

#print "invoking Server with:raid3wareunits $controller\n";

select(Server);
$| = 1;

print Server "raid3ware $controller\n";
my ($servanswer) = <Server>;
alarm(0);
close(Server);
select(STDOUT);

chomp($servanswer);

#print "REPLY: '$servanswer'\n";

$servanswer =~ s/\(//g;
my @servanswer = split(/\)/,$servanswer);

$answer = 'not found';
$state  = 'CRITICAL';

foreach $line (@servanswer) {
	my ($con, $model, $ports, $drives, $units, $notopt, $rrate, $vrate, $bbu) = split(/,/, $line);
	if ($con eq $controller) {
		if ($notopt eq '0') {
			if ($bbu eq 'OK') {
				$state  = "OK";
				$answer = 'All units optimal. Battery: ' . $bbu;
			} else {
				$state  = "WARNING";
				$answer = 'All units optimal. Battery: ' . $bbu;
			}
		} else {
			$answer = $status;
			$state = "CRITICAL";
		}
	}
}


print $answer;
exit $ERRORS{$state};

sub usage {
	print "Minimum arguments not supplied!\n";
        print "\n";
        print "Perl Check Users plugin for NetSaint\n";
        print "Copyright (c) 1999 Charlie Cook & Nick Reinking\n";
        print "Copyright (c) 2006 Dan Langille\n";
        print "\n";
        print "Usage: $0 <host> <controller>\n";
        print "\n";
	exit $ERRORS{"UNKNOWN"};
}

When running the script, you get something like this back:

$ perl check_3wareraid.pl opti c0
All units optimal. Battery: Charging

Yes, there's more to the NetSaint solution than what is shown here. I will provide a link to the full scripts later in this article. In the meantime, this is what NetSaint shows now:

charging

The battery is charging because the computer been offline for a few days. I don't leave this machine running all the time because of the noise. But that won't bother me once the machine is moved to the ISP.

The plug-in code

The following are the code segments you'll need:

Here are sample entries from my /usr/local/etc/netsaint/commands.cfg:

command[check_raid3wareunits.pl]=$USER1$/netsaint_statd/check_3wareraidunits.pl $HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$ $ARG4$
command[check_raid3ware.pl]=$USER1$/netsaint_statd/check_3wareraid.pl $HOSTADDRESS$ $ARG1$ $ARG2$ $ARG3$ $ARG4$

And here is what my server looks like in /usr/local/etc/netsaint/hosts.cfg:

# opti
service[opti]=PING;0;24x7;3;5;1;freebsd-admins;120;24x7;1;1;0;;check_ping3
service[opti]=LOAD;0;24x7;3;2;1;freebsd-admins;120;24x7;1;1;1;;check_rload! 3
service[opti]=PROCS;0;24x7;3;2;1;freebsd-admins;120;24x7;1;1;1;;check_rprocs!
service[opti]=USERS;0;24x7;3;2;1;freebsd-admins;120;24x7;1;1;1;;check_rusers! 4
service[opti]=DISKSALL;0;24x7;3;2;1;freebsd-admins;120;24x7;1;1;1;;check_rall_disks!/dev 90 90
service[opti]=RAID controller;0;24x7;3;2;1;raid-admins;120;24x7;1;1;1;;check_raid3ware.pl! c0
service[opti]=RAID spare 1;0;24x7;3;2;1;raid-admins;120;24x7;1;1;1;;check_raid3wareunits.pl! c0 u1
service[opti]=RAID spare 2;0;24x7;3;2;1;raid-admins;120;24x7;1;1;1;;check_raid3wareunits.pl! c0 u2
service[opti]=RAID array  ;0;24x7;3;2;1;raid-admins;120;24x7;1;1;1;;check_raid3wareunits.pl! c0 u0
service[opti]=HTTP;0;24x7;3;2;1;freebsd-admins;120;24x7;1;1;1;;check_http
Starting the test

To start the test, you can use either the CLI command or the web interface. I chose the web interface. Click on Monitor | Battery Backup and you should see this:

charging

Then click on Test Battery Capacity. When I ran the test, it took about 13 hours.

After the test
After running the BBU test, here is what I found in the alarms section of the 3Ware web interface:

charging

The test started early morning. The system first fully charged the battery. This took about an hour. The test finished about 12.5 hours later, given a battery capacity of 255 hours. I find that value suspicious because it is 0xFF. It makes me want to run another test, this time after I allow the BBU to charge up.

Recurring Tests

Today I found this message:

Mar 30 05:44:20 supernews kernel: twa0: INFO: (0x04: 0x0053): Battery capacity test is overdue:
Looking back, I found these:
Mar 28 09:21:54 supernews kernel: twa0: INFO: (0x04: 0x0055): Battery charging started:
Mar 28 09:23:15 supernews kernel: twa0: INFO: (0x04: 0x0056): Battery charging completed:

Looking back through the security run output emails (I knew there was a good reason to keep these messages!), I found 13 such notices since 12 Nov 2006. Yikes. I should have noticed them. I never did until today.

I have recently switched from NetSaint to Nagios and did not create a plugin for Nagios. I'm sure NetSaint would have found the battery charging messages and reported them. It would be nice to be able to determine that a battery capacity test is overdue. I found nothing in the manual. My only thought is grep /var/log/messages looking for it.

So now it is time to run a battery test.

# tw_cli /c0/bbu test
Performing the battery capacity test will disable the write cache on the
           controller /c0 for up to 24 hours.
Do you want to continue ? Y|N [N]: Y
Sending battery capacity test message to /c0/bbu ... Done.
The write cache will be resumed when the test is completed
with no error.

After starting the test, here are the messages I found:

Mar 31 03:42:15 supernews kernel: twa0: INFO: (0x04: 0x004E): Battery capacity test started:
Mar 31 03:42:15 supernews kernel: twa0: INFO: (0x04: 0x0055): Battery charging started:
Mar 31 03:42:17 supernews kernel: twa0: INFO: (0x04: 0x0056): Battery charging completed:

The unit status is:

# tw_cli /c0/bbu show status
/c0/bbu BBU Status                = Testing

See also /cx show alarms.

That's All Folks!

The BBU is vitally important to your RAID solution. Monitoring it is as important as monitoring the system itself. Enjoy.


Need more help on this topic? Click here
This article has no comments
Show me similar articles