|
The easiest way to find out the BBU status is through the
3Ware CLI. I do it like this:
# tw_cli /c0/bbu show
Name OnlineState BBUReady Status Volt Temp Hours LastCapTest
---------------------------------------------------------------------------
bbu On No Charging OK OK 255 30-Aug-2006
#
You can see that the BBU is charging and it's not ready yet.
You can also see that I've run the capacity test and the unit is good for 255
hours.
There is also a longer version:
# tw_cli /c0/bbu show show all
/c0/bbu Firmware Version = BBU: 2.00.00.011
/c0/bbu Serial Number = L021902B6081199
/c0/bbu Online State = On
/c0/bbu BBU Ready = No
/c0/bbu BBU Status = Charging
/c0/bbu Battery Voltage = OK
/c0/bbu Battery Temperature = OK
/c0/bbu Estimated Backup Capacity = 255 Hours
/c0/bbu Last Capacity Test = 30-Aug-2006
/c0/bbu Battery Installation Date = 09-Aug-2006
/c0/bbu Bootloader Version = BBU 0.02.02.001
/c0/bbu PCB Revision = 65
#
Both versions are relatively easy to parse. Regardless, I took another
approach.
# tw_cli info
Ctl Model Ports Drives Units NotOpt RRate VRate BBU
------------------------------------------------------------------------
c0 9550SX-8LP 8 8 3 0 4 4 Charging
#
Why?
I think the status of the RAID card itself is more important. The data provided
by the info command includes all 3Ware RAID cards in the system, the
number of drives, units, how many are not optimal, and the BBU status. Such
information allows the creation of certain tests to verify this data meets
expected values. For example, I could use this:
service[opti]=RAID controller;0;24x7;3;2;1;raid-admins;120;24x7;1;1;1;;check_raid3ware.pl! c0 8 3
The code could then verify that there are 8 drives and three units. Any
deviation from this would raise an error. I haven't coded that part yet.
I hope to get that specific later this month.
|
|
I decided to make a general plug that might have multiple uses. In this case,
I'm thinking that the script on the NetSaint server could take the number of
drives and units and compare that to supplied arguments. The device status
could then be set accordingly.
The first step is getting the status on the server (in this context, server means
the computer on which the 3Ware RAID card is installed). Here is the script
I used:
sub raid3ware {
my $controller = shift;
my $controllerlisting;
my $command = "$commandlist{$os}{raid3ware}";
open(PROCOUT, "$command |") || die;
$_ = <PROCOUT>;
while($_ = <PROCOUT>) {
if (/^(c\d+)\s+(\S+)\s*(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s*/) {
$controllerlisting .= '(' . $1 . ',' . $2 . ',' . $3 .
',' . $4 . ',' . $5 . ',' . $6 . ',' . $7 . ',' .
$8 . ',' . $9 . ')';
}
}
if (defined($controllerlisting)) {
print Client $controllerlisting;
} else {
print Client "no controllers?";
}
$controllerlisting = undef;
close(PROCOUT);
}
I'm making use of a regular expression, looking for a line that starts with
c followed by a number. It returns information for all controllers found.
Over on the NetSaint server, I wrote this script:
$ cat check_3wareraid.pl
#!/usr/bin/perl
#
# See LICENSE for copyright information
#
# check_3wareraid.pl <host>
#
# NetSaint host script to get the 3ware RAID status from a client that is running
# netsaint_statd.
#
require 5.003;
BEGIN { $ENV{PATH} = '/bin' }
use Socket;
use POSIX;
sub usage;
my $TIMEOUT = 15;
my %ERRORS = ('UNKNOWN', '-1',
'OK', '0',
'WARNING', '1',
'CRITICAL', '2');
my $remote = shift || &usage(%ERRORS);
my $controller = shift || &usage(%ERRORS);
my $port = shift || 1040;
my $remoteaddr = inet_aton("$remote");
my $paddr = sockaddr_in($port, $remoteaddr) || die "Can't create info for connection: #!\n";;
my $proto = getprotobyname('tcp');
socket(Server, PF_INET, SOCK_STREAM, $proto) || die "Can't create socket: $!";
setsockopt(Server, SOL_SOCKET, SO_REUSEADDR, 1);
connect(Server, $paddr) || die "Can't connect to server: $!";
my $state = "OK";
my $answer = undef;
# Just in case of problems, let's not hang NetSaint
$SIG{'ALRM'} = sub {
close(Server);
select(STDOUT);
print "No Answer from Client\n";
exit $ERRORS{"UNKNOWN"};
};
alarm($TIMEOUT);
#print "invoking Server with:raid3wareunits $controller\n";
select(Server);
$| = 1;
print Server "raid3ware $controller\n";
my ($servanswer) = <Server>;
alarm(0);
close(Server);
select(STDOUT);
chomp($servanswer);
#print "REPLY: '$servanswer'\n";
$servanswer =~ s/\(//g;
my @servanswer = split(/\)/,$servanswer);
$answer = 'not found';
$state = 'CRITICAL';
foreach $line (@servanswer) {
my ($con, $model, $ports, $drives, $units, $notopt, $rrate, $vrate, $bbu) = split(/,/, $line);
if ($con eq $controller) {
if ($notopt eq '0') {
if ($bbu eq 'OK') {
$state = "OK";
$answer = 'All units optimal. Battery: ' . $bbu;
} else {
$state = "WARNING";
$answer = 'All units optimal. Battery: ' . $bbu;
}
} else {
$answer = $status;
$state = "CRITICAL";
}
}
}
print $answer;
exit $ERRORS{$state};
sub usage {
print "Minimum arguments not supplied!\n";
print "\n";
print "Perl Check Users plugin for NetSaint\n";
print "Copyright (c) 1999 Charlie Cook & Nick Reinking\n";
print "Copyright (c) 2006 Dan Langille\n";
print "\n";
print "Usage: $0 <host> <controller>\n";
print "\n";
exit $ERRORS{"UNKNOWN"};
}
When running the script, you get something like this back:
$ perl check_3wareraid.pl opti c0
All units optimal. Battery: Charging
Yes, there's more to the NetSaint solution than what is shown here. I will
provide a link to the full scripts later in this article. In the meantime,
this is what NetSaint shows now:
The battery is charging because the computer been offline for a few days. I don't
leave this machine running all the time because of the noise. But that won't
bother me once the machine is moved to the ISP.
|
|
Today I found this message:
Mar 30 05:44:20 supernews kernel: twa0: INFO: (0x04: 0x0053): Battery capacity test is overdue:
Looking back, I found these:
Mar 28 09:21:54 supernews kernel: twa0: INFO: (0x04: 0x0055): Battery charging started:
Mar 28 09:23:15 supernews kernel: twa0: INFO: (0x04: 0x0056): Battery charging completed:
Looking back through the security run output emails (I knew there was a good
reason to keep these messages!), I found 13 such notices since 12 Nov 2006. Yikes.
I should have noticed them. I never did until today.
I have recently switched from NetSaint to Nagios and did not create a plugin for
Nagios. I'm sure NetSaint would have found the battery charging messages and reported
them. It would be nice to be able to determine that a battery capacity test is overdue.
I found nothing in the manual. My only thought is grep /var/log/messages
looking for it.
So now it is time to run a battery test.
# tw_cli /c0/bbu test
Performing the battery capacity test will disable the write cache on the
controller /c0 for up to 24 hours.
Do you want to continue ? Y|N [N]: Y
Sending battery capacity test message to /c0/bbu ... Done.
The write cache will be resumed when the test is completed
with no error.
After starting the test, here are the messages I found:
Mar 31 03:42:15 supernews kernel: twa0: INFO: (0x04: 0x004E): Battery capacity test started:
Mar 31 03:42:15 supernews kernel: twa0: INFO: (0x04: 0x0055): Battery charging started:
Mar 31 03:42:17 supernews kernel: twa0: INFO: (0x04: 0x0056): Battery charging completed:
The unit status is:
# tw_cli /c0/bbu show status
/c0/bbu BBU Status = Testing
See also /cx show alarms.
|