The FreeBSD Diary
Providing practical examples since 1998If you buy from Amazon USA, please support us by using this link.
Monitoring your HDD using SMART and Nagios 13 March 2010
Monitoring of your computer systems is a good idea. There are many tools that let you verify that specified services are running, and available for clients. I use Nagios. You can check that Apache is still running, Postfix is still accepting mail, and various other things. If you can write a test, Nagios can monitor it.
Typically, people monitor network connections, applications, and bandwidth consumption. Until recently, I did not monitor disk health. That recently changed.
I started using three new tools:
In this article I'll show you how I added SMART monitoring to my Nagios installation. munin is straight forward to install, but is outside the scope of this article. It is for another time.
This article also assumes you have Nagios installed and nrpe running on the host you are monitoring. I am using Fruity for my nagios configuration, so I will be glossing over that too.
Disks die. Usually, they die predictably. Tools exist for monitoring your HDD. Many modern disks contain SMART support. From http://en.wikipedia.org/wiki/S.M.A.R.T.:
Self-Monitoring, Analysis, and Reporting Technology, or S.M.A.R.T. (sometimes written as SMART), is a monitoring system for computer hard disks to detect and report on various indicators of reliability, in the hope of anticipating failures.
My first real introduction to SMART came from reading Watching a hard drive die by Greg Smith. Greg is present on the PostgreSQL Performance mailing list. He knows a lot about hardware and how to get the best out of it. As I was setting up a 10TB file server, I wanted to start monitoring the health of those disks.
To install smartmontools:
cd /usr/ports/sysutils/smartmontools/ make install clean
To have smartd start at boot:
echo 'smartd_enable="YES"' >> /etc/rc.conf
I used the default configuration file, but you could get more specific if you wanted:
cp -i /usr/local/etc/smartd.conf.sample /usr/local/etc/smartd.conf
To start smartd now:
# /usr/local/etc/rc.d/smartd start Starting smartd.
I know I have two HDD, so I added this to /etc/periodic.conf so I include drive health information in my daily status reports:
nagios-check_smartmon is a Nagios plugin that allows you to access smartmontools from within nagios. To install it:
# cd /usr/ports/net-mgmt/nagios-check_smartmon # make install clean
Let's see if we can run it:
# /usr/local/libexec/nagios/check_smartmon -d /dev/ad2 OK: device is functional and stable (temperature: 43)
That's what we need.
smartmon must be run with sufficient permission to access the device. The command runs as the Nagios user, via net-mgmt/nrpe. The following is the entry I add to /usr/local/etc/nrpe.cfg to monitor the two HDD in this system:
command[check_smartmon_ad2]=sudo /usr/local/libexec/nagios/check_smartmon -d /dev/ad2 command[check_smartmon_ad4]=sudo /usr/local/libexec/nagios/check_smartmon -d /dev/ad4
After changing the above configuration file, remember to restart nrpe:
# /usr/local/etc/rc.d/nrpe2 restart Stopping nrpe2. Starting nrpe2.
In order to allow the nagios user to run this command via sudo, I add the following via the visudo command:
nagios ALL=(ALL) NOPASSWD:/usr/local/libexec/nagios/check_smartmon -d /dev/ad2 nagios ALL=(ALL) NOPASSWD:/usr/local/libexec/nagios/check_smartmon -d /dev/ad4
From the nagios system, I ran these commands to verify that nrpe would return the expected results:
$ /usr/local/libexec/nagios/check_nrpe2 -H bast -c check_smartmon_ad2 OK: device is functional and stable (temperature: 42)
Good. So we know NRPE will perform the command and return the expected results. Now it's a simple matter of configuring nagios to run the above command.
Guess what. I found news:
WARNING: device temperature (57) exceeds warning temperature threshold (55)
I started a long self test:
# smartctl -t long /dev/ad6 smartctl version 5.38 [i386-portbld-freebsd8.0] Copyright (C) 2002-8 Bruce Allen Home page is http://smartmontools.sourceforge.net/ === START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION === Sending command: "Execute SMART Extended self-test routine immediately in off-line mode". Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful. Testing has begun. Please wait 54 minutes for test to complete. Test will complete after Sat Mar 13 20:38:33 2010 Use smartctl -X to abort test.
And soon after that:
CRITICAL: device temperature (61) exceeds critical temperature threshold (60)
After manually checking the HDD temperature, by putting my hand on the HDD, I determined all were of a similar temperature. I concluded SMART was wrong, which is not unknown. I adjusted nrpe.cfg to adjust for the higher reading:
command[check_smartmon_ad6]=sudo /usr/local/libexec/nagios/check_smartmon -d /dev/ad6 -w 65 -c 70
I also ran visudo and updated the ad6 entry to allow nagios to run the amended command.