The FreeBSD Diary

The FreeBSD Diary (TM)

Providing practical examples since 1998

If you buy from Amazon USA, please support us by using this link.
[ HOME | TOPICS | INDEX | WEB RESOURCES | BOOKS | CONTRIBUTE | SEARCH | FEEDBACK | FAQ | FORUMS ]

Things look quiet here. But I've been doing a lot of blogging at dan.langille.org because I prefer WordPress now. Not all my posts there are FreeBSD related. I am in the midst of migrating The FreeBSD Diary over to WordPress (and you can read about that here). Once the migration is completed, I'll move the FreeBSD posts into the new FreeBSD Diary website.

Monitoring your HDD using SMART and Nagios 13 March 2010
Need more help on this topic? Click here
This article has 1 comment
Show me similar articles

Monitoring of your computer systems is a good idea. There are many tools that let you verify that specified services are running, and available for clients. I use Nagios. You can check that Apache is still running, Postfix is still accepting mail, and various other things. If you can write a test, Nagios can monitor it.

Typically, people monitor network connections, applications, and bandwidth consumption. Until recently, I did not monitor disk health. That recently changed.

I started using three new tools:

In this article I'll show you how I added SMART monitoring to my Nagios installation. munin is straight forward to install, but is outside the scope of this article. It is for another time.

This article also assumes you have Nagios installed and nrpe running on the host you are monitoring. I am using Fruity for my nagios configuration, so I will be glossing over that too.

SMART

Disks die. Usually, they die predictably. Tools exist for monitoring your HDD. Many modern disks contain SMART support. From http://en.wikipedia.org/wiki/S.M.A.R.T.:

Self-Monitoring, Analysis, and Reporting Technology, or S.M.A.R.T. (sometimes written as SMART), is a monitoring system for computer hard disks to detect and report on various indicators of reliability, in the hope of anticipating failures.

My first real introduction to SMART came from reading Watching a hard drive die by Greg Smith. Greg is present on the PostgreSQL Performance mailing list. He knows a lot about hardware and how to get the best out of it. As I was setting up a 10TB file server, I wanted to start monitoring the health of those disks.

smartmontools

To install smartmontools:

cd /usr/ports/sysutils/smartmontools/
make install clean

To have smartd start at boot:

echo 'smartd_enable="YES"' >> /etc/rc.conf

I used the default configuration file, but you could get more specific if you wanted:

cp -i /usr/local/etc/smartd.conf.sample /usr/local/etc/smartd.conf

To start smartd now:

# /usr/local/etc/rc.d/smartd start
Starting smartd.

I know I have two HDD, so I added this to /etc/periodic.conf so I include drive health information in my daily status reports:

daily_status_smart_devices="/dev/ad0 /dev/ad2"
nagios-check_smartmon

nagios-check_smartmon is a Nagios plugin that allows you to access smartmontools from within nagios. To install it:

# cd /usr/ports/net-mgmt/nagios-check_smartmon
# make install clean

Let's see if we can run it:

# /usr/local/libexec/nagios/check_smartmon -d /dev/ad2
OK: device is functional and stable (temperature: 43)

That's what we need.

nrpe changes

smartmon must be run with sufficient permission to access the device. The command runs as the Nagios user, via net-mgmt/nrpe. The following is the entry I add to /usr/local/etc/nrpe.cfg to monitor the two HDD in this system:

command[check_smartmon_ad2]=sudo /usr/local/libexec/nagios/check_smartmon -d /dev/ad2
command[check_smartmon_ad4]=sudo /usr/local/libexec/nagios/check_smartmon -d /dev/ad4

After changing the above configuration file, remember to restart nrpe:

# /usr/local/etc/rc.d/nrpe2 restart
Stopping nrpe2.
Starting nrpe2.

In order to allow the nagios user to run this command via sudo, I add the following via the visudo command:

nagios   ALL=(ALL) NOPASSWD:/usr/local/libexec/nagios/check_smartmon -d /dev/ad2
nagios   ALL=(ALL) NOPASSWD:/usr/local/libexec/nagios/check_smartmon -d /dev/ad4

From the nagios system, I ran these commands to verify that nrpe would return the expected results:

$ /usr/local/libexec/nagios/check_nrpe2 -H bast -c check_smartmon_ad2
OK: device is functional and stable (temperature: 42)

Good. So we know NRPE will perform the command and return the expected results. Now it's a simple matter of configuring nagios to run the above command.

Guess what. I found news:

WARNING: device temperature (57) exceeds warning temperature threshold (55) 

I started a long self test:

 
# smartctl -t long /dev/ad6
smartctl version 5.38 [i386-portbld-freebsd8.0] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF OFFLINE IMMEDIATE AND SELF-TEST SECTION ===
Sending command: "Execute SMART Extended self-test routine immediately in off-line mode".
Drive command "Execute SMART Extended self-test routine immediately in off-line mode" successful.
Testing has begun.
Please wait 54 minutes for test to complete.
Test will complete after Sat Mar 13 20:38:33 2010

Use smartctl -X to abort test.

And soon after that:

CRITICAL: device temperature (61) exceeds critical temperature threshold (60) 

Nice.

After manually checking the HDD temperature, by putting my hand on the HDD, I determined all were of a similar temperature. I concluded SMART was wrong, which is not unknown. I adjusted nrpe.cfg to adjust for the higher reading:

command[check_smartmon_ad6]=sudo /usr/local/libexec/nagios/check_smartmon -d /dev/ad6 -w 65 -c 70

I also ran visudo and updated the ad6 entry to allow nagios to run the amended command.


Share
Need more help on this topic? Click here
This article has 1 comment
Show me similar articles