check_smart

Last update: October 01, 2019

This is a plugin to monitor the SMART values of hard and solid state drives. The plugin is a fork of check_smart released in 2009 by Kurt Yoder. The biggest change is that this fork allows to also to be used for disks behind a hardware raid controller.

Download

Download check_smart.pl

check_smart.pl

6696 downloads so far...

Download plugin and save it in your Nagios/Monitoring plugin folder (usually /usr/lib/nagios/plugins, depends on your distribution). Afterwards adjust the permissions (usually chmod 755).

Community contributions welcome on GitHub repo.

Version history / Changelog

Feb 3, 2009: Kurt Yoder - initial version of script (rev 1.0)
Jul 8, 2013: Claudio Kuenzler - support hardware raids like megaraid (rev 2.0)
Jul 9, 2013: Claudio Kuenzler - update help output (rev 2.1)
Oct 11, 2013: Claudio Kuenzler - making the plugin work on FreeBSD (rev 3.0)
Oct 11, 2013: Claudio Kuenzler - allowing -i sat (SATA on FreeBSD) (rev 3.1)
Nov 4, 2013: Claudio Kuenzler - works now with CCISS on FreeBSD (rev 3.2)
Nov 4, 2013: Claudio Kuenzler - elements in grown defect list causes warning (rev 3.3)
Nov 6, 2013: Claudio Kuenzler - add threshold option "bad" (-b) (rev 4.0)
Nov 7, 2013: Claudio Kuenzler - modified help (rev 4.0)
Nov 7, 2013: Claudio Kuenzler - bugfix in threshold logic (rev 4.1)
Mar 19, 2014: Claudio Kuenzler - bugfix in defect list perfdata (rev 4.2)
Apr 22, 2014: Jerome Lauret - implemented -g to do a global lookup (rev 5.0)
Apr 25, 2014: Claudio Kuenzler - cleanup, merge Jeromes code, perfdata output fix (rev 5.1)
May 5, 2014: Caspar Smit - Fixed output bug in global check / issue #3 (rev 5.2)
Feb 4, 2015: Caspar Smit and cguadall - Allow detection of more than 26 devices / issue #5 (rev 5.3)
Feb 5, 2015: Bastian de Groot - Different ATA vs. SCSI lookup (rev 5.4)
Feb 11, 2015: Josh Behrends - Allow script to run outside of nagios plugins dir / wiki url update (rev 5.5)
Feb 11, 2015: Claudio Kuenzler - Allow script to run outside of nagios plugins dir for FreeBSD too (rev 5.5)
Mar 12, 2015: Claudio Kuenzler - Change syntax of -g parameter (regex is now awaited from input) (rev 5.6)
Feb 6, 2017: Benedikt Heine - Fix Use of uninitialized value $device (rev 5.7)
Oct 10, 2017: Bobby Jones - Allow multiple devices for interface type megaraid, e.g. "megaraid,[1-5]" (rev 5.8)
Apr 28, 2018: Pavel Pulec (Inuits) - allow type "auto" (rev 5.9)
May 5, 2018: Claudio Kuenzler - Check selftest log for errors using new parameter -s (rev 5.10)
Dec 27, 2018: Claudio Kuenzler - Add exclude list (-e) to ignore certain attributes (5.11)
Jan 8, 2019: Claudio Kuenzler - Fix 'Use of uninitialized value' warnings (5.11.1)
Jun 4, 2019: Claudio Kuenzler - Add raw check list (-r) and warning thresholds (-w) (6.0)
Jun 11, 2019: Claudio Kuenzler - Allow using pseudo bus device /dev/bus/N (6.1)
Aug 19, 2019: Claudio Kuenzler - Add device model and serial number in output (6.2)
Oct 1, 2019: Michael Krahe - Allow exclusion from perfdata as well (-E) and by attribute number (6.3)

Requirements

  • Perl
  • smartmontools package (smartcl command is required)
  • For cciss (HP SmartArray) controllers, smartmontools >= 5.37
  • Entry in sudoers

Sudoers entry

This plugin needs to run as root, otherwise you're not able to lauch smartctl correctly. You have two options:

  • Launch the plugin itself with sudo
  • Launch the plugin itself as nagios user and the smartctl command as root with sudo

Here are some examples you can add to your sudoers with the command "visudo":

nagios ALL = NOPASSWD: /usr/local/libexec/nagios/check_smart.pl # for option 1 on FreeBSD
nagios ALL = NOPASSWD: /usr/local/sbin/smartctl # for option 2 on FreeBSD

nagios ALL = NOPASSWD: /usr/lib/nagios/plugins/check_smart.pl # for option 1 on Linux
nagios ALL = NOPASSWD: /usr/sbin/smartctl # for option 2 on Linux

Definition of the parameters

Short Long Description
-d --device A physical block device to be SMART monitored, eg /dev/sda. Pseudo-device /dev/bus/N is allowed.
-g --global A regular expression of physical devices to be monitored, eg "/dev/sd[a-z]" for devices /dev/sda until /dev/sdz.
It is also possible to use -g in conjunction with megaraid devices. Example: -g /dev/sda -i 'megaraid,[0-3]'.
The global check allows to quickly identify obvious errors on multiple drives, however it will not show details of each drive. For a detailed check including performance data for historical graphing, a single drive check using -d is advised.
-i --interface Drive's interface type (auto|ata|scsi|3ware,N|areca,N|hpt,L/M/N|cciss,N|megaraid,N)
See https://www.smartmontools.org/wiki/Supported_RAID-Controllers for interface types
If used in combination with -g/--global, megaraid interface supports regular expression, eg "-i megaraid,[8-9]"
-r* --raw* Comma separated list (without spaces) of ATA attributes to check for their raw values.
Default: 'Current_Pending_Sector, Reallocated_Sector_Ct, Program_Fail_Cnt_Total, Uncorrectable_Error_Cnt, Offline_Uncorrectable, Runtime_Bad_Block').
-b* --bad* Threshold value (integer) when to warn for N bad entries (ATA: Current Pending Sector, SCSI: Grown defect list)
Note: Deprecated for ATA since check_smart version 6.0, use -w instead. Continue to use -b for SCSI drives.
-w* --warn* Comma separated list of thresholds for ATA drives (e.g. 'Reallocated_Sector_Ct=10,Current_Pending_Sector=62').
-e* --exclude* List of (comma separated) SMART attributes which should be excluded (=ignored) from checks.
Also supports "When_failed" values, e.g. "In_the_past".
-E* --exclude-all* List of (comma separated) SMART attributes which should be excluded (=ignored) completely, for both checks and performance data.
Also supports "When_failed" values, e.g. "In_the_past".
-s* --selftest* Additionally check SMART's selftest log for errors
-h* --help Show help/usage
-v* --version* Show plugin version
N/A --debug* Show debugging information

* optional parameter

Either -d or -g parameter is required. -i is always required.

-e and -E exclude lists can co-exist.

Usage / running the plugin on the command line

Usage:

./check_smart.pl (-d string|-g regex) -i string [-r list] [-w list] [-b int] [-e list] [-s] [--debug]

Example: Single SATA Drive:

./check_smart.pl -d /dev/sda -i ata
WARNING: Reallocated_Sector_Ct is non-zero (3), Program_Fail_Cnt_Total is non-zero (3), Runtime_Bad_Block is non-zero (3), Uncorrectable_Error_Cnt is non-zero (1)|Reallocated_Sector_Ct=3 Power_On_Hours=31415 Power_Cycle_Count=889 Program_Fail_Count_Chip=2 Erase_Fail_Count_Chip=0 Wear_Leveling_Count=873 Used_Rsvd_Blk_Cnt_Chip=386 Used_Rsvd_Blk_Cnt_Tot=752 Unused_Rsvd_Blk_Cnt_Tot=3280 Program_Fail_Cnt_Total=3 Erase_Fail_Count_Total=0 Runtime_Bad_Block=3 Uncorrectable_Error_Cnt=1 ECC_Error_Rate=1 Offline_Uncorrectable=0 CRC_Error_Count=262 Available_Reservd_Space=1630 Total_LBAs_Written=3363787529 Total_LBAs_Read=3278685684

Example: Single SATA Drive with warning thresholds:

./check_smart.pl -d /dev/sda -i ata -w 'Reallocated_Sector_Ct=4,Runtime_Bad_Block=4,Uncorrectable_Error_Cnt=2'
WARNING: Reallocated_Sector_Ct is non-zero (3) (but less than threshold 4), Program_Fail_Cnt_Total is non-zero (3), Runtime_Bad_Block is non-zero (3) (but less than threshold 4), Uncorrectable_Error_Cnt is non-zero (1) (but less than threshold 2)|Reallocated_Sector_Ct=3 Power_On_Hours=31415 Power_Cycle_Count=889 Program_Fail_Count_Chip=2 Erase_Fail_Count_Chip=0 Wear_Leveling_Count=873 Used_Rsvd_Blk_Cnt_Chip=386 Used_Rsvd_Blk_Cnt_Tot=752 Unused_Rsvd_Blk_Cnt_Tot=3280 Program_Fail_Cnt_Total=3 Erase_Fail_Count_Total=0 Runtime_Bad_Block=3 Uncorrectable_Error_Cnt=1 ECC_Error_Rate=1 Offline_Uncorrectable=0 CRC_Error_Count=262 Available_Reservd_Space=1630 Total_LBAs_Written=3363863033 Total_LBAs_Read=3278685684

Example: Single SATA Drive but exclude certain attribute checks (yet keep the attribute data in performance data):

./check_smart.pl -d /dev/sda -i ata -e 'Reallocated_Sector_Ct,Program_Fail_Cnt_Total'
WARNING: Runtime_Bad_Block is non-zero (3), Uncorrectable_Error_Cnt is non-zero (1)|Reallocated_Sector_Ct=3 Power_On_Hours=31416 Power_Cycle_Count=889 Program_Fail_Count_Chip=2 Erase_Fail_Count_Chip=0 Wear_Leveling_Count=873 Used_Rsvd_Blk_Cnt_Chip=386 Used_Rsvd_Blk_Cnt_Tot=752 Unused_Rsvd_Blk_Cnt_Tot=3280 Program_Fail_Cnt_Total=3 Erase_Fail_Count_Total=0 Runtime_Bad_Block=3 Uncorrectable_Error_Cnt=1 ECC_Error_Rate=1 Offline_Uncorrectable=0 CRC_Error_Count=262 Available_Reservd_Space=1630 Total_LBAs_Written=3363924329 Total_LBAs_Read=3278685684

Example: Single SATA Drive but completely exclude certain attribute from check and performance data:

./check_smart.pl -d /dev/sda -i ata -E 'Reallocated_Sector_Ct,Program_Fail_Cnt_Total'
WARNING: Runtime_Bad_Block is non-zero (3), Uncorrectable_Error_Cnt is non-zero (1)|Power_On_Hours=31416 Power_Cycle_Count=889 Program_Fail_Count_Chip=2 Erase_Fail_Count_Chip=0 Wear_Leveling_Count=873 Used_Rsvd_Blk_Cnt_Chip=386 Used_Rsvd_Blk_Cnt_Tot=752 Unused_Rsvd_Blk_Cnt_Tot=3280 Erase_Fail_Count_Total=0 Runtime_Bad_Block=3 Uncorrectable_Error_Cnt=1 ECC_Error_Rate=1 Offline_Uncorrectable=0 CRC_Error_Count=262 Available_Reservd_Space=1630 Total_LBAs_Written=3363924329 Total_LBAs_Read=3278685684

Example: Single SATA Drive with manual override which attributes should be checked for their raw values:

./check_smart.pl -d /dev/sda -i ata -r 'Uncorrectable_Error_Cnt'
WARNING: Uncorrectable_Error_Cnt is non-zero (1)|Reallocated_Sector_Ct=3 Power_On_Hours=31416 Power_Cycle_Count=889 Program_Fail_Count_Chip=2 Erase_Fail_Count_Chip=0 Wear_Leveling_Count=873 Used_Rsvd_Blk_Cnt_Chip=386 Used_Rsvd_Blk_Cnt_Tot=752 Unused_Rsvd_Blk_Cnt_Tot=3280 Program_Fail_Cnt_Total=3 Erase_Fail_Count_Total=0 Runtime_Bad_Block=3 Uncorrectable_Error_Cnt=1 ECC_Error_Rate=1 Offline_Uncorrectable=0 CRC_Error_Count=262 Available_Reservd_Space=1630 Total_LBAs_Written=3363995193 Total_LBAs_Read=3278685684

Example: Drive attached to MegaRAID controller:

./check_smart.pl -d /dev/sda -i megaraid,8

Example: Intel RAID on FreeBSD 9.2 ("kldload mfip.ko" required):

/usr/local/libexec/nagios/check_smart.pl -d /dev/pass0 -i scsi

Example: SATA drives behind Intel RAID on FreeBSD 9.2 ("kldload mfip.ko" required):

/usr/local/libexec/nagios/check_smart.pl -d /dev/pass12 -i sat

Example: SCSI drives behind HP RAID (CCISS) on FreeBSD 6.0:

/usr/local/libexec/nagios/check_smart.pl -d /dev/ciss0 -i cciss,0
OK: no SMART errors detected|defect_list=0 sent_blocks=3093462752 temperature=24;;68

/usr/local/libexec/nagios/check_smart.pl -d /dev/ciss0 -i cciss,3
WARNING: 48 Elements in grown defect list | defect_list=48 sent_blocks=1137657348 temperature=22;;68

Example: Using threshold option (-b) to ignore 1 bad element, warning only when 2 bad elements are found:

/usr/local/libexec/nagios/check_smart.pl -d /dev/ciss0 -i cciss,1 -b 2
OK: 1 Elements in grown defect list (but less than threshold 2)|defect_list=1;2;2;; sent_blocks=2769458900762624 temperature=27;;65

Example: Check all SATA disks (sda - sdz) at the same time on Linux:

/usr/lib/nagios/plugins/check_smart.pl -g "/dev/sd[a-z]" -i ata
OK: [/dev/sda] - Device is clean --- [/dev/sdb] - Device is clean|

Example: Check all SCSI disks behind Intel RAID on FreeBSD 9.2 ("kldload mfip.ko" required):

/usr/local/libexec/nagios/check_smart.pl -g "/dev/pass[1-9]" -i scsi
OK: [/dev/pass0] - Device is clean --- [/dev/pass1] - Device is clean --- [/dev/pass2] - Device is clean --- [/dev/pass3] - Device is clean --- [/dev/pass4] - Device is clean --- [/dev/pass5] - Device is clean --- [/dev/pass6] - Device is clean --- [/dev/pass7] - Device is clean --- [/dev/pass8] - Device is clean --- [/dev/pass9] - Device is clean |

Example: Single SCSI drive on FreeBSD 10.1:

/usr/local/libexec/nagios/check_smart.pl -d /dev/da0 -i scsi
OK: no SMART errors detected. |sent_blocks=14067306 temperature=34;;60

Command definition (NRPE)

Example command definition for single drive in your nrpe.cfg:

command[check_smart]=sudo /usr/lib/nagios/plugins/check_smart.pl -d $ARG1$ -i $ARG2$ -w $ARG3$

Example command definition for multiple drives using -g parameter in your nrpe.cfg:

command[check_smart_multidrive]=sudo /usr/lib/nagios/plugins/check_smart.pl -g $ARG1$ -i $ARG2$ -w $ARG3$

Service definition

Service definition in Nagios, Icinga 1.x, Shinken, Naemon

Basic check of a single drive (or drive in software raid):

# Check SMART of a typical single disk (or used in software raid)
define service{
  use generic-service
  host_name mylinux1
  service_description Disk SMART Status SDA
  check_command check_nrpe!check_smart!-a "/dev/sda" "sat" "Current_Pending_Sector=14,Reallocated_Sector_Count=3"
}

Check SMART of multiple disks at same time:

# Check SMART of multiple disks with regex (looking for /dev/sda until /dev/sdf)
define service{
  use generic-service
  host_name mylinux1
  service_description Disk SMART Status
  check_command check_nrpe!check_smart_all!-a "/dev/sd[a-f]" "sat" "Current_Pending_Sector=14,Reallocated_Sector_Count=3"
}

Check SMART of a drive behind a cciss (HP SmartArray) controller:

# Check SMART of a drive behind a cciss (HP SmartArray) raid controller
define service{
  use generic-service
  host_name myhpproliant1
  service_description Disk SMART Status cciss2
  check_command check_nrpe!check_smart!-a "/dev/cciss/c0d0" "cciss,2" "Current_Pending_Sector=14,Reallocated_Sector_Count=3"
}

Here the argument 3 ($ARG3$) is "Current_Pending_Sector=14,Reallocated_Sector_Count=3". This means that this drive already has 13 pending sectors and 2 reallocated sectors. The warning thesholds are set to 14 for the Current_Pending_Sector attribute and to 3 for the Reallocated_Sector_Count attribute. As soon as the drive reaches 14 (or more) pending sectors or 3 (or more) reallocated sectors, the plugin will return a warning. This helps to see if a disk is really failing and the number of defect sectors are growing.

Service object definition Icinga 2.x

Check a single SATA drive with specific warning threshold

# SMART Check of drive sda
object Service "Hardware" {
  import "generic-service"
  host_name "linuxserver1"
  check_command = "nrpe"
  vars.nrpe_command = "check_smart"
  vars.nrpe_arguments = ["/dev/sda", "sat", "Current_Pending_Sector=14,Reallocated_Sector_Count=3"]
}

Screenshots

check_smart multiple drives with drive names
check_smart multiple drives in icingaweb2
check_smart multiple alerts
check_smart warning
check_smart warning
check_smart all ok with values below threshold
check_smart self log warning