Zabbix check for RAID failures

As always ... If this helped you in any way and you have some spare BitCoins, you may donate them to me - 16tb2Rgn4uDptrEuR94BkhQAZNgfoMj3ug

Strange thing about Zabbix, support for hardware errors, I suppose one can use IPMI, but what a shlepp to setup, I think a good way to monitor disks in a Linux machine is to use a utility called hpacucli

The one I use is hpacucli-9.0-24.0.noarch.rpm

Download the rpm and save to /etc/zabbix/scripts
Install the RPM

svr1:/etc/zabbix/scripts # ls -ltr *.rpm
-rw-r--r-- 1 root root 6504897 Mar 25 11:27 hpacucli-9.0-24.0.noarch.rpm
svr1:/etc/zabbix/scripts # rpm -ivh hpacucli-9.0-24.0.noarch.rpm
Preparing...                ########################################### [100%]
   1:hpacucli               ########################################### [100%]
svr1:/etc/zabbix/scripts #

vi a script called zx_raid_status.stage1.sh in /etc/zabbix/scripts, the script is below, just copy and paste and save

#!/bin/bash

# Script (run by root) to get raid status

# Changelog

# 0.3 HP GEN 8 2 x controllers - King Rat 20130405
# 0.2 Provide absolute path to hpacucli binary, and make logging clearer
# 0.1 Base version - 20120423

# Params 

# Version 
VER="0.3"

if [ -f /etc/zabbix/scripts/diskstatus.log ];then
 rm /etc/zabbix/scripts/diskstatus.log
fi

touch /etc/zabbix/scripts/diskstatus.log
chown zabbix:zabbix /etc/zabbix/scripts/diskstatus.log

# The logical disk(s)
LOGFILE="/etc/zabbix/scripts/diskstatus.log"
echo "Version "$VER > $LOGFILE
echo "Disk(s) last checked at "`date` >> $LOGFILE
echo `hostname -a` >> $LOGFILE

LDSTAT="/tmp/zx_ldstatus"
> ${LDSTAT}
# The physical disks
PDSTAT="/tmp/zx_pdstatus"
> ${PDSTAT}

# Our logger tag
TAG="zx_raidstatus"

# The app location
APP="/usr/sbin/hpacucli"

# Functions
nocont()
{
# How many controllers
${APP} ctrl all show config | grep -i "slot" | awk '{print $6}' > /etc/zabbix/scripts/cont.txt
sort /etc/zabbix/scripts/cont.txt > /etc/zabbix/scripts/sort.log
}

out()
{
 # Write to the log file
 logger -s -t ${TAG}
}

runroot()
{
 # This has to be run as root
 if [ `whoami` != 'root' ]
 then
  echo "This has to be run by root" | out
  exit
 fi
}

pdstatus()
{
while read line;
do
 # This check the status of all physical disks
 ${APP} ctrl slot=$line pd all show status | out
 ${APP} ctrl slot=$line pd all show status >> $LOGFILE
 ECNT=`${APP} ctrl slot=$line pd all show status | egrep -i "(fail|error|offline|rebuild|ignoring|degraded|skipping|nok)" | wc -l`
 if [ ${ECNT} -gt 0 ]
 then
  echo "${ECNT} non-OK statuses being reported (physical disk)" | out
  echo "${ECNT} non-OK statuses being reported (physical disk)" >> $LOGFILE
  echo ${ECNT} > ${PDSTAT}
 else
  echo 0 > ${PDSTAT}
  echo "Physical drives - all ok" >> $LOGFILE
 fi
done < /etc/zabbix/scripts/sort.log
}

ldstatus()
{
while read line;
do
 # This check the status of all physical disks
 ${APP} ctrl slot=$line logicaldrive all show status | out
 ${APP} ctrl slot=$line logicaldrive all show status >>$LOGFILE
 ECNT=`${APP} ctrl slot=$line pd all show status | egrep -i "(fail|error|offline|rebuild|ignoring|degraded|skipping|nok)" | wc -l`
 if [ ${ECNT} -gt 0 ]
 then
  echo "${ECNT} non-OK statuses being reported (logical disk)" | out
  echo "${ECNT} non-OK statuses being reported (logical disk)" >> $LOGFILE
  echo ${ECNT} > ${LDSTAT}
 else
  echo 0 > ${LDSTAT}
  echo "Logical drives - all ok" >> $LOGFILE
 fi
done < /etc/zabbix/scripts/sort.log
}

# Execute

echo "${VER} started"
runroot
nocont
ldstatus
pdstatus

vi a script called zx_raid_status_pdstat.sh in /etc/zabbix/scripts, the script is below, just copy and paste and save

#!/bin/sh

# This is the second stage run by zabbix to get the last physical disk error count

# Changelog

# 0.1 Base version

# Params

# Our version
VER="0.1"

# Our files to read
PDSTAT="/tmp/zx_pdstatus"

cat ${PDSTAT}

vi a script called zx_raid_status_ldstat.sh in /etc/zabbix/scripts, the script is below, just copy and paste and save

#!/bin/sh

# This is the second stage run by zabbix to get the last logical disk error count

# Changelog

# 0.1 Base version

# Params

# Our version
VER="0.1"

# Our files to read
LDSTAT="/tmp/zx_ldstatus"

cat ${LDSTAT}

You should have the following when done

svr1:/opt/temp # cd /etc/zabbix/scripts/
svr1:/etc/zabbix/scripts # ls -ltr
total 6492
-rw-r--r-- 1 root   root        1503 Mar 25 11:25 zx_raid_status.stage1.sh
-rw-r--r-- 1 root   root         242 Mar 25 11:25 zx_raid_status.pdstat.sh
-rw-r--r-- 1 root   root         241 Mar 25 11:25 zx_raid_status.ldstat.sh
-rw-r--r-- 1 root   root     6504897 Mar 25 11:27 hpacucli-9.0-24.0.noarch.rpm
svr1:/etc/zabbix/scripts #

Make the scripts executable with chmod +x *.sh and set the owner to Zabbix

svr1:/etc/zabbix/scripts # chmod +x zx*.sh
svr1:/etc/zabbix/scripts # chown zabbix:zabbix zx*.sh
svr1:/etc/zabbix/scripts # ls -ltr zx*.sh
-rwxr-xr-x 1 zabbix zabbix 1503 Mar 25 11:25 zx_raid_status.stage1.sh
-rwxr-xr-x 1 zabbix zabbix  242 Mar 25 11:25 zx_raid_status.pdstat.sh
-rwxr-xr-x 1 zabbix zabbix  241 Mar 25 11:25 zx_raid_status.ldstat.sh
svr1:/etc/zabbix/scripts #

Run the file manually to make sure that it works - zx_raid_status.stage1.sh

svr1:/etc/zabbix/scripts # /etc/zabbix/scripts/zx_raid_status.stage1.sh
0.2 started
zx_raidstatus:
zx_raidstatus:    logicaldrive 1 (279.4 GB, RAID 1): OK
zx_raidstatus:    logicaldrive 2 (1.1 TB, RAID 0): OK
zx_raidstatus:    logicaldrive 3 (1.4 TB, RAID 1+0): Failed
zx_raidstatus:
zx_raidstatus: 4 non-OK statuses being reported (logical disk)
zx_raidstatus:
zx_raidstatus:    physicaldrive 2C:1:1 (port 2C:box 1:bay 1, 300 GB): OK
zx_raidstatus:    physicaldrive 2C:1:2 (port 2C:box 1:bay 2, 300 GB): OK
zx_raidstatus:    physicaldrive 2C:1:3 (port 2C:box 1:bay 3, 300 GB): OK
zx_raidstatus:    physicaldrive 2C:1:4 (port 2C:box 1:bay 4, 300 GB): OK
zx_raidstatus:    physicaldrive 3C:1:5 (port 3C:box 1:bay 5, 300 GB): OK
zx_raidstatus:    physicaldrive 3C:1:6 (port 3C:box 1:bay 6, 300 GB): OK
zx_raidstatus:    physicaldrive 3C:1:7 (port 3C:box 1:bay 7, 300 GB): Failed
zx_raidstatus:    physicaldrive 3C:1:8 (port 3C:box 1:bay 8, 300 GB): Failed
zx_raidstatus:    physicaldrive 4C:2:1 (port 4C:box 2:bay 1, 300 GB): OK
zx_raidstatus:    physicaldrive 4C:2:2 (port 4C:box 2:bay 2, 300 GB): OK
zx_raidstatus:    physicaldrive 4C:2:3 (port 4C:box 2:bay 3, 300 GB): OK
zx_raidstatus:    physicaldrive 4C:2:4 (port 4C:box 2:bay 4, 300 GB): Failed
zx_raidstatus:    physicaldrive 5C:2:5 (port 5C:box 2:bay 5, 300 GB): OK
zx_raidstatus:    physicaldrive 5C:2:6 (port 5C:box 2:bay 6, 300 GB): OK
zx_raidstatus:    physicaldrive 5C:2:7 (port 5C:box 2:bay 7, 300 GB): OK
zx_raidstatus:    physicaldrive 5C:2:8 (port 5C:box 2:bay 8, 300 GB): Failed
zx_raidstatus:
zx_raidstatus: 4 non-OK statuses being reported (physical disk)
svr1:/etc/zabbix/scripts #

Add the following line to the root crontab, this line will run the script every 5 min and write logfiles to /tmp, The logfiles in /tmp will contain the number of errors on the disks

*/5 * * * * /etc/zabbix/scripts/zx_raid_status.stage1.sh > /dev/null 2>&1

svr1:/etc/zabbix/scripts # crontab -l
# DO NOT EDIT THIS FILE - edit the master and reinstall.
# (/tmp/crontab.XXXXIM2c9I installed on Mon Mar 25 11:37:16 2013)
# (Cron version V5.0 -- $Id: crontab.c,v 1.12 2004/01/23 18:56:42 vixie Exp $)
*/5 * * * * /etc/zabbix/scripts/zx_raid_status.stage1.sh > /dev/null 2>&1
svr1:/etc/zabbix/scripts #

Change the Zabbix config file

svr1:/etc/zabbix/scripts # vi /etc/zabbix/zabbix_agentd.conf

and add this to the bottom of the file
UserParameter=raid.lderror,/etc/zabbix/scripts/zx_raid_status.ldstat.sh
UserParameter=raid.pderror,/etc/zabbix/scripts/zx_raid_status.pdstat.sh

svr1:/etc/zabbix/scripts # tail /etc/zabbix/zabbix_agentd.conf
#UserParameter=mysql.qps,mysqladmin -uroot status|cut -f9 -d":"
#UserParameter=mysql.version,mysql -V
UserParameter=raid.lderror,/etc/zabbix/scripts/zx_raid_status.ldstat.sh
UserParameter=raid.pderror,/etc/zabbix/scripts/zx_raid_status.pdstat.sh
svr1:/etc/zabbix/scripts #

Stop and start the Zabbix agent

svr1:/etc/zabbix/scripts # /etc/init.d/zabbix-agent stop
Shutdown may take a while....
Shutting down zabbix_agent:                                                                                                                                                              done
svr1:/etc/zabbix/scripts # /etc/init.d/zabbix-agent start
Starting zabbix_agent:                                                                                                                                                                   done
svr1:/etc/zabbix/scripts # /etc/init.d/zabbix-agent status
Zabbix agent running(PID): 16290
16291
16292
16293
16294
svr1:/etc/zabbix/scripts #

The ITEMS and TRIGGERS are setup on the Zabbix server as follow