From 1e2387474a449452b78520b9ad96a8b4b5e99722 Mon Sep 17 00:00:00 2001 From: Harald Pfeiffer Date: Wed, 17 Apr 2019 19:07:19 +0200 Subject: initial commit of source fetch --- .../check_hpasm/check_hpasm-4.8/README | 346 +++++++++++++++++++++ 1 file changed, 346 insertions(+) create mode 100755 nagios-plugins-contrib-24.20190301~bpo9+1/check_hpasm/check_hpasm-4.8/README (limited to 'nagios-plugins-contrib-24.20190301~bpo9+1/check_hpasm/check_hpasm-4.8/README') diff --git a/nagios-plugins-contrib-24.20190301~bpo9+1/check_hpasm/check_hpasm-4.8/README b/nagios-plugins-contrib-24.20190301~bpo9+1/check_hpasm/check_hpasm-4.8/README new file mode 100755 index 0000000..43fceac --- /dev/null +++ b/nagios-plugins-contrib-24.20190301~bpo9+1/check_hpasm/check_hpasm-4.8/README @@ -0,0 +1,346 @@ +check_hpasm Nagios Plugin README +--------------------- + +This plugin checks the hardware health of HP Proliant servers with the +hpasm software installed. It uses the hpasmcli command to acquire the +condition of the system's critical components like cpus, power supplies, +temperatures, fans and memory modules. Newer versions also use SNMP. + +* For instructions on installing this plugin for use with Nagios, + see below. In addition, generic instructions for the GNU toolchain + can be found in the INSTALL file. + +* For major changes between releases, read the CHANGES file. + +* For information on detailed changes that have been made, + read the Changelog file. + +* This plugins is self documenting. All plugins that comply with + the basic guidelines for development will provide detailed help when + invoked with the '-h' or '--help' options. + +You can check for the latest plugin at: + http://www.consol.de/opensource/nagios/check-hpasm + +Send mail to gerhard.lausser@consol.de for assistance. +Please include the OS type and version that you are using. +Also, run the plugin with the '-v' option and provide the resulting +version information. Of course, there may be additional diagnostic information +required as well. Use good judgment. + + +How to "compile" the check_hpasm script. +-------------------------------------------------------- + +1) Run the configure script to initialize variables and create a Makefile, etc. + + ./configure --prefix=BASEDIRECTORY --with-nagios-user=SOMEUSER --with-nagios-group=SOMEGROUP --with-perl=PATH_TO_PERL --with-noinst-level=LEVEL --with-degrees=UNIT --with-perfdata --with-hpacucli + + a) Replace BASEDIRECTORY with the path of the directory under which Nagios + is installed (default is '/usr/local/nagios') + b) Replace SOMEUSER with the name of a user on your system that will be + assigned permissions to the installed plugins (default is 'nagios') + c) Replace SOMEGRP with the name of a group on your system that will be + assigned permissions to the installed plugins (default is 'nagios') + d) Replace PATH_TO_PERL with the path where a perl binary can be found. + Besides the system wide perl you might have installed a private perl + just for the nagios plugins (default is the perl in your path). + e) Replace LEVEL with one of ok, warning, critical or unknown. + If the required hpasm-rpm is not installed, the check_hpasm plugin + will exit with the level specified. If you chose ok, the message + will say "ok - .... hpasm is not installed". This is different from + the "ok - hardware working fine" if hpasm was found. + The default is to treat a missing hpasm package as ok. + f) Replace UNIT with one of celsius or fahrenheit. The hpasmcli "show temp" + prints temperatures both in units of celsius and fahrenheit. With the + --with-degrees option you can decide which units will be shown in an + alarm message. + The default is "celsius". + g) You can tell check_hpasm to output performance data by default if + you call configure with the --enable-perfdata option. + h) You can tell check_hpasm to check the raid status with the hpacucli command + if you call configure with the --enable-hpacucli option. + You need the hpacucli rpm. + +2) "Compile" the plugin with the following command: + + make + + This will produce a "check_hpasm" script. You will also find + a "check_hpasm.pl" which you better ignore. It is the base for + the compilation filled with placeholders. These will be replaced during + the make process. + + +3) Install the compiled plugin script with the following command: + + make install + + The installation procedure will attempt to place the plugin in a + 'libexec/' subdirectory in the base directory you specified with + the --prefix argument to the configure script. + + +4) Verify that your configuration files for Nagios contains + the correct paths to the new plugin. + + +5) Add this line to /etc/sudoers: + nagios ALL=NOPASSWD: /sbin/hpasmcli + or ths, if you also installed the hpacu package + nagios ALL=NOPASSWD: /sbin/hpasmcli, /usr/sbin/hpacucli + + + +Command line parameters +----------------------- + +-v, --verbose + Increased verbosity will print how check_hpasm communicates with the + hpasm daemon and which values were acquired. + +-t, --timeout + The number of seconds after which the plugin will abort. + +-b, --blacklist + If some components of your system are missing (mostly the secondary + power supply bay is empty) and you tolerate this, then blacklist the + missing/failed component to avoid false alarms. + The value for this option is a slash-separated list of components to + ignore. + Example: -b p:1,2/f:2/t:3,4/c:1/d:0-1,0-2 + means: ignore power supplies #1 and #2, fan #2, temperature #3 and #4, + cpu #1 and dimms #1 and #2 in cartridge #0. + +-c, --customthresh + Override the machine-default temperature thresholds. + Example: -c 1:60/4:80/5:50 + Sets limit for temperature 1 to 60 degrees, temperature 4 to 80 degrees + and temperature 5 to 50 degrees. You get the consecutive numbers by + calling check_hpasm -v + ... + checking temperatures + 1 processor_zone temperature is 46 (62 max) + 2 cpu#1 temperature is 43 (73 max) + 3 i/o_zone temperature is 54 (68 max) + 4 cpu#2 temperature is 46 (73 max) + 5 power_supply_bay temperature is 38 (55 max) + +-p, --perfdata + Add performance data to the output even if you did not compile check_hpasm + with --with-perfdata in step 1. + + + +SNMP and Memory Modules +----------------------- +Older hardware does not always show valuable information when queried for +the health of memory modules. Maybe it's because older modules do not support +error checking at all. + + +1. no cpqHeResMemModule +--------------------------------------------------------------------------- + +2. collapsed cpqHeResMemModule +--------------------------------------------------------------------------- + +Some (older) systems do not support the cpqHeResMemModuleEntry table. +Either there is no oid with 1.3.6.1.4.1.232.6.2.14.11.1 at all +or there is a single oid like + +Example: +iso.3.6.1.4.1.232.2.2.4.5.1.3.0.1 = INTEGER: 524288 +iso.3.6.1.4.1.232.2.2.4.5.1.3.0.2 = INTEGER: 262144 +iso.3.6.1.4.1.232.2.2.4.5.1.3.0.3 = INTEGER: 0 +iso.3.6.1.4.1.232.2.2.4.5.1.3.0.4 = INTEGER: 524288 +iso.3.6.1.4.1.232.2.2.4.5.1.3.0.5 = INTEGER: 262144 +iso.3.6.1.4.1.232.2.2.4.5.1.3.0.6 = INTEGER: 0 + + ^-- module number + ^-- cartridge number (0 = system board) + ^-- size + +iso.3.6.1.4.1.232.6.2.14.11.1.1.0.6 = INTEGER: 0 + +I compared 300 systems and found out that with +1.3.6.1.4.1.232.6.2.14.11.1... = +no1 is always 1 +no2 is always 0 +no3 is the number of memory slots (including the empty ones). +no4 is always 0. It is probably the health status of the +overall memory subsystem. I don't know. +I will implement 0 = ok, not 0 = ask compaq + +cpqSiMemECCStatus provides no usable information. All my test systems +showed 0 which is an undocumented value. + +function get_size(cpqHeResMemModuleEntry) will return 1. + +3. cpqHeResMemModule containing crap +--------------------------------------------------------------------------- + +grepping for cpqSiMemBoardSize shows 4 modules +iso.3.6.1.4.1.232.2.2.4.5.1.3.0.1 = INTEGER: 262144 +iso.3.6.1.4.1.232.2.2.4.5.1.3.0.2 = INTEGER: 262144 +iso.3.6.1.4.1.232.2.2.4.5.1.3.0.3 = INTEGER: 0 +iso.3.6.1.4.1.232.2.2.4.5.1.3.0.4 = INTEGER: 262144 +iso.3.6.1.4.1.232.2.2.4.5.1.3.0.5 = INTEGER: 262144 +iso.3.6.1.4.1.232.2.2.4.5.1.3.0.6 = INTEGER: 0 + +grepping for cpqHeResMemEntry shows one module with zero values +iso.3.6.1.4.1.232.6.2.14.11.1.1.0.0 = INTEGER: 0 +iso.3.6.1.4.1.232.6.2.14.11.1.2.0.0 = INTEGER: 0 +iso.3.6.1.4.1.232.6.2.14.11.1.3.0.0 = "" +iso.3.6.1.4.1.232.6.2.14.11.1.4.0.0 = INTEGER: 0 +iso.3.6.1.4.1.232.6.2.14.11.1.5.0.0 = INTEGER: 0 +iso.3.6.1.4.1.232.6.2.14.11.1.6.0.0 = Hex-STRING: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 + + +4. cpqHeResMemModuleEntry and cpqSiMemModuleEntry use different table indexes +--------------------------------------------------------------------------- + +cpqSiMemBoardIndex 1.3.6.1.4.1.232.2.2.4.5.1.1 +cpqSiMemModuleIndex 1.3.6.1.4.1.232.2.2.4.5.1.2 + +cpqHeResMemBoardIndex 1.3.6.1.4.1.232.6.2.14.11.1.1 +cpqHeResMemModuleIndex 1.3.6.1.4.1.232.6.2.14.11.1.2 + + +cpqSiMemBoardIndex +SNMPv2-SMI::enterprises.232.2.2.4.5.1.1.0.1 = INTEGER: 0 +SNMPv2-SMI::enterprises.232.2.2.4.5.1.1.0.2 = INTEGER: 0 +SNMPv2-SMI::enterprises.232.2.2.4.5.1.1.0.3 = INTEGER: 0 +SNMPv2-SMI::enterprises.232.2.2.4.5.1.1.0.4 = INTEGER: 0 +SNMPv2-SMI::enterprises.232.2.2.4.5.1.1.0.5 = INTEGER: 0 +SNMPv2-SMI::enterprises.232.2.2.4.5.1.1.0.6 = INTEGER: 0 + +cpqHeResMemBoardIndex +SNMPv2-SMI::enterprises.232.6.2.14.11.1.1.1.1 = INTEGER: 0 +SNMPv2-SMI::enterprises.232.6.2.14.11.1.1.1.2 = INTEGER: 0 +SNMPv2-SMI::enterprises.232.6.2.14.11.1.1.1.3 = INTEGER: 0 +SNMPv2-SMI::enterprises.232.6.2.14.11.1.1.1.4 = INTEGER: 0 +SNMPv2-SMI::enterprises.232.6.2.14.11.1.1.1.5 = INTEGER: 0 +SNMPv2-SMI::enterprises.232.6.2.14.11.1.1.1.6 = INTEGER: 0 + +It is not possible to use the SNMP-table-indices to identify the +corresponding he-entry. Matching is done with nested loops. + +5. even worse: cpqHeResMemBoardIndex and cpqSiMemBoardIndex don't match +--------------------------------------------------------------------------- + +cpqSiMemBoardIndex +iso.3.6.1.4.1.232.2.2.4.5.1.1.1.1 = INTEGER: 1 +iso.3.6.1.4.1.232.2.2.4.5.1.1.1.2 = INTEGER: 1 +iso.3.6.1.4.1.232.2.2.4.5.1.1.1.3 = INTEGER: 1 +iso.3.6.1.4.1.232.2.2.4.5.1.1.1.4 = INTEGER: 1 +iso.3.6.1.4.1.232.2.2.4.5.1.1.1.5 = INTEGER: 1 +iso.3.6.1.4.1.232.2.2.4.5.1.1.1.6 = INTEGER: 1 +iso.3.6.1.4.1.232.2.2.4.5.1.1.1.7 = INTEGER: 1 +iso.3.6.1.4.1.232.2.2.4.5.1.1.1.8 = INTEGER: 1 +iso.3.6.1.4.1.232.2.2.4.5.1.1.2.1 = INTEGER: 2 +iso.3.6.1.4.1.232.2.2.4.5.1.1.2.2 = INTEGER: 2 +iso.3.6.1.4.1.232.2.2.4.5.1.1.2.3 = INTEGER: 2 +iso.3.6.1.4.1.232.2.2.4.5.1.1.2.4 = INTEGER: 2 +iso.3.6.1.4.1.232.2.2.4.5.1.1.2.5 = INTEGER: 2 +iso.3.6.1.4.1.232.2.2.4.5.1.1.2.6 = INTEGER: 2 +iso.3.6.1.4.1.232.2.2.4.5.1.1.2.7 = INTEGER: 2 +iso.3.6.1.4.1.232.2.2.4.5.1.1.2.8 = INTEGER: 2 +iso.3.6.1.4.1.232.2.2.4.5.1.1.3.1 = INTEGER: 3 + +cpqHeResMemBoardIndex +iso.3.6.1.4.1.232.6.2.14.11.1.1.0.1 = INTEGER: 0 +iso.3.6.1.4.1.232.6.2.14.11.1.1.0.2 = INTEGER: 0 +iso.3.6.1.4.1.232.6.2.14.11.1.1.0.3 = INTEGER: 0 +iso.3.6.1.4.1.232.6.2.14.11.1.1.0.4 = INTEGER: 0 +iso.3.6.1.4.1.232.6.2.14.11.1.1.0.5 = INTEGER: 0 +iso.3.6.1.4.1.232.6.2.14.11.1.1.0.6 = INTEGER: 0 +iso.3.6.1.4.1.232.6.2.14.11.1.1.0.7 = INTEGER: 0 +iso.3.6.1.4.1.232.6.2.14.11.1.1.0.8 = INTEGER: 0 +iso.3.6.1.4.1.232.6.2.14.11.1.1.1.1 = INTEGER: 1 +iso.3.6.1.4.1.232.6.2.14.11.1.1.1.2 = INTEGER: 1 +iso.3.6.1.4.1.232.6.2.14.11.1.1.1.3 = INTEGER: 1 +iso.3.6.1.4.1.232.6.2.14.11.1.1.1.4 = INTEGER: 1 +iso.3.6.1.4.1.232.6.2.14.11.1.1.1.5 = INTEGER: 1 +iso.3.6.1.4.1.232.6.2.14.11.1.1.1.6 = INTEGER: 1 +iso.3.6.1.4.1.232.6.2.14.11.1.1.1.7 = INTEGER: 1 +iso.3.6.1.4.1.232.6.2.14.11.1.1.1.8 = INTEGER: 1 +iso.3.6.1.4.1.232.6.2.14.11.1.1.2.1 = INTEGER: 2 + + +Redundant fans +----------------------- +I saw one old server which had only half of the possible fans installed. + +Fan# 1 2 3 4 5 6 + +cpqHeFltTolFanPresent yes no yes no yes no +cpqHeFltTolFanRedundant no no no no no no +cpqHeFltTolFanRedundantPartner 2 1 4 3 6 5 +cpqHeFltTolFanCondition ok other ok other ok other +cpqHeFltTolFanLocation cpu cpu cpu cpu io io + +Normally this would result in +... +fan #1 (cpu) is not redundant +fan #2 (cpu) is not redundant +fan #3 (cpu) is not redundant +fan #4 (cpu) is not redundant +fan #5 (ioboard) is not redundant +fan #6 (ioboard) is not redundant +WARNING - fan #1 (cpu) is not redundant, fan #2 (cpu) is not redundant, fan #3 (cpu) is not redundant, fan #4 (cpu) is not redundant, fan #5 (ioboard) is not redundant, fan #6 (ioboard) is not redundant + +However it was the server's owner decision not to install fan pairs but only one fan per location, so for him this is a false alert. + +By using --ignore-fan-redundancy check_hpasm only looks at the cpqHeFltTolFanCondition and ignores dependencies between two fans, so the result is: + +fan 1 speed is normal, pctmax is 50%, location is cpu, redundance is no, partner is 2 +fan 3 speed is normal, pctmax is 50%, location is cpu, redundance is no, partner is 4 +fan 5 speed is normal, pctmax is 50%, location is ioboard, redundance is no, partner is 6 +OK - System: 'proliant ml370 g3', ... + + +A snmp forwarding trick +----------------------- +local - where check_hpasm runs +remote - where a proliant can be reached +proliant - where the snmp agent runs + +remote: +ssh -R6667:localhost:6667 local +socat tcp4-listen:6667,reuseaddr,fork UDP:proliant:161 + +local: +socat udp4-listen:161,reuseaddr,fork tcp:localhost:6667 +check_hpasm --hostname 127.0.0.1 + + +Sample data from real machines +------------------------------ + +hpasmcli=$(which hpasmcli) +hpacucli=$(which hpacucli) +for i in server powersupply fans temp dimm +do + $hpasmcli -s "show $i" | while read line + do + printf "%s %s\n" $i "$line" + done +done +if [ -x "$hpacucli" ]; then + for i in config status + do + $hpacucli ctrl all show $i | while read line + do + printf "%s %s\n" $i "$line" + done + done +fi + +If you think check_hpasm is not working correctly, please run the above script +and send me the output. It's also helpful to see the output of snmpwalk +snmpwalk .... 1.3.6.1.4.1.232 + + +-- +Gerhard Lausser -- cgit v1.2.3