The Compaq Health & Wellness Driver
The Compaq ProLiant Linux Team
April 2, 2001
This guide was designed to facilitate the installation and use of the Compaq Health and Wellness Driver on various Linux distributions on Compaq ProLiant Servers.
Notice
© 2001 Compaq Computer Corporation
Compaq, Compaq Insight Manager, NetFlex, NonStop, ProLiant, ROMPaq, and SmartStart are registered United States Patent and Trademark Office.
Alpha, AlphaServer, AlphaStation, ProSignia, and SoftPaq are trademarks and/or service marks of Compaq Computer Corporation.
Netelligent is a trademark and/or service mark of Compaq Information Technologies Group, L.P. in the U.S. and/or other countries.
Microsoft, MS-DOS, Windows, and Windows NT are trademarks and/or registered trademarks of Microsoft Corporation.
UNIX is a registered trademark of The Open Group.
SCO, UnixWare, OpenServer 5, and UnixWare 7 are registered trademarks of the Santa Cruz Operation.
Linux is a registered trademark of Linus Torvalds.
Red Hat is a registered trademark of Red Hat, Inc.
Caldera Systems and OpenLinux are either registered trademarks or trademarks of Caldera Systems.
TurboLinux is a trademark of Turbo Linux, Inc.
SuSE is a registered trademark of SuSE AG.
Other product names mentioned herein may be trademarks and/or registered trademarks of their respective companies.
The information in this publication is subject to change without notice and is provided "AS IS WITHOUT WARRANTY OF ANY KIND. THE ENTIRE RISK ARISING OUT OF THE USE OF THIS INFORMATION REMAINS WITH RECIPIENT. IN NO EVENT SHALL COMPAQ BE LIABLE FOR ANY DIRECT CONSEQUENTIAL, INCIDENTAL, SPECIAL, PUNITIVE OR OTHER DAMAGES WHATSOEVER (INCLUDING WITHOUT LIMITATION, DAMAGES FOR LOSS OF BUSINESS PROFITS, BUSINESS INTERRUPTION OR LOSS OF BUSINESS INFORMATION), EVEN IF COMPAQ HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
The limited warranties for Compaq products are exclusively set forth in the documentation accompanying such products. Nothing herein should be construed as constituting a further or additional warranty.
This publication does not constitute an endorsement of the product or products that were tested. The configuration or configurations tested or described may or may not be the only available solution. This test is not a determination of product quality or correctness, nor does it ensure compliance with any federal state or local requirements.
Compaq Health & Wellness Driver How-To
Solution Guide prepared by Compaq ProLiant Linux Team
Second Edition (November 2000)
Third Edition (December 2000)
Fourth Edition (February 2001)
Fifth Edition (April 2001)
Abstract
Compaq has created many different tools for managing Compaq servers, a key component of which is the health and wellness driver. This document describes the features of the health and wellness driver for linux, how it can be installed and how information can be leveraged.
Contents
1 What is the Health and Wellness Driver?1.1 Exposing the Health log into /proc
1.2 System Temperature Monitoring
5 Compaq Integrated Management Log Viewer (IML Viewer)
1 What is the Health and Wellness Driver?
The Compaq Wellness Driver (cpqhealth.o) collects and monitors important operational data on your server to ensure that the system is "healthy". Any abnormal conditions are logged into a non-volatile Health Log and can be inspected by using certain /proc entries.
Compaq Servers are equipped with hardware sensors and firmware to monitor certain abnormal conditions such as abnormal temperature readings, fan failures, ECC memory errors, etc. The cpqhealth.o driver monitors these conditions and reports them to the administrator by printing messages on the console (preserved in /var/log/messages), and also logging the condition into the server's health log.
The following is a list of the features of the Compaq Wellness Driver:
1.1 Exposing the Health log into /proc
Most events trigger a log entry into a Compaq internal area of non volatile memory (NVRAM). This health log is exposed through the Compaq Wellness Driver into the /proc filesystem.
1.2 System Temperature Monitoring
A Compaq server may contain several temperature sensors. If the normal operating range is exceeded for any of these sensors, the Compaq Wellness Driver does the following:
Use the Compaq System Configuration Utility to control the shutdown option.
If a cooling fan fails, the Compaq Wellness Driver does the following:
Use the Compaq System Configuration Utility to control the shutdown option.
1.4 Monitoring the System Fault Tolerant Power Supply
If a primary power supply fails, the server automatically switches over to a backup power supply. The system wellness driver does the following:
If a correctable ECC memory error occurs, the driver logs the error in the health log including the memory address causing the error. If too many errors occur at the same memory location, the driver disables the ECC error interrupts to prevent flooding the console with warnings (the hardware automatically corrects the ECC error).
1.6 Automatic Server Recovery (ASR)
The Automatic Server Recovery is implemented using a "heartbeat" timer that continually counts down. The driver frequently reloads the counter to prevent it from counting down to zero. If the ASR counts down to 0, it is assumed that the operating system is locked up and the system automatically attempts to reboot. Before rebooting, the driver does the following:
This server feature is configured using the Compaq System Configuration Utility
The health and wellness driver is available as a Red Hat Package Manager file (RPM). As with every RPM file the following options are available: you may install, query, refresh and uninstall the package. For the remainder of this section, we discuss how to install and uninstall the package (for more information about RPM files, see the appropriate How-To document). We also show you how the driver should react during regular operation.
If you have a previous version of the Health Driver installed, it is important to uninstall this version before installing the new RPM file. See section 2.3 for information on uninstalling the driver.
After obtaining the RPM file, login as root and type the following to install the driver:
rpm -ivh cpqhealth-2.0.0-9.<distribution>.i386.rpm
The RPM file may have a different version number depending on supported systems and functionality. The distribution refers to the Linux distribution supported by the RPM. It is very important to install the RPM only on the supported distribution. The driver will be inserted immediately. On systems with variable speed fans, you may notice that the fans will start spinning more slowly if the temperature is reasonably low. To check whether the driver is loaded properly, you might want to type (only available as system admin):
lsmod
You should see an entry indication that driver "cpqhealth" was inserted.
This driver is currently supported according to the following matrix:
|
Health Driver Version |
Release Date |
Distribution |
Version |
Compaq Servers |
|
1.0.0-1 |
08/08/2000 |
Red Hat Linux SuSE Linux TurboLinux Server Caldera OpenLinux eServer |
6.1, 6.2 6.3 6.0 2.3 |
ProLiant 800 ProLiant 1600 ProLiant 1850R ProLiant 3000 ProLiant 5500 ProLiant 6400R ProLiant 8000 ProLiant 8500 ProLiant DL360 ProLiant DL380 ProLiant DL580 ProLiant ML330 ProLiant ML350 ProLiant ML370 ProLiant ML530 ProLiant ML570 |
|
1.1.0-2 |
11/03/2000 |
Red Hat Linux SuSE Linux TurboLinux Server Caldera OpenLinux eServer |
6.2, 7.0 6.3, 7.0 6.0, 6.0.5 2.3 |
ProLiant DL360 ProLiant DL380 ProLiant DL580 ProLiant ML330 ProLiant ML350 1 GHz ProLiant ML370 ProLiant ML530 ProLiant ML570 |
|
1.1.1-1 |
11/15/2000 |
Red Hat Linux SuSE Linux TurboLinux Server Caldera OpenLinux eServer |
6.2, 7.0 6.3, 7.0 6.0, 6.0.5 2.3 |
ProLiant DL320 ProLiant DL360 ProLiant DL380 ProLiant DL580 ProLiant ML330 ProLiant ML350 1 GHz ProLiant ML370 ProLiant ML530 ProLiant ML570 |
|
1.2.0-1 |
12/14/2000 |
Red Hat Linux SuSE Linux TurboLinux Server Caldera OpenLinux eServer |
6.2, 7.0 6.3, 7.0 6.0.5 2.3.1 |
ProLiant 8000 ProLiant 8500 ProLiant DL320 ProLiant DL360 ProLiant DL380 ProLiant DL580 ProLiant ML330 ProLiant ML350 1 GHz ProLiant ML370 ProLiant ML530 ProLiant ML570 |
|
2.0.0-11 |
04/02/2001 |
Red Hat Linux SuSE Linux TurboLinux Server Caldera OpenLinux eServer |
6.2, 7.0 6.3, 7.0 6.0.5 2.3.1 |
ProLiant DL320 ProLiant DL360 ProLiant DL380 ProLiant DL580 ProLiant ML330 ProLiant ML350 1 GHz ProLiant ML370 ProLiant ML530 ProLiant ML570 |
|
Red Hat Linux SuSE Linux Caldera OpenLinux eServer |
7.0 7.0 2.3.1 |
ProLiant 8000 ProLiant 8500 ProLiant DL760 ProLiant ML330e ProLiant ML750 |
On any other machine, you will get an error message when you attempt to install the package. The driver will not be operational and it is advisable to uninstall the driver at your earliest convenience.
The Red Hat Package Manager provides the option to upgrade an RPM package. Before upgrading, it is important to uninstall any RPM packages that are dependent on the health driver, such as the Compaq Management Agents and the Compaq Remote Insight Driver, since these packages are dependent on a specific health driver version. Attempting to install these packages on an unsupported health driver version may result in an unstable system. Type the following, in order, to uninstall any of these packages if they are present on your system:
rpm -e cmanic
rpm -e cmastor
rpm -e cmasvr
rpm -e cmafdtn
rpm -e cpqrid
To upgrade the health driver, type the following command:
rpm -Uvh cpqhealth-2.0.0-9.<distribution>.i386.rpm
Please note that if the upgrade option is used, the health driver will be stopped after installation to preserve system stability. Please upgrade any components dependent on the Compaq Health Driver (cpqrid, cmafdtn, cmasvr, cmanic, cmastor).
To start the Health driver, type the following commands:
For Redhat, Caldera, TurboLinux:
/etc/rc.d/init.d/cpqhealth startFor SuSE:
/etc/rc.d/cpqhealth start
You will notice that once installed, the driver will be automatically loaded every time your server boots up.
Several /proc entries are available when the driver is running. They are:
The contents of the /proc entries are described in section 4.
For additional information and help, a man page is available by typing:
man cpqhealth
Uninstall is according to the RPM standard and is achieved by typing:
rpm -e cpqhealth
If the health driver is running, it will be shut down. Should you reboot the system, the health driver will NOT be inserted at bootup time.
If you do not recall the version of the health driver installed, the following command may be used to discover the package version:
rpm -q cpqhealth
If you ever want to unload the driver, simply type (as system admin):
rmmod cpqhealth
The health driver will be removed from your system. Should an error condition occur, the driver will log an entry to the system log and to the health log as well as to the (text) console. In case of an emergency, the health driver will attempt to shut your system down gracefully. Using the rmmod command will not prevent the driver from being inserted at bootup time.
A prototype of the driver is inserted in /lib/modules/Compaq/drivers/<kernel-type>. Furthermore, a copy of the driver is landed in /lib/modules/<current kernel-version>/misc. This allows the insertion of the health driver by hand from anywhere in the file system.
The health driver exposes the following device nodes that are used to control its operation. These character device nodes all have a major number of 207, and the minor numbers are as follows:
0 = /dev/cpqhealth/cpqw Redirector interface
1 = /dev/cpqhealth/crom EISA CROM
2 = /dev/cpqhealth/cdt Data Table
3 = /dev/cpqhealth/cevt Event Log
4 = /dev/cpqhealth/casr Automatic Server Recovery
5 = /dev/cpqhealth/cecc ECC Memory
6 = /dev/cpqhealth/cmca Machine Check Architecture
7 = /dev/cpqhealth/ccsm Deprecated CDT
8 = /dev/cpqhealth/cnmi NMI Handling
9 = /dev/cpqhealth/css Sideshow Management
10 = /dev/cpqhealth/cram CMOS interface
11 = /dev/cpqhealth/cpci PCI IRQ interface
In order to insert the driver at bootup time, the rc.local script is modified during the install process.
When events occur outside of normal operations, the health driver may display a console message. The following is a list of console messages the health driver provides as it monitors system health.
View the IML (the IML is described more fully in Section 4) to identify where the fault lies when failures are reported, and take the appropriate action.
If a memory module fails, view the IML to identify the faulty memory module. Plan for maintenance downtime and replace the module.
3.2 Thermal Sensors (Temperature)
If the temperature exceeds the acceptable threshold, ensure that all system fans are functional and that airflow to all system vents is not obstructed. Check room temperature and make sure air conditioning is not turned off at night.
If a critical fan has failed, replace the specified fan immediately, even if the fan appears functional (spinning). If a redundant fan has failed, replace the fan during scheduled maintenance.
Check the status and connections on all power supplies when failures are reported. If a power supply has failed, replace the specified power supply.
If a processor power module has failed, replace the specified processor power module.
Fans, Power Supply and Temperature information are available through the health driver's /proc entries. Internally, the driver actually has many more information items. We will expose them over time and probably create a special subdirectory for all Compaq related entries.
Currently, all information is summarized through tables. The table rows represent an instance of hardware device (temperature sensor, fan, power supply). The table columns have attributes that are as follows:
1. Instance Number of the temperature sensor
2. Type of sensor.
3. Over Threshold? (0 = no, 1 = yes)
4. Data Available? (0 = no, 1 = yes)
5. Current Temperature Valid? (0 = no, 1 = yes)
6. Current Temperature in degrees Celsius
7. Threshold Temperature Valid? (0 = no, 1 = yes)
8. Threshold Temperature
1. Instance Number
2. Type of Fan
3. Location Designator
4. Speed State of Fan
5. Redundant Partner (Instance Number or 0 if not applicable)
6. Redundant Fan? (0 = no, 1 = yes)
7. Is Primary Fan? (0 = no, 1 = yes)
8. Is Hot Pluggable Fan? (0 = no, 1 = yes)
1. Instance Number
2. Type of Power Supply
3. Number of Ratings (can be 0)
4. Number of Channels (can be 0)
5. Number of Temperature Sensors (can be 0)
6. Number of Fans (can be 0)
7. A number of rows describing each rating's attributes (see below)
8. A number of rows describing each channel's attributes
9. A number of rows describing each temperature sensor's attributes
10. A number of rows describing each fan's attributes
The rating is the standard specification for the power supply. Here are its attributes:
1. Is Data Valid? (0 = no, 1 = yes)
2. Threshold Voltage
3. Total Power Output in Watts
Each channel has the following attributes:
1. Is Data Valid? (0 = no, 1 = yes)
2. Current Voltage in mV
3. Current Amperes in mA
4. Current Wattage in mW
The temperature sensor's attributes are as follows:
1. Is Data Valid? (0 = no, 1 = yes)
2. Current Temperature in Celsius
3. Current Threshold Temperature in Celsius
The cooling fan's information is:
1. Is Data Valid? (0 = no, 1 = yes)
2. Type of Fan
3. Current Speed State
4.4 Integrated Management Log (IML)
The log entry is structured like this:
5 Compaq Integrated Management Log Viewer (IML Viewer)
The information in the Integrated Management Log may also be leveraged through the IML Viewer application, which is also included in the Red Hat Package Manager file. The IML records system events, critical errors, power-on messages, memory errors, and any catastrophic hardware or software errors that typically cause a system to fail. The IML Viewer allows the manipulation of this data.
The IML Viewer is an application that runs in the X-Windows environment. Type the following to run the IML Viewer:
cpqimlview
The Compaq Integrated Management Log Viewer automatically displays the current entries in the IML.

Each event in the IML Viewer has one of the following statuses to identify the severity of the event:
The severity of the event and other information in the IML Viewer helps to quickly identify and correct problems, thus minimizing downtime. The IML Viewer allows several capabilities to enhance the ability to identify, correct, and document server health. The following describes the menu options available.

The File Menu options include:



The Log Menu options include:
Clear All Entries - clear the IML. It is recommended to save the current contents into a file before emptying the log.


The View Menu options include


This section describes common problems that might occur during install and operation of the Health and Wellness Driver. In most cases, a workaround is available which shall be described in the next paragraphs.
Symptom When the Health and Wellness Driver RPM file is installed you will get the following message:
Inserting Health & Wellness Driver...
This Compaq Health & Wellness Driver is not certified for your system.
Uninstall this package at your earliest convenience.
The driver is not inserted into the list of modules. When trying to force the insertion of the driver in /lib/module/Compaq/drivers with insmod, the following message is output:
cpqhealth.o: init_module: Device or resource busy
Cause The Linux Compaq Health and Wellness Driver is only certified for a subset of systems that Compaq offers. The driver is deactivated for all other hardware and will not function by design.
Workaround There is no workaround for this.
6.2 Health Driver Immediately Stops after Installation
Symptom When the Health and Wellness Driver RPM file is installed you will get the following message:
If you are using the upgrade option of the RPM package, the
Compaq Health Driver will be stopped to prevent an unstable system.
Please upgrade any components dependent on the Compaq Health Driver
(cmafdtn, cmasvr, cmanic, cmastor, cpqrid)
To restart the Compaq Health Driver, type "/etc/rc.d/init.d/cpqhealth start"
Cause
The Compaq Management Agents are only certified for a specific version of the Linux Compaq Health and Wellness Driver. Since this is the case, using the upgrade option instead of the install option can invite certain problems in the interaction between the agents and the health driver that may result in an unstable system. To prevent this, the health driver will be stopped after an upgrade to provide the system administrator with the opportunity to make sure the system has no dependencies that may cause problems.Workaround Use the rpm -ivh command instead of the rpm -Uvh command to install the health driver. Uninstall any dependent packages:
rpm -e cmanic
rpm -e cmastor
rpm -e cmasvr
rpm -e cmafdtn
rpm -e cpqrid
rpm -e cpqhealth
rpm -ivh cpqhealth-2.0.0-9.<distribution>.i386.rpm
The message will still appear when the rpm -ivh option is used; this is merely a reminder. However, the health driver will not be stopped.
Symptom If you run a SuSE distribution, you will see no console messages appearing on the text screens (Ctrl+Alt+F1, for instance). However, the error messages do get logged properly in /var/log/messages.
If you run KDE or Gnome, xterms will also not show the console messages originating from the health driver.
Cause SuSE configures the syslogd daemon slightly differently than other distributions: The system messages will not appear on the lower digit terminals (tty1-9), but will exclusively appear on tty10 (Ctrl+Alt+F10).
Workaround If you are not happy with the message logging on your system, you may configure it differently by modifying /etc/syslogd.conf in the following way:
# Log all kernel messages to the console.
# Logging much else clutters up the screen.
kern.* /dev/console
# Log anything (except mail) of level info or higher.
# Don't log private authentication messages!
*.info;mail.none;news.none;authpriv.none /var/log/messages
After sending a "HUP" signal to syslogd process ID, you should now see your kernel messages appearing on all consoles.
kill -1 <pid of syslogd>
Symptom If you insert the Health and Wellness driver on a system with a minimal Linux install, you might encounter the following message:
error: failed dependencies:
egcs >= 1.1.2 is needed by cpqhealth-2.0.0-9
/bin/objcopy is needed by cpqhealth-2.0.0-9
After downloading the ecgs RPM file, and trying to install it, similar error messages will result in ecgs' failure to install.
Cause The health and wellness driver depends on two (fairly basic) Linux applications being present in the system: gcc, the C compiler, and objcopy, a binary utility to fix up object modules. The former is contained in a package called egcs, the latter in an RPM named binutils. Unfortunately, these RPM files have dependencies on other packages which in turn have their own dependencies, etc. The Red Hat Package Manager is designed to detect failed dependencies which have to be met step by step. This can be frustrating to the user.
Workaround We list all packages that both ecgs and gcc are dependent on. You must install the packages in the following order to prevent failed dependencies.
1. ld-config
2. glibc
3. info-install
4. readline
5. termcap
6. libtermcap
7. bash
8. binutils
9. gcc-cpp
10. kernel-headers
11. glibc-devel
12. make
13. ecgs
In order to install these packages we recommend the following procedure:
Symptom When starting cpqimlview, the IML Viewer, you will see the following message:

The IML is not functioning after this error message appears.
Cause The problem lies in the fact that the health driver is not inserted on your system. This, for instance, could have happened, when cpqimlview was used while the Health and Wellness Driver package was uninstalled. Another reason is that your system is not certified for the current version of the health driver.
Workaround You may want to try to insert the driver manually by typing:
modprobe cpqhealth.o
in a console window. This will insert the health driver (verify by typing 'lsmod'). If that is not working, then your system is most likely not certified for the health driver.
Symptom You will experience the following problems:
failed to open //var/lib/rpm/packages.rpm
error: cannot open //var/lib/rpm/packages.rpm
Cause Preparing a driver install necessitates access to system administrator rights.
Workaround Be sure to log in as root before you attempt the driver install.