The Compaq Health & Wellness Driver

The Compaq ProLiant Linux Team

April 2, 2001

This guide was designed to facilitate the installation and use of the Compaq Health and Wellness Driver on various Linux distributions on Compaq ProLiant Servers.

Notice

© 2001 Compaq Computer Corporation

Compaq, Compaq Insight Manager, NetFlex, NonStop, ProLiant, ROMPaq, and SmartStart are registered United States Patent and Trademark Office.

Alpha, AlphaServer, AlphaStation, ProSignia, and SoftPaq are trademarks and/or service marks of Compaq Computer Corporation.

Netelligent is a trademark and/or service mark of Compaq Information Technologies Group, L.P. in the U.S. and/or other countries.

Microsoft, MS-DOS, Windows, and Windows NT are trademarks and/or registered trademarks of Microsoft Corporation.

UNIX is a registered trademark of The Open Group.

SCO, UnixWare, OpenServer 5, and UnixWare 7 are registered trademarks of the Santa Cruz Operation.

Linux is a registered trademark of Linus Torvalds.

Red Hat is a registered trademark of Red Hat, Inc.

Caldera Systems and OpenLinux are either registered trademarks or trademarks of Caldera Systems.

TurboLinux is a trademark of Turbo Linux, Inc.

SuSE is a registered trademark of SuSE AG.

Other product names mentioned herein may be trademarks and/or registered trademarks of their respective companies.

The information in this publication is subject to change without notice and is provided "AS IS WITHOUT WARRANTY OF ANY KIND. THE ENTIRE RISK ARISING OUT OF THE USE OF THIS INFORMATION REMAINS WITH RECIPIENT. IN NO EVENT SHALL COMPAQ BE LIABLE FOR ANY DIRECT CONSEQUENTIAL, INCIDENTAL, SPECIAL, PUNITIVE OR OTHER DAMAGES WHATSOEVER (INCLUDING WITHOUT LIMITATION, DAMAGES FOR LOSS OF BUSINESS PROFITS, BUSINESS INTERRUPTION OR LOSS OF BUSINESS INFORMATION), EVEN IF COMPAQ HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

The limited warranties for Compaq products are exclusively set forth in the documentation accompanying such products. Nothing herein should be construed as constituting a further or additional warranty.

This publication does not constitute an endorsement of the product or products that were tested. The configuration or configurations tested or described may or may not be the only available solution. This test is not a determination of product quality or correctness, nor does it ensure compliance with any federal state or local requirements.

Compaq Health & Wellness Driver How-To

Solution Guide prepared by Compaq ProLiant Linux Team

Second Edition (November 2000)

Third Edition (December 2000)

Fourth Edition (February 2001)

Fifth Edition (April 2001)

 

Abstract

Compaq has created many different tools for managing Compaq servers, a key component of which is the health and wellness driver. This document describes the features of the health and wellness driver for linux, how it can be installed and how information can be leveraged.

Contents

1 What is the Health and Wellness Driver?

1.1 Exposing the Health log into /proc

1.2 System Temperature Monitoring

1.3 System Fan Monitoring

1.4 Monitoring the System Fault Tolerant Power Supply

1.5 ECC Memory Monitoring

1.6 Automatic Server Recovery (ASR)

2 Setup Procedure

2.1 Install

2.2 Upgrading the Driver

2.3 Running the Driver

2.4 Uninstall

2.5 Behind the Scenes

3 Console Messages

3.1 Memory

3.2 Thermal Sensors (Temperature)

3.3 Fans

3.4 Power Supplies

3.5 Processor Power Modules

4 Information Retrieval

4.1 Temperature

4.2 Fan

4.3 Power Supply

4.4 Integrated Management Log (IML)

5 Compaq Integrated Management Log Viewer (IML Viewer)

5.1 Running the IML Viewer

5.2 File Menu

5.3 Log Menu

5.4 View Menu

6 Troubleshooting

6.1 Non Certified Machines

6.2 Health Driver Immediately Stops after Installation

6.3 No Console Messages

6.4 Failed Dependencies

6.5 Failure in cpqimlview

6.6 Superuser Only

 

 

1 What is the Health and Wellness Driver?

The Compaq Wellness Driver (cpqhealth.o) collects and monitors important operational data on your server to ensure that the system is "healthy". Any abnormal conditions are logged into a non-volatile Health Log and can be inspected by using certain /proc entries.

Compaq Servers are equipped with hardware sensors and firmware to monitor certain abnormal conditions such as abnormal temperature readings, fan failures, ECC memory errors, etc. The cpqhealth.o driver monitors these conditions and reports them to the administrator by printing messages on the console (preserved in /var/log/messages), and also logging the condition into the server's health log.

The following is a list of the features of the Compaq Wellness Driver:

 

1.1 Exposing the Health log into /proc

Most events trigger a log entry into a Compaq internal area of non volatile memory (NVRAM). This health log is exposed through the Compaq Wellness Driver into the /proc filesystem.

 

1.2 System Temperature Monitoring

A Compaq server may contain several temperature sensors. If the normal operating range is exceeded for any of these sensors, the Compaq Wellness Driver does the following:

Use the Compaq System Configuration Utility to control the shutdown option.

 

1.3 System Fan Monitoring

If a cooling fan fails, the Compaq Wellness Driver does the following:

Use the Compaq System Configuration Utility to control the shutdown option.

 

1.4 Monitoring the System Fault Tolerant Power Supply

If a primary power supply fails, the server automatically switches over to a backup power supply. The system wellness driver does the following:

 

1.5 ECC Memory Monitoring

If a correctable ECC memory error occurs, the driver logs the error in the health log including the memory address causing the error. If too many errors occur at the same memory location, the driver disables the ECC error interrupts to prevent flooding the console with warnings (the hardware automatically corrects the ECC error).

 

1.6 Automatic Server Recovery (ASR)

The Automatic Server Recovery is implemented using a "heartbeat" timer that continually counts down. The driver frequently reloads the counter to prevent it from counting down to zero. If the ASR counts down to 0, it is assumed that the operating system is locked up and the system automatically attempts to reboot. Before rebooting, the driver does the following:

This server feature is configured using the Compaq System Configuration Utility

 

2 Setup Procedure

The health and wellness driver is available as a Red Hat Package Manager file (RPM). As with every RPM file the following options are available: you may install, query, refresh and uninstall the package. For the remainder of this section, we discuss how to install and uninstall the package (for more information about RPM files, see the appropriate How-To document). We also show you how the driver should react during regular operation.

 

2.1 Install

If you have a previous version of the Health Driver installed, it is important to uninstall this version before installing the new RPM file. See section 2.3 for information on uninstalling the driver.

After obtaining the RPM file, login as root and type the following to install the driver:

rpm -ivh cpqhealth-2.0.0-9.<distribution>.i386.rpm

The RPM file may have a different version number depending on supported systems and functionality. The distribution refers to the Linux distribution supported by the RPM. It is very important to install the RPM only on the supported distribution. The driver will be inserted immediately. On systems with variable speed fans, you may notice that the fans will start spinning more slowly if the temperature is reasonably low. To check whether the driver is loaded properly, you might want to type (only available as system admin):

lsmod

You should see an entry indication that driver "cpqhealth" was inserted.

This driver is currently supported according to the following matrix:

Health Driver Version

Release Date

Distribution

Version

Compaq Servers

1.0.0-1

08/08/2000

Red Hat Linux

SuSE Linux

TurboLinux Server

Caldera OpenLinux eServer

6.1, 6.2

6.3

6.0

2.3

ProLiant 800

ProLiant 1600

ProLiant 1850R

ProLiant 3000

ProLiant 5500

ProLiant 6400R

ProLiant 8000

ProLiant 8500

ProLiant DL360

ProLiant DL380

ProLiant DL580

ProLiant ML330

ProLiant ML350

ProLiant ML370

ProLiant ML530

ProLiant ML570

1.1.0-2

11/03/2000

Red Hat Linux

SuSE Linux

TurboLinux Server

Caldera OpenLinux eServer

6.2, 7.0

6.3, 7.0

6.0, 6.0.5

2.3

ProLiant DL360

ProLiant DL380

ProLiant DL580

ProLiant ML330

ProLiant ML350 1 GHz

ProLiant ML370

ProLiant ML530

ProLiant ML570

1.1.1-1

11/15/2000

Red Hat Linux

SuSE Linux

TurboLinux Server

Caldera OpenLinux eServer

6.2, 7.0

6.3, 7.0

6.0, 6.0.5

2.3

ProLiant DL320

ProLiant DL360

ProLiant DL380

ProLiant DL580

ProLiant ML330

ProLiant ML350 1 GHz

ProLiant ML370

ProLiant ML530

ProLiant ML570

1.2.0-1

12/14/2000

Red Hat Linux

SuSE Linux

TurboLinux Server

Caldera OpenLinux eServer

6.2, 7.0

6.3, 7.0

6.0.5

2.3.1

ProLiant 8000

ProLiant 8500

ProLiant DL320

ProLiant DL360

ProLiant DL380

ProLiant DL580

ProLiant ML330

ProLiant ML350 1 GHz

ProLiant ML370

ProLiant ML530

ProLiant ML570

2.0.0-11

04/02/2001

Red Hat Linux

SuSE Linux

TurboLinux Server

Caldera OpenLinux eServer

6.2, 7.0

6.3, 7.0

6.0.5

2.3.1

ProLiant DL320

ProLiant DL360

ProLiant DL380

ProLiant DL580

ProLiant ML330

ProLiant ML350 1 GHz

ProLiant ML370

ProLiant ML530

ProLiant ML570

Red Hat Linux

SuSE Linux

Caldera OpenLinux eServer

7.0

7.0

2.3.1

ProLiant 8000

ProLiant 8500

ProLiant DL760

ProLiant ML330e

ProLiant ML750

On any other machine, you will get an error message when you attempt to install the package. The driver will not be operational and it is advisable to uninstall the driver at your earliest convenience.

 

2.2 Upgrading the Driver

The Red Hat Package Manager provides the option to upgrade an RPM package. Before upgrading, it is important to uninstall any RPM packages that are dependent on the health driver, such as the Compaq Management Agents and the Compaq Remote Insight Driver, since these packages are dependent on a specific health driver version. Attempting to install these packages on an unsupported health driver version may result in an unstable system. Type the following, in order, to uninstall any of these packages if they are present on your system:

rpm -e cmanic

rpm -e cmastor

rpm -e cmasvr

rpm -e cmafdtn

rpm -e cpqrid

To upgrade the health driver, type the following command:

rpm -Uvh cpqhealth-2.0.0-9.<distribution>.i386.rpm

Please note that if the upgrade option is used, the health driver will be stopped after installation to preserve system stability. Please upgrade any components dependent on the Compaq Health Driver (cpqrid, cmafdtn, cmasvr, cmanic, cmastor).

To start the Health driver, type the following commands:

For Redhat, Caldera, TurboLinux: /etc/rc.d/init.d/cpqhealth start

For SuSE: /etc/rc.d/cpqhealth start

 

2.3 Running the Driver

You will notice that once installed, the driver will be automatically loaded every time your server boots up.

Several /proc entries are available when the driver is running. They are:

The contents of the /proc entries are described in section 4.

For additional information and help, a man page is available by typing:

man cpqhealth

 

2.4 Uninstall

Uninstall is according to the RPM standard and is achieved by typing:

rpm -e cpqhealth

If the health driver is running, it will be shut down. Should you reboot the system, the health driver will NOT be inserted at bootup time.

If you do not recall the version of the health driver installed, the following command may be used to discover the package version:

rpm -q cpqhealth

If you ever want to unload the driver, simply type (as system admin):

rmmod cpqhealth

The health driver will be removed from your system. Should an error condition occur, the driver will log an entry to the system log and to the health log as well as to the (text) console. In case of an emergency, the health driver will attempt to shut your system down gracefully. Using the rmmod command will not prevent the driver from being inserted at bootup time.

 

2.5 Behind the Scenes

A prototype of the driver is inserted in /lib/modules/Compaq/drivers/<kernel-type>. Furthermore, a copy of the driver is landed in /lib/modules/<current kernel-version>/misc. This allows the insertion of the health driver by hand from anywhere in the file system.

The health driver exposes the following device nodes that are used to control its operation. These character device nodes all have a major number of 207, and the minor numbers are as follows:

0 = /dev/cpqhealth/cpqw Redirector interface

1 = /dev/cpqhealth/crom EISA CROM

2 = /dev/cpqhealth/cdt Data Table

3 = /dev/cpqhealth/cevt Event Log

4 = /dev/cpqhealth/casr Automatic Server Recovery

5 = /dev/cpqhealth/cecc ECC Memory

6 = /dev/cpqhealth/cmca Machine Check Architecture

7 = /dev/cpqhealth/ccsm Deprecated CDT

8 = /dev/cpqhealth/cnmi NMI Handling

9 = /dev/cpqhealth/css Sideshow Management

10 = /dev/cpqhealth/cram CMOS interface

11 = /dev/cpqhealth/cpci PCI IRQ interface

In order to insert the driver at bootup time, the rc.local script is modified during the install process.

 

3 Console Messages

When events occur outside of normal operations, the health driver may display a console message. The following is a list of console messages the health driver provides as it monitors system health.

View the IML (the IML is described more fully in Section 4) to identify where the fault lies when failures are reported, and take the appropriate action.

 

3.1 Memory

If a memory module fails, view the IML to identify the faulty memory module. Plan for maintenance downtime and replace the module.

 

3.2 Thermal Sensors (Temperature)

If the temperature exceeds the acceptable threshold, ensure that all system fans are functional and that airflow to all system vents is not obstructed. Check room temperature and make sure air conditioning is not turned off at night.

 

3.3 Fans

If a critical fan has failed, replace the specified fan immediately, even if the fan appears functional (spinning). If a redundant fan has failed, replace the fan during scheduled maintenance.

 

3.4 Power Supplies

Check the status and connections on all power supplies when failures are reported. If a power supply has failed, replace the specified power supply.

 

3.5 Processor Power Modules

If a processor power module has failed, replace the specified processor power module.

 

4 Information Retrieval

Fans, Power Supply and Temperature information are available through the health driver's /proc entries. Internally, the driver actually has many more information items. We will expose them over time and probably create a special subdirectory for all Compaq related entries.

Currently, all information is summarized through tables. The table rows represent an instance of hardware device (temperature sensor, fan, power supply). The table columns have attributes that are as follows:

 

4.1 Temperature

1. Instance Number of the temperature sensor

2. Type of sensor.

3. Over Threshold? (0 = no, 1 = yes)

4. Data Available? (0 = no, 1 = yes)

5. Current Temperature Valid? (0 = no, 1 = yes)

6. Current Temperature in degrees Celsius

7. Threshold Temperature Valid? (0 = no, 1 = yes)

8. Threshold Temperature

 

4.2 Fan

1. Instance Number

2. Type of Fan

3. Location Designator

4. Speed State of Fan

5. Redundant Partner (Instance Number or 0 if not applicable)

6. Redundant Fan? (0 = no, 1 = yes)

7. Is Primary Fan? (0 = no, 1 = yes)

8. Is Hot Pluggable Fan? (0 = no, 1 = yes)

 

4.3 Power Supply

1. Instance Number

2. Type of Power Supply

3. Number of Ratings (can be 0)

4. Number of Channels (can be 0)

5. Number of Temperature Sensors (can be 0)

6. Number of Fans (can be 0)

7. A number of rows describing each rating's attributes (see below)

8. A number of rows describing each channel's attributes

9. A number of rows describing each temperature sensor's attributes

10. A number of rows describing each fan's attributes

The rating is the standard specification for the power supply. Here are its attributes:

1. Is Data Valid? (0 = no, 1 = yes)

2. Threshold Voltage

3. Total Power Output in Watts

Each channel has the following attributes:

1. Is Data Valid? (0 = no, 1 = yes)

2. Current Voltage in mV

3. Current Amperes in mA

4. Current Wattage in mW

The temperature sensor's attributes are as follows:

1. Is Data Valid? (0 = no, 1 = yes)

2. Current Temperature in Celsius

3. Current Threshold Temperature in Celsius

The cooling fan's information is:

1. Is Data Valid? (0 = no, 1 = yes)

2. Type of Fan

3. Current Speed State

 

4.4 Integrated Management Log (IML)

The log entry is structured like this:

 

 

5 Compaq Integrated Management Log Viewer (IML Viewer)

The information in the Integrated Management Log may also be leveraged through the IML Viewer application, which is also included in the Red Hat Package Manager file. The IML records system events, critical errors, power-on messages, memory errors, and any catastrophic hardware or software errors that typically cause a system to fail. The IML Viewer allows the manipulation of this data.

 

5.1 Running the IML Viewer

The IML Viewer is an application that runs in the X-Windows environment. Type the following to run the IML Viewer:

cpqimlview

The Compaq Integrated Management Log Viewer automatically displays the current entries in the IML.

Each event in the IML Viewer has one of the following statuses to identify the severity of the event:

The severity of the event and other information in the IML Viewer helps to quickly identify and correct problems, thus minimizing downtime. The IML Viewer allows several capabilities to enhance the ability to identify, correct, and document server health. The following describes the menu options available.

 

5.2 File Menu

The File Menu options include:

 

 

 

5.3 Log Menu

The Log Menu options include:

 

 

 

5.4 View Menu

The View Menu options include

6 Troubleshooting

This section describes common problems that might occur during install and operation of the Health and Wellness Driver. In most cases, a workaround is available which shall be described in the next paragraphs.

6.1 Non Certified Machines

Symptom When the Health and Wellness Driver RPM file is installed you will get the following message:

Inserting Health & Wellness Driver...

This Compaq Health & Wellness Driver is not certified for your system.

Uninstall this package at your earliest convenience.

The driver is not inserted into the list of modules. When trying to force the insertion of the driver in /lib/module/Compaq/drivers with insmod, the following message is output:

cpqhealth.o: init_module: Device or resource busy

Cause The Linux Compaq Health and Wellness Driver is only certified for a subset of systems that Compaq offers. The driver is deactivated for all other hardware and will not function by design.

Workaround There is no workaround for this.

 

6.2 Health Driver Immediately Stops after Installation

Symptom When the Health and Wellness Driver RPM file is installed you will get the following message:

If you are using the upgrade option of the RPM package, the

Compaq Health Driver will be stopped to prevent an unstable system.

Please upgrade any components dependent on the Compaq Health Driver

(cmafdtn, cmasvr, cmanic, cmastor, cpqrid)

To restart the Compaq Health Driver, type "/etc/rc.d/init.d/cpqhealth start"

Cause The Compaq Management Agents are only certified for a specific version of the Linux Compaq Health and Wellness Driver. Since this is the case, using the upgrade option instead of the install option can invite certain problems in the interaction between the agents and the health driver that may result in an unstable system. To prevent this, the health driver will be stopped after an upgrade to provide the system administrator with the opportunity to make sure the system has no dependencies that may cause problems.

Workaround Use the rpm -ivh command instead of the rpm -Uvh command to install the health driver. Uninstall any dependent packages:

rpm -e cmanic

rpm -e cmastor

rpm -e cmasvr

rpm -e cmafdtn

rpm -e cpqrid

rpm -e cpqhealth

rpm -ivh cpqhealth-2.0.0-9.<distribution>.i386.rpm

The message will still appear when the rpm -ivh option is used; this is merely a reminder. However, the health driver will not be stopped.

6.3 No Console Messages

Symptom If you run a SuSE distribution, you will see no console messages appearing on the text screens (Ctrl+Alt+F1, for instance). However, the error messages do get logged properly in /var/log/messages.

If you run KDE or Gnome, xterms will also not show the console messages originating from the health driver.

Cause SuSE configures the syslogd daemon slightly differently than other distributions: The system messages will not appear on the lower digit terminals (tty1-9), but will exclusively appear on tty10 (Ctrl+Alt+F10).

Workaround If you are not happy with the message logging on your system, you may configure it differently by modifying /etc/syslogd.conf in the following way:

# Log all kernel messages to the console.

# Logging much else clutters up the screen.

kern.* /dev/console

# Log anything (except mail) of level info or higher.

# Don't log private authentication messages!

*.info;mail.none;news.none;authpriv.none /var/log/messages

After sending a "HUP" signal to syslogd process ID, you should now see your kernel messages appearing on all consoles.

kill -1 <pid of syslogd>

6.4 Failed Dependencies

Symptom If you insert the Health and Wellness driver on a system with a minimal Linux install, you might encounter the following message:

error: failed dependencies:

egcs >= 1.1.2 is needed by cpqhealth-2.0.0-9

/bin/objcopy is needed by cpqhealth-2.0.0-9

After downloading the ecgs RPM file, and trying to install it, similar error messages will result in ecgs' failure to install.

Cause The health and wellness driver depends on two (fairly basic) Linux applications being present in the system: gcc, the C compiler, and objcopy, a binary utility to fix up object modules. The former is contained in a package called egcs, the latter in an RPM named binutils. Unfortunately, these RPM files have dependencies on other packages which in turn have their own dependencies, etc. The Red Hat Package Manager is designed to detect failed dependencies which have to be met step by step. This can be frustrating to the user.

Workaround We list all packages that both ecgs and gcc are dependent on. You must install the packages in the following order to prevent failed dependencies.

1. ld-config

2. glibc

3. info-install

4. readline

5. termcap

6. libtermcap

7. bash

8. binutils

9. gcc-cpp

10. kernel-headers

11. glibc-devel

12. make

13. ecgs

In order to install these packages we recommend the following procedure:

 

6.5 Failure in cpqimlview

Symptom When starting cpqimlview, the IML Viewer, you will see the following message:

The IML is not functioning after this error message appears.

Cause The problem lies in the fact that the health driver is not inserted on your system. This, for instance, could have happened, when cpqimlview was used while the Health and Wellness Driver package was uninstalled. Another reason is that your system is not certified for the current version of the health driver.

Workaround You may want to try to insert the driver manually by typing:

modprobe cpqhealth.o

in a console window. This will insert the health driver (verify by typing 'lsmod'). If that is not working, then your system is most likely not certified for the health driver.

 

6.6 Superuser Only

Symptom You will experience the following problems:

failed to open //var/lib/rpm/packages.rpm

error: cannot open //var/lib/rpm/packages.rpm

Cause Preparing a driver install necessitates access to system administrator rights.

Workaround Be sure to log in as root before you attempt the driver install.