Intel MPCMM0001 Network Card User Manual


 
MPCMM0001 Chassis Management Module Software Technical Product Specification 41
Process Monitoring and Integrity
Process Monitoring and Integrity 6
6.1 Overview
The Chassis Management Module monitors the general health of processes running on the CMM
and can take recovery actions upon detection of failed processes. This is handled by the Process
Monitoring Service (PMS).
Upon detecting unhealthy processes, the PMS will take a configurable recovery action. Examples
of recovery actions include restarting the process, failing over to the standby CMM, etc.
The PMS itself is also monitored to ensure that it is operating correctly. The PMS is monitored in
both a single CMM configuration and a redundant CMM configuration. When faults are detected in
the PMS, corrective actions are taken.
The PMS also provides dynamic configuration and status information through the CLI, RPC, and
SNMP interfaces. For example, users can administratively lock/disable monitoring of a process
while the PMS is running to suit their particular needs. The PMS also provides static configuration
to allow customers the ability to tune the static system parameters for the given platform. Examples
of these parameters may include monitoring interval, retries, and ramp-up times.
6.1.1 Process Existence Monitoring
Process existence monitoring utilizes the operating system's process table to determine the
existence of the process. When the CMM software is started, the PMS initializes and determines
the set of processes to monitor for process existence. The PMS periodically queries the operating
system for the existence of that set of processes. When a monitored process is found not to exist,
the PMS will generate a SEL entry and take a recovery action.
Process existence monitoring can be utilized on all permanent processes (processes which exist for
the life of the CMM software as a whole). It is particularly useful when monitoring processes that
were not specifically developed for running on the CMM. Applications that are provided by the
operating system vendor are examples of these types of processes. For the Linux* operating
system, processes like syslogd and crond would be good examples.
6.1.2 Thread Watchdog Monitoring
Thread watchdog monitoring requires that the process being monitored notifies the PMS of its
continued operation. Notifying the PMS will allow the PMS to monitor the process for existence
and conditions where a process locks-up. Each thread requiring monitoring within a process using
the thread watchdog will register with the PMS. The PMS will loop through its list of registered
threads and determine if the set of registered threads are operating. When any thread is determined
to be unresponsive (i.e., not notifying the PMS of its continued operation), the PMS will generate a
SEL entry and take a recovery action.
Thread watchdog monitoring can be used on all processes that are instrumented with the PMS
thread watchdog API. It provides more functionality then process existence monitoring and can be
used in conjunction with process integrity monitoring to provide a comprehensive solution. Thread