Resource monitoring allows Novell Cluster Services to detect when an individual resource on a node has failed independently of its ability to detect node failures. Monitoring is disabled by default. It is enabled separately for each cluster resource.
When you enable resource monitoring, you must specify a polling interval, a failure rate, a failure action, and a timeout value. These settings control how error conditions are resolved for the resource.
The monitoring script runs at a frequency specified by the polling interval. By default, it runs every minute when the resource is online. You can specify the polling interval in minutes or seconds. The polling interval applies only to a given resource.
The failure rate is the maximum number of failures (
) detected by the monitoring script during a specified amount of time ( ).A failure action is initiated when the resource monitor detects that the resource fails more times than the maximum number of local failures allowed to occur during the specified time interval. For failures that occur before it exceeds the maximum, Cluster Services automatically attempts to unload and load the resource. The progress and output of executing a monitor script are appended to /var/opt/novell/log/ncs/resource_name.monitor.out file.
For example, if you set the failure rate to 3 failures in 10 minutes, the failure action is initiated if it fails 4 times in a 10 minute period. For the first 3 failures, Cluster Services automatically attempts to unload and load the resource.
The
indicates whether you want the resource to be set to a comatose state, to migrate to another server, or to reboot the hosting node (without synchronizing or unmounting the disks) if a failure action initiates. The reboot option is normally used only for a mission-critical cluster resource that must remain available.If the failure action initiates and you chose the option to migrate the resource to another server, the resource migrates to the next server in its
list, which you previously ordered according to your preferences. The resource remains on the server it has migrated to unless you migrate it to another server or the failure action initiates again, in which case it again migrates to the next server in its list.If the failure action initiates and you chose the option to reboot the hosting node without synchronizing or unmounting the disks, each of the resources on the hosting node will fail over to the next server in its
list because of the reboot. This is a hard reboot, not a graceful one.With resource monitoring, the
, , and Modes have no effect on where the resource migrates. This means that a resource that has been migrated by the resource monitoring failure action does not migrate back (fail back) to the node it migrated from unless you manually migrate it back.The timeout value determines how much time the script is given to complete. If the script does not complete within the specified time, the configured failure action is initiated. Cluster Services marks the process as failed right after the defined timeout expires, but it must wait for the process to conclude before it can start other resource operations.
The timeout value is applied only when the resource is migrated to another node. It is not used during resource online/offline procedures.
The monitoring script runs at the frequency you specify as the polling interval.
There are two conditions that trigger a response by Novell Cluster Services:
Novell Cluster Services tallies the error occurrence, compares it to the configured failure rate, then does one of the following:
Total errors in the interval are less than or equal to the Maximum Local Failures: Novell Cluster Services tries to resolve the error by offlining the resource, then onlining the resource.
If this problem resolution effort fails, Novell Cluster Services goes to Step 4 immediately regardless of the failure rate condition at that time.
Total errors in the interval are more than the Maximum Local Failures: Go to Step 4.
Novell Cluster Services initiates the configured failure action. Possible actions are:
Puts the resource in a comatose state
Migrates the resource to another server
Reboots the hosting node (without synchronizing or unmounting the disks)
The resource monitoring function allows you to monitor the health of a specified resource by using a script that you create or customize. If you want Novell Cluster Services to check the health status of a resource, you must enable and configure resource monitoring for that resource. Enabling resource monitoring requires you to specify a polling interval, a failure rate, a failure action, and a timeout value.
If you are creating a new cluster resource, the Monitor Script page should already be displayed. You can start with Step 5.
In iManager, click
, then click .Browse to locate and select the Cluster object of the cluster you want to manage.
Select the check box next to the resource that you want to configure monitoring for, then click the
link.Click the
tab.Select the
check box to enable resource monitoring for the selected resource.Resource monitoring is disabled by default.
For the polling interval, specify how often you want the resource monitoring script for this resource to run.
You can specify the value in minutes or seconds.
Specify the number of failures (
) for the specified amount of time ( ).For information, see Failure Rate.
Specify the
by indicating whether you want the resource to be set to a comatose state, to migrate to another server, or to reboot the hosting node (without synchronizing or unmounting the disks) if a failure action initiates. The reboot option is normally used only for a mission-critical cluster resource that must remain available.For information, see Failure Action.
Click the
tab, then click the link.Edit or add the necessary commands to the script to monitor the resource on the server.
The resource templates included with Novell Cluster Services for Linux include resource monitoring scripts that you can customize.
You can use the same commands that would be used at the Linux terminal console. For example, see Section 10.6.4, Monitoring Services That Are Critical to Clustering.
Specify the
value, then click to save the script.The timeout value determines how much time the script is given to complete. If the script does not complete within the specified time, the failure action you chose in Step 8 initiates.
Do one of the following:
If you are configuring a new resource, click Section 10.7.2, Setting the Start, Failover, and Failback Modes for a Resource.
, then continue withClick
to save your changes.Changes for a resource’s properties are not applied while the resource is loaded or running on a server. You must offline, then online the resource to activate the changes for the resource.
The resource templates included with Novell Cluster Services for Linux include resource monitoring scripts that you can customize.
Example monitor scripts are available in the following sections:
Monitoring scripts can also be used for monitoring critical services needed by the resources, such as Linux User Management (namcd) and Novell eDirectory (ndsd). However, the monitoring is in effect only where the cluster resource is running.
IMPORTANT:The monitor script runs only on the cluster server where the cluster resource is currently online. The script does not monitor the critical services on its assigned cluster server when the resource is offline. The monitor script does not monitor critical services for any other cluster node.
For example, to monitor whether the namcd and ndsd services are running, add the following commands to the Monitor script:
# (optional) status of the eDirectory service exit_on_error rcndsd status # (optional) status of the Linux User Management service exit_on_error rcnamcd status
You can use the namcd status command instead of rcnamcd status in the Monitor script if you want to automatically restart namcd if it is not loaded and running. However, namcd creates messages in /var/log/messages with each check.