[Pacemaker] stonith monitor timeout not restartable
Ron Kerry
rkerry at sgi.com
Fri Nov 19 19:19:02 UTC 2010
I have a customer running pacemaker/openais from the SLE11-HAE distribution. On occasion we have a
stonith clone instance timeout. It becomes unrestartable and it is not generally recoverable unless
we completely stop openais on both nodes and restart. I have an example from the logs below. Is this
a known issue?
>> This is the stonith operation that has the problem
Nov 15 23:08:35 dmf2 lrmd: [83870]: debug: rsc:stonith-sgi-ipmi:1:10: monitor
Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: perform_ra_op: resetting scheduler class to SCHED_OTHER
Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: stonithd_signon: creating connection
Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: sending out the signon msg.
Nov 15 23:08:35 dmf2 stonithd: [83868]: debug: client STONITH_RA_EXEC_98416 (pid=98416) succeeded to
signon to stonithd.
Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: signed on to stonithd.
Nov 15 23:08:35 dmf2 stonithd: [83868]: debug: client STONITH_RA_EXEC_98416 [pid: 98416] requests a
resource operation monitor on stonith-sgi-ipmi:1 (external/sgi-ipmi)
Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: waiting for the stonithRA reply msg.
Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: a stonith RA operation queue to run, call_id=98417.
Nov 15 23:08:35 dmf2 lrmd: [98416]: debug: stonithd_receive_ops_result: begin
Nov 15 23:08:35 dmf2 stonithd: [98417]: debug: external_status: called.
Nov 15 23:08:35 dmf2 stonithd: [98417]: debug: external_run_cmd: Calling
'/usr/lib64/stonith/plugins/external/sgi-ipmi status'
>> the stonith operation times out!!
Nov 15 23:08:50 dmf2 stonithd: [83868]: WARN: external_stonith-sgi-ipmi:1_monitor process (PID
98417) timed out (try 1). Killing with signal SIGTERM (15).
Nov 15 23:08:55 dmf2 stonithd: [83868]: WARN: external_stonith-sgi-ipmi:1_monitor process (PID
98417) timed out (try 2). Killing with signal SIGKILL (9).
Nov 15 23:08:55 dmf2 stonithd: [83868]: WARN: Managed external_stonith-sgi-ipmi:1_monitor process
98417 killed by signal 9 [SIGKILL - Kill, unblockable].
Nov 15 23:08:55 dmf2 stonithd: [83868]: debug: Child process external_stonith-sgi-ipmi:1_monitor
[98417] exited, its exit code: 0 when signo=9.
Nov 15 23:08:55 dmf2 stonithd: [83868]: WARN: A STONITH operation timed out.
Nov 15 23:08:55 dmf2 stonithd: [83868]: debug: stonith-sgi-ipmi:1's (external/sgi-ipmi) op monitor
finished. op_result=2
Nov 15 23:08:55 dmf2 stonithd: [83868]: debug: client STONITH_RA_EXEC_98416 (pid=98416) signed off
Nov 15 23:08:55 dmf2 crmd: [83873]: info: process_lrm_event: LRM operation
stonith-sgi-ipmi:1_monitor_15000 (call=10, rc=2, cib-update=118, confirmed=false) invalid parameter
>> a timeout is interpreted as an invalid parameter
Nov 15 23:08:55 dmf2 crmd: [83873]: debug: te_update_diff: Processing diff (cib_modify): 0.453.14 ->
0.453.15 (S_IDLE)
Nov 15 23:08:55 dmf2 crmd: [83873]: debug: get_xpath_object: No match for
//cib_update_result//diff-added//crm_config in /notify/cib_update_result/diff
Nov 15 23:08:55 dmf2 crmd: [83873]: info: process_graph_event: Action
stonith-sgi-ipmi:1_monitor_15000/32 (0:2;32:7:0:0600ce86-cbe8-475c-add7-32ee6b812e21) initiated by a
different transitioner
Nov 15 23:08:55 dmf2 crmd: [83873]: info: abort_transition_graph: process_graph_event:456 -
Triggered transition abort (complete=1, tag=lrm_rsc_op, id=stonith-sgi-ipmi:1_monitor_15000,
magic=0:2;32:7:0:0600ce86-cbe8-475c-add7-32ee6b812e21, cib=0.453.15) : Foreign event
>> updating the failcount and last failure time
Nov 15 23:09:09 dmf2 crmd: [83873]: WARN: update_failcount: Updating failcount for
stonith-sgi-ipmi:1 on dmf2 after failed monitor: rc=2 (update=value++, time=1289880535)
Nov 15 23:09:09 dmf2 attrd: [83871]: debug: attrd_local_callback: update message from crmd:
fail-count-stonith-sgi-ipmi:1=value++
Nov 15 23:09:09 dmf2 crmd: [83873]: debug: attrd_update: Sent update:
fail-count-stonith-sgi-ipmi:1=value++ for dmf2
Nov 15 23:09:09 dmf2 attrd: [83871]: info: find_hash_entry: Creating hash entry for
fail-count-stonith-sgi-ipmi:1
Nov 15 23:09:09 dmf2 crmd: [83873]: debug: attrd_update: Sent update:
last-failure-stonith-sgi-ipmi:1=1289880535 for dmf2
Nov 15 23:09:09 dmf2 attrd: [83871]: debug: attrd_local_callback: Supplied: value++, Current:
(null), Stored: (null)
...
>> the timeout is considered a hard error and it will not allow a restart!
Nov 15 23:09:15 dmf2 pengine: [83872]: debug: unpack_rsc_op: stonith-sgi-ipmi:1_monitor_15000 on
dmf2 returned 2 (invalid parameter) instead of the expected value: 0 (ok)
Nov 15 23:09:15 dmf2 pengine: [83872]: ERROR: unpack_rsc_op: Hard error -
stonith-sgi-ipmi:1_monitor_15000 failed with rc=2: Preventing stonith-sgi-ipmi-set from re-starting
on dmf2
Nov 15 23:09:37 dmf2 pengine: [83872]: debug: native_assign_node: All nodes for resource
stonith-sgi-ipmi:1 are unavailable, unclean or shutting down (dmf1: 1, -1000000)
Nov 15 23:09:37 dmf2 pengine: [83872]: WARN: native_color: Resource stonith-sgi-ipmi:1 cannot run
anywhere
Nov 15 23:09:37 dmf2 pengine: [83872]: debug: clone_color: Allocated 1 stonith-sgi-ipmi-set
instances of a possible 2
Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource CXFS (Started dmf2)
Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource Public-IP (Started dmf2)
Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource PDMO-IP (Started dmf2)
Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource Openvault (Started dmf2)
Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource DMF (Started dmf2)
Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource NFS (Started dmf2)
Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Leave resource stonith-sgi-ipmi:0
(Started dmf1)
Nov 15 23:09:37 dmf2 pengine: [83872]: notice: LogActions: Stop resource stonith-sgi-ipmi:1 (dmf2)
Nov 15 23:09:58 dmf2 pengine: [83872]: debug: unpack_rsc_op: stonith-sgi-ipmi:1_monitor_15000 on
dmf2 returned 2 (invalid parameter) instead of the expected value: 0 (ok)
Nov 15 23:09:58 dmf2 pengine: [83872]: ERROR: unpack_rsc_op: Hard error -
stonith-sgi-ipmi:1_monitor_15000 failed with rc=2: Preventing stonith-sgi-ipmi-set from re-starting
on dmf2
Nov 15 23:09:58 dmf2 pengine: [83872]: WARN: unpack_rsc_op: Processing failed op
stonith-sgi-ipmi:1_monitor_15000 on dmf2: invalid parameter (2)
Nov 15 23:09:58 dmf2 pengine: [83872]: WARN: unpack_rsc_op: Processing failed op CXFS_monitor_0 on
dmf2: unknown exec error (-2)
Nov 15 23:09:58 dmf2 pengine: [83872]: notice: group_print: Resource Group: dmfGroup
Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print: CXFS (ocf::sgi:cxfs):
Started dmf2
Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print: Public-IP
(ocf::heartbeat:IPaddr2): Started dmf2
Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print: PDMO-IP
(ocf::heartbeat:IPaddr2): Started dmf2
Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print: Openvault
(ocf::sgi:openvault): Started dmf2
Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print: DMF (ocf::sgi:dmf): Started dmf2
Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print: NFS (ocf::heartbeat:nfsserver):
Started dmf2
Nov 15 23:09:58 dmf2 pengine: [83872]: notice: clone_print: Clone Set: stonith-sgi-ipmi-set
Nov 15 23:09:58 dmf2 pengine: [83872]: debug: native_active: Resource stonith-sgi-ipmi:0 active on dmf1
Nov 15 23:09:58 dmf2 pengine: [83872]: debug: native_active: Resource stonith-sgi-ipmi:0 active on dmf1
Nov 15 23:09:58 dmf2 pengine: [83872]: debug: native_active: Resource stonith-sgi-ipmi:1 active on dmf2
Nov 15 23:09:58 dmf2 pengine: [83872]: notice: native_print: stonith-sgi-ipmi:1
(stonith:external/sgi-ipmi): Started dmf2 FAILED
Nov 15 23:10:08 dmf2 pengine: [83872]: notice: short_print: Started: [ dmf1 ]
Nov 15 23:10:08 dmf2 pengine: [83872]: debug: common_apply_stickiness: Resource stonith-sgi-ipmi:0:
preferring current location (node=dmf1, weight=1)
Nov 15 23:10:08 dmf2 pengine: [83872]: debug: common_apply_stickiness: Resource CXFS: preferring
current location (node=dmf2, weight=1)
Nov 15 23:10:08 dmf2 pengine: [83872]: debug: common_apply_stickiness: Resource Public-IP:
preferring current location (node=dmf2, weight=1)
Nov 15 23:10:08 dmf2 pengine: [83872]: debug: common_apply_stickiness: Resource PDMO-IP: preferring
current location (node=dmf2, weight=1)
Nov 15 23:10:08 dmf2 pengine: [83872]: debug: common_apply_stickiness: Resource Openvault:
preferring current location (node=dmf2, weight=1)
Nov 15 23:10:08 dmf2 pengine: [83872]: debug: common_apply_stickiness: Resource DMF: preferring
current location (node=dmf2, weight=1)
Nov 15 23:10:08 dmf2 pengine: [83872]: debug: common_apply_stickiness: Resource NFS: preferring
current location (node=dmf2, weight=1)
Nov 15 23:10:08 dmf2 pengine: [83872]: debug: common_apply_stickiness: Resource stonith-sgi-ipmi:1:
preferring current location (node=dmf2, weight=1)
>> This implies the resource can still restart, but the operation is cancelled anyway
Nov 15 23:10:08 dmf2 pengine: [83872]: info: get_failcount: stonith-sgi-ipmi-set has failed 1 times
on dmf2
Nov 15 23:10:08 dmf2 pengine: [83872]: notice: common_apply_stickiness: stonith-sgi-ipmi-set can
fail 999999 more times on dmf2 before being forced off
Nov 15 23:10:08 dmf2 pengine: [83872]: debug: native_assign_node: All nodes for resource
stonith-sgi-ipmi:1 are unavailable, unclean or shutting down (dmf1: 1, -1000000)
Nov 15 23:10:08 dmf2 lrmd: [83870]: info: cancel_op: operation monitor[10] on
stonith::external/sgi-ipmi::stonith-sgi-ipmi:1 for client 83873, its parameters:
CRM_meta_interval=[15000] nodelist=[dmf1;admin;admin;supermicro;10.7.8.201
dmf2;admin;admin;supermicro;10CRM_meta_on_fail=[restart] CRM_meta_timeout=[15000]
CRM_meta_clone_max=[2] crm_feature_set=[3.0.1] CRM_meta_globally_unique=[false]
CRM_meta_name=[monitor] CRM_meta_clone=[1] cancelled
Nov 15 23:10:08 dmf2 crmd: [83873]: debug: do_fsa_action: actions:trace: // A_FINALIZE_TIMER_STOP
Nov 15 23:10:08 dmf2 pengine: [83872]: WARN: native_color: Resource stonith-sgi-ipmi:1 cannot run
anywhere
Nov 15 23:10:08 dmf2 lrmd: [83870]: debug: on_msg_cancel_op: operation 10 cancelled
Nov 15 23:10:08 dmf2 crmd: [83873]: debug: do_fsa_action: actions:trace: // A_TE_INVOKE
Nov 15 23:10:08 dmf2 pengine: [83872]: debug: clone_color: Allocated 1 stonith-sgi-ipmi-set
instances of a possible 2
Nov 15 23:10:08 dmf2 lrmd: [83870]: debug: on_msg_perform_op: add an operation operation stop[26] on
stonith::external/sgi-ipmi::stonith-sgi-ipmi:1 for client 83873, its parameters:
CRM_meta_timeout=[20000] CRM_meta_clone_max=[2] crm_feature_set=[3.0.1]
CRM_meta_globally_unique=[false] CRM_meta_clone=[1] to the operation list.
Nov 15 23:10:09 dmf2 crmd: [83873]: info: te_rsc_command: Initiating action 3: stop
stonith-sgi-ipmi:1_stop_0 on dmf2 (local)
Nov 15 23:10:09 dmf2 lrmd: [102943]: debug: sending out the signon msg.
Nov 15 23:10:09 dmf2 crmd: [83873]: debug: cancel_op: Cancelling op 10 for stonith-sgi-ipmi:1
(stonith-sgi-ipmi:1:10)
Nov 15 23:10:09 dmf2 crmd: [83873]: debug: cancel_op: Op 10 for stonith-sgi-ipmi:1
(stonith-sgi-ipmi:1:10): cancelled
Nov 15 23:10:09 dmf2 pengine: [83872]: notice: LogActions: Stop resource stonith-sgi-ipmi:1 (dmf2)
Nov 15 23:10:09 dmf2 crmd: [83873]: info: do_lrm_rsc_op: Performing
key=3:35:0:815c525b-e8f1-4bb4-a5eb-2229d1980cc1 op=stonith-sgi-ipmi:1_stop_0 )
Nov 15 23:10:09 dmf2 stonithd: [83868]: debug: client STONITH_RA_EXEC_102943 (pid=102943) succeeded
to signon to stonithd.
Nov 15 23:10:09 dmf2 lrmd: [102943]: debug: signed on to stonithd.
Nov 15 23:10:09 dmf2 lrmd: [102943]: info: Try to stop STONITH resource <rsc_id=stonith-sgi-ipmi:1>
: Device=external/sgi-ipmi
Nov 15 23:10:09 dmf2 lrmd: [102943]: debug: waiting for the stonithRA reply msg.
Nov 15 23:10:09 dmf2 stonithd: [83868]: debug: client STONITH_RA_EXEC_102943 [pid: 102943] requests
a resource operation stop on stonith-sgi-ipmi:1 (external/sgi-ipmi)
Nov 15 23:10:09 dmf2 stonithd: [83868]: debug: external_destroy: called.
Nov 15 23:10:09 dmf2 lrmd: [102943]: debug: a stonith RA operation queue to run, call_id=102944.
Nov 15 23:10:09 dmf2 lrmd: [102943]: debug: stonithd_receive_ops_result: begin
Nov 15 23:10:09 dmf2 stonithd: [83868]: debug: Child process external_stonith-sgi-ipmi:1_stop
[102944] exited, its exit code: 0 when signo=0.
Nov 15 23:10:09 dmf2 stonithd: [83868]: debug: stonith-sgi-ipmi:1's (external/sgi-ipmi) op stop
finished. op_result=0
Nov 15 23:10:09 dmf2 stonithd: [83868]: debug: client STONITH_RA_EXEC_102943 (pid=102943) signed off
Nov 15 23:10:09 dmf2 lrmd: [83870]: info: Managed stonith-sgi-ipmi:1:stop process 102943 exited with
return code 0.
Nov 15 23:10:09 dmf2 crmd: [83873]: info: process_lrm_event: LRM operation stonith-sgi-ipmi:1_stop_0
(call=26, rc=0, cib-update=122, confirmed=true) ok
Nov 15 23:10:09 dmf2 crmd: [83873]: info: match_graph_event: Action stonith-sgi-ipmi:1_stop_0 (3)
confirmed on dmf2 (rc=0)
--
Ron Kerry rkerry at sgi.com
More information about the Pacemaker
mailing list