[Pacemaker] On recovery of failed node, pengine fails to correctly monitor 'dirty' resources

Tue Aug 12 23:54:56 CEST 2014

----- Original Message -----
> This seems to happen in every resource agent we've been using -- including
> ClusterMon, anything, etc. These resource agents also aren't being called
> when these errors appear -- though the logs in /var/log/messages would
> indicate as if they were called. The calls may differ that are being
> displayed -- always a monitor, but sometimes a stop and start as well. In
> either case, the resource agent never receives those calls.
> 
> Below is a log from this happening. I'll annotate it to try to provide some
> context.
> 
> ug 12 11:22:10 ti14-demo1 kernel: DLM (built Dec  9 2013 18:27:30) installed
> Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [MAIN  ] Corosync Cluster Engine
> ('1.4.1'): started and ready to provide service.
> Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [MAIN  ] Corosync built-in
> features: nss dbus rdma snmp
> Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [MAIN  ] Successfully read
> config from /etc/cluster/cluster.conf
> Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [MAIN  ] Successfully parsed
> cman config
> Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [TOTEM ] Initializing transport
> (UDP/IP Unicast).
> Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [TOTEM ] Initializing
> transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
> Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [TOTEM ] The network interface
> [10.28.230.201] is now up.
> Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [QUORUM] Using quorum provider
> quorum_cman
> Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [SERV  ] Service engine loaded:
> corosync cluster quorum service v0.1
> Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [CMAN  ] CMAN 3.0.12.1 (built
> Apr 14 2014 09:36:10) started
> Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [SERV  ] Service engine loaded:
> corosync CMAN membership service 2.90
> Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [SERV  ] Service engine loaded:
> openais checkpoint service B.01.01
> Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [SERV  ] Service engine loaded:
> corosync extended virtual synchrony service
> Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [SERV  ] Service engine loaded:
> corosync configuration service
> Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [SERV  ] Service engine loaded:
> corosync cluster closed process group service v1.01
> Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [SERV  ] Service engine loaded:
> corosync cluster config database access v1.01
> Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [SERV  ] Service engine loaded:
> corosync profile loading service
> Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [QUORUM] Using quorum provider
> quorum_cman
> Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [SERV  ] Service engine loaded:
> corosync cluster quorum service v0.1
> Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [MAIN  ] Compatibility mode set
> to whitetank.  Using V1 and V2 of the synchronization engine.
> Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [TOTEM ] adding new UDPU member
> {10.28.230.201}
> Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [TOTEM ] adding new UDPU member
> {10.28.230.203}
> Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [TOTEM ] A processor joined or
> left the membership and a new membership was formed.
> Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [QUORUM] Members[1]: 1
> Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [QUORUM] Members[1]: 1
> Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [CPG   ] chosen downlist: sender
> r(0) ip(10.28.230.201) ; members(old:0 left:0)
> Aug 12 11:22:14 ti14-demo1 corosync[2925]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> Aug 12 11:22:18 ti14-demo1 fenced[2978]: fenced 3.0.12.1 started
> Aug 12 11:22:18 ti14-demo1 dlm_controld[2997]: dlm_controld 3.0.12.1 started
> Aug 12 11:22:19 ti14-demo1 corosync[2925]:   [TOTEM ] A processor joined or
> left the membership and a new membership was formed.
> Aug 12 11:22:19 ti14-demo1 corosync[2925]:   [CMAN  ] quorum regained,
> resuming activity
> Aug 12 11:22:19 ti14-demo1 corosync[2925]:   [QUORUM] This node is within the
> primary component and will provide service.
> Aug 12 11:22:19 ti14-demo1 corosync[2925]:   [QUORUM] Members[2]: 1 2
> Aug 12 11:22:19 ti14-demo1 corosync[2925]:   [QUORUM] Members[2]: 1 2
> Aug 12 11:22:19 ti14-demo1 corosync[2925]:   [CPG   ] chosen downlist: sender
> r(0) ip(10.28.230.201) ; members(old:1 left:0)
> Aug 12 11:22:19 ti14-demo1 corosync[2925]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> Aug 12 11:22:19 ti14-demo1 gfs_controld[3053]: gfs_controld 3.0.12.1 started
> Aug 12 11:22:20 ti14-demo1 pacemaker: Starting Pacemaker Cluster Manager
> Aug 12 11:22:20 ti14-demo1 pacemakerd[3136]:   notice: crm_add_logfile:
> Additional logging available in /var/log/cluster/corosync.log
> Aug 12 11:22:20 ti14-demo1 pacemakerd[3136]:   notice: main: Starting
> Pacemaker 1.1.10-14.el6_5.3 (Build: 368c726):  generated-manpages
> agent-manpages ascii-docs publican-docs ncurses libqb-logging libqb-ipc
> nagios  corosync-plugin cman
> Aug 12 11:22:21 ti14-demo1 stonith-ng[3143]:   notice: crm_add_logfile:
> Additional logging available in /var/log/cluster/corosync.log
> Aug 12 11:22:21 ti14-demo1 lrmd[3144]:   notice: crm_add_logfile: Additional
> logging available in /var/log/cluster/corosync.log
> Aug 12 11:22:21 ti14-demo1 stonith-ng[3143]:   notice: crm_cluster_connect:
> Connecting to cluster infrastructure: cman
> Aug 12 11:22:21 ti14-demo1 attrd[3145]:   notice: crm_add_logfile: Additional
> logging available in /var/log/cluster/corosync.log
> Aug 12 11:22:21 ti14-demo1 attrd[3145]:   notice: crm_cluster_connect:
> Connecting to cluster infrastructure: cman
> Aug 12 11:22:21 ti14-demo1 cib[3142]:   notice: crm_add_logfile: Additional
> logging available in /var/log/cluster/corosync.log
> Aug 12 11:22:21 ti14-demo1 pengine[3146]:   notice: crm_add_logfile:
> Additional logging available in /var/log/cluster/corosync.log
> Aug 12 11:22:21 ti14-demo1 crmd[3147]:   notice: crm_add_logfile: Additional
> logging available in /var/log/cluster/corosync.log
> Aug 12 11:22:21 ti14-demo1 crmd[3147]:   notice: main: CRM Git Version:
> 368c726
> Aug 12 11:22:21 ti14-demo1 attrd[3145]:   notice: main: Starting mainloop...
> Aug 12 11:22:21 ti14-demo1 cib[3142]:   notice: crm_cluster_connect:
> Connecting to cluster infrastructure: cman
> Aug 12 11:22:22 ti14-demo1 crmd[3147]:   notice: crm_cluster_connect:
> Connecting to cluster infrastructure: cman
> Aug 12 11:22:22 ti14-demo1 stonith-ng[3143]:   notice: setup_cib: Watching
> for stonith topology changes
> Aug 12 11:22:22 ti14-demo1 stonith-ng[3143]:   notice: unpack_config: On loss
> of CCM Quorum: Ignore
> Aug 12 11:22:22 ti14-demo1 crmd[3147]:   notice: cman_event_callback:
> Membership 28: quorum acquired
> Aug 12 11:22:22 ti14-demo1 crmd[3147]:   notice: crm_update_peer_state:
> cman_event_callback: Node ti14-demo1[1] - state is now member (was (null))
> Aug 12 11:22:22 ti14-demo1 crmd[3147]:   notice: crm_update_peer_state:
> cman_event_callback: Node ti14-demo2[2] - state is now member (was (null))
> Aug 12 11:22:22 ti14-demo1 crmd[3147]:   notice: do_started: The local CRM is
> operational
> Aug 12 11:22:22 ti14-demo1 crmd[3147]:   notice: do_state_transition: State
> transition S_STARTING -> S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL
> origin=do_started ]
> Aug 12 11:22:43 ti14-demo1 crmd[3147]:  warning: do_log: FSA: Input
> I_DC_TIMEOUT from crm_timer_popped() received in state S_PENDING
> Aug 12 11:22:43 ti14-demo1 crmd[3147]:   notice: do_state_transition: State
> transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC
> cause=C_FSA_INTERNAL origin=do_election_check ]
> Aug 12 11:22:43 ti14-demo1 attrd[3145]:   notice: attrd_local_callback:
> Sending full refresh (origin=crmd)
> Aug 12 11:22:43 ti14-demo1 pengine[3146]:   notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Aug 12 11:22:43 ti14-demo1 pengine[3146]:   notice: LogActions: Start
>    cmon#011(ti14-demo1)
> Aug 12 11:22:43 ti14-demo1 pengine[3146]:   notice: process_pe_message:
> Calculated Transition 0: /var/lib/pacemaker/pengine/pe-input-10.bz2
> Aug 12 11:22:43 ti14-demo1 crmd[3147]:   notice: te_rsc_command: Initiating
> action 6: monitor cmon_monitor_0 on ti14-demo2
> Aug 12 11:22:43 ti14-demo1 crmd[3147]:   notice: te_rsc_command: Initiating
> action 4: monitor cmon_monitor_0 on ti14-demo1 (local)
> Aug 12 11:22:43 ti14-demo1 crmd[3147]:   notice: process_lrm_event: LRM
> operation cmon_monitor_0 (call=5, rc=7, cib-update=26, confirmed=true) not
> running
> Aug 12 11:22:43 ti14-demo1 crmd[3147]:   notice: te_rsc_command: Initiating
> action 3: probe_complete probe_complete on ti14-demo1 (local) - no waiting
> Aug 12 11:22:43 ti14-demo1 attrd[3145]:   notice: attrd_trigger_update:
> Sending flush op to all hosts for: probe_complete (true)
> Aug 12 11:22:43 ti14-demo1 attrd[3145]:   notice: attrd_perform_update: Sent
> update 4: probe_complete=true
> Aug 12 11:22:43 ti14-demo1 crmd[3147]:   notice: te_rsc_command: Initiating
> action 5: probe_complete probe_complete on ti14-demo2 - no waiting
> Aug 12 11:22:43 ti14-demo1 crmd[3147]:   notice: te_rsc_command: Initiating
> action 7: start cmon_start_0 on ti14-demo1 (local)
> Aug 12 11:22:43 ti14-demo1 crmd[3147]:   notice: process_lrm_event: LRM
> operation cmon_start_0 (call=8, rc=0, cib-update=27, confirmed=true) ok
> Aug 12 11:22:43 ti14-demo1 crmd[3147]:   notice: te_rsc_command: Initiating
> action 8: monitor cmon_monitor_60000 on ti14-demo1 (local)
> Aug 12 11:22:43 ti14-demo1 crmd[3147]:   notice: process_lrm_event: LRM
> operation cmon_monitor_60000 (call=11, rc=0, cib-update=28, confirmed=false)
> ok
> Aug 12 11:22:43 ti14-demo1 crmd[3147]:   notice: run_graph: Transition 0
> (Complete=7, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-10.bz2): Complete
> Aug 12 11:22:43 ti14-demo1 crmd[3147]:   notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> 
> 
> killing a cluster-mon on ti14-demo1, resulting in a monitor fail (as
> expected)
> 
> 
> 
> Aug 12 11:24:44 ti14-demo1 lrmd[3144]:   notice: operation_finished:
> cmon_monitor_60000:3206:stderr [
> /usr/lib/ocf/resource.d/heartbeat/ClusterMon: line 155: kill: (3175) - No
> such process ]
> Aug 12 11:24:44 ti14-demo1 crmd[3147]:   notice: process_lrm_event: LRM
> operation cmon_monitor_60000 (call=11, rc=7, cib-update=29, confirmed=false)
> not running
> Aug 12 11:24:44 ti14-demo1 crmd[3147]:  warning: update_failcount: Updating
> failcount for cmon on ti14-demo1 after failed monitor: rc=7 (update=value++,
> time=1407857084)
> Aug 12 11:24:44 ti14-demo1 crmd[3147]:   notice: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
> origin=abort_transition_graph ]
> Aug 12 11:24:44 ti14-demo1 attrd[3145]:   notice: attrd_trigger_update:
> Sending flush op to all hosts for: fail-count-cmon (1)
> Aug 12 11:24:44 ti14-demo1 attrd[3145]:   notice: attrd_perform_update: Sent
> update 9: fail-count-cmon=1
> Aug 12 11:24:44 ti14-demo1 attrd[3145]:   notice: attrd_trigger_update:
> Sending flush op to all hosts for: last-failure-cmon (1407857084)
> Aug 12 11:24:44 ti14-demo1 pengine[3146]:   notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Aug 12 11:24:44 ti14-demo1 attrd[3145]:   notice: attrd_perform_update: Sent
> update 12: last-failure-cmon=1407857084
> Aug 12 11:24:44 ti14-demo1 pengine[3146]:  warning: unpack_rsc_op: Processing
> failed op monitor for cmon on ti14-demo1: not running (7)
> Aug 12 11:24:44 ti14-demo1 pengine[3146]:   notice: LogActions: Recover
> cmon#011(Started ti14-demo1)
> Aug 12 11:24:44 ti14-demo1 pengine[3146]:   notice: process_pe_message:
> Calculated Transition 1: /var/lib/pacemaker/pengine/pe-input-11.bz2
> Aug 12 11:24:44 ti14-demo1 pengine[3146]:   notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Aug 12 11:24:44 ti14-demo1 pengine[3146]:  warning: unpack_rsc_op: Processing
> failed op monitor for cmon on ti14-demo1: not running (7)
> Aug 12 11:24:44 ti14-demo1 pengine[3146]:   notice: LogActions: Recover
> cmon#011(Started ti14-demo1)
> Aug 12 11:24:44 ti14-demo1 pengine[3146]:   notice: process_pe_message:
> Calculated Transition 2: /var/lib/pacemaker/pengine/pe-input-12.bz2
> Aug 12 11:24:44 ti14-demo1 pengine[3146]:   notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Aug 12 11:24:44 ti14-demo1 pengine[3146]:  warning: unpack_rsc_op: Processing
> failed op monitor for cmon on ti14-demo1: not running (7)
> Aug 12 11:24:44 ti14-demo1 pengine[3146]:   notice: LogActions: Recover
> cmon#011(Started ti14-demo1)
> Aug 12 11:24:44 ti14-demo1 crmd[3147]:   notice: te_rsc_command: Initiating
> action 2: stop cmon_stop_0 on ti14-demo1 (local)
> Aug 12 11:24:44 ti14-demo1 pengine[3146]:   notice: process_pe_message:
> Calculated Transition 3: /var/lib/pacemaker/pengine/pe-input-13.bz2
> Aug 12 11:24:44 ti14-demo1 lrmd[3144]:   notice: operation_finished:
> cmon_stop_0:3211:stderr [ /usr/lib/ocf/resource.d/heartbeat/ClusterMon: line
> 145: kill: (3175) - No such process ]
> Aug 12 11:24:44 ti14-demo1 crmd[3147]:   notice: process_lrm_event: LRM
> operation cmon_stop_0 (call=16, rc=0, cib-update=33, confirmed=true) ok
> Aug 12 11:24:44 ti14-demo1 crmd[3147]:   notice: te_rsc_command: Initiating
> action 7: start cmon_start_0 on ti14-demo1 (local)
> Aug 12 11:24:44 ti14-demo1 crmd[3147]:   notice: process_lrm_event: LRM
> operation cmon_start_0 (call=20, rc=0, cib-update=34, confirmed=true) ok
> Aug 12 11:24:44 ti14-demo1 crmd[3147]:   notice: te_rsc_command: Initiating
> action 1: monitor cmon_monitor_60000 on ti14-demo1 (local)
> Aug 12 11:24:44 ti14-demo1 crmd[3147]:   notice: process_lrm_event: LRM
> operation cmon_monitor_60000 (call=23, rc=0, cib-update=35, confirmed=false)
> ok
> Aug 12 11:24:44 ti14-demo1 crmd[3147]:   notice: run_graph: Transition 3
> (Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-13.bz2): Complete
> Aug 12 11:24:44 ti14-demo1 crmd[3147]:   notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> 
> 
> Rebooting ti14-demo2
> 
> 
> 
> Aug 12 11:25:25 ti14-demo1 corosync[2925]:   [TOTEM ] A processor failed,
> forming new configuration.
> Aug 12 11:25:27 ti14-demo1 corosync[2925]:   [CMAN  ] quorum lost, blocking
> activity
> Aug 12 11:25:27 ti14-demo1 corosync[2925]:   [QUORUM] This node is within the
> non-primary component and will NOT provide any services.
> Aug 12 11:25:27 ti14-demo1 corosync[2925]:   [QUORUM] Members[1]: 1
> Aug 12 11:25:27 ti14-demo1 corosync[2925]:   [TOTEM ] A processor joined or
> left the membership and a new membership was formed.
> Aug 12 11:25:27 ti14-demo1 crmd[3147]:   notice: cman_event_callback:
> Membership 32: quorum lost
> Aug 12 11:25:27 ti14-demo1 kernel: dlm: closing connection to node 2
> Aug 12 11:25:27 ti14-demo1 corosync[2925]:   [CPG   ] chosen downlist: sender
> r(0) ip(10.28.230.201) ; members(old:2 left:1)
> Aug 12 11:25:27 ti14-demo1 corosync[2925]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> Aug 12 11:25:27 ti14-demo1 crmd[3147]:   notice: crm_update_peer_state:
> cman_event_callback: Node ti14-demo2[2] - state is now lost (was member)
> Aug 12 11:25:27 ti14-demo1 crmd[3147]:  warning: match_down_event: No match
> for shutdown action on ti14-demo2
> Aug 12 11:25:27 ti14-demo1 crmd[3147]:   notice: peer_update_callback:
> Stonith/shutdown of ti14-demo2 not matched
> Aug 12 11:25:27 ti14-demo1 crmd[3147]:   notice: do_state_transition: State
> transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL
> origin=check_join_state ]
> Aug 12 11:25:27 ti14-demo1 attrd[3145]:   notice: attrd_local_callback:
> Sending full refresh (origin=crmd)
> Aug 12 11:25:27 ti14-demo1 attrd[3145]:   notice: attrd_trigger_update:
> Sending flush op to all hosts for: last-failure-cmon (1407857084)
> Aug 12 11:25:27 ti14-demo1 crmd[3147]:  warning: match_down_event: No match
> for shutdown action on ti14-demo2
> Aug 12 11:25:27 ti14-demo1 crmd[3147]:   notice: peer_update_callback:
> Stonith/shutdown of ti14-demo2 not matched
> Aug 12 11:25:27 ti14-demo1 attrd[3145]:   notice: attrd_trigger_update:
> Sending flush op to all hosts for: fail-count-cmon (1)
> Aug 12 11:25:27 ti14-demo1 pengine[3146]:   notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Aug 12 11:25:27 ti14-demo1 pengine[3146]:  warning: unpack_rsc_op: Processing
> failed op monitor for cmon on ti14-demo1: not running (7)
> Aug 12 11:25:27 ti14-demo1 attrd[3145]:   notice: attrd_trigger_update:
> Sending flush op to all hosts for: probe_complete (true)
> Aug 12 11:25:27 ti14-demo1 pengine[3146]:   notice: process_pe_message:
> Calculated Transition 4: /var/lib/pacemaker/pengine/pe-input-14.bz2
> Aug 12 11:25:27 ti14-demo1 crmd[3147]:   notice: run_graph: Transition 4
> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-14.bz2): Complete
> Aug 12 11:25:27 ti14-demo1 crmd[3147]:   notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Aug 12 11:28:03 ti14-demo1 corosync[2925]:   [TOTEM ] A processor joined or
> left the membership and a new membership was formed.
> Aug 12 11:28:03 ti14-demo1 corosync[2925]:   [CMAN  ] quorum regained,
> resuming activity
> Aug 12 11:28:03 ti14-demo1 corosync[2925]:   [QUORUM] This node is within the
> primary component and will provide service.
> Aug 12 11:28:03 ti14-demo1 corosync[2925]:   [QUORUM] Members[2]: 1 2
> Aug 12 11:28:03 ti14-demo1 corosync[2925]:   [QUORUM] Members[2]: 1 2
> Aug 12 11:28:03 ti14-demo1 crmd[3147]:   notice: cman_event_callback:
> Membership 36: quorum acquired
> Aug 12 11:28:03 ti14-demo1 crmd[3147]:   notice: crm_update_peer_state:
> cman_event_callback: Node ti14-demo2[2] - state is now member (was lost)
> Aug 12 11:28:03 ti14-demo1 corosync[2925]:   [CPG   ] chosen downlist: sender
> r(0) ip(10.28.230.201) ; members(old:1 left:0)
> Aug 12 11:28:03 ti14-demo1 corosync[2925]:   [MAIN  ] Completed service
> synchronization, ready to provide service.
> Aug 12 11:28:11 ti14-demo1 crmd[3147]:   notice: do_state_transition: State
> transition S_IDLE -> S_INTEGRATION [ input=I_NODE_JOIN cause=C_FSA_INTERNAL
> origin=peer_update_callback ]
> Aug 12 11:28:13 ti14-demo1 attrd[3145]:   notice: attrd_local_callback:
> Sending full refresh (origin=crmd)
> Aug 12 11:28:13 ti14-demo1 attrd[3145]:   notice: attrd_trigger_update:
> Sending flush op to all hosts for: last-failure-cmon (1407857084)
> Aug 12 11:28:13 ti14-demo1 attrd[3145]:   notice: attrd_trigger_update:
> Sending flush op to all hosts for: fail-count-cmon (1)
> Aug 12 11:28:13 ti14-demo1 attrd[3145]:   notice: attrd_trigger_update:
> Sending flush op to all hosts for: probe_complete (true)
> Aug 12 11:28:14 ti14-demo1 pengine[3146]:   notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Aug 12 11:28:14 ti14-demo1 pengine[3146]:  warning: unpack_rsc_op: Processing
> failed op monitor for cmon on ti14-demo1: not running (7)
> 
> 
> monitor fails for cluster_mon on ti14-demo1 -- but it didn't actually ever
> get called, or fail in any way. cluster_mon has been running fine since
> being restarted at the very beginning.

cmon did fail earlier. This log msg is the pengine processing that failure, not
pacemaker saying there's some new failure that occurred.

If you don't want to see this msg anymore clear the failcount.

crm_resource -C -r cmon

-- Vossel

> Aug 12 11:28:14 ti14-demo1 pengine[3146]:   notice: process_pe_message:
> Calculated Transition 5: /var/lib/pacemaker/pengine/pe-input-15.bz2
> Aug 12 11:28:14 ti14-demo1 crmd[3147]:   notice: te_rsc_command: Initiating
> action 6: monitor cmon_monitor_0 on ti14-demo2
> Aug 12 11:28:14 ti14-demo1 crmd[3147]:   notice: te_rsc_command: Initiating
> action 5: probe_complete probe_complete on ti14-demo2 - no waiting
> Aug 12 11:28:14 ti14-demo1 crmd[3147]:   notice: run_graph: Transition 5
> (Complete=3, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-15.bz2): Complete
> Aug 12 11:28:14 ti14-demo1 crmd[3147]:   notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> 
> 
> 
> On Monday, August 11, 2014 4:06 PM, David Vossel <dvossel at redhat.com> wrote:
> 
> 
> 
> 
> 
> ----- Original Message -----
> > Greetings,
> > 
> > We are using pacemaker and cman in a two-node cluster with
> > no-quorum-policy:
> > ignore and stonith-enabled: false on a Centos 6 system (pacemaker related
> > RPM versions are listed below). We are seeing some bizarre (to us) behavior
> > when a node is fully lost (e.g. reboot -nf ). Here's the scenario we have:
> > 
> > 1) Fail a resource named "some-resource" started with the
> > ocf:heartbeat:anything script (or others) on node01 (in our case, it's a
> > master/slave resource we're pulling observations from, but it can happen on
> > normal ones).
> > 2) Wait for Resource to recover.
> > 3) Fail node02 (reboot -nf, or power loss)
> > 4) When node02 recovers, we see in /var/log/messages:
> > - Quorum is recovered
> > - Sending flush op to all hosts for master-some-resource,
> > last-failure-some-resource, probe_complete(true),
> > fail-count-some-resource(1)
> > - pengine Processing failed op monitor for some-resource on node01: unknown
> > error (1)
> > * After adding a simple "`date` called with $@ >> /tmp/log.rsc", we do not
> > see the resource agent being called at this time, on either node.
> > * Sometimes, we see other operations happen that are also not being sent to
> > the RA, including stop/start
> > * The resource is actually happilly running on node01 throughtout this
> > whole
> > process, so there's no reason we should be seeing this failure here.
> > * This issue is only seen on resources that had not yet been cleaned up.
> > Resources that were 'clean' when both nodes were last online do not have
> > this issue.
> > 
> > We noticed this originally because we are using the ClusterMon RA to report
> > on different types of errors, and this is giving us false positives. Any
> > thoughts on configuration issues we could be having, or if this sounds like
> > a bug in pacemaker somewhere?
> 
> This is likely a bug in whatever resource-agent you are using.  There's no
> way
> for us to know for sure without logs.
> 
> -- Vossel
> 
> 
> > 
> > Thanks!
> > 
> > ----
> > Versions:
> > ccs-0.16.2-69.el6_5.1.x86_64
> > clusterlib-3.0.12.1-59.el6_5.2.x86_64
> > cman-3.0.12.1-59.el6_5.2.x86_64
> > corosync-1.4.1-17.el6_5.1.x86_64
> > corosynclib-1.4.1-17.el6_5.1.x86_64
> > fence-virt-0.2.3-15.el6.x86_64
> > libqb-0.16.0-2.el6.x86_64
> > modcluster-0.16.2-28.el6.x86_64
> > openais-1.1.1-7.el6.x86_64
> > openaislib-1.1.1-7.el6.x86_64
> > pacemaker-1.1.10-14.el6_5.3.x86_64
> > pacemaker-cli-1.1.10-14.el6_5.3.x86_64
> > pacemaker-cluster-libs-1.1.10-14.el6_5.3.x86_64
> > pacemaker-libs-1.1.10-14.el6_5.3.x86_64
> > pcs-0.9.90-2.el6.centos.3.noarch
> > resource-agents-3.9.2-40.el6_5.7.x86_64
> > ricci-0.16.2-69.el6_5.1.x86_64
> > 
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> 
> >
>