[Pacemaker] Help understanding what went wrong
Alex Samad - Yieldbroker
Alex.Samad at yieldbroker.com
Thu Oct 30 01:06:31 UTC 2014
Hi
Seems like I had the same problem on my prod cluster
Oct 30 01:21:53 prodrp1 lrmd[2536]: warning: child_timeout_callback: ybrpstat_monitor_5000 process (PID 17918) timed out
Oct 30 01:21:53 prodrp1 lrmd[2536]: warning: operation_finished: ybrpstat_monitor_5000:17918 - timed out after 20000ms
Oct 30 01:21:53 prodrp1 crmd[2539]: error: process_lrm_event: LRM operation ybrpstat_monitor_5000 (4384) Timed Out (timeout=20000ms)
Oct 30 01:21:53 prodrp1 attrd[2537]: notice: attrd_cs_dispatch: Update relayed from prodrp2
This is the first recording on prodrp1.
The issue is that prodrp1 was the master for the IP.
I am wondering if it has something to do with it being a cloned resource ?
So the cluster realises there is a problem, in the middle of it realising that the cluster check monitor fails because (assumption here) it does some sort of cluster wide check!
A
> -----Original Message-----
> From: Alex Samad - Yieldbroker [mailto:Alex.Samad at yieldbroker.com]
> Sent: Thursday, 30 October 2014 11:43 AM
> To: pacemaker at oss.clusterlabs.org
> Subject: [Pacemaker] Help understanding what went wrong
>
> Hi
>
>
> rpm -qa | grep pace
> pacemaker-libs-1.1.10-14.el6_5.3.x86_64
> pacemaker-cluster-libs-1.1.10-14.el6_5.3.x86_64
> pacemaker-cli-1.1.10-14.el6_5.3.x86_64
> pacemaker-1.1.10-14.el6_5.3.x86_64
>
> centos 6.5
>
>
> I have a 2 node cluster
> pcs config
> Cluster Name: ybrp
> Corosync Nodes:
>
> Pacemaker Nodes:
> wwwrp1 wwwrp2
>
> Resources:
> Resource: ybrpip (class=ocf provider=heartbeat type=IPaddr2)
> Attributes: ip=10.32.43.50 cidr_netmask=24 nic=eth0
> clusterip_hash=sourceip-sourceport
> Meta Attrs: stickiness=0,migration-threshold=3,failure-timeout=600s
> Operations: start on-fail=restart interval=0s timeout=60s (ybrpip-start-
> interval-0s)
> monitor on-fail=restart interval=5s timeout=20s (ybrpip-monitor-
> interval-5s)
> stop interval=0s timeout=60s (ybrpip-stop-interval-0s)
> Clone: ybrpstat-clone
> Meta Attrs: globally-unique=false clone-max=2 clone-node-max=1
> Resource: ybrpstat (class=ocf provider=yb type=proxy)
> Operations: monitor on-fail=restart interval=5s timeout=20s (ybrpstat-
> monitor-interval-5s)
>
> Stonith Devices:
> Fencing Levels:
>
> Location Constraints:
> Ordering Constraints:
> start ybrpstat-clone then start ybrpip (Mandatory) (id:order-ybrpstat-clone-
> ybrpip-mandatory)
> Colocation Constraints:
> ybrpip with ybrpstat-clone (INFINITY) (id:colocation-ybrpip-ybrpstat-clone-
> INFINITY)
>
> Cluster Properties:
> cluster-infrastructure: cman
> dc-version: 1.1.10-14.el6_5.3-368c726
> last-lrm-refresh: 1414629002
> no-quorum-policy: ignore
> stonith-enabled: false
>
>
>
> Basically 1 node died (wwwrp1) hardware failure
>
> I can see in the logs the cluster wants to bring the IP address over and it
> seems to do it but
>
>
>
> ## this seems to be the ip address moving Oct 30 01:19:43 wwwrp2
> IPaddr2(ybrpip)[25778]: INFO: Adding inet address 10.32.43.50/24 with
> broadcast address 10.32.43.255 to device eth0 Oct 30 01:19:43 wwwrp2
> IPaddr2(ybrpip)[25778]: INFO: Bringing device eth0 up Oct 30 01:19:43
> wwwrp2 IPaddr2(ybrpip)[25778]: INFO: /usr/libexec/heartbeat/send_arp -i
> 200 -r 5 -p /var/run/resource-agents/send_arp-10.32.43.50 eth0 10.32.43.50
> auto not_used not_used
>
>
> ## this seems to be where it checks other stuff
> Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: process_lrm_event: LRM
> operation ybrpip_start_0 (call=13476, rc=0, cib-update=11762,
> confirmed=true) ok
> Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: te_rsc_command: Initiating
> action 8: monitor ybrpip_monitor_5000 on wwwrp2 (local)
> Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: process_lrm_event: LRM
> operation ybrpip_monitor_5000 (call=13479, rc=0, cib-update=11763,
> confirmed=false) ok
> Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: run_graph: Transition 6828
> (Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-3.bz2): Complete
> Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
>
> # this is where monitor times out, but it doesn't look like 20000ms .. the initial
> try was at Oct 30 01:19:43 Oct 30 01:19:44 wwwrp2 lrmd[2459]: warning:
> child_timeout_callback: ybrpstat_monitor_5000 process (PID 25712) timed
> out Oct 30 01:19:44 wwwrp2 lrmd[2459]: warning: operation_finished:
> ybrpstat_monitor_5000:25712 - timed out after 20000ms
> Oct 30 01:19:44 wwwrp2 crmd[2462]: error: process_lrm_event: LRM
> operation ybrpstat_monitor_5000 (13473) Timed Out (timeout=20000ms)
> Oct 30 01:19:44 wwwrp2 crmd[2462]: warning: update_failcount: Updating
> failcount for ybrpstat on wwwrp2 after failed monitor: rc=1
> (update=value++, time=1414592384)
>
> I'm guessing because it timed out it went into failed mode. I need to know
> why it timed out. The script has never timed out before or in testing...
>
> Am I reading this right. The reason the resource didn't fail over (ip address)
> was because there was no ybrpstat running on wwwrp2 and the reason for
> that was the monitor action failed/timed out
>
> Thanks
> Alex
>
>
> === logs ==
>
> Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Oct 30 01:19:42 wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing
> failed op start for ybrpstat:0 on wwwrp1: unknown error (1)
> Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: LogActions: Start
> ybrpip#011(wwwrp1)
> Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: LogActions: Recover
> ybrpstat:0#011(Started wwwrp1)
> Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: process_pe_message:
> Calculated Transition 6827: /var/lib/pacemaker/pengine/pe-input-2.bz2
> Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Oct 30 01:19:42 wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing
> failed op start for ybrpstat:0 on wwwrp1: unknown error (1) Oct 30 01:19:42
> wwwrp2 pengine[2461]: warning: common_apply_stickiness: Forcing
> ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000) Oct
> 30 01:19:42 wwwrp2 pengine[2461]: warning: common_apply_stickiness:
> Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures
> (max=1000000)
> Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: LogActions: Start
> ybrpip#011(wwwrp2)
> Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: LogActions: Stop
> ybrpstat:0#011(wwwrp1)
> Oct 30 01:19:42 wwwrp2 pengine[2461]: notice: process_pe_message:
> Calculated Transition 6828: /var/lib/pacemaker/pengine/pe-input-3.bz2
> Oct 30 01:19:42 wwwrp2 crmd[2462]: notice: te_rsc_command: Initiating
> action 7: start ybrpip_start_0 on wwwrp2 (local)
> Oct 30 01:19:42 wwwrp2 crmd[2462]: notice: te_rsc_command: Initiating
> action 1: stop ybrpstat_stop_0 on wwwrp1
> Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: Adding inet address
> 10.32.43.50/24 with broadcast address 10.32.43.255 to device eth0 Oct 30
> 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: Bringing device eth0 up Oct
> 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO:
> /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /var/run/resource-
> agents/send_arp-10.32.43.50 eth0 10.32.43.50 auto not_used not_used
> Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: process_lrm_event: LRM
> operation ybrpip_start_0 (call=13476, rc=0, cib-update=11762,
> confirmed=true) ok
> Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: te_rsc_command: Initiating
> action 8: monitor ybrpip_monitor_5000 on wwwrp2 (local)
> Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: process_lrm_event: LRM
> operation ybrpip_monitor_5000 (call=13479, rc=0, cib-update=11763,
> confirmed=false) ok
> Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: run_graph: Transition 6828
> (Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pacemaker/pengine/pe-input-3.bz2): Complete
> Oct 30 01:19:43 wwwrp2 crmd[2462]: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Oct 30 01:19:44 wwwrp2 lrmd[2459]: warning: child_timeout_callback:
> ybrpstat_monitor_5000 process (PID 25712) timed out Oct 30 01:19:44
> wwwrp2 lrmd[2459]: warning: operation_finished:
> ybrpstat_monitor_5000:25712 - timed out after 20000ms
> Oct 30 01:19:44 wwwrp2 crmd[2462]: error: process_lrm_event: LRM
> operation ybrpstat_monitor_5000 (13473) Timed Out (timeout=20000ms)
> Oct 30 01:19:44 wwwrp2 crmd[2462]: warning: update_failcount: Updating
> failcount for ybrpstat on wwwrp2 after failed monitor: rc=1
> (update=value++, time=1414592384)
> Oct 30 01:19:44 wwwrp2 crmd[2462]: notice: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC
> cause=C_FSA_INTERNAL origin=abort_transition_graph ]
> Oct 30 01:19:44 wwwrp2 attrd[2460]: notice: attrd_trigger_update: Sending
> flush op to all hosts for: fail-count-ybrpstat (1)
> Oct 30 01:19:44 wwwrp2 attrd[2460]: notice: attrd_perform_update: Sent
> update 12543: fail-count-ybrpstat=1
> Oct 30 01:19:44 wwwrp2 attrd[2460]: notice: attrd_trigger_update: Sending
> flush op to all hosts for: last-failure-ybrpstat (1414592384)
> Oct 30 01:19:44 wwwrp2 attrd[2460]: notice: attrd_perform_update: Sent
> update 12545: last-failure-ybrpstat=1414592384
> Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing
> failed op start for ybrpstat:0 on wwwrp1: unknown error (1) Oct 30 01:19:44
> wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing failed op
> monitor for ybrpstat:0 on wwwrp2: unknown error (1) Oct 30 01:19:44
> wwwrp2 pengine[2461]: warning: common_apply_stickiness: Forcing
> ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000) Oct
> 30 01:19:44 wwwrp2 pengine[2461]: warning: common_apply_stickiness:
> Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures
> (max=1000000)
> Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: LogActions: Restart
> ybrpip#011(Started wwwrp2)
> Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: LogActions: Recover
> ybrpstat:0#011(Started wwwrp2)
> Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: process_pe_message:
> Calculated Transition 6829: /var/lib/pacemaker/pengine/pe-input-4.bz2
> Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: unpack_config: On loss of
> CCM Quorum: Ignore
> Oct 30 01:19:44 wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing
> failed op start for ybrpstat:0 on wwwrp1: unknown error (1) Oct 30 01:19:44
> wwwrp2 pengine[2461]: warning: unpack_rsc_op: Processing failed op
> monitor for ybrpstat:0 on wwwrp2: unknown error (1) Oct 30 01:19:44
> wwwrp2 pengine[2461]: warning: common_apply_stickiness: Forcing
> ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000) Oct
> 30 01:19:44 wwwrp2 pengine[2461]: warning: common_apply_stickiness:
> Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures
> (max=1000000)
> Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: LogActions: Restart
> ybrpip#011(Started wwwrp2)
> Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: LogActions: Recover
> ybrpstat:0#011(Started wwwrp2)
> Oct 30 01:19:44 wwwrp2 pengine[2461]: notice: process_pe_message:
> Calculated Transition 6830: /var/lib/pacemaker/pengine/pe-input-5.bz2
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Pacemaker
mailing list