[Pacemaker] Help understanding what went wrong

Wed Oct 29 20:42:31 EDT 2014

Hi

rpm -qa | grep pace
pacemaker-libs-1.1.10-14.el6_5.3.x86_64
pacemaker-cluster-libs-1.1.10-14.el6_5.3.x86_64
pacemaker-cli-1.1.10-14.el6_5.3.x86_64
pacemaker-1.1.10-14.el6_5.3.x86_64

centos 6.5

I have a 2 node cluster
pcs config 
Cluster Name: ybrp
Corosync Nodes:

Pacemaker Nodes:
 wwwrp1 wwwrp2 

Resources: 
 Resource: ybrpip (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: ip=10.32.43.50 cidr_netmask=24 nic=eth0 clusterip_hash=sourceip-sourceport 
  Meta Attrs: stickiness=0,migration-threshold=3,failure-timeout=600s 
  Operations: start on-fail=restart interval=0s timeout=60s (ybrpip-start-interval-0s)
              monitor on-fail=restart interval=5s timeout=20s (ybrpip-monitor-interval-5s)
              stop interval=0s timeout=60s (ybrpip-stop-interval-0s)
 Clone: ybrpstat-clone
  Meta Attrs: globally-unique=false clone-max=2 clone-node-max=1 
  Resource: ybrpstat (class=ocf provider=yb type=proxy)
   Operations: monitor on-fail=restart interval=5s timeout=20s (ybrpstat-monitor-interval-5s)

Stonith Devices: 
Fencing Levels: 

Location Constraints:
Ordering Constraints:
  start ybrpstat-clone then start ybrpip (Mandatory) (id:order-ybrpstat-clone-ybrpip-mandatory)
Colocation Constraints:
  ybrpip with ybrpstat-clone (INFINITY) (id:colocation-ybrpip-ybrpstat-clone-INFINITY)

Cluster Properties:
 cluster-infrastructure: cman
 dc-version: 1.1.10-14.el6_5.3-368c726
 last-lrm-refresh: 1414629002
 no-quorum-policy: ignore
 stonith-enabled: false

Basically 1 node died (wwwrp1) hardware failure

I can see in the logs the cluster wants to bring the IP address over and it seems to do it but

## this seems to be the ip address moving
Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: Adding inet address 10.32.43.50/24 with broadcast address 10.32.43.255 to device eth0
Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: Bringing device eth0 up
Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /var/run/resource-agents/send_arp-10.32.43.50 eth0 10.32.43.50 auto not_used not_used

## this seems to be where it checks other stuff
Oct 30 01:19:43 wwwrp2 crmd[2462]:   notice: process_lrm_event: LRM operation ybrpip_start_0 (call=13476, rc=0, cib-update=11762, confirmed=true) ok
Oct 30 01:19:43 wwwrp2 crmd[2462]:   notice: te_rsc_command: Initiating action 8: monitor ybrpip_monitor_5000 on wwwrp2 (local)
Oct 30 01:19:43 wwwrp2 crmd[2462]:   notice: process_lrm_event: LRM operation ybrpip_monitor_5000 (call=13479, rc=0, cib-update=11763, confirmed=false) ok
Oct 30 01:19:43 wwwrp2 crmd[2462]:   notice: run_graph: Transition 6828 (Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3.bz2): Complete
Oct 30 01:19:43 wwwrp2 crmd[2462]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]

# this is where monitor times out, but it doesn't look like 20000ms .. the initial try was at Oct 30 01:19:43
Oct 30 01:19:44 wwwrp2 lrmd[2459]:  warning: child_timeout_callback: ybrpstat_monitor_5000 process (PID 25712) timed out
Oct 30 01:19:44 wwwrp2 lrmd[2459]:  warning: operation_finished: ybrpstat_monitor_5000:25712 - timed out after 20000ms
Oct 30 01:19:44 wwwrp2 crmd[2462]:    error: process_lrm_event: LRM operation ybrpstat_monitor_5000 (13473) Timed Out (timeout=20000ms)
Oct 30 01:19:44 wwwrp2 crmd[2462]:  warning: update_failcount: Updating failcount for ybrpstat on wwwrp2 after failed monitor: rc=1 (update=value++, time=1414592384)

I'm guessing because it timed out it went into failed mode. I need to know why it timed out. The script has never timed out before or in testing...

Am I reading this right.  The reason the resource didn't fail over (ip address) was because there was no ybrpstat running on wwwrp2 and the reason for that was the monitor action failed/timed out

Thanks
Alex

=== logs ==

Oct 30 01:19:42 wwwrp2 pengine[2461]:   notice: unpack_config: On loss of CCM Quorum: Ignore
Oct 30 01:19:42 wwwrp2 pengine[2461]:  warning: unpack_rsc_op: Processing failed op start for ybrpstat:0 on wwwrp1: unknown error (1)
Oct 30 01:19:42 wwwrp2 pengine[2461]:   notice: LogActions: Start   ybrpip#011(wwwrp1)
Oct 30 01:19:42 wwwrp2 pengine[2461]:   notice: LogActions: Recover ybrpstat:0#011(Started wwwrp1)
Oct 30 01:19:42 wwwrp2 pengine[2461]:   notice: process_pe_message: Calculated Transition 6827: /var/lib/pacemaker/pengine/pe-input-2.bz2
Oct 30 01:19:42 wwwrp2 pengine[2461]:   notice: unpack_config: On loss of CCM Quorum: Ignore
Oct 30 01:19:42 wwwrp2 pengine[2461]:  warning: unpack_rsc_op: Processing failed op start for ybrpstat:0 on wwwrp1: unknown error (1)
Oct 30 01:19:42 wwwrp2 pengine[2461]:  warning: common_apply_stickiness: Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000)
Oct 30 01:19:42 wwwrp2 pengine[2461]:  warning: common_apply_stickiness: Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000)
Oct 30 01:19:42 wwwrp2 pengine[2461]:   notice: LogActions: Start   ybrpip#011(wwwrp2)
Oct 30 01:19:42 wwwrp2 pengine[2461]:   notice: LogActions: Stop    ybrpstat:0#011(wwwrp1)
Oct 30 01:19:42 wwwrp2 pengine[2461]:   notice: process_pe_message: Calculated Transition 6828: /var/lib/pacemaker/pengine/pe-input-3.bz2
Oct 30 01:19:42 wwwrp2 crmd[2462]:   notice: te_rsc_command: Initiating action 7: start ybrpip_start_0 on wwwrp2 (local)
Oct 30 01:19:42 wwwrp2 crmd[2462]:   notice: te_rsc_command: Initiating action 1: stop ybrpstat_stop_0 on wwwrp1
Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: Adding inet address 10.32.43.50/24 with broadcast address 10.32.43.255 to device eth0
Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: Bringing device eth0 up
Oct 30 01:19:43 wwwrp2 IPaddr2(ybrpip)[25778]: INFO: /usr/libexec/heartbeat/send_arp -i 200 -r 5 -p /var/run/resource-agents/send_arp-10.32.43.50 eth0 10.32.43.50 auto not_used not_used
Oct 30 01:19:43 wwwrp2 crmd[2462]:   notice: process_lrm_event: LRM operation ybrpip_start_0 (call=13476, rc=0, cib-update=11762, confirmed=true) ok
Oct 30 01:19:43 wwwrp2 crmd[2462]:   notice: te_rsc_command: Initiating action 8: monitor ybrpip_monitor_5000 on wwwrp2 (local)
Oct 30 01:19:43 wwwrp2 crmd[2462]:   notice: process_lrm_event: LRM operation ybrpip_monitor_5000 (call=13479, rc=0, cib-update=11763, confirmed=false) ok
Oct 30 01:19:43 wwwrp2 crmd[2462]:   notice: run_graph: Transition 6828 (Complete=6, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-3.bz2): Complete
Oct 30 01:19:43 wwwrp2 crmd[2462]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]
Oct 30 01:19:44 wwwrp2 lrmd[2459]:  warning: child_timeout_callback: ybrpstat_monitor_5000 process (PID 25712) timed out
Oct 30 01:19:44 wwwrp2 lrmd[2459]:  warning: operation_finished: ybrpstat_monitor_5000:25712 - timed out after 20000ms
Oct 30 01:19:44 wwwrp2 crmd[2462]:    error: process_lrm_event: LRM operation ybrpstat_monitor_5000 (13473) Timed Out (timeout=20000ms)
Oct 30 01:19:44 wwwrp2 crmd[2462]:  warning: update_failcount: Updating failcount for ybrpstat on wwwrp2 after failed monitor: rc=1 (update=value++, time=1414592384)
Oct 30 01:19:44 wwwrp2 crmd[2462]:   notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Oct 30 01:19:44 wwwrp2 attrd[2460]:   notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-ybrpstat (1)
Oct 30 01:19:44 wwwrp2 attrd[2460]:   notice: attrd_perform_update: Sent update 12543: fail-count-ybrpstat=1
Oct 30 01:19:44 wwwrp2 attrd[2460]:   notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-ybrpstat (1414592384)
Oct 30 01:19:44 wwwrp2 attrd[2460]:   notice: attrd_perform_update: Sent update 12545: last-failure-ybrpstat=1414592384
Oct 30 01:19:44 wwwrp2 pengine[2461]:   notice: unpack_config: On loss of CCM Quorum: Ignore
Oct 30 01:19:44 wwwrp2 pengine[2461]:  warning: unpack_rsc_op: Processing failed op start for ybrpstat:0 on wwwrp1: unknown error (1)
Oct 30 01:19:44 wwwrp2 pengine[2461]:  warning: unpack_rsc_op: Processing failed op monitor for ybrpstat:0 on wwwrp2: unknown error (1)
Oct 30 01:19:44 wwwrp2 pengine[2461]:  warning: common_apply_stickiness: Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000)
Oct 30 01:19:44 wwwrp2 pengine[2461]:  warning: common_apply_stickiness: Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000)
Oct 30 01:19:44 wwwrp2 pengine[2461]:   notice: LogActions: Restart ybrpip#011(Started wwwrp2)
Oct 30 01:19:44 wwwrp2 pengine[2461]:   notice: LogActions: Recover ybrpstat:0#011(Started wwwrp2)
Oct 30 01:19:44 wwwrp2 pengine[2461]:   notice: process_pe_message: Calculated Transition 6829: /var/lib/pacemaker/pengine/pe-input-4.bz2
Oct 30 01:19:44 wwwrp2 pengine[2461]:   notice: unpack_config: On loss of CCM Quorum: Ignore
Oct 30 01:19:44 wwwrp2 pengine[2461]:  warning: unpack_rsc_op: Processing failed op start for ybrpstat:0 on wwwrp1: unknown error (1)
Oct 30 01:19:44 wwwrp2 pengine[2461]:  warning: unpack_rsc_op: Processing failed op monitor for ybrpstat:0 on wwwrp2: unknown error (1)
Oct 30 01:19:44 wwwrp2 pengine[2461]:  warning: common_apply_stickiness: Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000)
Oct 30 01:19:44 wwwrp2 pengine[2461]:  warning: common_apply_stickiness: Forcing ybrpstat-clone away from wwwrp1 after 1000000 failures (max=1000000)
Oct 30 01:19:44 wwwrp2 pengine[2461]:   notice: LogActions: Restart ybrpip#011(Started wwwrp2)
Oct 30 01:19:44 wwwrp2 pengine[2461]:   notice: LogActions: Recover ybrpstat:0#011(Started wwwrp2)
Oct 30 01:19:44 wwwrp2 pengine[2461]:   notice: process_pe_message: Calculated Transition 6830: /var/lib/pacemaker/pengine/pe-input-5.bz2