[Pacemaker] IPaddr2 Unkown interface cause a failover that didn't work

Wed Sep 30 18:15:55 UTC 2015

Hi Everyone,
I have experience a weird issue last night where our cluster try to
failover due to an "Unkown interface"

Look like when the IPaddr2 monitor try to perform a status on eth0, it
didn't find the device. Both node are VM. I haven't found any reason as why
eth0 would have "disapear"

<LOG NODE1>
Sep 29 21:25:04 node-01 IPaddr2(vip_v207_174)[4082]: ERROR: Unknown
interface [eth0] No such device.
Sep 29 21:25:04 node-01 IPaddr2(vip_v207_174)[4082]: ERROR: [findif] failed
Sep 29 21:25:05 node-01 crmd[3369]:   notice: process_lrm_event: Operation
vip_v207_174_monitor_10000: not configured (node=node-01, call=91, rc=6,
cib-update=73, confirmed=false)
Sep 29 21:25:06 node-01 attrd[3367]:   notice: attrd_cs_dispatch: Update
relayed from node-02
Sep 29 21:25:06 node-01 attrd[3367]:   notice: attrd_trigger_update:
Sending flush op to all hosts for: fail-count-vip_v207_174 (2)
Sep 29 21:25:06 node-01 attrd[3367]:   notice: attrd_perform_update: Sent
update 41: fail-count-vip_v207_174=2
Sep 29 21:25:06 node-01 attrd[3367]:   notice: attrd_cs_dispatch: Update
relayed from node-02
Sep 29 21:25:06 node-01 attrd[3367]:   notice: attrd_trigger_update:
Sending flush op to all hosts for: last-failure-vip_v207_174 (1443576306)
Sep 29 21:25:06 node-01 attrd[3367]:   notice: attrd_perform_update: Sent
update 43: last-failure-vip_v207_174=1443576306
Sep 29 21:25:07 node-01 crmd[3369]:   notice: process_lrm_event: Operation
fwcorp-mailto-sysadmin_stop_0: ok (node=node-01, call=110, rc=0,
cib-update=74, confirmed=true)
Sep 29 21:25:07 node-01 crmd[3369]:   notice: process_lrm_event: Operation
change-default-fw_stop_0: ok (node=node-01, call=112, rc=0, cib-update=75,
confirmed=true)
Sep 29 21:25:07 node-01 IPaddr2(vip_v254_230)[4259]: INFO: IP status = ok,
IP_CIP=
Sep 29 21:25:07 node-01 crmd[3369]:   notice: process_lrm_event: Operation
vip_v254_230_stop_0: ok (node=node-01, call=114, rc=0, cib-update=76,
confirmed=true)
Sep 29 21:25:07 node-01 IPaddr2(vip_v27_1)[4313]: INFO: IP status = ok,
IP_CIP=
Sep 29 21:25:07 node-01 crmd[3369]:   notice: process_lrm_event: Operation
vip_v27_1_stop_0: ok (node=node-01, call=116, rc=0, cib-update=77,
confirmed=true)
Sep 29 21:25:07 node-01 IPaddr2(vip_v26_1)[4366]: INFO: IP status = ok,
IP_CIP=
Sep 29 21:25:07 node-01 crmd[3369]:   notice: process_lrm_event: Operation
vip_v26_1_stop_0: ok (node=node-01, call=118, rc=0, cib-update=78,
confirmed=true)
Sep 29 21:25:07 node-01 IPaddr2(vip_v207_174)[4419]: INFO: IP status = ok,
IP_CIP=
Sep 29 21:25:07 node-01 crmd[3369]:   notice: process_lrm_event: Operation
vip_v207_174_stop_0: ok (node=node-01, call=120, rc=0, cib-update=79,
confirmed=true)
</LOG NODE1>

<LOG NODE2>
Sep 29 21:22:48 node-02 crmd[3241]:   notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED
origin=crm_timer_popped ]
Sep 29 21:22:48 node-02 pengine[3240]:   notice: update_validation:
pacemaker-1.2-style configuration is also valid for pacemaker-1.3
Sep 29 21:22:48 node-02 pengine[3240]:   notice: update_validation:
Upgrading pacemaker-1.3-style configuration to pacemaker-2.0 with
upgrade-1.3.xsl
Sep 29 21:22:48 node-02 pengine[3240]:   notice: update_validation:
Transformed the configuration from pacemaker-1.2 to pacemaker-2.0
Sep 29 21:22:48 node-02 pengine[3240]:   notice: unpack_config: On loss of
CCM Quorum: Ignore
Sep 29 21:22:48 node-02 crmd[3241]:   notice: run_graph: Transition 14769
(Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-786.bz2): Complete
Sep 29 21:22:48 node-02 crmd[3241]:   notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
Sep 29 21:22:48 node-02 pengine[3240]:   notice: process_pe_message:
Calculated Transition 14769: /var/lib/pacemaker/pengine/pe-input-786.bz2
Sep 29 21:25:06 node-02 crmd[3241]:  warning: update_failcount: Updating
failcount for vip_v207_174 on node-01 after failed monitor: rc=6
(update=value++, time=1443576306)
Sep 29 21:25:06 node-02 crmd[3241]:   notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
origin=abort_transition_graph ]
Sep 29 21:25:06 node-02 pengine[3240]:   notice: update_validation:
pacemaker-1.2-style configuration is also valid for pacemaker-1.3
Sep 29 21:25:06 node-02 pengine[3240]:   notice: update_validation:
Upgrading pacemaker-1.3-style configuration to pacemaker-2.0 with
upgrade-1.3.xsl
Sep 29 21:25:06 node-02 pengine[3240]:   notice: update_validation:
Transformed the configuration from pacemaker-1.2 to pacemaker-2.0
Sep 29 21:25:06 node-02 pengine[3240]:   notice: unpack_config: On loss of
CCM Quorum: Ignore
Sep 29 21:25:06 node-02 pengine[3240]:  warning: unpack_rsc_op_failure:
Processing failed op monitor for vip_v207_174 on node-01: not configured (6)
Sep 29 21:25:06 node-02 pengine[3240]:    error: unpack_rsc_op: Preventing
vip_v207_174 from re-starting anywhere: operation monitor failed 'not
configured' (6)
Sep 29 21:25:06 node-02 pengine[3240]:   notice: LogActions: Stop
 vip_v207_174#011(node-01)
Sep 29 21:25:06 node-02 pengine[3240]:   notice: LogActions: Stop
 vip_v26_1#011(node-01)
Sep 29 21:25:06 node-02 pengine[3240]:   notice: LogActions: Stop
 vip_v27_1#011(node-01)
Sep 29 21:25:06 node-02 pengine[3240]:   notice: LogActions: Stop
 vip_v254_230#011(node-01)
Sep 29 21:25:06 node-02 pengine[3240]:   notice: LogActions: Stop
 change-default-fw#011(node-01)
Sep 29 21:25:06 node-02 pengine[3240]:   notice: LogActions: Stop
 fwcorp-mailto-sysadmin#011(node-01)
Sep 29 21:25:06 node-02 pengine[3240]:   notice: process_pe_message:
Calculated Transition 14770: /var/lib/pacemaker/pengine/pe-input-787.bz2
Sep 29 21:25:06 node-02 crmd[3241]:   notice: te_rsc_command: Initiating
action 16: stop fwcorp-mailto-sysadmin_stop_0 on node-01
Sep 29 21:25:06 node-02 crmd[3241]:   notice: abort_transition_graph:
Transition aborted by status-node-01-fail-count-vip_v207_174,
fail-count-vip_v207_174=2: Transient attribute change (modify cib=0.94.107,
source=te_update_diff:391,
path=/cib/status/node_state[@id='node-01']/transient_attributes[@id='node-01']/instance_attributes[@id='status-node-01']/nvpair[@id='status-node-01-fail-count-vip_v207_174'],
0)
Sep 29 21:25:07 node-02 crmd[3241]:   notice: run_graph: Transition 14770
(Complete=2, Pending=0, Fired=0, Skipped=7, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-787.bz2): Stopped
Sep 29 21:25:07 node-02 pengine[3240]:   notice: update_validation:
pacemaker-1.2-style configuration is also valid for pacemaker-1.3
Sep 29 21:25:07 node-02 pengine[3240]:   notice: update_validation:
Upgrading pacemaker-1.3-style configuration to pacemaker-2.0 with
upgrade-1.3.xsl
Sep 29 21:25:07 node-02 pengine[3240]:   notice: update_validation:
Transformed the configuration from pacemaker-1.2 to pacemaker-2.0
Sep 29 21:25:07 node-02 pengine[3240]:   notice: unpack_config: On loss of
CCM Quorum: Ignore
Sep 29 21:25:07 node-02 pengine[3240]:  warning: unpack_rsc_op_failure:
Processing failed op monitor for vip_v207_174 on node-01: not configured (6)
Sep 29 21:25:07 node-02 pengine[3240]:    error: unpack_rsc_op: Preventing
vip_v207_174 from re-starting anywhere: operation monitor failed 'not
configured' (6)
Sep 29 21:25:07 node-02 pengine[3240]:   notice: LogActions: Stop
 vip_v207_174#011(node-01)
Sep 29 21:25:07 node-02 pengine[3240]:   notice: LogActions: Stop
 vip_v26_1#011(node-01)
Sep 29 21:25:07 node-02 pengine[3240]:   notice: LogActions: Stop
 vip_v27_1#011(node-01)
Sep 29 21:25:07 node-02 pengine[3240]:   notice: LogActions: Stop
 vip_v254_230#011(node-01)
Sep 29 21:25:07 node-02 pengine[3240]:   notice: LogActions: Stop
 change-default-fw#011(node-01)
Sep 29 21:25:07 node-02 crmd[3241]:   notice: te_rsc_command: Initiating
action 14: stop change-default-fw_stop_0 on node-01
Sep 29 21:25:07 node-02 pengine[3240]:   notice: process_pe_message:
Calculated Transition 14771: /var/lib/pacemaker/pengine/pe-input-788.bz2
Sep 29 21:25:07 node-02 crmd[3241]:   notice: te_rsc_command: Initiating
action 13: stop vip_v254_230_stop_0 on node-01
Sep 29 21:25:07 node-02 crmd[3241]:   notice: te_rsc_command: Initiating
action 12: stop vip_v27_1_stop_0 on node-01
Sep 29 21:25:07 node-02 crmd[3241]:   notice: te_rsc_command: Initiating
action 11: stop vip_v26_1_stop_0 on node-01
Sep 29 21:25:07 node-02 crmd[3241]:   notice: te_rsc_command: Initiating
action 3: stop vip_v207_174_stop_0 on node-01
Sep 29 21:25:07 node-02 crmd[3241]:   notice: run_graph: Transition 14771
(Complete=8, Pending=0, Fired=0, Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-788.bz2): Complete
Sep 29 21:25:07 node-02 crmd[3241]:   notice: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
cause=C_FSA_INTERNAL origin=notify_crmd ]
</LOG NODE2>

I know that I found some post that say to run sysctl -w
net.ipv4.conf.all.promote_secondaries=1 to avoid secondary nic to be remove
when primary is gone, but in this case the eth0 has a single nic that is
manage through IPaddr2 within crm configuration

Here's the configuration or node:

<CONFIGURATION>
Cluster Name: nodecluster1
Corosync Nodes:
 node-01 node-02
Pacemaker Nodes:
 node-01 node-02

Resources:
 Group: lbpcivip
  Resource: vip_v207_174 (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: ip=x.x.x.174 cidr_netmask=27 broadcast=x.x.x.191 nic=eth0
   Operations: monitor interval=10s (vip_v207_174-monitor-interval-10s)
  Resource: vip_v26_1 (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: ip=x.x.26.1
   Operations: monitor interval=10s (vip_v26_1-monitor-interval-10s)
  Resource: vip_v27_1 (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: ip=x.x.27.1
   Operations: monitor interval=10s (vip_v27_1-monitor-interval-10s)
  Resource: vip_v254_230 (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: ip=x.x.254.230
   Operations: monitor interval=10s (vip_v254_230-monitor-interval-10s)
  Resource: change-default-fw (class=lsb type=fwdefaultgw)
   Operations: monitor interval=60s (change-default-fw-monitor-interval-60s)
  Resource: fwcorp-mailto-sysadmin (class=ocf provider=heartbeat
type=MailTo)
   Attributes: email=its at touchtunes.com subject="[node - Clustered
services]"
   Operations: monitor interval=60s
(fwcorp-mailto-sysadmin-monitor-interval-60s)

Stonith Devices:
Fencing Levels:

Location Constraints:
Ordering Constraints:
Colocation Constraints:

Cluster Properties:
 cluster-infrastructure: cman
 dc-version: 1.1.11-97629de
 last-lrm-refresh: 1412269491
 no-quorum-policy: ignore
 stonith-enabled: false
</CONFIGURATION>

Has anyone have suggestion on how I can solve this issue? Why did the
failover from node1 to node2 didn't work ?

If more information is require let me know, any suggestion would be
appreciated!

Thanx!

--
                         !!!!!
                       ( o o )
 --------------oOO----(_)----OOo--------------
   Luc Paulin
   email: paulinster(at)gmail.com
   Skype: paulinster
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20150930/56a20844/attachment-0003.html>