[Pacemaker] Failing to move around IPaddr2 resource

Mon Feb 20 01:44:55 UTC 2012

I have three servers that I'm trying to create IP failover on with
heartbeat. I have three IPs, one for each machine, and I want an IP to be
assigned to a different machine when it goes down. This is all working
splendidly.

But in addition, I also want an IP to be assigned to a different machine
when either the internal OR external network interface goes down. To do
this, I have a ping resource on each machine that pings the other 2
machines internal and external ips (so 4 IPs total being pinged on each
machine). This is where I'm having problems.

When I take down a network interface manually with ifdown, sometimes it
fails to stop IP resources on the machines. This is what crm_mon outputs:

============
Last updated: Sun Feb 19 19:29:53 2012
Stack: Heartbeat
Current DC: anlutest2 (32769730-5e5e-40d6-baa0-9748131232da) - partition
with quorum
Version: 1.0.8-042548a451fce8400660f6031f4da6f0223dd5dd
3 Nodes configured, unknown expected votes
6 Resources configured.
============

Online: [ anlutest1 anlutest3 anlutest2 ]

address01       (ocf::heartbeat:IPaddr2):       Started anlutest2
(unmanaged) FAILED
address02       (ocf::heartbeat:IPaddr2):       Started anlutest3
address03       (ocf::heartbeat:IPaddr2):       Started anlutest1
(unmanaged) FAILED
ping01  (ocf::pacemaker:ping):  Started anlutest1
ping02  (ocf::pacemaker:ping):  Started anlutest2
ping03  (ocf::pacemaker:ping):  Started anlutest3

Failed actions:
    address01_stop_0 (node=anlutest2, call=454, rc=1, status=complete):
unknown error
    address03_stop_0 (node=anlutest1, call=104, rc=1, status=complete):
unknown error

The reason for this seems to be detailed in the syslog:

Feb 19 19:25:06 anlutest1 lrmd: [27108]: info: rsc:address03:104: stop
Feb 19 19:25:06 anlutest1 crmd: [27111]: info: process_lrm_event: LRM
operation address01_monitor_5000 (call=100, status=1, cib-update=0,
confirmed=true) Cancelled
Feb 19 19:25:06 anlutest1 crmd: [27111]: info: process_lrm_event: LRM
operation address03_monitor_5000 (call=102, status=1, cib-update=0,
confirmed=true) Cancelled
Feb 19 19:25:06 anlutest1 IPaddr2[32290]: [32350]: INFO: IP status = ok,
IP_CIP=
Feb 19 19:25:06 anlutest1 IPaddr2[32291]: [32351]: INFO: IP status = ok,
IP_CIP=
Feb 19 19:25:06 anlutest1 IPaddr2[32290]: [32354]: INFO: ip -f inet addr
delete 50.97.234.170/29 dev eth1
Feb 19 19:25:06 anlutest1 IPaddr2[32291]: [32355]: INFO: ip -f inet addr
delete 50.97.234.172/29 dev eth1
Feb 19 19:25:06 anlutest1 crmd: [27111]: info: process_lrm_event: LRM
operation address01_stop_0 (call=103, rc=0, cib-update=135, confirmed=true)
ok
Feb 19 19:25:06 anlutest1 lrmd: [27108]: info: RA output:
(address03:stop:stderr) RTNETLINK answers: Cannot assign requested address
Feb 19 19:25:06 anlutest1 crmd: [27111]: info: process_lrm_event: LRM
operation address03_stop_0 (call=104, rc=1, cib-update=136, confirmed=true)
unknown error
Feb 19 19:25:07 anlutest1 attrd: [27110]: info: attrd_ha_callback: flush
message from anlutest2
Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_ha_callback: Update
relayed from anlutest2
Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_trigger_update:
Sending flush op to all hosts for: fail-count-address03 (INFINITY)
Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_perform_update: Sent
update 377: fail-count-address03=INFINITY
Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_ha_callback: Update
relayed from anlutest2
Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_trigger_update:
Sending flush op to all hosts for: last-failure-address03 (1329701107)
Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_perform_update: Sent
update 379: last-failure-address03=1329701107
Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_ha_callback: flush
message from anlutest2
Feb 19 19:25:08 anlutest1 attrd: [27110]: info: attrd_ha_callback: flush
message from anlutest2

But I have no idea what the RTNETLINK error is. Googling around seems to
show some issues about Ubuntu wireless drivers, but these interfaces are
all wired. Does anyone have any idea what is going on? I suspect there
might be some sort of weird IP assigning going on, due to the pingd
resource not reporting their scores all at the same time maybe?

When I manually go and cleanup the failed nodes, they get properly assigned
to the nodes that aren't down, so if we can't resolve the underlying issue,
is there a way to automatically attempt to cleanup failed resources a
limited number of times?

My configuration is here, in case there's anything wrong with it.

Anlu
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120219/b3606dce/attachment-0003.html>