[Pacemaker] Race condition in pacemaker/lrmd cooperation right after live migration

Mon Jul 4 12:51:30 CET 2011

Hi all,

There is feeling that race condition is possible during live migration
of resources.

I put one node to standby mode, that made all resources migrate to
another one.
Virtual machines were successfully live-migrated, but then marked as
FAILED almost immediately.
Logs show some interesting details:
=========
Jul  4 10:21:48 s01-1 VirtualDomain[22988]: INFO:
mgmt01.c01.ttc.prague.cz.vds-ok.com: live migration to s01-0 succeeded.
Jul  4 10:21:48 s01-1 lrmd: [7741]: info: RA output:
(mgmt01.c01.ttc.prague.cz.vds-ok.com-vm:migrate_to:stdout) Domain
mgmt01.c01.ttc.prague.cz.vds-ok.com has been undefined
Jul  4 10:21:48 s01-0 VirtualDomain[4641]: INFO:
mgmt01.c01.ttc.prague.cz.vds-ok.com: live migration from s01-1 succeeded.
Jul  4 10:21:49 s01-0 lrmd: [1927]: info: RA output:
(mgmt01.c01.ttc.prague.cz.vds-ok.com-vm:migrate_from:stderr)
mgmt01.c01.ttc.prague.cz.vds-ok.com-vm is active on more than one node,
returning the default value for <null>
Jul  4 10:21:49 s01-1 crmd: [7744]: info: do_lrm_rsc_op: Performing
key=110:695:0:7ae65826-5d35-41c0-945a-8336ecb0bc3c
op=mgmt01.c01.ttc.prague.cz.vds-ok.com-vm_stop_0 )
Jul  4 10:21:49 s01-1 lrmd: [7741]: info:
rsc:mgmt01.c01.ttc.prague.cz.vds-ok.com-vm:1006: stop
Jul  4 10:21:49 s01-1 VirtualDomain[24062]: ERROR: Virtual domain
mgmt01.c01.ttc.prague.cz.vds-ok.com has no state during stop operation,
bailing out.
Jul  4 10:21:49 s01-1 crmd: [7744]: info: process_lrm_event: LRM
operation mgmt01.c01.ttc.prague.cz.vds-ok.com-vm_stop_0 (call=1006,
rc=0, cib-update=1031, confirmed=true) ok
=========
Note that line with "is active on more than one node" follows "migration
from s01-1 succeeded" immediately in syslog (in both local and remote
files), so it was put into syslog queue immediately after former one.