[Pacemaker] Race condition in pacemaker/lrmd cooperation right	after live migration
    Vladislav Bogdanov 
    bubble at hoster-ok.com
       
    Mon Jul  4 11:51:30 UTC 2011
    
    
  
Hi all,
There is feeling that race condition is possible during live migration
of resources.
I put one node to standby mode, that made all resources migrate to
another one.
Virtual machines were successfully live-migrated, but then marked as
FAILED almost immediately.
Logs show some interesting details:
=========
Jul  4 10:21:48 s01-1 VirtualDomain[22988]: INFO:
mgmt01.c01.ttc.prague.cz.vds-ok.com: live migration to s01-0 succeeded.
Jul  4 10:21:48 s01-1 lrmd: [7741]: info: RA output:
(mgmt01.c01.ttc.prague.cz.vds-ok.com-vm:migrate_to:stdout) Domain
mgmt01.c01.ttc.prague.cz.vds-ok.com has been undefined
Jul  4 10:21:48 s01-0 VirtualDomain[4641]: INFO:
mgmt01.c01.ttc.prague.cz.vds-ok.com: live migration from s01-1 succeeded.
Jul  4 10:21:49 s01-0 lrmd: [1927]: info: RA output:
(mgmt01.c01.ttc.prague.cz.vds-ok.com-vm:migrate_from:stderr)
mgmt01.c01.ttc.prague.cz.vds-ok.com-vm is active on more than one node,
returning the default value for <null>
Jul  4 10:21:49 s01-1 crmd: [7744]: info: do_lrm_rsc_op: Performing
key=110:695:0:7ae65826-5d35-41c0-945a-8336ecb0bc3c
op=mgmt01.c01.ttc.prague.cz.vds-ok.com-vm_stop_0 )
Jul  4 10:21:49 s01-1 lrmd: [7741]: info:
rsc:mgmt01.c01.ttc.prague.cz.vds-ok.com-vm:1006: stop
Jul  4 10:21:49 s01-1 VirtualDomain[24062]: ERROR: Virtual domain
mgmt01.c01.ttc.prague.cz.vds-ok.com has no state during stop operation,
bailing out.
Jul  4 10:21:49 s01-1 crmd: [7744]: info: process_lrm_event: LRM
operation mgmt01.c01.ttc.prague.cz.vds-ok.com-vm_stop_0 (call=1006,
rc=0, cib-update=1031, confirmed=true) ok
=========
Note that line with "is active on more than one node" follows "migration
from s01-1 succeeded" immediately in syslog (in both local and remote
files), so it was put into syslog queue immediately after former one.
>From what I understand, lrmd made decision to fail resource just because
'stop' operation was not yet run on another node.
What else can it be if my feeling is wrong?
Version of pacemaker is 'almost' 1.1-devel tip.
cluster-glue is 1.0.7
I use own version of VirtualDomain RA, but it has the same migration
logic as a stock one.
Best,
Vladislav
    
    
More information about the Pacemaker
mailing list