[Pacemaker] Question about behavior of the post-failure during the migrate_to

Thu Dec 19 02:36:27 EST 2013

Hi David,

2013/12/19 David Vossel <dvossel at redhat.com>:
>
> ----- Original Message -----
>> From: "Kazunori INOUE" <kazunori.inoue3 at gmail.com>
>> To: "pm" <pacemaker at oss.clusterlabs.org>
>> Sent: Wednesday, December 18, 2013 4:56:20 AM
>> Subject: [Pacemaker] Question about behavior of the post-failure during the   migrate_to
>>
>> Hi,
>>
>> When a node crashed while VM resource was migrating, the VM started
>> in two nodes. [1]
>> Is this the designed behavior?
>>
>> [1]
>>    Stack: corosync
>>    Current DC: bl460g1n6 (3232261592) - partition with quorum
>>    Version: 1.1.11-0.4.ce5d77c.git.el6-ce5d77c
>>    3 Nodes configured
>>    8 Resources configured
>>
>>
>>    Online: [ bl460g1n6 bl460g1n8 ]
>>    OFFLINE: [ bl460g1n7 ]
>>
>>    Full list of resources:
>>
>>    prmDummy        (ocf::pacemaker:Dummy): Started bl460g1n6
>>    prmVM2  (ocf::heartbeat:VirtualDomain): Started bl460g1n8
>>
>>
>>    # ssh bl460g1n6 virsh list --all
>>     Id    Name                           State
>>    ----------------------------------------------------
>>     113   vm2                            running
>>
>>    # ssh bl460g1n8 virsh list --all
>>     Id    Name                           State
>>    ----------------------------------------------------
>>     34    vm2                            running
>>
>>
>> [Steps to reproduce]
>> 1) Before migrate : vm2 running on bl460g1n7 (DC)
>>
>>    Stack: corosync
>>    Current DC: bl460g1n7 (3232261593) - partition with quorum
>>    Version: 1.1.11-0.4.ce5d77c.git.el6-ce5d77c
>>    3 Nodes configured
>>    8 Resources configured
>>
>>
>>    Online: [ bl460g1n6 bl460g1n7 bl460g1n8 ]
>>
>>    Full list of resources:
>>
>>    prmDummy        (ocf::pacemaker:Dummy): Started bl460g1n7
>>    prmVM2  (ocf::heartbeat:VirtualDomain): Started bl460g1n7
>>
>>    ...snip...
>>
>> 2) Migrate the VM resource,
>>
>>    # crm resource move prmVM2
>>
>>    bl460g1n6 was selected to migration destination.
>>
>>    Dec 18 14:11:36 bl460g1n7 crmd[6928]:   notice: te_rsc_command:
>> Initiating action 47: migrate_to prmVM2_migrate_to_0 on bl460g1n7
>> (local)
>>    Dec 18 14:11:36 bl460g1n7 lrmd[6925]:     info:
>> cancel_recurring_action: Cancelling operation prmVM2_monitor_10000
>>    Dec 18 14:11:36 bl460g1n7 crmd[6928]:     info: do_lrm_rsc_op:
>> Performing key=47:5:0:ddf348fe-fbad-4abb-9a12-8250f71b075a
>> op=prmVM2_migrate_to_0
>>    Dec 18 14:11:36 bl460g1n7 lrmd[6925]:     info: log_execute:
>> executing - rsc:prmVM2 action:migrate_to call_id:33
>>    Dec 18 14:11:36 bl460g1n7 crmd[6928]:     info: process_lrm_event:
>> LRM operation prmVM2_monitor_10000 (call=31, status=1, cib-update=0,
>> confirmed=true) Cancelled
>>    Dec 18 14:11:36 bl460g1n7 VirtualDomain(prmVM2)[7387]: INFO: vm2:
>> Starting live migration to bl460g1n6 (using remote hypervisor URI
>> qemu+ssh://bl460g1n6/system ).
>>
>> 3) And then, before migrate_to is completed after "virsh migrate"
>>    in VirtualDomain was completed, I made bl460g1n7 crash.
>>
>>    As a result, vm2 was running in bl460g1n6 already, but it was
>>    even started in bl460g1n8 by pacemaker. [1]
>
> Oh, wow. I see what is going on.  If the migrate_to action fails, we actually have to call stop on the target node. I believe we attempt handle these "dangling migrations" already, but something about your situation must be different.  Can you please create a crm_report so we can have your pengine files to test with?
>
> Creating a bug on bugs.clusterlabs.org to track this issue would also be a good idea.  The holidays are coming up and I could see this getting lost otherwise.
>
> Thanks,
> -- Vossel
>

I opened Bugzilla about this.
* http://bugs.clusterlabs.org/show_bug.cgi?id=5186

Although I attached crm_report to bugzilla, is this enough as information?

>
>
>
>>    Dec 18 14:11:49 bl460g1n8 crmd[25981]:   notice: process_lrm_event:
>> LRM operation prmVM2_start_0 (call=31, rc=0, cib-update=28,
>> confirmed=true) ok
>>
>>
>> Best Regards,
>> Kazunori INOUE
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org