[Pacemaker] Question about behavior of the post-failure during the migrate_to

Wed Dec 18 12:28:21 EST 2013

----- Original Message -----
> From: "Kazunori INOUE" <kazunori.inoue3 at gmail.com>
> To: "pm" <pacemaker at oss.clusterlabs.org>
> Sent: Wednesday, December 18, 2013 4:56:20 AM
> Subject: [Pacemaker] Question about behavior of the post-failure during the	migrate_to
> 
> Hi,
> 
> When a node crashed while VM resource was migrating, the VM started
> in two nodes. [1]
> Is this the designed behavior?
> 
> [1]
>    Stack: corosync
>    Current DC: bl460g1n6 (3232261592) - partition with quorum
>    Version: 1.1.11-0.4.ce5d77c.git.el6-ce5d77c
>    3 Nodes configured
>    8 Resources configured
> 
> 
>    Online: [ bl460g1n6 bl460g1n8 ]
>    OFFLINE: [ bl460g1n7 ]
> 
>    Full list of resources:
> 
>    prmDummy        (ocf::pacemaker:Dummy): Started bl460g1n6
>    prmVM2  (ocf::heartbeat:VirtualDomain): Started bl460g1n8
> 
> 
>    # ssh bl460g1n6 virsh list --all
>     Id    Name                           State
>    ----------------------------------------------------
>     113   vm2                            running
> 
>    # ssh bl460g1n8 virsh list --all
>     Id    Name                           State
>    ----------------------------------------------------
>     34    vm2                            running
> 
> 
> [Steps to reproduce]
> 1) Before migrate : vm2 running on bl460g1n7 (DC)
> 
>    Stack: corosync
>    Current DC: bl460g1n7 (3232261593) - partition with quorum
>    Version: 1.1.11-0.4.ce5d77c.git.el6-ce5d77c
>    3 Nodes configured
>    8 Resources configured
> 
> 
>    Online: [ bl460g1n6 bl460g1n7 bl460g1n8 ]
> 
>    Full list of resources:
> 
>    prmDummy        (ocf::pacemaker:Dummy): Started bl460g1n7
>    prmVM2  (ocf::heartbeat:VirtualDomain): Started bl460g1n7
> 
>    ...snip...
> 
> 2) Migrate the VM resource,
> 
>    # crm resource move prmVM2
> 
>    bl460g1n6 was selected to migration destination.
> 
>    Dec 18 14:11:36 bl460g1n7 crmd[6928]:   notice: te_rsc_command:
> Initiating action 47: migrate_to prmVM2_migrate_to_0 on bl460g1n7
> (local)
>    Dec 18 14:11:36 bl460g1n7 lrmd[6925]:     info:
> cancel_recurring_action: Cancelling operation prmVM2_monitor_10000
>    Dec 18 14:11:36 bl460g1n7 crmd[6928]:     info: do_lrm_rsc_op:
> Performing key=47:5:0:ddf348fe-fbad-4abb-9a12-8250f71b075a
> op=prmVM2_migrate_to_0
>    Dec 18 14:11:36 bl460g1n7 lrmd[6925]:     info: log_execute:
> executing - rsc:prmVM2 action:migrate_to call_id:33
>    Dec 18 14:11:36 bl460g1n7 crmd[6928]:     info: process_lrm_event:
> LRM operation prmVM2_monitor_10000 (call=31, status=1, cib-update=0,
> confirmed=true) Cancelled
>    Dec 18 14:11:36 bl460g1n7 VirtualDomain(prmVM2)[7387]: INFO: vm2:
> Starting live migration to bl460g1n6 (using remote hypervisor URI
> qemu+ssh://bl460g1n6/system ).
> 
> 3) And then, before migrate_to is completed after "virsh migrate"
>    in VirtualDomain was completed, I made bl460g1n7 crash.
> 
>    As a result, vm2 was running in bl460g1n6 already, but it was
>    even started in bl460g1n8 by pacemaker. [1]

Oh, wow. I see what is going on.  If the migrate_to action fails, we actually have to call stop on the target node. I believe we attempt handle these "dangling migrations" already, but something about your situation must be different.  Can you please create a crm_report so we can have your pengine files to test with? 

Creating a bug on bugs.clusterlabs.org to track this issue would also be a good idea.  The holidays are coming up and I could see this getting lost otherwise.

Thanks,
-- Vossel

>    Dec 18 14:11:49 bl460g1n8 crmd[25981]:   notice: process_lrm_event:
> LRM operation prmVM2_start_0 (call=31, rc=0, cib-update=28,
> confirmed=true) ok
> 
> 
> Best Regards,
> Kazunori INOUE
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>