[Pacemaker] Question about behavior of the post-failure during the migrate_to
David Vossel
dvossel at redhat.com
Wed Dec 18 17:28:21 UTC 2013
----- Original Message -----
> From: "Kazunori INOUE" <kazunori.inoue3 at gmail.com>
> To: "pm" <pacemaker at oss.clusterlabs.org>
> Sent: Wednesday, December 18, 2013 4:56:20 AM
> Subject: [Pacemaker] Question about behavior of the post-failure during the migrate_to
>
> Hi,
>
> When a node crashed while VM resource was migrating, the VM started
> in two nodes. [1]
> Is this the designed behavior?
>
> [1]
> Stack: corosync
> Current DC: bl460g1n6 (3232261592) - partition with quorum
> Version: 1.1.11-0.4.ce5d77c.git.el6-ce5d77c
> 3 Nodes configured
> 8 Resources configured
>
>
> Online: [ bl460g1n6 bl460g1n8 ]
> OFFLINE: [ bl460g1n7 ]
>
> Full list of resources:
>
> prmDummy (ocf::pacemaker:Dummy): Started bl460g1n6
> prmVM2 (ocf::heartbeat:VirtualDomain): Started bl460g1n8
>
>
> # ssh bl460g1n6 virsh list --all
> Id Name State
> ----------------------------------------------------
> 113 vm2 running
>
> # ssh bl460g1n8 virsh list --all
> Id Name State
> ----------------------------------------------------
> 34 vm2 running
>
>
> [Steps to reproduce]
> 1) Before migrate : vm2 running on bl460g1n7 (DC)
>
> Stack: corosync
> Current DC: bl460g1n7 (3232261593) - partition with quorum
> Version: 1.1.11-0.4.ce5d77c.git.el6-ce5d77c
> 3 Nodes configured
> 8 Resources configured
>
>
> Online: [ bl460g1n6 bl460g1n7 bl460g1n8 ]
>
> Full list of resources:
>
> prmDummy (ocf::pacemaker:Dummy): Started bl460g1n7
> prmVM2 (ocf::heartbeat:VirtualDomain): Started bl460g1n7
>
> ...snip...
>
> 2) Migrate the VM resource,
>
> # crm resource move prmVM2
>
> bl460g1n6 was selected to migration destination.
>
> Dec 18 14:11:36 bl460g1n7 crmd[6928]: notice: te_rsc_command:
> Initiating action 47: migrate_to prmVM2_migrate_to_0 on bl460g1n7
> (local)
> Dec 18 14:11:36 bl460g1n7 lrmd[6925]: info:
> cancel_recurring_action: Cancelling operation prmVM2_monitor_10000
> Dec 18 14:11:36 bl460g1n7 crmd[6928]: info: do_lrm_rsc_op:
> Performing key=47:5:0:ddf348fe-fbad-4abb-9a12-8250f71b075a
> op=prmVM2_migrate_to_0
> Dec 18 14:11:36 bl460g1n7 lrmd[6925]: info: log_execute:
> executing - rsc:prmVM2 action:migrate_to call_id:33
> Dec 18 14:11:36 bl460g1n7 crmd[6928]: info: process_lrm_event:
> LRM operation prmVM2_monitor_10000 (call=31, status=1, cib-update=0,
> confirmed=true) Cancelled
> Dec 18 14:11:36 bl460g1n7 VirtualDomain(prmVM2)[7387]: INFO: vm2:
> Starting live migration to bl460g1n6 (using remote hypervisor URI
> qemu+ssh://bl460g1n6/system ).
>
> 3) And then, before migrate_to is completed after "virsh migrate"
> in VirtualDomain was completed, I made bl460g1n7 crash.
>
> As a result, vm2 was running in bl460g1n6 already, but it was
> even started in bl460g1n8 by pacemaker. [1]
Oh, wow. I see what is going on. If the migrate_to action fails, we actually have to call stop on the target node. I believe we attempt handle these "dangling migrations" already, but something about your situation must be different. Can you please create a crm_report so we can have your pengine files to test with?
Creating a bug on bugs.clusterlabs.org to track this issue would also be a good idea. The holidays are coming up and I could see this getting lost otherwise.
Thanks,
-- Vossel
> Dec 18 14:11:49 bl460g1n8 crmd[25981]: notice: process_lrm_event:
> LRM operation prmVM2_start_0 (call=31, rc=0, cib-update=28,
> confirmed=true) ok
>
>
> Best Regards,
> Kazunori INOUE
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
More information about the Pacemaker
mailing list