[Pacemaker] Question about behavior of the post-failure during the migrate_to
Kazunori INOUE
kazunori.inoue3 at gmail.com
Thu Dec 19 07:36:27 UTC 2013
Hi David,
2013/12/19 David Vossel <dvossel at redhat.com>:
>
> ----- Original Message -----
>> From: "Kazunori INOUE" <kazunori.inoue3 at gmail.com>
>> To: "pm" <pacemaker at oss.clusterlabs.org>
>> Sent: Wednesday, December 18, 2013 4:56:20 AM
>> Subject: [Pacemaker] Question about behavior of the post-failure during the migrate_to
>>
>> Hi,
>>
>> When a node crashed while VM resource was migrating, the VM started
>> in two nodes. [1]
>> Is this the designed behavior?
>>
>> [1]
>> Stack: corosync
>> Current DC: bl460g1n6 (3232261592) - partition with quorum
>> Version: 1.1.11-0.4.ce5d77c.git.el6-ce5d77c
>> 3 Nodes configured
>> 8 Resources configured
>>
>>
>> Online: [ bl460g1n6 bl460g1n8 ]
>> OFFLINE: [ bl460g1n7 ]
>>
>> Full list of resources:
>>
>> prmDummy (ocf::pacemaker:Dummy): Started bl460g1n6
>> prmVM2 (ocf::heartbeat:VirtualDomain): Started bl460g1n8
>>
>>
>> # ssh bl460g1n6 virsh list --all
>> Id Name State
>> ----------------------------------------------------
>> 113 vm2 running
>>
>> # ssh bl460g1n8 virsh list --all
>> Id Name State
>> ----------------------------------------------------
>> 34 vm2 running
>>
>>
>> [Steps to reproduce]
>> 1) Before migrate : vm2 running on bl460g1n7 (DC)
>>
>> Stack: corosync
>> Current DC: bl460g1n7 (3232261593) - partition with quorum
>> Version: 1.1.11-0.4.ce5d77c.git.el6-ce5d77c
>> 3 Nodes configured
>> 8 Resources configured
>>
>>
>> Online: [ bl460g1n6 bl460g1n7 bl460g1n8 ]
>>
>> Full list of resources:
>>
>> prmDummy (ocf::pacemaker:Dummy): Started bl460g1n7
>> prmVM2 (ocf::heartbeat:VirtualDomain): Started bl460g1n7
>>
>> ...snip...
>>
>> 2) Migrate the VM resource,
>>
>> # crm resource move prmVM2
>>
>> bl460g1n6 was selected to migration destination.
>>
>> Dec 18 14:11:36 bl460g1n7 crmd[6928]: notice: te_rsc_command:
>> Initiating action 47: migrate_to prmVM2_migrate_to_0 on bl460g1n7
>> (local)
>> Dec 18 14:11:36 bl460g1n7 lrmd[6925]: info:
>> cancel_recurring_action: Cancelling operation prmVM2_monitor_10000
>> Dec 18 14:11:36 bl460g1n7 crmd[6928]: info: do_lrm_rsc_op:
>> Performing key=47:5:0:ddf348fe-fbad-4abb-9a12-8250f71b075a
>> op=prmVM2_migrate_to_0
>> Dec 18 14:11:36 bl460g1n7 lrmd[6925]: info: log_execute:
>> executing - rsc:prmVM2 action:migrate_to call_id:33
>> Dec 18 14:11:36 bl460g1n7 crmd[6928]: info: process_lrm_event:
>> LRM operation prmVM2_monitor_10000 (call=31, status=1, cib-update=0,
>> confirmed=true) Cancelled
>> Dec 18 14:11:36 bl460g1n7 VirtualDomain(prmVM2)[7387]: INFO: vm2:
>> Starting live migration to bl460g1n6 (using remote hypervisor URI
>> qemu+ssh://bl460g1n6/system ).
>>
>> 3) And then, before migrate_to is completed after "virsh migrate"
>> in VirtualDomain was completed, I made bl460g1n7 crash.
>>
>> As a result, vm2 was running in bl460g1n6 already, but it was
>> even started in bl460g1n8 by pacemaker. [1]
>
> Oh, wow. I see what is going on. If the migrate_to action fails, we actually have to call stop on the target node. I believe we attempt handle these "dangling migrations" already, but something about your situation must be different. Can you please create a crm_report so we can have your pengine files to test with?
>
> Creating a bug on bugs.clusterlabs.org to track this issue would also be a good idea. The holidays are coming up and I could see this getting lost otherwise.
>
> Thanks,
> -- Vossel
>
I opened Bugzilla about this.
* http://bugs.clusterlabs.org/show_bug.cgi?id=5186
Although I attached crm_report to bugzilla, is this enough as information?
>
>
>
>> Dec 18 14:11:49 bl460g1n8 crmd[25981]: notice: process_lrm_event:
>> LRM operation prmVM2_start_0 (call=31, rc=0, cib-update=28,
>> confirmed=true) ok
>>
>>
>> Best Regards,
>> Kazunori INOUE
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
More information about the Pacemaker
mailing list