[Pacemaker] KVM live migration and multipath

Mon Jun 24 03:33:23 EDT 2013

22.06.2013 16:37, Sven Arnold wrote:
> Hi,
> 
> I am getting closer... Some updates for those who are interested.
> 
>>>> Did you turn caching off for your VMs disks?
>>>
>>> That's a point. Indeed caching was not explicitely turned off and I just
>>> noticed that the default setting of the cache attribute of the device
>>> tag in libvirt has changed. [1]
>>> I would expect that libvirt flushes all caches before finalizing the
>>> migration process. But it is probably best to turn off caches anyway.
>>>
>>> I have now configured:
>>>
>>> <disk type='block' device='disk'>
>>>        <driver name='qemu' type='raw' cache='none'/>
>>
>> I would also switch to a native IO (aio) if your kernel/qemu support
>> that. Otherwise qemu allocates several dedicated IO threads, and it is
>> much slower that aio. There were some problems with aio in the past, but
>> it should work ok for recent enough distros.
>>
> 
> This is interesting. After switching to native io out of curiosity:
> 
> <driver name='qemu' type='raw' cache='none' io='native'/>
> 
> the situation looked much better - to my surprise I did not experience
> further corruptions with this virtual machine.

ok

> 
> Then I added a second and third vm to the setup only to get errors again
> on those machines. I noticed that those additional vms had older qemu
> machine types (pc-0.11 and pc-0.12) set. After upgrading the domains to
> machine type pc-1.0:
> 
> <os>
>     <type arch='x86_64' machine='pc-1.0'>hvm</type>
>     <boot dev='hd'/>
> </os>
> 
> I did not trigger file system corruptions again. So, at this moment it
> looks like it is important to:
> - turn caching off
> - use native aio
> - *and* use an up-to-date machine type
> 
> Failure to meet any of these criteria would result in fs corruption.
> Does this make sense at all?

Later part is a little bit surprising. You may want to look at qemu
sources to see what are the differences between that machine versions,
they all are in one file, and well organized. Usually differences are
not so big, but I wouldn't say they may not affect what you observe.

> 
>>
>> May be that may depend on combination of libvirt/qemu versions and
>> migration mode used?
> 
> qemu is at 1.0 (1.0+noroms-0ubuntu14.8)
> libvirt is at 0.9.8 (0.9.8-2ubuntu17.10)
> 
>> And, do you always have fs corruption, independently of IO load?
>>
> 
> I seems so that I have to create some IO to trigger the corruption.
> 
>>
>> Did you try to stop all but one iSCSI connection to eliminate
>> multipathing?
>>
> 
> Not exactly. This would be what I would do next if I have still problems.
> What I did, was to use one iSCSI path directly (by using
> /dev/disk/by-path/... as the source of the block device). This seemed to
> work - but it is hard to tell if I just did not trigger a bug in my setup.
> 
> That everything worked with a single path (or at least seemed so) is not
> consistent with the observations above. Therefore I still do not trust
> the setup and will do some more long time tests.
> 
> May I ask a few more questions?
> 
> Do you manage the multipath daemon with pacemaker? In my setup multipath
> is started at boot time and not managed by pacemaker.

Yes. And all multipath records as well. That is the part of my iscsi RA.

> 
> Where do you loose the dependencies between targets and initiator?
> I use two advisory orders:
> 
> order o-iscsitarget_before_iscsiinitiator 0: rg-iscsitarget
> clone-iscsiinitiator

Wait. You do have targets and initiators on the same boxes, correct?

But what is it for? I recall you have active/passive drbd there, and
then export it with iSCSI, and then connect via iSCSI to the same host?
Why just doesn't back VMs with DRBD?

I have dedicated pairs of storage hosts which run cluster as well and
manage targets there, that's why I use iSCSI.

> 
> order o-iscsiinitiator_before_libvirt 0: clone-iscsiinitiator
> clone-libvirtd
> 
> to have the possibility to restart targets (needed for failover) and to
> restart iscsi initiators (to scan for new targets easily). Is this good
> practice?

I wouldn't say so.
I manage targets/luns separately (but I have IET), so they are not a
part of any config except pacemaker's CIB.

I also manage libvirt disk pools with pacemaker, so their definitions
are not part of any config either. I start with absolutely "clean"
system (just cluster and common shared storage are set up) and add
everything from pacemaker.

And, you have iscsiadm for initiator part.

Vladislav