[Pacemaker] Trouble with Xen high availability. Can't get it.

Mon Dec 5 11:57:25 UTC 2011

Hello.
I made a cluster with two nodes(Ubuntu 11.10 + corocync + drbd + cman + Pacemaker), and configure Xen resource to start virtual machine (VM1 for short, Ubuntu 10.10 ), virtual machines disks are on the drbd resource.
So now i try testing availability.
I execute this command on node1:

$sudo crm node standby

And I receive this message:

block drbd1: Sending state for detaching disk failed

I notice that on node1 service drbd stops

$cat /proc/drbd
1: cs:Unconfigured

Is this normal? There is a following:

Virtual machine doesn't stop. It confirms with icmp echo response from the VM1. I run interactive VM1 console on node2, with :

$sudo xm console VM1

I can see that it continues to work, and remote ssh session with VM1 also continues to work.

Then I bring back node1 , with:

$sudo crm node online

I receive messages:

dlm: Using TCP for comunications
dlm: connecting to 1
dlm: got connection from 1

There Icmp echo responces from VM1 stopped on 15 sec. Thus the interactive console VM1 on node2 and remote ssh session with VM1 too has shown shutdown process. 
I.e.there was a restart VM1 on node2, that as I believe shouldn't be.
Further I switch off node2:

$sudo crm node standby

Also, I receive this message:

block drbd1: Sending state for detaching disk failed

I notice that on node2 service drbd stops. The interactive console VM1 on node2 and remote ssh session has displayed shutdown process, but the interactive console VM1 on node1 works normally.
Thus ICMP echo response from the VM1 has stopped on 275с. During this time i cant get remote ssh connect to VM1.
After this long interval Xen services start working .
Further I switch on node2:

$sudo crm node online

A situation similarly described earlier, i.e. icmp echo responces to VM1 stopped on 15 sec. Thus the interactive console VM1 on node1 and remote ssh session with VM1 too has shown shutdown process. 
I.e. there was a restart VM1 on node1.

I have repeated this operation some times(4-5), with the same result, tried to add in parameters of service Xen:

meta allow-migrate = "true"

It doesn't changed behavior.

I wonder whether this parameter, allow-migrate, is necessary in Active/Active configuration? It was not include on Clusters from scratch manual, but i saw it on other (active/passive) config's, 
thus I assume it's not nessasary, because Xen services are equally started on both servers. And I expect that any node failure must not stop services on another node. Am I think correctly?

So. How to avoid such reboots of VM1? And  what I need to do for maintaining continuous working of VM1?

What the reason of such various delay restoration - 15 sec node1 and 275 sec on node2? How to reduce them, and is better to avoid?

Do i need live migration? If yes, then how to make that. I used parameter meta allow-migrate = "true", but it didn't influence.

Whether it is because i do not configure Stonith yet?. At least this is my assumption.

I will be grateful for you for any help.