[Pacemaker] [Problem] The state of a node cut with the node that rebooted by a cluster is not recognized.

Tue Jun 4 05:00:35 UTC 2013

Hi All,

We confirmed a state of the recognition of the cluster in the next procedure.
We confirm it by the next combination.(RHEL6.4 guest)
 * corosync-2.3.0
 * pacemaker-Pacemaker-1.1.10-rc3

-------------------------

Step 1) Start all nodes and constitute a cluster.

[root at rh64-coro1 ~]# crm_mon -1 -Af
Last updated: Tue Jun  4 22:30:25 2013
Last change: Tue Jun  4 22:22:54 2013 via crmd on rh64-coro1
Stack: corosync
Current DC: rh64-coro3 (4231178432) - partition with quorum
Version: 1.1.9-db294e1
3 Nodes configured, unknown expected votes
0 Resources configured.

Online: [ rh64-coro1 rh64-coro2 rh64-coro3 ]

Node Attributes:
* Node rh64-coro1:
* Node rh64-coro2:
* Node rh64-coro3:

Migration summary:
* Node rh64-coro1: 
* Node rh64-coro3: 
* Node rh64-coro2: 

Step 2) Stop the first unit node.

[root at rh64-coro2 ~]# crm_mon -1 -Af
Last updated: Tue Jun  4 22:30:55 2013
Last change: Tue Jun  4 22:22:54 2013 via crmd on rh64-coro1
Stack: corosync
Current DC: rh64-coro3 (4231178432) - partition with quorum
Version: 1.1.9-db294e1
3 Nodes configured, unknown expected votes
0 Resources configured.

Online: [ rh64-coro2 rh64-coro3 ]
OFFLINE: [ rh64-coro1 ]

Node Attributes:
* Node rh64-coro2:
* Node rh64-coro3:

Migration summary:
* Node rh64-coro3: 
* Node rh64-coro2: 

Step 3) Restart the first unit node.

[root at rh64-coro1 ~]# crm_mon -1 -Af
Last updated: Tue Jun  4 22:31:29 2013
Last change: Tue Jun  4 22:22:54 2013 via crmd on rh64-coro1
Stack: corosync
Current DC: rh64-coro3 (4231178432) - partition with quorum
Version: 1.1.9-db294e1
3 Nodes configured, unknown expected votes
0 Resources configured.

Online: [ rh64-coro1 rh64-coro2 rh64-coro3 ]

Node Attributes:
* Node rh64-coro1:
* Node rh64-coro2:
* Node rh64-coro3:

Migration summary:
* Node rh64-coro1: 
* Node rh64-coro3: 
* Node rh64-coro2: 

Step 4) Interrupt the inter-connect of all nodes.

[root at kvm-host ~]# brctl delif virbr2 vnet1;brctl delif virbr2 vnet4;brctl delif virbr2 vnet7;brctl delif virbr3 vnet2;brctl delif virbr3 vnet5;brctl delif virbr3 vnet8

-------------------------

Two nodes that do not reboot then recognize other nodes definitely.

[root at rh64-coro2 ~]# crm_mon -1 -Af
Last updated: Tue Jun  4 22:32:06 2013
Last change: Tue Jun  4 22:22:54 2013 via crmd on rh64-coro1
Stack: corosync
Current DC: rh64-coro2 (4214401216) - partition WITHOUT quorum
Version: 1.1.9-db294e1
3 Nodes configured, unknown expected votes
0 Resources configured.

Node rh64-coro1 (4197624000): UNCLEAN (offline)
Node rh64-coro3 (4231178432): UNCLEAN (offline)
Online: [ rh64-coro2 ]

Node Attributes:
* Node rh64-coro2:

Migration summary:
* Node rh64-coro2: 

[root at rh64-coro3 ~]# crm_mon -1 -Af
Last updated: Tue Jun  4 22:33:17 2013
Last change: Tue Jun  4 22:22:54 2013 via crmd on rh64-coro1
Stack: corosync
Current DC: rh64-coro3 (4231178432) - partition WITHOUT quorum
Version: 1.1.9-db294e1
3 Nodes configured, unknown expected votes
0 Resources configured.

Node rh64-coro1 (4197624000): UNCLEAN (offline)
Node rh64-coro2 (4214401216): UNCLEAN (offline)
Online: [ rh64-coro3 ]

Node Attributes:
* Node rh64-coro3:

Migration summary:
* Node rh64-coro3: 

However, the node that rebooted does not recognize the state of one node definitely.

[root at rh64-coro1 ~]# crm_mon -1 -Af
Last updated: Tue Jun  4 22:33:31 2013
Last change: Tue Jun  4 22:22:54 2013 via crmd on rh64-coro1
Stack: corosync
Current DC: rh64-coro1 (4197624000) - partition WITHOUT quorum
Version: 1.1.9-db294e1
3 Nodes configured, unknown expected votes
0 Resources configured.

Node rh64-coro3 (4231178432): UNCLEAN (offline)----------------> OKay.
Online: [ rh64-coro1 rh64-coro2 ] ------------------------------> rh64-coro2 NG.

Node Attributes:
* Node rh64-coro1:
* Node rh64-coro2:

Migration summary:
* Node rh64-coro1: 
* Node rh64-coro2: 

It is right movement that recognize other nodes in a UNCLEAN state in the node that rebooted, but seems to recognize it by mistake.

It is like the problem of Pacemaker somehow or other.
 * There seems to be the problem with crm_peer_cache hush table.

Best Regards,
Hideo Yamauchi.