[Pacemaker] [Problem] The state of a node cut with the node that rebooted by a cluster is not recognized.

Tue Jun 4 01:17:51 EDT 2013

Hi All,

I registered this problem with Bugzilla.

 * http://bugs.clusterlabs.org/show_bug.cgi?id=5160

Best Regards,
Hideo Yamauchi.

--- On Tue, 2013/6/4, renayama19661014 at ybb.ne.jp <renayama19661014 at ybb.ne.jp> wrote:

> Hi All,
> 
> We confirmed a state of the recognition of the cluster in the next procedure.
> We confirm it by the next combination.(RHEL6.4 guest)
>  * corosync-2.3.0
>  * pacemaker-Pacemaker-1.1.10-rc3
> 
> -------------------------
> 
> Step 1) Start all nodes and constitute a cluster.
> 
> [root at rh64-coro1 ~]# crm_mon -1 -Af
> Last updated: Tue Jun  4 22:30:25 2013
> Last change: Tue Jun  4 22:22:54 2013 via crmd on rh64-coro1
> Stack: corosync
> Current DC: rh64-coro3 (4231178432) - partition with quorum
> Version: 1.1.9-db294e1
> 3 Nodes configured, unknown expected votes
> 0 Resources configured.
> 
> 
> Online: [ rh64-coro1 rh64-coro2 rh64-coro3 ]
> 
> 
> Node Attributes:
> * Node rh64-coro1:
> * Node rh64-coro2:
> * Node rh64-coro3:
> 
> Migration summary:
> * Node rh64-coro1: 
> * Node rh64-coro3: 
> * Node rh64-coro2: 
> 
> 
> Step 2) Stop the first unit node.
> 
> [root at rh64-coro2 ~]# crm_mon -1 -Af
> Last updated: Tue Jun  4 22:30:55 2013
> Last change: Tue Jun  4 22:22:54 2013 via crmd on rh64-coro1
> Stack: corosync
> Current DC: rh64-coro3 (4231178432) - partition with quorum
> Version: 1.1.9-db294e1
> 3 Nodes configured, unknown expected votes
> 0 Resources configured.
> 
> 
> Online: [ rh64-coro2 rh64-coro3 ]
> OFFLINE: [ rh64-coro1 ]
> 
> 
> Node Attributes:
> * Node rh64-coro2:
> * Node rh64-coro3:
> 
> Migration summary:
> * Node rh64-coro3: 
> * Node rh64-coro2: 
> 
> 
> Step 3) Restart the first unit node.
> 
> [root at rh64-coro1 ~]# crm_mon -1 -Af
> Last updated: Tue Jun  4 22:31:29 2013
> Last change: Tue Jun  4 22:22:54 2013 via crmd on rh64-coro1
> Stack: corosync
> Current DC: rh64-coro3 (4231178432) - partition with quorum
> Version: 1.1.9-db294e1
> 3 Nodes configured, unknown expected votes
> 0 Resources configured.
> 
> 
> Online: [ rh64-coro1 rh64-coro2 rh64-coro3 ]
> 
> 
> Node Attributes:
> * Node rh64-coro1:
> * Node rh64-coro2:
> * Node rh64-coro3:
> 
> Migration summary:
> * Node rh64-coro1: 
> * Node rh64-coro3: 
> * Node rh64-coro2: 
> 
> 
> Step 4) Interrupt the inter-connect of all nodes.
> 
> [root at kvm-host ~]# brctl delif virbr2 vnet1;brctl delif virbr2 vnet4;brctl delif virbr2 vnet7;brctl delif virbr3 vnet2;brctl delif virbr3 vnet5;brctl delif virbr3 vnet8
> 
> -------------------------
> 
> 
> Two nodes that do not reboot then recognize other nodes definitely.
> 
> [root at rh64-coro2 ~]# crm_mon -1 -Af
> Last updated: Tue Jun  4 22:32:06 2013
> Last change: Tue Jun  4 22:22:54 2013 via crmd on rh64-coro1
> Stack: corosync
> Current DC: rh64-coro2 (4214401216) - partition WITHOUT quorum
> Version: 1.1.9-db294e1
> 3 Nodes configured, unknown expected votes
> 0 Resources configured.
> 
> 
> Node rh64-coro1 (4197624000): UNCLEAN (offline)
> Node rh64-coro3 (4231178432): UNCLEAN (offline)
> Online: [ rh64-coro2 ]
> 
> 
> Node Attributes:
> * Node rh64-coro2:
> 
> Migration summary:
> * Node rh64-coro2: 
> 
> [root at rh64-coro3 ~]# crm_mon -1 -Af
> Last updated: Tue Jun  4 22:33:17 2013
> Last change: Tue Jun  4 22:22:54 2013 via crmd on rh64-coro1
> Stack: corosync
> Current DC: rh64-coro3 (4231178432) - partition WITHOUT quorum
> Version: 1.1.9-db294e1
> 3 Nodes configured, unknown expected votes
> 0 Resources configured.
> 
> 
> Node rh64-coro1 (4197624000): UNCLEAN (offline)
> Node rh64-coro2 (4214401216): UNCLEAN (offline)
> Online: [ rh64-coro3 ]
> 
> 
> Node Attributes:
> * Node rh64-coro3:
> 
> Migration summary:
> * Node rh64-coro3: 
> 
> 
> However, the node that rebooted does not recognize the state of one node definitely.
> 
> [root at rh64-coro1 ~]# crm_mon -1 -Af
> Last updated: Tue Jun  4 22:33:31 2013
> Last change: Tue Jun  4 22:22:54 2013 via crmd on rh64-coro1
> Stack: corosync
> Current DC: rh64-coro1 (4197624000) - partition WITHOUT quorum
> Version: 1.1.9-db294e1
> 3 Nodes configured, unknown expected votes
> 0 Resources configured.
> 
> 
> Node rh64-coro3 (4231178432): UNCLEAN (offline)----------------> OKay.
> Online: [ rh64-coro1 rh64-coro2 ] ------------------------------> rh64-coro2 NG.
> 
> 
> Node Attributes:
> * Node rh64-coro1:
> * Node rh64-coro2:
> 
> Migration summary:
> * Node rh64-coro1: 
> * Node rh64-coro2: 
> 
> 
> It is right movement that recognize other nodes in a UNCLEAN state in the node that rebooted, but seems to recognize it by mistake.
> 
> It is like the problem of Pacemaker somehow or other.
>  * There seems to be the problem with crm_peer_cache hush table.
> 
> Best Regards,
> Hideo Yamauchi.
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>