[Pacemaker] [Problem] The state of a node cut with the node that rebooted by a cluster is not recognized.
renayama19661014 at ybb.ne.jp
renayama19661014 at ybb.ne.jp
Tue Jun 4 05:00:35 UTC 2013
Hi All,
We confirmed a state of the recognition of the cluster in the next procedure.
We confirm it by the next combination.(RHEL6.4 guest)
* corosync-2.3.0
* pacemaker-Pacemaker-1.1.10-rc3
-------------------------
Step 1) Start all nodes and constitute a cluster.
[root at rh64-coro1 ~]# crm_mon -1 -Af
Last updated: Tue Jun 4 22:30:25 2013
Last change: Tue Jun 4 22:22:54 2013 via crmd on rh64-coro1
Stack: corosync
Current DC: rh64-coro3 (4231178432) - partition with quorum
Version: 1.1.9-db294e1
3 Nodes configured, unknown expected votes
0 Resources configured.
Online: [ rh64-coro1 rh64-coro2 rh64-coro3 ]
Node Attributes:
* Node rh64-coro1:
* Node rh64-coro2:
* Node rh64-coro3:
Migration summary:
* Node rh64-coro1:
* Node rh64-coro3:
* Node rh64-coro2:
Step 2) Stop the first unit node.
[root at rh64-coro2 ~]# crm_mon -1 -Af
Last updated: Tue Jun 4 22:30:55 2013
Last change: Tue Jun 4 22:22:54 2013 via crmd on rh64-coro1
Stack: corosync
Current DC: rh64-coro3 (4231178432) - partition with quorum
Version: 1.1.9-db294e1
3 Nodes configured, unknown expected votes
0 Resources configured.
Online: [ rh64-coro2 rh64-coro3 ]
OFFLINE: [ rh64-coro1 ]
Node Attributes:
* Node rh64-coro2:
* Node rh64-coro3:
Migration summary:
* Node rh64-coro3:
* Node rh64-coro2:
Step 3) Restart the first unit node.
[root at rh64-coro1 ~]# crm_mon -1 -Af
Last updated: Tue Jun 4 22:31:29 2013
Last change: Tue Jun 4 22:22:54 2013 via crmd on rh64-coro1
Stack: corosync
Current DC: rh64-coro3 (4231178432) - partition with quorum
Version: 1.1.9-db294e1
3 Nodes configured, unknown expected votes
0 Resources configured.
Online: [ rh64-coro1 rh64-coro2 rh64-coro3 ]
Node Attributes:
* Node rh64-coro1:
* Node rh64-coro2:
* Node rh64-coro3:
Migration summary:
* Node rh64-coro1:
* Node rh64-coro3:
* Node rh64-coro2:
Step 4) Interrupt the inter-connect of all nodes.
[root at kvm-host ~]# brctl delif virbr2 vnet1;brctl delif virbr2 vnet4;brctl delif virbr2 vnet7;brctl delif virbr3 vnet2;brctl delif virbr3 vnet5;brctl delif virbr3 vnet8
-------------------------
Two nodes that do not reboot then recognize other nodes definitely.
[root at rh64-coro2 ~]# crm_mon -1 -Af
Last updated: Tue Jun 4 22:32:06 2013
Last change: Tue Jun 4 22:22:54 2013 via crmd on rh64-coro1
Stack: corosync
Current DC: rh64-coro2 (4214401216) - partition WITHOUT quorum
Version: 1.1.9-db294e1
3 Nodes configured, unknown expected votes
0 Resources configured.
Node rh64-coro1 (4197624000): UNCLEAN (offline)
Node rh64-coro3 (4231178432): UNCLEAN (offline)
Online: [ rh64-coro2 ]
Node Attributes:
* Node rh64-coro2:
Migration summary:
* Node rh64-coro2:
[root at rh64-coro3 ~]# crm_mon -1 -Af
Last updated: Tue Jun 4 22:33:17 2013
Last change: Tue Jun 4 22:22:54 2013 via crmd on rh64-coro1
Stack: corosync
Current DC: rh64-coro3 (4231178432) - partition WITHOUT quorum
Version: 1.1.9-db294e1
3 Nodes configured, unknown expected votes
0 Resources configured.
Node rh64-coro1 (4197624000): UNCLEAN (offline)
Node rh64-coro2 (4214401216): UNCLEAN (offline)
Online: [ rh64-coro3 ]
Node Attributes:
* Node rh64-coro3:
Migration summary:
* Node rh64-coro3:
However, the node that rebooted does not recognize the state of one node definitely.
[root at rh64-coro1 ~]# crm_mon -1 -Af
Last updated: Tue Jun 4 22:33:31 2013
Last change: Tue Jun 4 22:22:54 2013 via crmd on rh64-coro1
Stack: corosync
Current DC: rh64-coro1 (4197624000) - partition WITHOUT quorum
Version: 1.1.9-db294e1
3 Nodes configured, unknown expected votes
0 Resources configured.
Node rh64-coro3 (4231178432): UNCLEAN (offline)----------------> OKay.
Online: [ rh64-coro1 rh64-coro2 ] ------------------------------> rh64-coro2 NG.
Node Attributes:
* Node rh64-coro1:
* Node rh64-coro2:
Migration summary:
* Node rh64-coro1:
* Node rh64-coro2:
It is right movement that recognize other nodes in a UNCLEAN state in the node that rebooted, but seems to recognize it by mistake.
It is like the problem of Pacemaker somehow or other.
* There seems to be the problem with crm_peer_cache hush table.
Best Regards,
Hideo Yamauchi.
More information about the Pacemaker
mailing list