[Pacemaker] Nodes appear UNCLEAN (offline) during Pacemaker upgrade to 1.1.7
Parshvi
parshvi.17 at gmail.com
Fri Nov 23 13:27:59 UTC 2012
Parshvi <parshvi.17 at ...> writes:
>
> Hi,
> We are upgrading to Pacemaker 1.1.7 and Corosync 1.4.3.
> The previous version was:
> Pacemaker: 1.0.12
> Corosync : 1.2.7
> The issues faced in the older version are:
> 1) Numerous, Policy engine and crmd crashes, stopping failed cluster resources
> from recovering.
> 2) pacemaker logs show FSM in pending state, service comes in sync only after
a
> restart.
>
> Environment:
> 1) OS: OEL 5.8
> RPMS(packages) for Pacemaker 1.1.7, Corosync 1.4.3 and other dependent pkgs
are
> not available for OEL 5.8. Hence, we have build all pkgs from source (github).
>
> We have a two node cluster. We have installed the build binaries on both
cluster
> nodes. crm_mon shows both nodes as online. All processes of corosync and
> pacemaker appear started and running.
>
> Issues faced:
> We have another setup, consisting of two nodes in the cluster(same as above).
> Pkg binaries have been installed on both the nodes.
> One of the nodes appears UNCLEAN (offline) and other node appears (offline).
> crmd process continuously respawns until its max respawn count is reached. DC
> appears NONE in crm_mon.
>
> I have checked selinux, firewall on the nodes(its disabled).
>
> I have an hb_report of the nodes. I can share it if needed.
>
> I also created another cluster of 2 nodes: One node was from WORKING cluster
and
> another node was from NON_WORKING cluster.
> A dump of the o/p of crm_mon of such a cluster is:
>
> Last updated: Sat Nov 17 19:53:37 2012
> Last change: Sat Nov 17 19:53:27 2012 via crmd on node-112
> Stack: openais
> Current DC: node-112 - partition with quorum
> Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
> 2 Nodes configured, 2 expected votes
> 0 Resources configured.
> ============
>
> Node node-122: UNCLEAN (offline)
> Online: [ node-112 ]
>
> After some time the UNCLEAN(offline) node appears offline:
>
> Last updated: Sat Nov 17 20:26:48 2012
> Last change: Sat Nov 17 20:15:38 2012 via cibadmin on node-112
> Stack: openais
> Current DC: node-112 - partition with quorum
> Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
> 2 Nodes configured, 2 expected votes
> 0 Resources configured.
> ============
>
> Online: [ node-112 ]
> OFFLINE: [ node-122 ]
>
> I would request the owners to please respond with some input. The old version
is
> a concern at our production.
A dump of following commands on the node appearing UNCLEAN(offline) is:
corosync-objctl | grep member
runtime.totem.pg.mrp.srp.members.1887545536.ip=r(0) ip(192.168.100.112)
runtime.totem.pg.mrp.srp.members.1887545536.join_count=1
runtime.totem.pg.mrp.srp.members.1887545536.status=joined
runtime.totem.pg.mrp.srp.members.2055317696.ip=r(0) ip(192.168.100.122)
runtime.totem.pg.mrp.srp.members.2055317696.join_count=1
runtime.totem.pg.mrp.srp.members.2055317696.status=joined
corosync-cfgtool -s
Printing ring status.
Local node ID 2055317696
RING ID 0
id = 192.168.100.122
status = ring 0 active with no faults
More information about the Pacemaker
mailing list