[Pacemaker] Nodes appear UNCLEAN (offline) during Pacemaker upgrade to 1.1.7

Parshvi parshvi.17 at gmail.com
Fri Nov 23 13:27:59 UTC 2012


Parshvi <parshvi.17 at ...> writes:

> 
> Hi,
> We are upgrading to Pacemaker 1.1.7 and Corosync 1.4.3.
> The previous version was:
> Pacemaker: 1.0.12
> Corosync : 1.2.7
> The issues faced in the older version are:
> 1) Numerous, Policy engine and crmd crashes, stopping failed cluster resources 
> from recovering.
> 2) pacemaker logs show FSM in pending state, service comes in sync only after 
a 
> restart.
> 
> Environment:
> 1) OS: OEL 5.8
> RPMS(packages) for Pacemaker 1.1.7, Corosync 1.4.3 and other dependent pkgs 
are 
> not available for OEL 5.8. Hence, we have build all pkgs from source (github).
> 
> We have a two node cluster. We have installed the build binaries on both 
cluster 
> nodes. crm_mon shows both nodes as online. All processes of corosync and 
> pacemaker appear started and running.
> 
> Issues faced:
> We have another setup, consisting of two nodes in the cluster(same as above).
> Pkg binaries have been installed on both the nodes.
> One of the nodes appears UNCLEAN (offline) and other node appears (offline).
> crmd process continuously respawns until its max respawn count is reached. DC 
> appears NONE in crm_mon.
> 
> I have checked selinux, firewall on the nodes(its disabled).
> 
> I have an hb_report of the nodes. I can share it if needed.
> 
> I also created another cluster of 2 nodes: One node was from WORKING cluster 
and 
> another node was from NON_WORKING cluster.
> A dump of the o/p of crm_mon of such a cluster is:
> 
> Last updated: Sat Nov 17 19:53:37 2012
> Last change: Sat Nov 17 19:53:27 2012 via crmd on node-112
> Stack: openais
> Current DC: node-112 - partition with quorum
> Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
> 2 Nodes configured, 2 expected votes
> 0 Resources configured.
> ============
> 
> Node node-122: UNCLEAN (offline)
> Online: [ node-112 ]
> 
> After some time the UNCLEAN(offline) node appears offline:
> 
> Last updated: Sat Nov 17 20:26:48 2012
> Last change: Sat Nov 17 20:15:38 2012 via cibadmin on node-112
> Stack: openais
> Current DC: node-112 - partition with quorum
> Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
> 2 Nodes configured, 2 expected votes
> 0 Resources configured.
> ============
> 
> Online: [ node-112 ]
> OFFLINE: [ node-122 ]
> 
> I would request the owners to please respond with some input. The old version 
is 
> a concern at our production.

A dump of following commands on the node appearing UNCLEAN(offline) is:
corosync-objctl  | grep member
runtime.totem.pg.mrp.srp.members.1887545536.ip=r(0) ip(192.168.100.112)
runtime.totem.pg.mrp.srp.members.1887545536.join_count=1
runtime.totem.pg.mrp.srp.members.1887545536.status=joined
runtime.totem.pg.mrp.srp.members.2055317696.ip=r(0) ip(192.168.100.122)
runtime.totem.pg.mrp.srp.members.2055317696.join_count=1
runtime.totem.pg.mrp.srp.members.2055317696.status=joined

corosync-cfgtool -s
Printing ring status.
Local node ID 2055317696
RING ID 0
        id      = 192.168.100.122
        status  = ring 0 active with no faults








More information about the Pacemaker mailing list