[Pacemaker] node1 fencing itself after node2 being fenced

Mon Feb 17 13:52:37 EST 2014

> -----Original Message-----
> From: Andrew Beekhof [mailto:andrew at beekhof.net]
> Sent: 17 February 2014 00:55
> To: lists at blueface.com; The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] node1 fencing itself after node2 being fenced
> 
> 
> If you have configured cman to use fence_pcmk, then all cman/dlm/clvmd
> fencing operations are sent to Pacemaker.
> If you aren't running pacemaker, then you have a big problem as no-one can
> perform fencing.

I have configured pacemaker as the resource manager and I have it enabled to
start on boot-up too as follows:

chkconfig cman on
chkconfig clvmd on
chkconfig pacemaker on

> 
> I don't know if you are testing without pacemaker running, but if so you
> would need to configure cman with real fencing devices.
>

I have been testing with pacemaker running and the fencing appears to be
operating fine, the issue I seem to have is that clvmd is unable re-acquire
its locks when attempting to rejoin the cluster after a fence operation, so
it looks like clvmd just hangs when the startup script fires it off on
boot-up. When the 3rd node is in this state (hung clvmd), then the other 2
nodes are unable to obtain locks from the third node as clvmd has hung, as
an example, this is what happens when the 3rd node is hung at the clvmd
startup phase after pacemaker has issued a fence operation (running pvs on
node1)

[root at test01 ~]# pvs
  Error locking on node test03: Command timed out
  Unable to obtain global lock.

The dlm elements look fine to me here too:

[root at test01 ~]# dlm_tool ls
dlm lockspaces
name          cdr
id            0xa8054052
flags         0x00000008 fs_reg
change        member 2 joined 0 remove 1 failed 1 seq 2,2
members       1 2 

name          clvmd
id            0x4104eefa
flags         0x00000000 
change        member 3 joined 1 remove 0 failed 0 seq 3,3
members       1 2 3

So it looks like cman/dlm are operating properly, however, clvmd hangs and
never exits so pacemaker never starts on the 3rd node. So the 3rd node is in
"pending" state while clvmd is hung:

[root at test02 ~]# crm_mon -Afr -1
Last updated: Mon Feb 17 15:52:28 2014
Last change: Mon Feb 17 15:43:16 2014 via cibadmin on test01
Stack: cman
Current DC: test02 - partition with quorum
Version: 1.1.10-14.el6_5.2-368c726
3 Nodes configured
15 Resources configured

Node test03: pending
Online: [ test01 test02 ]

Full list of resources:

 fence_test01      (stonith:fence_vmware_soap):    Started test01 
 fence_test02      (stonith:fence_vmware_soap):    Started test02 
 fence_test03      (stonith:fence_vmware_soap):    Started test01 
 Clone Set: fs_cdr-clone [fs_cdr]
     Started: [ test01 test02 ]
     Stopped: [ test03 ]
 Resource Group: sftp01-vip
     vip-001    (ocf::heartbeat:IPaddr2):       Started test01 
     vip-002    (ocf::heartbeat:IPaddr2):       Started test01 
 Resource Group: sftp02-vip
     vip-003    (ocf::heartbeat:IPaddr2):       Started test02 
     vip-004    (ocf::heartbeat:IPaddr2):       Started test02 
 Resource Group: sftp03-vip
     vip-005    (ocf::heartbeat:IPaddr2):       Started test02 
     vip-006    (ocf::heartbeat:IPaddr2):       Started test02 
 sftp01 (lsb:sftp01):   Started test01 
 sftp02 (lsb:sftp02):   Started test02 
 sftp03 (lsb:sftp03):   Started test02 

Node Attributes:
* Node test01:
* Node test02:
* Node test03:

Migration summary:
* Node test03: 
* Node test02: 
* Node test01: