[Pacemaker] cLVM stuck

Thu Feb 9 14:29:35 UTC 2012

Hi all,

we run a three Node HA Cluster using cLVM and Xen.

After installing some online updates node by node
the cLVM is stuck, and the last (updated) Node does not want to join the
the cLVM.
Two nodes are still running
The xen VMs are running (they have there disks on cLVs),
but commands like 'lvdisplay' do not work.

Is there a way to recover the cLVM without restarting the whole cluster ?

we have the latest
SuSE Sles SP1 and HA-Extension including
pacemaker-1.1.5-5.9.11.1
corosync-1.3.3-0.3.1

Some ERROR messages:
Feb  9 12:06:42 orion1 crmd: [6462]: ERROR: process_lrm_event: LRM  
operation clvm:0_start_0 (15) Timed Out (timeout=240000ms)
Feb  9 12:13:41 orion1 crmd: [6462]: ERROR: process_lrm_event: LRM  
operation cluvg1:2_start_0 (19) Timed Out (timeout=240000ms)
Feb  9 12:16:21 orion1 crmd: [6462]: ERROR: process_lrm_event: LRM  
operation cluvg1:2_stop_0 (20) Timed Out (timeout=100000ms)
Feb  9 13:39:10 orion1 crmd: [14350]: ERROR: process_lrm_event: LRM  
operation clvm:0_start_0 (15) Timed Out (timeout=240000ms)
Feb  9 13:53:38 orion1 crmd: [14350]: ERROR: process_lrm_event: LRM  
operation cluvg1:2_start_0 (19) Timed Out (timeout=240000ms)
Feb  9 13:56:18 orion1 crmd: [14350]: ERROR: process_lrm_event: LRM  
operation cluvg1:2_stop_0 (20) Timed Out (timeout=100000ms)

Feb  9 12:11:55 orion2 crm_resource: [13025]: ERROR:  
resource_ipc_timeout: No messages received in 60 seconds
Feb  9 12:13:41 orion2 crmd: [5882]: ERROR: send_msg_via_ipc: Unknown  
Sub-system (13025_crm_resource)... discarding message.
Feb  9 12:14:41 orion2 crmd: [5882]: ERROR: print_elem: Aborting  
transition, action lost: [Action 35]: In-flight (id: cluvg1:2_start_0,  
loc: orion1, priority: 0)
Feb  9 13:54:38 orion2 crmd: [5882]: ERROR: print_elem: Aborting  
transition, action lost: [Action 35]: In-flight (id: cluvg1:2_start_0,  
loc: orion1, priority: 0)

Some additional information:

crm_mon -1:
============
Last updated: Thu Feb  9 15:10:34 2012
Stack: openais
Current DC: orion2 - partition with quorum
Version: 1.1.5-5bd2b9154d7d9f86d7f56fe0a74072a5a6590c60
3 Nodes configured, 3 expected votes
17 Resources configured.
============

Online: [ orion2 orion7 ]
OFFLINE: [ orion1 ]

  Clone Set: dlm_clone [dlm]
      Started: [ orion2 orion7 ]
      Stopped: [ dlm:0 ]
  Clone Set: clvm_clone [clvm]
      Started: [ orion2 orion7 ]
      Stopped: [ clvm:0 ]
  sbd_stonith    (stonith:external/sbd): Started orion2
  Clone Set: cluvg1_clone [cluvg1]
      Started: [ orion2 orion7 ]
      Stopped: [ cluvg1:2 ]
  styx   (ocf::heartbeat:Xen):   Started orion7
  shib   (ocf::heartbeat:Xen):   Started orion7
  wiki   (ocf::heartbeat:Xen):   Started orion2
  horde  (ocf::heartbeat:Xen):   Started orion7
  www    (ocf::heartbeat:Xen):   Started orion7
  enventory      (ocf::heartbeat:Xen):   Started orion2
  mailrelay      (ocf::heartbeat:Xen):   Started orion2

crm configure show
node orion1 \
         attributes standby="off"
node orion2 \
         attributes standby="off"
node orion7 \
         attributes standby="off"
primitive cluvg1 ocf:heartbeat:LVM \
         params volgrpname="cluvg1" \
         op start interval="0" timeout="240s" \
         op stop interval="0" timeout="100s" \
         meta target-role="Started"
primitive clvm ocf:lvm2:clvmd \
         params daemon_timeout="30" \
         op start interval="0" timeout="240s" \
         op stop interval="0" timeout="100s" \
         meta target-role="Started"
primitive dlm ocf:pacemaker:controld \
         op monitor interval="120s" \
         op start interval="0" timeout="240s" \
         op stop interval="0" timeout="100s" \
         meta target-role="Started"
primitive enventory ocf:heartbeat:Xen \
         meta target-role="Started" allow-migrate="true" \
         operations $id="enventory-operations" \
         op monitor interval="10" timeout="30" \
         op migrate_from interval="0" timeout="600" \
         op migrate_to interval="0" timeout="600" \
         params xmfile="/etc/xen/vm/enventory" shutdown_timeout="60"
primitive horde ocf:heartbeat:Xen \
         meta target-role="Started" is-managed="true" allow-migrate="true" \
         operations $id="horde-operations" \
         op monitor interval="10" timeout="30" \
         op migrate_from interval="0" timeout="600" \
         op migrate_to interval="0" timeout="600" \
         params xmfile="/etc/xen/vm/horde" shutdown_timeout="120"
primitive sbd_stonith stonith:external/sbd \
         params  
sbd_device="/dev/disk/by-id/scsi-360080e50001c150e0000019e4df6d4d5-part1"  
\
         meta target-role="started"
...
...
...
clone cluvg1_clone cluvg1 \
         meta interleave="true" target-role="started" is-managed="true"
clone clvm_clone clvm \
         meta globally-unique="false" interleave="true" target-role="started"
clone dlm_clone dlm \
         meta globally-unique="false" interleave="true" target-role="started"
colocation cluvg1_with_clvm inf: cluvg1_clone clvm_clone
colocation clvm_with_dlm inf: clvm_clone dlm_clone
colocation enventory_with_cluvg1 inf: enventory cluvg1_clone
colocation horde_with_cluvg1 inf: horde cluvg1_clone
...
... more Xen VMs
...
order cluvg1_before_enventory inf: cluvg1_clone enventory
order cluvg1_before_horde inf: cluvg1_clone horde
order clvm_before_cluvg1 inf: clvm_clone cluvg1_clone
order dlm_before_clvm inf: dlm_clone clvm_clone
property $id="cib-bootstrap-options" \
         dc-version="1.1.5-5bd2b9154d7d9f86d7f56fe0a74072a5a6590c60" \
         cluster-infrastructure="openais" \
         expected-quorum-votes="3" \
         stonith-timeout="420s" \
         last-lrm-refresh="1328792018"
rsc_defaults $id="rsc_defaults-options" \
         resource-stickiness="10"
op_defaults $id="op_defaults-options" \
         record-pending="false"

-- 
Karl Rößmann				Tel. +49-711-689-1657
Max-Planck-Institut FKF       		Fax. +49-711-689-1632
Postfach 800 665
70506 Stuttgart				email K.Roessmann at fkf.mpg.de