[Pacemaker] cLVM stuck
Andreas Kurz
andreas at hastexo.com
Thu Feb 9 15:50:08 UTC 2012
Hello,
On 02/09/2012 03:29 PM, Karl Rößmann wrote:
> Hi all,
>
> we run a three Node HA Cluster using cLVM and Xen.
>
> After installing some online updates node by node
While cluster was in maintenance-mode or when cluster was shut down on
the node that received updates?
> the cLVM is stuck, and the last (updated) Node does not want to join the
> the cLVM.
> Two nodes are still running
> The xen VMs are running (they have there disks on cLVs),
> but commands like 'lvdisplay' do not work.
>
> Is there a way to recover the cLVM without restarting the whole cluster ?
Any logentries about the controld on orion1 ... what is the output of
"crm_mon -1fr"? ... looks like there is a problem with starting the
dlm_controld.pcmk .... and I wonder why orion1 is not fenced on stop
errors, or did that happen?
Did you inspect the output of "dlm_tool ls/dump" on all nodes where the
controld is running?
Your crm_mon output shows orion1 offline ... seems not to be timely
related to your logs?
Regards,
Andreas
--
Need help with Pacemaker?
http://www.hastexo.com/now
>
> we have the latest
> SuSE Sles SP1 and HA-Extension including
> pacemaker-1.1.5-5.9.11.1
> corosync-1.3.3-0.3.1
>
> Some ERROR messages:
> Feb 9 12:06:42 orion1 crmd: [6462]: ERROR: process_lrm_event: LRM
> operation clvm:0_start_0 (15) Timed Out (timeout=240000ms)
> Feb 9 12:13:41 orion1 crmd: [6462]: ERROR: process_lrm_event: LRM
> operation cluvg1:2_start_0 (19) Timed Out (timeout=240000ms)
> Feb 9 12:16:21 orion1 crmd: [6462]: ERROR: process_lrm_event: LRM
> operation cluvg1:2_stop_0 (20) Timed Out (timeout=100000ms)
> Feb 9 13:39:10 orion1 crmd: [14350]: ERROR: process_lrm_event: LRM
> operation clvm:0_start_0 (15) Timed Out (timeout=240000ms)
> Feb 9 13:53:38 orion1 crmd: [14350]: ERROR: process_lrm_event: LRM
> operation cluvg1:2_start_0 (19) Timed Out (timeout=240000ms)
> Feb 9 13:56:18 orion1 crmd: [14350]: ERROR: process_lrm_event: LRM
> operation cluvg1:2_stop_0 (20) Timed Out (timeout=100000ms)
>
>
>
> Feb 9 12:11:55 orion2 crm_resource: [13025]: ERROR:
> resource_ipc_timeout: No messages received in 60 seconds
> Feb 9 12:13:41 orion2 crmd: [5882]: ERROR: send_msg_via_ipc: Unknown
> Sub-system (13025_crm_resource)... discarding message.
> Feb 9 12:14:41 orion2 crmd: [5882]: ERROR: print_elem: Aborting
> transition, action lost: [Action 35]: In-flight (id: cluvg1:2_start_0,
> loc: orion1, priority: 0)
> Feb 9 13:54:38 orion2 crmd: [5882]: ERROR: print_elem: Aborting
> transition, action lost: [Action 35]: In-flight (id: cluvg1:2_start_0,
> loc: orion1, priority: 0)
>
>
>
> Some additional information:
>
> crm_mon -1:
> ============
> Last updated: Thu Feb 9 15:10:34 2012
> Stack: openais
> Current DC: orion2 - partition with quorum
> Version: 1.1.5-5bd2b9154d7d9f86d7f56fe0a74072a5a6590c60
> 3 Nodes configured, 3 expected votes
> 17 Resources configured.
> ============
>
> Online: [ orion2 orion7 ]
> OFFLINE: [ orion1 ]
>
> Clone Set: dlm_clone [dlm]
> Started: [ orion2 orion7 ]
> Stopped: [ dlm:0 ]
> Clone Set: clvm_clone [clvm]
> Started: [ orion2 orion7 ]
> Stopped: [ clvm:0 ]
> sbd_stonith (stonith:external/sbd): Started orion2
> Clone Set: cluvg1_clone [cluvg1]
> Started: [ orion2 orion7 ]
> Stopped: [ cluvg1:2 ]
> styx (ocf::heartbeat:Xen): Started orion7
> shib (ocf::heartbeat:Xen): Started orion7
> wiki (ocf::heartbeat:Xen): Started orion2
> horde (ocf::heartbeat:Xen): Started orion7
> www (ocf::heartbeat:Xen): Started orion7
> enventory (ocf::heartbeat:Xen): Started orion2
> mailrelay (ocf::heartbeat:Xen): Started orion2
>
>
>
> crm configure show
> node orion1 \
> attributes standby="off"
> node orion2 \
> attributes standby="off"
> node orion7 \
> attributes standby="off"
> primitive cluvg1 ocf:heartbeat:LVM \
> params volgrpname="cluvg1" \
> op start interval="0" timeout="240s" \
> op stop interval="0" timeout="100s" \
> meta target-role="Started"
> primitive clvm ocf:lvm2:clvmd \
> params daemon_timeout="30" \
> op start interval="0" timeout="240s" \
> op stop interval="0" timeout="100s" \
> meta target-role="Started"
> primitive dlm ocf:pacemaker:controld \
> op monitor interval="120s" \
> op start interval="0" timeout="240s" \
> op stop interval="0" timeout="100s" \
> meta target-role="Started"
> primitive enventory ocf:heartbeat:Xen \
> meta target-role="Started" allow-migrate="true" \
> operations $id="enventory-operations" \
> op monitor interval="10" timeout="30" \
> op migrate_from interval="0" timeout="600" \
> op migrate_to interval="0" timeout="600" \
> params xmfile="/etc/xen/vm/enventory" shutdown_timeout="60"
> primitive horde ocf:heartbeat:Xen \
> meta target-role="Started" is-managed="true" allow-migrate="true" \
> operations $id="horde-operations" \
> op monitor interval="10" timeout="30" \
> op migrate_from interval="0" timeout="600" \
> op migrate_to interval="0" timeout="600" \
> params xmfile="/etc/xen/vm/horde" shutdown_timeout="120"
> primitive sbd_stonith stonith:external/sbd \
> params
> sbd_device="/dev/disk/by-id/scsi-360080e50001c150e0000019e4df6d4d5-part1" \
> meta target-role="started"
> ...
> ...
> ...
> clone cluvg1_clone cluvg1 \
> meta interleave="true" target-role="started" is-managed="true"
> clone clvm_clone clvm \
> meta globally-unique="false" interleave="true"
> target-role="started"
> clone dlm_clone dlm \
> meta globally-unique="false" interleave="true"
> target-role="started"
> colocation cluvg1_with_clvm inf: cluvg1_clone clvm_clone
> colocation clvm_with_dlm inf: clvm_clone dlm_clone
> colocation enventory_with_cluvg1 inf: enventory cluvg1_clone
> colocation horde_with_cluvg1 inf: horde cluvg1_clone
> ...
> ... more Xen VMs
> ...
> order cluvg1_before_enventory inf: cluvg1_clone enventory
> order cluvg1_before_horde inf: cluvg1_clone horde
> order clvm_before_cluvg1 inf: clvm_clone cluvg1_clone
> order dlm_before_clvm inf: dlm_clone clvm_clone
> property $id="cib-bootstrap-options" \
> dc-version="1.1.5-5bd2b9154d7d9f86d7f56fe0a74072a5a6590c60" \
> cluster-infrastructure="openais" \
> expected-quorum-votes="3" \
> stonith-timeout="420s" \
> last-lrm-refresh="1328792018"
> rsc_defaults $id="rsc_defaults-options" \
> resource-stickiness="10"
> op_defaults $id="op_defaults-options" \
> record-pending="false"
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 222 bytes
Desc: OpenPGP digital signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120209/465d542d/attachment-0004.sig>
More information about the Pacemaker
mailing list