[Pacemaker] cLVM stuck
Karl Rößmann
K.Roessmann at fkf.mpg.de
Fri Feb 10 07:56:20 UTC 2012
Quoting Andreas Kurz <andreas at hastexo.com>:
> Hello,
>
> On 02/09/2012 03:29 PM, Karl Rößmann wrote:
>> Hi all,
>>
>> we run a three Node HA Cluster using cLVM and Xen.
>>
>> After installing some online updates node by node
>
> While cluster was in maintenance-mode or when cluster was shut down on
> the node that received updates?
>
the cluster was shutdown on the node that received the updates
>> the cLVM is stuck, and the last (updated) Node does not want to join the
>> the cLVM.
>> Two nodes are still running
>> The xen VMs are running (they have there disks on cLVs),
>> but commands like 'lvdisplay' do not work.
>>
>> Is there a way to recover the cLVM without restarting the whole cluster ?
>
> Any logentries about the controld on orion1 ... what is the output of
> "crm_mon -1fr"? ... looks like there is a problem with starting the
> dlm_controld.pcmk .... and I wonder why orion1 is not fenced on stop
> errors, or did that happen?
yes orion was fenced.
controld entries on orion1:
Feb 9 10:31:56 orion1 cluster-dlm: log_config: dlm:controld conf 2 0
1 memb 16885952 117549248 join left 33663168
Feb 9 10:31:56 orion1 cluster-dlm: log_config: dlm:controld conf 2 0
1 memb 16885952 117549248 join left 33663168
Feb 9 10:42:23 orion1 cluster-dlm: log_config: dlm:controld conf 3 1
0 memb 16885952 33663168 117549248 join 33663168 left
Feb 9 10:46:04 orion1 cluster-dlm: log_config: dlm:controld conf 2 0
1 memb 16885952 33663168 join left 117549248
Feb 9 10:46:04 orion1 cluster-dlm: log_config: dlm:controld conf 2 0
1 memb 16885952 33663168 join left 117549248
Feb 9 10:54:25 orion1 cluster-dlm: log_config: dlm:controld conf 3 1
0 memb 16885952 33663168 117549248 join 117549248 left
Feb 9 11:11:23 orion1 lrmd: [6031]: info: RA output:
(dlm:0:probe:stderr) dlm_controld.pcmk: no process found
Feb 9 11:11:30 orion1 lrmd: [6031]: info: RA output:
(dlm:0:start:stderr) dlm_controld.pcmk: no process found
Feb 9 11:11:30 orion1 cluster-dlm[6347]: main: dlm_controld master started
Feb 9 11:11:30 orion1 cluster-dlm: log_config: dlm:controld conf 3 1
0 memb 16885952 33663168 117549248 join 16885952 left
Feb 9 12:02:39 orion1 lrmd: [6459]: info: RA output:
(dlm:0:probe:stderr) dlm_controld.pcmk: no process found
Feb 9 12:02:41 orion1 lrmd: [6459]: info: RA output:
(dlm:0:start:stderr) dlm_controld.pcmk: no process found
Feb 9 12:02:41 orion1 cluster-dlm[6773]: main: dlm_controld master started
Feb 9 12:02:41 orion1 cluster-dlm: log_config: dlm:controld conf 3 1
0 memb 16885952 33663168 117549248 join 16885952 left
Feb 9 13:35:07 orion1 lrmd: [14347]: info: RA output:
(dlm:0:probe:stderr) dlm_controld.pcmk: no process found
Feb 9 13:35:09 orion1 lrmd: [14347]: info: RA output:
(dlm:0:start:stderr) dlm_controld.pcmk: no process found
Feb 9 13:35:09 orion1 cluster-dlm[14773]: main: dlm_controld master started
Feb 9 13:35:09 orion1 cluster-dlm: log_config: dlm:controld conf 3 1
0 memb 16885952 33663168 117549248 join 16885952 left
output from crm_mon -1fr:
============
Last updated: Fri Feb 10 08:41:53 2012
Stack: openais
Current DC: orion2 - partition with quorum
Version: 1.1.5-5bd2b9154d7d9f86d7f56fe0a74072a5a6590c60
3 Nodes configured, 3 expected votes
17 Resources configured.
============
Online: [ orion2 orion7 ]
OFFLINE: [ orion1 ]
Full list of resources:
Clone Set: dlm_clone [dlm]
Started: [ orion2 orion7 ]
Stopped: [ dlm:0 ]
Clone Set: clvm_clone [clvm]
Started: [ orion2 orion7 ]
Stopped: [ clvm:0 ]
sbd_stonith (stonith:external/sbd): Started orion2
Clone Set: cluvg1_clone [cluvg1]
Started: [ orion2 orion7 ]
Stopped: [ cluvg1:2 ]
styx (ocf::heartbeat:Xen): Started orion7
shib (ocf::heartbeat:Xen): Started orion7
wiki (ocf::heartbeat:Xen): Started orion2
horde (ocf::heartbeat:Xen): Started orion7
www (ocf::heartbeat:Xen): Started orion7
enventory (ocf::heartbeat:Xen): Started orion2
mailrelay (ocf::heartbeat:Xen): Started orion2
Migration summary:
* Node orion2:
* Node orion7:
>
> Did you inspect the output of "dlm_tool ls/dump" on all nodes where the
> controld is running?
>
output from dlm_tool ls from orion2:
dlm lockspaces
name clvmd
id 0x4104eefa
flags 0x00000004 kern_stop
change member 3 joined 1 remove 0 failed 0 seq 3,3
members 16885952 33663168 117549248
new change member 2 joined 0 remove 1 failed 1 seq 4,10
new status wait_messages 1 wait_condition 0
new members 33663168 117549248
output from dlm_tool ls from orion7:
dlm lockspaces
name clvmd
id 0x4104eefa
flags 0x00000004 kern_stop
change member 3 joined 1 remove 0 failed 0 seq 1,1
members 16885952 33663168 117549248
new change member 2 joined 0 remove 1 failed 1 seq 2,8
new status wait_messages 1 wait_condition 0
new members 33663168 117549248
output from dlm_tool dump from orion2:
...
1328790910 clvmd wait_messages cg 9 need 2 of 3
1328792190 Processing membership 572
1328792190 Skipped active node 33663168: born-on=528, last-seen=572,
this-event=572, last-event=568
1328792190 Skipped active node 117549248: born-on=536, last-seen=572,
this-event=572, last-event=568
1328792190 del_configfs_node rmdir
"/sys/kernel/config/dlm/cluster/comms/16885952"
1328792190 Removed inactive node 16885952: born-on=568, last-seen=568,
this-event=572, last-event=568
1328792190 dlm:controld conf 2 0 1 memb 33663168 117549248 join left 16885952
1328792190 dlm:ls:clvmd conf 2 0 1 memb 33663168 117549248 join left 16885952
1328792190 clvmd add_change cg 10 remove nodeid 16885952 reason 3
1328792190 clvmd add_change cg 10 counts member 2 joined 0 remove 1 failed 1
1328792190 Node 16885952/orion1 has not been shot yet
1328792190 clvmd check_fencing 16885952 not fenced add 1328790910 fence 0
1328792540 Node 16885952/orion1 was last shot 'now'
1328792540 clvmd check_fencing 16885952 1328790910 fenced at 1328792540
1328792540 clvmd check_fencing done
1328792540 clvmd check_quorum disabled
1328792540 clvmd check_fs none registered
1328792540 clvmd send nack old cg 8 new cg 10
1328792540 clvmd send_start cg 8 flags 6 data2 0 counts 3 2 0 1 1
1328792540 clvmd send nack old cg 6 new cg 10
1328792540 clvmd send_start cg 6 flags 6 data2 0 counts 3 2 0 1 1
1328792540 clvmd send nack old cg 4 new cg 10
1328792540 clvmd send_start cg 4 flags 6 data2 0 counts 3 2 0 1 1
1328792540 clvmd send_start cg 10 flags 2 data2 0 counts 3 2 0 1 1
1328792540 clvmd receive_start 33663168:8 len 80
1328792540 clvmd match_change 33663168:8 skip 4 already start
1328792540 clvmd match_change 33663168:8 skip 5 already start
1328792540 clvmd match_change 33663168:8 skip 6 is nacked
1328792540 clvmd match_change 33663168:8 skip 7 is nacked
1328792540 clvmd match_change 33663168:8 skip 8 is nacked
1328792540 clvmd match_change 33663168:8 skip 9 is nacked
1328792540 clvmd match_change 33663168:8 matches cg 10
1328792540 clvmd receive_start 33663168:8 is NACK
1328792540 clvmd wait_messages cg 10 need 2 of 2
...
output from dlm_tool dump from orion7:
...
1328792190 Processing membership 572
1328792190 Skipped active node 33663168: born-on=528, last-seen=572,
this-event=572, last-event=568
1328792190 Skipped active node 117549248: born-on=536, last-seen=572,
this-event=572, last-event=568
1328792190 del_configfs_node rmdir
"/sys/kernel/config/dlm/cluster/comms/16885952"
1328792190 Removed inactive node 16885952: born-on=568, last-seen=568,
this-event=572, last-event=568
1328792190 dlm:controld conf 2 0 1 memb 33663168 117549248 join left 16885952
1328792190 dlm:ls:clvmd conf 2 0 1 memb 33663168 117549248 join left 16885952
1328792190 clvmd add_change cg 8 remove nodeid 16885952 reason 3
1328792190 clvmd add_change cg 8 counts member 2 joined 0 remove 1 failed 1
1328792190 Node 16885952/orion1 has not been shot yet
1328792190 clvmd check_fencing 16885952 not fenced add 1328790910 fence 0
1328792540 clvmd receive_start 33663168:8 len 80
1328792540 clvmd match_change 33663168:8 skip 2 already start
1328792540 clvmd match_change 33663168:8 skip 3 already start
1328792540 clvmd match_change 33663168:8 skip 4 is nacked
1328792540 clvmd match_change 33663168:8 skip 5 is nacked
1328792540 clvmd match_change 33663168:8 skip 6 is nacked
1328792540 clvmd match_change 33663168:8 skip 7 is nacked
1328792540 clvmd match_change 33663168:8 matches cg 8
1328792540 clvmd receive_start 33663168:8 is NACK
1328792540 Node 16885952/orion1 was last shot 'now'
1328792540 clvmd check_fencing 16885952 1328790910 fenced at 1328792540
1328792540 clvmd check_fencing done
1328792540 clvmd check_quorum disabled
1328792540 clvmd check_fs none registered
1328792540 clvmd send nack old cg 6 new cg 8
1328792540 clvmd send_start cg 6 flags 6 data2 0 counts 1 2 0 1 1
1328792540 clvmd send nack old cg 4 new cg 8
1328792540 clvmd send_start cg 4 flags 6 data2 0 counts 1 2 0 1 1
1328792540 clvmd send nack old cg 2 new cg 8
1328792540 clvmd send_start cg 2 flags 6 data2 0 counts 1 2 0 1 1
1328792540 clvmd send_start cg 8 flags 2 data2 0 counts 1 2 0 1 1
1328792540 clvmd receive_start 33663168:6 len 80
1328792540 clvmd match_change 33663168:6 skip 2 already start
1328792540 clvmd match_change 33663168:6 skip 3 already start
1328792540 clvmd match_change 33663168:6 skip 4 is nacked
1328792540 clvmd match_change 33663168:6 skip 5 is nacked
1328792540 clvmd match_change 33663168:6 skip 6 is nacked
1328792540 clvmd match_change 33663168:6 skip 7 is nacked
1328792540 clvmd match_change 33663168:6 skip 8 is nacked
1328792540 clvmd find_change 33663168:6 no match
1328792540 clvmd wait_messages cg 8 need 2 of 2
1328792540 clvmd receive_start 33663168:4 len 80
1328792540 clvmd match_change 33663168:4 skip 2 already start
1328792540 clvmd match_change 33663168:4 skip 3 already start
...
> Your crm_mon output shows orion1 offline ... seems not to be timely
> related to your logs?
>
> Regards,
> Andreas
>
> --
> Need help with Pacemaker?
> http://www.hastexo.com/now
>
>
>>
>> we have the latest
>> SuSE Sles SP1 and HA-Extension including
>> pacemaker-1.1.5-5.9.11.1
>> corosync-1.3.3-0.3.1
>>
>> Some ERROR messages:
>> Feb 9 12:06:42 orion1 crmd: [6462]: ERROR: process_lrm_event: LRM
>> operation clvm:0_start_0 (15) Timed Out (timeout=240000ms)
>> Feb 9 12:13:41 orion1 crmd: [6462]: ERROR: process_lrm_event: LRM
>> operation cluvg1:2_start_0 (19) Timed Out (timeout=240000ms)
>> Feb 9 12:16:21 orion1 crmd: [6462]: ERROR: process_lrm_event: LRM
>> operation cluvg1:2_stop_0 (20) Timed Out (timeout=100000ms)
>> Feb 9 13:39:10 orion1 crmd: [14350]: ERROR: process_lrm_event: LRM
>> operation clvm:0_start_0 (15) Timed Out (timeout=240000ms)
>> Feb 9 13:53:38 orion1 crmd: [14350]: ERROR: process_lrm_event: LRM
>> operation cluvg1:2_start_0 (19) Timed Out (timeout=240000ms)
>> Feb 9 13:56:18 orion1 crmd: [14350]: ERROR: process_lrm_event: LRM
>> operation cluvg1:2_stop_0 (20) Timed Out (timeout=100000ms)
>>
>>
>>
>> Feb 9 12:11:55 orion2 crm_resource: [13025]: ERROR:
>> resource_ipc_timeout: No messages received in 60 seconds
>> Feb 9 12:13:41 orion2 crmd: [5882]: ERROR: send_msg_via_ipc: Unknown
>> Sub-system (13025_crm_resource)... discarding message.
>> Feb 9 12:14:41 orion2 crmd: [5882]: ERROR: print_elem: Aborting
>> transition, action lost: [Action 35]: In-flight (id: cluvg1:2_start_0,
>> loc: orion1, priority: 0)
>> Feb 9 13:54:38 orion2 crmd: [5882]: ERROR: print_elem: Aborting
>> transition, action lost: [Action 35]: In-flight (id: cluvg1:2_start_0,
>> loc: orion1, priority: 0)
>>
>>
>>
>> Some additional information:
>>
>> crm_mon -1:
>> ============
>> Last updated: Thu Feb 9 15:10:34 2012
>> Stack: openais
>> Current DC: orion2 - partition with quorum
>> Version: 1.1.5-5bd2b9154d7d9f86d7f56fe0a74072a5a6590c60
>> 3 Nodes configured, 3 expected votes
>> 17 Resources configured.
>> ============
>>
>> Online: [ orion2 orion7 ]
>> OFFLINE: [ orion1 ]
>>
>> Clone Set: dlm_clone [dlm]
>> Started: [ orion2 orion7 ]
>> Stopped: [ dlm:0 ]
>> Clone Set: clvm_clone [clvm]
>> Started: [ orion2 orion7 ]
>> Stopped: [ clvm:0 ]
>> sbd_stonith (stonith:external/sbd): Started orion2
>> Clone Set: cluvg1_clone [cluvg1]
>> Started: [ orion2 orion7 ]
>> Stopped: [ cluvg1:2 ]
>> styx (ocf::heartbeat:Xen): Started orion7
>> shib (ocf::heartbeat:Xen): Started orion7
>> wiki (ocf::heartbeat:Xen): Started orion2
>> horde (ocf::heartbeat:Xen): Started orion7
>> www (ocf::heartbeat:Xen): Started orion7
>> enventory (ocf::heartbeat:Xen): Started orion2
>> mailrelay (ocf::heartbeat:Xen): Started orion2
>>
>>
>>
>> crm configure show
>> node orion1 \
>> attributes standby="off"
>> node orion2 \
>> attributes standby="off"
>> node orion7 \
>> attributes standby="off"
>> primitive cluvg1 ocf:heartbeat:LVM \
>> params volgrpname="cluvg1" \
>> op start interval="0" timeout="240s" \
>> op stop interval="0" timeout="100s" \
>> meta target-role="Started"
>> primitive clvm ocf:lvm2:clvmd \
>> params daemon_timeout="30" \
>> op start interval="0" timeout="240s" \
>> op stop interval="0" timeout="100s" \
>> meta target-role="Started"
>> primitive dlm ocf:pacemaker:controld \
>> op monitor interval="120s" \
>> op start interval="0" timeout="240s" \
>> op stop interval="0" timeout="100s" \
>> meta target-role="Started"
>> primitive enventory ocf:heartbeat:Xen \
>> meta target-role="Started" allow-migrate="true" \
>> operations $id="enventory-operations" \
>> op monitor interval="10" timeout="30" \
>> op migrate_from interval="0" timeout="600" \
>> op migrate_to interval="0" timeout="600" \
>> params xmfile="/etc/xen/vm/enventory" shutdown_timeout="60"
>> primitive horde ocf:heartbeat:Xen \
>> meta target-role="Started" is-managed="true" allow-migrate="true" \
>> operations $id="horde-operations" \
>> op monitor interval="10" timeout="30" \
>> op migrate_from interval="0" timeout="600" \
>> op migrate_to interval="0" timeout="600" \
>> params xmfile="/etc/xen/vm/horde" shutdown_timeout="120"
>> primitive sbd_stonith stonith:external/sbd \
>> params
>> sbd_device="/dev/disk/by-id/scsi-360080e50001c150e0000019e4df6d4d5-part1" \
>> meta target-role="started"
>> ...
>> ...
>> ...
>> clone cluvg1_clone cluvg1 \
>> meta interleave="true" target-role="started" is-managed="true"
>> clone clvm_clone clvm \
>> meta globally-unique="false" interleave="true"
>> target-role="started"
>> clone dlm_clone dlm \
>> meta globally-unique="false" interleave="true"
>> target-role="started"
>> colocation cluvg1_with_clvm inf: cluvg1_clone clvm_clone
>> colocation clvm_with_dlm inf: clvm_clone dlm_clone
>> colocation enventory_with_cluvg1 inf: enventory cluvg1_clone
>> colocation horde_with_cluvg1 inf: horde cluvg1_clone
>> ...
>> ... more Xen VMs
>> ...
>> order cluvg1_before_enventory inf: cluvg1_clone enventory
>> order cluvg1_before_horde inf: cluvg1_clone horde
>> order clvm_before_cluvg1 inf: clvm_clone cluvg1_clone
>> order dlm_before_clvm inf: dlm_clone clvm_clone
>> property $id="cib-bootstrap-options" \
>> dc-version="1.1.5-5bd2b9154d7d9f86d7f56fe0a74072a5a6590c60" \
>> cluster-infrastructure="openais" \
>> expected-quorum-votes="3" \
>> stonith-timeout="420s" \
>> last-lrm-refresh="1328792018"
>> rsc_defaults $id="rsc_defaults-options" \
>> resource-stickiness="10"
>> op_defaults $id="op_defaults-options" \
>> record-pending="false"
>>
>>
>
>
>
--
Karl Rößmann Tel. +49-711-689-1657
Max-Planck-Institut FKF Fax. +49-711-689-1632
Postfach 800 665
70506 Stuttgart email K.Roessmann at fkf.mpg.de
More information about the Pacemaker
mailing list