[Pacemaker] Trouble Starting Filesystem

Mon Dec 10 18:07:04 EST 2012

Folks,

I am still struggling with this problem. At the moment, I cannot get my OCSF2 filesystem to start at all. OCFS2 worked until I expanded my cluster from 2 nodes to 4 nodes.

I see this in /var/log/syslog. In particular, note the "FATAL: Module scsi_hostadapter not found." on the last line.

Dec 10 16:48:03 aztestc1 crmd: [2416]: info: do_lrm_rsc_op: Performing key=71:14:0:a766cb8e-4813-483e-a127-d67cf25979ea op=p_fs_share_plesk:0_start_0 )
Dec 10 16:48:03 aztestc1 lrmd: [2413]: debug: on_msg_perform_op:2396: copying parameters for rsc p_fs_share_plesk:0
Dec 10 16:48:03 aztestc1 lrmd: [2413]: debug: on_msg_perform_op: add an operation operation start[29] on p_fs_share_plesk:0 for client 2416, its parameters: CRM_meta_notify_start_resource=[p_fs_share_plesk:0 p_fs_share_plesk:1 ] CRM_meta_notify_stop_resource=[ ] fstype=[ocfs2] CRM_meta_notify_demote_resource=[ ] CRM_meta_notify_master_uname=[ ] CRM_meta_notify_promote_uname=[ ] CRM_meta_timeout=[60000] options=[rw,noatime] CRM_meta_name=[start] CRM_meta_notify_inactive_resource=[p_fs_share_plesk:0 p_fs_share_plesk:1 ] CRM_meta_notify_start_uname=[aztestc1 aztestc2 ] crm_feature_set=[3.0 to the operation list.
Dec 10 16:48:03 aztestc1 lrmd: [2413]: info: rsc:p_fs_share_plesk:0 start[29] (pid 4528)
Dec 10 16:48:03 aztestc1 crmd: [2416]: debug: get_xpath_object: No match for //cib_update_result//diff-added//crm_config in /notify/cib_update_result/diff
Dec 10 16:48:03 aztestc1 crmd: [2416]: debug: get_xpath_object: No match for //cib_update_result//diff-added//crm_config in /notify/cib_update_result/diff
Dec 10 16:48:03 aztestc1 lrmd: [2413]: debug: rsc:p_drbd_share_plesk:1 monitor[16] (pid 4530)
Dec 10 16:48:03 aztestc1 crmd: [2416]: debug: get_xpath_object: No match for //cib_update_result//diff-added//crm_config in /notify/cib_update_result/diff
Dec 10 16:48:03 aztestc1 Filesystem[4528]: INFO: Running start for /dev/drbd/by-res/shareplesk on /shareplesk
Dec 10 16:48:03 aztestc1 drbd[4530]: DEBUG: shareplesk: Calling /usr/sbin/crm_master -Q -l reboot -v 10000
Dec 10 16:48:03 aztestc1 lrmd: [2413]: info: RA output: (p_fs_share_plesk:0:start:stderr) FATAL: Module scsi_hostadapter not found.

DRBD is running in dual-primary mode:

root at aztestc1:~# service drbd status
drbd driver loaded OK; device status:
version: 8.3.11 (api:88/proto:86-96)
srcversion: 71955441799F513ACA6DA60 
m:res         cs         ro               ds                 p  mounted  fstype
1:shareplesk  Connected  Primary/Primary  UpToDate/UpToDate  C

Everything looks happy:

root at aztestc1:~# crm_mon -1
============
Last updated: Mon Dec 10 16:59:40 2012
Last change: Mon Dec 10 16:48:02 2012 via crmd on aztestc3
Stack: cman
Current DC: aztestc3 - partition with quorum
Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
4 Nodes configured, unknown expected votes
10 Resources configured.
============

Online: [ aztestc3 aztestc4 aztestc1 aztestc2 ]

 Clone Set: cl_fencing [p_stonith]
     Started: [ aztestc2 aztestc1 aztestc4 aztestc3 ]
 Clone Set: cl_o2cb [p_o2cb]
     Started: [ aztestc1 aztestc2 ]
 Master/Slave Set: ms_drbd_share_plesk [p_drbd_share_plesk]
     Masters: [ aztestc2 aztestc1 ]

Failed actions:
    p_fs_share_plesk:1_start_0 (node=aztestc2, call=31, rc=1, status=complete): unknown error
    p_fs_share_plesk:0_start_0 (node=aztestc1, call=29, rc=1, status=complete): unknown error

Here is my complete configuration, which does not work:

node aztestc1 \
	attributes standby="off"
node aztestc2 \
	attributes standby="off"
node aztestc3 \
	attributes standby="off"
node aztestc4 \
	attributes standby="off"
primitive p_drbd_share_plesk ocf:linbit:drbd \
	params drbd_resource="shareplesk" \
	op monitor interval="15s" role="Master" timeout="20s" \
	op monitor interval="20s" role="Slave" timeout="20s" \
	op start interval="0" timeout="240s" \
	op stop interval="0" timeout="100s"
primitive p_fs_share_plesk ocf:heartbeat:Filesystem \
	params device="/dev/drbd/by-res/shareplesk" directory="/shareplesk" fstype="ocfs2" options="rw,noatime" \
	op start interval="0" timeout="60" \
	op stop interval="0" timeout="60" \
	op monitor interval="20" timeout="40"
primitive p_o2cb ocf:pacemaker:o2cb \
	params stack="cman" \
	op start interval="0" timeout="90" \
	op stop interval="0" timeout="100" \
	op monitor interval="10" timeout="20"
primitive p_stonith stonith:fence_ec2 \
	params pcmk_host_check="static-list" pcmk_host_list="aztestc1 aztestc2 aztestc3 aztestc4" \
	op monitor interval="600s" timeout="300s" \
	op start start-delay="10s" interval="0"
ms ms_drbd_share_plesk p_drbd_share_plesk \
	meta master-max="2" notify="true" interleave="true" clone-max="2" is-managed="true" target-role="Started"
clone cl_fencing p_stonith \
	meta target-role="Started"
clone cl_fs_share_plesk p_fs_share_plesk \
	meta clone-max="2" interleave="true" notify="true" globally-unique="false" target-role="Started"
clone cl_o2cb p_o2cb \
	meta clone-max="2" interleave="true" globally-unique="false" target-role="Started"
location lo_drbd_plesk3 ms_drbd_share_plesk -inf: aztestc3
location lo_drbd_plesk4 ms_drbd_share_plesk -inf: aztestc4
location lo_fs_plesk3 cl_fs_share_plesk -inf: aztestc3
location lo_fs_plesk4 cl_fs_share_plesk -inf: aztestc4
location lo_o2cb3 cl_o2cb -inf: aztestc3
location lo_o2cb4 cl_o2cb -inf: aztestc4
order o_20plesk inf: ms_drbd_share_plesk:promote cl_o2cb:start
order o_40fs_plesk inf: cl_o2cb cl_fs_share_plesk
property $id="cib-bootstrap-options" \
	stonith-enabled="true" \
	stonith-timeout="180s" \
	no-quorum-policy="freeze" \
	dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
	cluster-infrastructure="cman" \
	last-lrm-refresh="1355179514"
rsc_defaults $id="rsc-options" \
	resource-stickiness="100"

and here is my previous 2-node configuration, which worked "mostly." Sometimes I had to manually "crm resource cleanup cl_fs_share" to get the filesystem to mount but otherwise eveyrthing was fine.

node aztestc1 \
	attributes standby="off"
node aztestc2 \
	attributes standby="off"
primitive p_drbd_share ocf:linbit:drbd \
	params drbd_resource="share" \
	op monitor interval="15s" role="Master" timeout="20s" \
	op monitor interval="20s" role="Slave" timeout="20s" \
	op start interval="0" timeout="240s" \
	op stop interval="0" timeout="100s"
primitive p_fs_share ocf:heartbeat:Filesystem \
	params device="/dev/drbd/by-res/share" directory="/share" fstype="ocfs2" options="rw,noatime" \
	op start interval="0" timeout="60" \
	op stop interval="0" timeout="60" \
	op monitor interval="20" timeout="40"
primitive p_o2cb ocf:pacemaker:o2cb \
	params stack="cman" \
	op start interval="0" timeout="90" \
	op stop interval="0" timeout="100" \
	op monitor interval="10" timeout="20"
primitive p_stonith stonith:fence_ec2 \
	params pcmk_host_check="static-list" pcmk_host_list="aztestc1 aztestc2" \
	op monitor interval="600s" timeout="300s" \
	op start start-delay="10s" interval="0"
ms ms_drbd_share p_drbd_share \
	meta master-max="2" notify="true" interleave="true" clone-max="2" is-managed="true" target-role="Started"
clone cl_fencing p_stonith \
	meta target-role="Started"
clone cl_fs_share p_fs_share \
	meta interleave="true" notify="true" globally-unique="false" target-role="Started"
clone cl_o2cb p_o2cb \
	meta interleave="true" globally-unique="false"
order o_ocfs2 inf: ms_drbd_share:promote cl_o2cb
order o_share inf: cl_o2cb cl_fs_share
property $id="cib-bootstrap-options" \
	stonith-enabled="true" \
	stonith-timeout="180s" \
	dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
	cluster-infrastructure="cman" \
	last-lrm-refresh="1354808774"

Thoughts? Ideas? Suggestions?

Thank you,
    -- Art Z.

--
Art Zemon, President
 [http://www.hens-teeth.net/] Hen's Teeth Network for reliable web hosting and programming
 (866)HENS-NET / (636)447-3030 ext. 200 / www.hens-teeth.net