[Pacemaker] Trouble Starting Filesystem

Tue Dec 4 18:11:13 EST 2012

Folks,

I have having trouble starting my DRBD+OCFS2 filesystem. It seems to be
a timing thing, with the filesystem trying to come up before DRBD has
gotten the second node of the cluster into Primary mode. I find this in
the logs:

    Dec  4 15:50:05 aztestc4 lrmd: [1177]: info: RA output:
    (p_fs_share:1:start:stderr) FATAL: Module scsi_hostadapter not found.
    Dec  4 15:50:05 aztestc4 lrmd: [1177]: info: RA output:
    (p_fs_share:1:start:stderr) blockdev:
    Dec  4 15:50:05 aztestc4 lrmd: [1177]: info: RA output:
    (p_fs_share:1:start:stderr) cannot open /dev/drbd/by-res/share
    Dec  4 15:50:05 aztestc4 lrmd: [1177]: info: RA output:
    (p_fs_share:1:start:stderr) :
    Dec  4 15:50:05 aztestc4 lrmd: [1177]: info: RA output:
    (p_fs_share:1:start:stderr) Wrong medium type
    Dec  4 15:50:05 aztestc4 lrmd: [1177]: info: RA output:
    (p_fs_share:1:start:stderr) mount.ocfs2
    Dec  4 15:50:05 aztestc4 lrmd: [1177]: info: RA output:
    (p_fs_share:1:start:stderr) :
    Dec  4 15:50:05 aztestc4 lrmd: [1177]: info: RA output:
    (p_fs_share:1:start:stderr) I/O error on channel
    Dec  4 15:50:05 aztestc4 lrmd: [1177]: info: RA output:
    (p_fs_share:1:start:stderr) 
    Dec  4 15:50:05 aztestc4 lrmd: [1177]: info: RA output:
    (p_fs_share:1:start:stderr) while opening device /dev/drbd1
    Dec  4 15:50:05 aztestc4 lrmd: [1177]: info: RA output:
    (p_fs_share:1:start:stderr)
    Dec  4 15:50:05 aztestc4 Filesystem[1631]: ERROR: Couldn't mount
    filesystem /dev/drbd/by-res/share on /share
    Dec  4 15:50:05 aztestc4 lrmd: [1177]: WARN: Managed
    p_fs_share:1:start process 1631 exited with return code 1.
    Dec  4 15:50:05 aztestc4 lrmd: [1177]: info: operation start[15] on
    p_fs_share:1 for client 1180: pid 1631 exited with return code 1
    Dec  4 15:50:05 aztestc4 crmd: [1180]: debug:
    create_operation_update: do_update_resource: Updating resouce
    p_fs_share:1 after complete start op (interval=0)
    Dec  4 15:50:05 aztestc4 crmd: [1180]: info: process_lrm_event: LRM
    operation p_fs_share:1_start_0 (call=15, rc=1, cib-update=18,
    confirmed=true) unknown error

If I simply wait a little while (maybe a minute, maybe less) and then
"crm resource cleanup cl_fs_share", the filesystem starts properly on
both nodes. Here are the pertinent parts of my configuration:

    primitive p_drbd_share ocf:linbit:drbd \
        params drbd_resource="share" \
        op monitor interval="15s" role="Master" timeout="20s" \
        op monitor interval="20s" role="Slave" timeout="20s" \
        op start interval="0" timeout="240s" \
        op stop interval="0" timeout="100s"
    primitive p_fs_share ocf:heartbeat:Filesystem \
        params device="/dev/drbd/by-res/share" directory="/share"
    fstype="ocfs2" options="rw,noatime" \
        op start interval="0" timeout="60" \
        op stop interval="0" timeout="60" \
        op monitor interval="20" timeout="40"
    primitive p_o2cb ocf:pacemaker:o2cb \
        params stack="cman" \
        op start interval="0" timeout="90" \
        op stop interval="0" timeout="100" \
        op monitor interval="10" timeout="20"
    ms ms_drbd_share p_drbd_share \
        meta master-max="2" notify="true" interleave="true"
    clone-max="2" is-managed="true" target-role="Started"
    clone cl_fs_share p_fs_share \
        meta interleave="true" notify="true" globally-unique="false"
    target-role="Started"
    clone cl_o2cb p_o2cb \
        meta interleave="true" globally-unique="false"
    order o_ocfs2 inf: ms_drbd_share:promote cl_o2cb
    order o_share inf: cl_o2cb cl_fs_share

Should I increase the timeout value in

    primitive p_fs_share ocf:heartbeat:Filesystem \
        ... \
        op start interval="0" timeout="60"

to take care of this? I am dubious because I think cl_o2cb is starting,
which allows cl_fs_share to start, before ms_drbd_share is done promote-ing.

Thanks,
    -- Art Z.

-- 

Art Zemon, President
Hen's Teeth Network <http://www.hens-teeth.net/> for reliable web
hosting and programming
(866)HENS-NET / (636)447-3030 ext. 200 / www.hens-teeth.net

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20121204/15a44d5d/attachment-0002.html>