[Pacemaker] Dual-Primary DRBD with OCFS2 on SLES 11 SP1
Dejan Muhamedagic
dejanmm at fastmail.fm
Thu Sep 29 13:28:06 UTC 2011
Hi Darren,
On Thu, Sep 29, 2011 at 02:15:34PM +0100, Darren.Mansell at opengi.co.uk wrote:
> (Originally sent to DRBD-user, reposted here as it may be more relevant)
>
>
>
>
> Hello all.
>
>
>
> I'm implementing a 2-node cluster using Corosync/Pacemaker/DRBD/OCFS2
> for dual-primary shared FS.
>
>
>
> I've followed the instructions on the DRBD applications site and it
> works really well.
>
>
>
> However, if I 'pull the plug' on a node, the other node continues to
> operate the clones, but the filesystem is locked and inaccessible (the
> monitor op works for the filesystem, but fails for the OCFS2 resource.)
>
>
>
> If I do a reboot one node, there are no problems and I can continue to
> access the OCFS2 FS.
>
>
>
> After I pull the plug:
>
>
>
> Online: [ test-odp-02 ]
>
> OFFLINE: [ test-odp-01 ]
>
>
>
> Resource Group: Load-Balancing
>
> Virtual-IP-ODP (ocf::heartbeat:IPaddr2): Started
> test-odp-02
>
> Virtual-IP-ODPWS (ocf::heartbeat:IPaddr2): Started
> test-odp-02
>
> ldirectord (ocf::heartbeat:ldirectord): Started test-odp-02
>
> Master/Slave Set: ms_drbd_ocfs2 [p_drbd_ocfs2]
>
> Masters: [ test-odp-02 ]
>
> Stopped: [ p_drbd_ocfs2:1 ]
>
> Clone Set: cl-odp [odp]
>
> Started: [ test-odp-02 ]
>
> Stopped: [ odp:1 ]
>
> Clone Set: cl-odpws [odpws]
>
> Started: [ test-odp-02 ]
>
> Stopped: [ odpws:1 ]
>
> Clone Set: cl_fs_ocfs2 [p_fs_ocfs2]
>
> Started: [ test-odp-02 ]
>
> Stopped: [ p_fs_ocfs2:1 ]
>
> Clone Set: cl_ocfs2mgmt [g_ocfs2mgmt]
>
> Started: [ test-odp-02 ]
>
> Stopped: [ g_ocfs2mgmt:1 ]
>
>
>
> Failed actions:
>
> p_o2cb:0_monitor_10000 (node=test-odp-02, call=19, rc=-2,
> status=Timed Out): unknown
>
> exec error
>
>
>
>
>
> test-odp-02:~ # mount
>
> /dev/drbd0 on /opt/odp type ocfs2
> (rw,_netdev,noatime,cluster_stack=pcmk)
>
>
>
> test-odp-02:~ # ls /opt/odp
>
> ...just hangs forever...
>
>
>
> If I then power test-odp-01 back on, everything fails back fine and the
> ls command suddenly completes.
>
>
>
> It seems to me that OCFS2 is trying to talk to the node that has
> disappeared and doesn't time out. Does anyone have any ideas? (attached
> CRM and DRBD configs)
With stonith disabled, I doubt that your cluster can behave as
it should.
Thanks,
Dejan
>
>
> Many thanks.
>
>
>
> Darren Mansell
>
>
>
Content-Description: crm.txt
> node test-odp-01
> node test-odp-02 \
> attributes standby="off"
> primitive Virtual-IP-ODP ocf:heartbeat:IPaddr2 \
> params lvs_support="true" ip="2.21.15.100" cidr_netmask="8" broadcast="2.255.255.255" \
> op monitor interval="1m" timeout="10s" \
> meta migration-threshold="10" failure-timeout="600"
> primitive Virtual-IP-ODPWS ocf:heartbeat:IPaddr2 \
> params lvs_support="true" ip="2.21.15.103" cidr_netmask="8" broadcast="2.255.255.255" \
> op monitor interval="1m" timeout="10s" \
> meta migration-threshold="10" failure-timeout="600"
> primitive ldirectord ocf:heartbeat:ldirectord \
> params configfile="/etc/ha.d/ldirectord.cf" \
> op monitor interval="2m" timeout="20s" \
> meta migration-threshold="10" failure-timeout="600"
> primitive odp lsb:odp \
> op monitor interval="10s" enabled="true" timeout="10s" \
> meta migration-threshold="10" failure-timeout="600"
> primitive odpwebservice lsb:odpws \
> op monitor interval="10s" enabled="true" timeout="10s" \
> meta migration-threshold="10" failure-timeout="600"
> primitive p_controld ocf:pacemaker:controld \
> op monitor interval="10s" enabled="true" timeout="10s" \
> meta migration-threshold="10" failure-timeout="600"
> primitive p_drbd_ocfs2 ocf:linbit:drbd \
> params drbd_resource="r0" \
> op monitor interval="10s" enabled="true" timeout="10s" \
> meta migration-threshold="10" failure-timeout="600"
> primitive p_fs_ocfs2 ocf:heartbeat:Filesystem \
> params device="/dev/drbd/by-res/r0" directory="/opt/odp" fstype="ocfs2" options="rw,noatime" \
> op monitor interval="10s" enabled="true" timeout="10s" \
> meta migration-threshold="10" failure-timeout="600"
> primitive p_o2cb ocf:ocfs2:o2cb \
> op monitor interval="10s" enabled="true" timeout="10s" \
> meta migration-threshold="10" failure-timeout="600"
> group Load-Balancing Virtual-IP-ODP Virtual-IP-ODPWS ldirectord
> group g_ocfs2mgmt p_controld p_o2cb
> ms ms_drbd_ocfs2 p_drbd_ocfs2 \
> meta master-max="2" clone-max="2" notify="true"
> clone cl-odp odp
> clone cl-odpws odpws
> clone cl_fs_ocfs2 p_fs_ocfs2 \
> meta target-role="Started"
> clone cl_ocfs2mgmt g_ocfs2mgmt \
> meta interleave="true"
> location Prefer-Node1 ldirectord \
> rule $id="prefer-node1-rule" 100: #uname eq test-odp-01
> order o_ocfs2 inf: ms_drbd_ocfs2:promote cl_ocfs2mgmt:start cl_fs_ocfs2:start
> order tomcatlast1 inf: cl_fs_ocfs2 cl-odp
> order tomcatlast2 inf: cl_fs_ocfs2 cl-odpws
> property $id="cib-bootstrap-options" \
> dc-version="1.1.5-5bd2b9154d7d9f86d7f56fe0a74072a5a6590c60" \
> cluster-infrastructure="openais" \
> expected-quorum-votes="2" \
> no-quorum-policy="ignore" \
> start-failure-is-fatal="false" \
> stonith-action="reboot" \
> stonith-enabled="false" \
> last-lrm-refresh="1317207361"
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
More information about the Pacemaker
mailing list