[Pacemaker] Dual-Primary DRBD with OCFS2 on SLES 11 SP1

Thu Sep 29 13:28:06 UTC 2011

Hi Darren,

On Thu, Sep 29, 2011 at 02:15:34PM +0100, Darren.Mansell at opengi.co.uk wrote:
> (Originally sent to DRBD-user, reposted here as it may be more relevant)
> 
> 
>  
> 
> Hello all.
> 
>  
> 
> I'm implementing a 2-node cluster using Corosync/Pacemaker/DRBD/OCFS2
> for dual-primary shared FS.
> 
>  
> 
> I've followed the instructions on the DRBD applications site and it
> works really well.
> 
>  
> 
> However, if I 'pull the plug' on a node, the other node continues to
> operate the clones, but the filesystem is locked and inaccessible (the
> monitor op works for the filesystem, but fails for the OCFS2 resource.)
> 
>  
> 
> If I do a reboot one node, there are no problems and I can continue to
> access the OCFS2 FS.
> 
>  
> 
> After I pull the plug:
> 
>  
> 
> Online: [ test-odp-02 ]
> 
> OFFLINE: [ test-odp-01 ]
> 
>  
> 
> Resource Group: Load-Balancing
> 
>      Virtual-IP-ODP     (ocf::heartbeat:IPaddr2):       Started
> test-odp-02
> 
>      Virtual-IP-ODPWS   (ocf::heartbeat:IPaddr2):       Started
> test-odp-02
> 
>      ldirectord (ocf::heartbeat:ldirectord):    Started test-odp-02
> 
> Master/Slave Set: ms_drbd_ocfs2 [p_drbd_ocfs2]
> 
>      Masters: [ test-odp-02 ]
> 
>      Stopped: [ p_drbd_ocfs2:1 ]
> 
> Clone Set: cl-odp [odp]
> 
>      Started: [ test-odp-02 ]
> 
>      Stopped: [ odp:1 ]
> 
> Clone Set: cl-odpws [odpws]
> 
>      Started: [ test-odp-02 ]
> 
>      Stopped: [ odpws:1 ]
> 
> Clone Set: cl_fs_ocfs2 [p_fs_ocfs2]
> 
>      Started: [ test-odp-02 ]
> 
>      Stopped: [ p_fs_ocfs2:1 ]
> 
> Clone Set: cl_ocfs2mgmt [g_ocfs2mgmt]
> 
>      Started: [ test-odp-02 ]
> 
>      Stopped: [ g_ocfs2mgmt:1 ]
> 
>  
> 
> Failed actions:
> 
>     p_o2cb:0_monitor_10000 (node=test-odp-02, call=19, rc=-2,
> status=Timed Out): unknown
> 
> exec error
> 
>  
> 
>  
> 
> test-odp-02:~ # mount
> 
> /dev/drbd0 on /opt/odp type ocfs2
> (rw,_netdev,noatime,cluster_stack=pcmk)
> 
>  
> 
> test-odp-02:~ # ls /opt/odp
> 
> ...just hangs forever...
> 
>  
> 
> If I then power test-odp-01 back on, everything fails back fine and the
> ls command suddenly completes.
> 
>  
> 
> It seems to me that OCFS2 is trying to talk to the node that has
> disappeared and doesn't time out. Does anyone have any ideas? (attached
> CRM and DRBD configs)

With stonith disabled, I doubt that your cluster can behave as
it should.

Thanks,

Dejan

>  
> 
> Many thanks.
> 
>  
> 
> Darren Mansell
> 
>  
> 

Content-Description: crm.txt
> node test-odp-01
> node test-odp-02 \
>         attributes standby="off"
> primitive Virtual-IP-ODP ocf:heartbeat:IPaddr2 \
>         params lvs_support="true" ip="2.21.15.100" cidr_netmask="8" broadcast="2.255.255.255" \
>         op monitor interval="1m" timeout="10s" \
>         meta migration-threshold="10" failure-timeout="600"
> primitive Virtual-IP-ODPWS ocf:heartbeat:IPaddr2 \
>         params lvs_support="true" ip="2.21.15.103" cidr_netmask="8" broadcast="2.255.255.255" \
>         op monitor interval="1m" timeout="10s" \
>         meta migration-threshold="10" failure-timeout="600"
> primitive ldirectord ocf:heartbeat:ldirectord \
>         params configfile="/etc/ha.d/ldirectord.cf" \
>         op monitor interval="2m" timeout="20s" \
>         meta migration-threshold="10" failure-timeout="600"
> primitive odp lsb:odp \
>         op monitor interval="10s" enabled="true" timeout="10s" \
>         meta migration-threshold="10" failure-timeout="600"
> primitive odpwebservice lsb:odpws \
>         op monitor interval="10s" enabled="true" timeout="10s" \
>         meta migration-threshold="10" failure-timeout="600"
> primitive p_controld ocf:pacemaker:controld \
>         op monitor interval="10s" enabled="true" timeout="10s" \
>         meta migration-threshold="10" failure-timeout="600"
> primitive p_drbd_ocfs2 ocf:linbit:drbd \
>         params drbd_resource="r0" \
>         op monitor interval="10s" enabled="true" timeout="10s" \
>         meta migration-threshold="10" failure-timeout="600"
> primitive p_fs_ocfs2 ocf:heartbeat:Filesystem \
>         params device="/dev/drbd/by-res/r0" directory="/opt/odp" fstype="ocfs2" options="rw,noatime" \
>         op monitor interval="10s" enabled="true" timeout="10s" \
>         meta migration-threshold="10" failure-timeout="600"
> primitive p_o2cb ocf:ocfs2:o2cb \
>         op monitor interval="10s" enabled="true" timeout="10s" \
>         meta migration-threshold="10" failure-timeout="600"
> group Load-Balancing Virtual-IP-ODP Virtual-IP-ODPWS ldirectord
> group g_ocfs2mgmt p_controld p_o2cb
> ms ms_drbd_ocfs2 p_drbd_ocfs2 \
>         meta master-max="2" clone-max="2" notify="true"
> clone cl-odp odp
> clone cl-odpws odpws
> clone cl_fs_ocfs2 p_fs_ocfs2 \
>         meta target-role="Started"
> clone cl_ocfs2mgmt g_ocfs2mgmt \
>         meta interleave="true"
> location Prefer-Node1 ldirectord \
>         rule $id="prefer-node1-rule" 100: #uname eq test-odp-01
> order o_ocfs2 inf: ms_drbd_ocfs2:promote cl_ocfs2mgmt:start cl_fs_ocfs2:start
> order tomcatlast1 inf: cl_fs_ocfs2 cl-odp
> order tomcatlast2 inf: cl_fs_ocfs2 cl-odpws
> property $id="cib-bootstrap-options" \
>         dc-version="1.1.5-5bd2b9154d7d9f86d7f56fe0a74072a5a6590c60" \
>         cluster-infrastructure="openais" \
>         expected-quorum-votes="2" \
>         no-quorum-policy="ignore" \
>         start-failure-is-fatal="false" \
>         stonith-action="reboot" \
>         stonith-enabled="false" \
>         last-lrm-refresh="1317207361"
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker