[Pacemaker] Nodes will not promote DRBD resources to master on failover

Sat Mar 24 21:31:13 UTC 2012

Hello

I think this constrains it's wrong

======================================
colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master
ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm
======================================

it must be like this
======================================
colocation c_drbd_libvirt_vm inf: g_vm ms_drbd_vmstore:Master
ms_drbd_mount1:Master ms_drbd_mount2:Master

Il giorno 24 marzo 2012 20:15, Andrew Martin <amartin at xes-inc.com> ha
scritto:

> Hi Andreas,
>
> My complete cluster configuration is as follows:
> ============
> Last updated: Sat Mar 24 13:51:55 2012
> Last change: Sat Mar 24 13:41:55 2012
> Stack: Heartbeat
> Current DC: node2 (9100538b-7a1f-41fd-9c1a-c6b4b1c32b18) - partition with
> quorum
> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
> 3 Nodes configured, unknown expected votes
> 19 Resources configured.
> ============
>
> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE (standby)
> Online: [ node2 node1 ]
>
>  Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
>      Masters: [ node2 ]
>      Slaves: [ node1 ]
>  Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
>      Masters: [ node2 ]
>      Slaves: [ node1 ]
>  Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
>      Masters: [ node2 ]
>      Slaves: [ node1 ]
>  Resource Group: g_vm
>      p_fs_vmstore (ocf::heartbeat:Filesystem): Started node2
>      p_vm (ocf::heartbeat:VirtualDomain): Started node2
>  Clone Set: cl_daemons [g_daemons]
>      Started: [ node2 node1 ]
>      Stopped: [ g_daemons:2 ]
>  Clone Set: cl_sysadmin_notify [p_sysadmin_notify]
>      Started: [ node2 node1 ]
>      Stopped: [ p_sysadmin_notify:2 ]
>  stonith-node1 (stonith:external/tripplitepdu): Started node2
>  stonith-node2 (stonith:external/tripplitepdu): Started node1
>  Clone Set: cl_ping [p_ping]
>      Started: [ node2 node1 ]
>      Stopped: [ p_ping:2 ]
>
> node $id="6553a515-273e-42fe-ab9e-00f74bd582c3" node1 \
>         attributes standby="off"
> node $id="9100538b-7a1f-41fd-9c1a-c6b4b1c32b18" node2 \
>         attributes standby="off"
> node $id="c4bf25d7-a6b7-4863-984d-aafd937c0da4" quorumnode \
>         attributes standby="on"
> primitive p_drbd_mount2 ocf:linbit:drbd \
>         params drbd_resource="mount2" \
>         op monitor interval="15" role="Master" \
>         op monitor interval="30" role="Slave"
> primitive p_drbd_mount1 ocf:linbit:drbd \
>         params drbd_resource="mount1" \
>         op monitor interval="15" role="Master" \
>         op monitor interval="30" role="Slave"
> primitive p_drbd_vmstore ocf:linbit:drbd \
>         params drbd_resource="vmstore" \
>         op monitor interval="15" role="Master" \
>         op monitor interval="30" role="Slave"
> primitive p_fs_vmstore ocf:heartbeat:Filesystem \
>         params device="/dev/drbd0" directory="/vmstore" fstype="ext4" \
>         op start interval="0" timeout="60s" \
>         op stop interval="0" timeout="60s" \
>         op monitor interval="20s" timeout="40s"
> primitive p_libvirt-bin upstart:libvirt-bin \
>         op monitor interval="30"
> primitive p_ping ocf:pacemaker:ping \
>         params name="p_ping" host_list="192.168.1.10 192.168.1.11"
> multiplier="1000" \
>         op monitor interval="20s"
> primitive p_sysadmin_notify ocf:heartbeat:MailTo \
>         params email="me at example.com" \
>         params subject="Pacemaker Change" \
>         op start interval="0" timeout="10" \
>         op stop interval="0" timeout="10" \
>         op monitor interval="10" timeout="10"
> primitive p_vm ocf:heartbeat:VirtualDomain \
>         params config="/vmstore/config/vm.xml" \
>         meta allow-migrate="false" \
>         op start interval="0" timeout="120s" \
>         op stop interval="0" timeout="120s" \
>         op monitor interval="10" timeout="30"
> primitive stonith-node1 stonith:external/tripplitepdu \
>         params pdu_ipaddr="192.168.1.12" pdu_port="1" pdu_username="xxx"
> pdu_password="xxx" hostname_to_stonith="node1"
> primitive stonith-node2 stonith:external/tripplitepdu \
>         params pdu_ipaddr="192.168.1.12" pdu_port="2" pdu_username="xxx"
> pdu_password="xxx" hostname_to_stonith="node2"
> group g_daemons p_libvirt-bin
> group g_vm p_fs_vmstore p_vm
> ms ms_drbd_mount2 p_drbd_mount2 \
>         meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> ms ms_drbd_mount1 p_drbd_mount1 \
>         meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> ms ms_drbd_vmstore p_drbd_vmstore \
>         meta master-max="1" master-node-max="1" clone-max="2"
> clone-node-max="1" notify="true"
> clone cl_daemons g_daemons
> clone cl_ping p_ping \
>         meta interleave="true"
> clone cl_sysadmin_notify p_sysadmin_notify
> location l-st-node1 stonith-node1 -inf: node1
> location l-st-node2 stonith-node2 -inf: node2
> location l_run_on_most_connected p_vm \
>         rule $id="l_run_on_most_connected-rule" p_ping: defined p_ping
> colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master
> ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm
> order o_drbd-fs-vm inf: ms_drbd_vmstore:promote ms_drbd_mount1:promote
> ms_drbd_mount2:promote cl_daemons:start g_vm:start
> property $id="cib-bootstrap-options" \
>         dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
>         cluster-infrastructure="Heartbeat" \
>         stonith-enabled="false" \
>         no-quorum-policy="stop" \
>         last-lrm-refresh="1332539900" \
>         cluster-recheck-interval="5m" \
>         crmd-integration-timeout="3m" \
>         shutdown-escalation="5m"
>
> The STONITH plugin is a custom plugin I wrote for the Tripp-Lite
> PDUMH20ATNET that I'm using as the STONITH device:
> http://www.tripplite.com/shared/product-pages/en/PDUMH20ATNET.pdf
>
> As you can see, I left the DRBD service to be started by the operating
> system (as an lsb script at boot time) however Pacemaker controls actually
> bringing up/taking down the individual DRBD devices. The behavior I observe
> is as follows: I issue "crm resource migrate p_vm" on node1 and failover
> successfully to node2. During this time, node2 fences node1's DRBD devices
> (using dopd) and marks them as Outdated. Meanwhile node2's DRBD devices are
> UpToDate. I then shutdown both nodes and then bring them back up. They
> reconnect to the cluster (with quorum), and node1's DRBD devices are still
> Outdated as expected and node2's DRBD devices are still UpToDate, as
> expected. At this point, DRBD starts on both nodes, however node2 will not
> set DRBD as master:
> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE (standby)
> Online: [ node2 node1 ]
>
>  Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
>      Slaves: [ node1 node2 ]
>  Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
>      Slaves: [ node1 node 2 ]
>  Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
>      Slaves: [ node1 node2 ]
>
> I am having trouble sorting through the logging information because there
> is so much of it in /var/log/daemon.log, but I can't  find an error message
> printed about why it will not promote node2. At this point the DRBD devices
> are as follows:
> node2: cstate = WFConnection dstate=UpToDate
> node1: cstate = StandAlone dstate=Outdated
>
> I don't see any reason why node2 can't become DRBD master, or am I missing
> something? If I do "drbdadm connect all" on node1, then the cstate on both
> nodes changes to "Connected" and node2 immediately promotes the DRBD
> resources to master. Any ideas on why I'm observing this incorrect behavior?
>
> Any tips on how I can better filter through the pacemaker/heartbeat logs
> or how to get additional useful debug information?
>
> Thanks,
>
> Andrew
>
> ------------------------------
> *From: *"Andreas Kurz" <andreas at hastexo.com>
> *To: *pacemaker at oss.clusterlabs.org
> *Sent: *Wednesday, 1 February, 2012 4:19:25 PM
> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
> master on failover
>
> On 01/25/2012 08:58 PM, Andrew Martin wrote:
> > Hello,
> >
> > Recently I finished configuring a two-node cluster with pacemaker 1.1.6
> > and heartbeat 3.0.5 on nodes running Ubuntu 10.04. This cluster includes
> > the following resources:
> > - primitives for DRBD storage devices
> > - primitives for mounting the filesystem on the DRBD storage
> > - primitives for some mount binds
> > - primitive for starting apache
> > - primitives for starting samba and nfs servers (following instructions
> > here <http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf>)
> > - primitives for exporting nfs shares (ocf:heartbeat:exportfs)
>
> not enough information ... please share at least your complete cluster
> configuration
>
> Regards,
> Andreas
>
> --
> Need help with Pacemaker?
> http://www.hastexo.com/now
>
> >
> > Perhaps this is best described through the output of crm_mon:
> > Online: [ node1 node2 ]
> >
> >  Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] (unmanaged)
> >      p_drbd_mount1:0     (ocf::linbit:drbd):     Started node2
> (unmanaged)
> >      p_drbd_mount1:1     (ocf::linbit:drbd):     Started node1
> > (unmanaged) FAILED
> >  Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
> >      p_drbd_mount2:0       (ocf::linbit:drbd):     Master node1
> > (unmanaged) FAILED
> >      Slaves: [ node2 ]
> >  Resource Group: g_core
> >      p_fs_mount1 (ocf::heartbeat:Filesystem):    Started node1
> >      p_fs_mount2   (ocf::heartbeat:Filesystem):    Started node1
> >      p_ip_nfs   (ocf::heartbeat:IPaddr2):       Started node1
> >  Resource Group: g_apache
> >      p_fs_mountbind1    (ocf::heartbeat:Filesystem):    Started node1
> >      p_fs_mountbind2    (ocf::heartbeat:Filesystem):    Started node1
> >      p_fs_mountbind3    (ocf::heartbeat:Filesystem):    Started node1
> >      p_fs_varwww        (ocf::heartbeat:Filesystem):    Started node1
> >      p_apache   (ocf::heartbeat:apache):        Started node1
> >  Resource Group: g_fileservers
> >      p_lsb_smb  (lsb:smbd):     Started node1
> >      p_lsb_nmb  (lsb:nmbd):     Started node1
> >      p_lsb_nfsserver    (lsb:nfs-kernel-server):        Started node1
> >      p_exportfs_mount1   (ocf::heartbeat:exportfs):      Started node1
> >      p_exportfs_mount2     (ocf::heartbeat:exportfs):      Started node1
> >
> > I have read through the Pacemaker Explained
> > <
> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained
> >
> > documentation, however could not find a way to further debug these
> > problems. First, I put node1 into standby mode to attempt failover to
> > the other node (node2). Node2 appeared to start the transition to
> > master, however it failed to promote the DRBD resources to master (the
> > first step). I have attached a copy of this session in commands.log and
> > additional excerpts from /var/log/syslog during important steps. I have
> > attempted everything I can think of to try and start the DRBD resource
> > (e.g. start/stop/promote/manage/cleanup under crm resource, restarting
> > heartbeat) but cannot bring it out of the slave state. However, if I set
> > it to unmanaged and then run drbdadm primary all in the terminal,
> > pacemaker is satisfied and continues starting the rest of the resources.
> > It then failed when attempting to mount the filesystem for mount2, the
> > p_fs_mount2 resource. I attempted to mount the filesystem myself and was
> > successful. I then unmounted it and ran cleanup on p_fs_mount2 and then
> > it mounted. The rest of the resources started as expected until the
> > p_exportfs_mount2 resource, which failed as follows:
> > p_exportfs_mount2     (ocf::heartbeat:exportfs):      started node2
> > (unmanaged) FAILED
> >
> > I ran cleanup on this and it started, however when running this test
> > earlier today no command could successfully start this exportfs
> resource.
> >
> > How can I configure pacemaker to better resolve these problems and be
> > able to bring the node up successfully on its own? What can I check to
> > determine why these failures are occuring? /var/log/syslog did not seem
> > to contain very much useful information regarding why the failures
> occurred.
> >
> > Thanks,
> >
> > Andrew
> >
> >
> >
> >
> > This body part will be downloaded on demand.
>
>
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>

-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120324/d1a6bce9/attachment.htm>