[Pacemaker] Nodes will not promote DRBD resources to master on failover

Wed Apr 11 01:02:31 CEST 2012

On 04/10/2012 04:29 PM, Andrew Martin wrote:
> Hi Andreas,
> 
> ----- Original Message ----- 
> 
>> From: "Andreas Kurz" <andreas at hastexo.com>
>> To: pacemaker at oss.clusterlabs.org
>> Sent: Tuesday, April 10, 2012 5:28:15 AM
>> Subject: Re: [Pacemaker] Nodes will not promote DRBD resources to
>> master on failover
> 
>> On 04/10/2012 06:17 AM, Andrew Martin wrote:
>>> Hi Andreas,
>>>
>>> Yes, I attempted to generalize hostnames and usernames/passwords in
>>> the
>>> archive. Sorry for making it more confusing :(
>>>
>>> I completely purged pacemaker from all 3 nodes and reinstalled
>>> everything. I then completely rebuild the CIB by manually adding in
>>> each
>>> primitive/constraint one at a time and testing along the way. After
>>> doing this DRBD appears to be working at least somewhat better -
>>> the
>>> ocf:linbit:drbd devices are started and managed by pacemaker.
>>> However,
>>> if for example a node is STONITHed when it comes back up it will
>>> not
>>> restart the ocf:linbit:drbd resources until I manually load the
>>> DRBD
>>> kernel module, bring the DRBD devices up (drbdadm up all), and
>>> cleanup
>>> the resources (e.g. crm resource cleanup ms_drbd_vmstore). Is it
>>> possible that the DRBD kernel module needs to be loaded at boot
>>> time,
>>> independent of pacemaker?
> 
>> No, this is done by the drbd OCF script on start.
> 
> 
>>>
>>> Here's the new CIB (mostly the same as before):
>>> http://pastebin.com/MxrqBXMp

There is that libvirt-bin upstart job resource but not cloned, producing
this:  Resource p_libvirt-bin (upstart::libvirt-bin) is active on 2
nodes attempting recovery ... errors.

I'd say having upstart respawning libvirtd is quite fine. Removing this
primitive and therefore also from the group with its dependencies is ok.

>>>
>>> Typically quorumnode stays in the OFFLINE (standby) state, though
>>> occasionally it changes to pending. I have just tried
>>> cleaning /var/lib/heartbeat/crm on quorumnode again so we will see
>>> if
>>> that helps keep it in the OFFLINE (standby) state. I have it
>>> explicitly
>>> set to standby in the CIB configuration and also created a rule to
>>> prevent some of the resources from running on it?
>>> node $id="c4bf25d7-a6b7-4863-984d-aafd937c0da4" quorumnode \
>>> attributes standby="on"
>>> ...
> 
>> The node should be in "ONLINE (standby)" state if you start heartbeat
>> and pacemaker is enabled with "crm yes" or "crm respawn"in ha.cf
> 
> I have never seen it listed as ONLINE (standby). Here's the ha.cf on quorumnode:
> autojoin none
> mcast eth0 239.0.0.43 694 1 0
> warntime 5
> deadtime 15
> initdead 60
> keepalive 2
> node node1
> node node2
> node quorumnode
> crm respawn
> 
> And here's the ha.cf on node[12]:
> autojoin none
> mcast br0 239.0.0.43 694 1 0
> bcast br1
> warntime 5
> deadtime 15
> initdead 60
> keepalive 2
> node node1
> node node2
> node quorumnode
> crm respawn
> respawn hacluster /usr/lib/heartbeat/dopd
> apiauth dopd gid=haclient uid=hacluster
> 
> The only difference between these boxes is that quorumnode is a CentOS 5.5 box so it is stuck at heartbeat 3.0.3, whereas node[12] are both on Ubuntu 10.04 using the Ubuntu HA PPA, so they are running heartbeat 3.0.5. Would this make a difference?
>

hmmm ... heartbeat 3.0.3 is about 2 years old IIRC and there have been
some important fixes ... any heartbeat logs from quorumnode? tried using
ucast for br0/eth0?

Regards,
Andreas

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

>>> location loc_not_on_quorumnode g_vm -inf: quorumnode
>>>
>>> Would it be wise to create additional constraints to prevent all
>>> resources (including each ms_drbd resource) from running on it,
>>> even
>>> though this should be implied by standby?
> 
>> There is no need for that. A node in standby will never run resources
>> and if there is no DRBD and installed on that node your resources
>> won't
>> start anyways.
> 
> I've removed this constraint
> 
>>>
>>> Below is a portion of the log from when I started a node yet DRBD
>>> failed
>>> to start. As you can see it thinks the DRBD device is operating
>>> correctly as it proceeds to starting subsequent resources, e.g.
>>> Apr 9 20:22:55 node1 Filesystem[2939]: [2956]: WARNING: Couldn't
>>> find
>>> device [/dev/drbd0]. Expected /dev/??? to exist
>>> http://pastebin.com/zTCHPtWy
> 
>> The only thing i can read from that log fragments is, that probes are
>> running ... not enough information. Really interesting would be logs
>> from the DC.
> 
> Here is the log from the DC for that same time period:
> http://pastebin.com/d4PGGLPi
> 
>>>
>>> After seeing these messages in the log I run
>>> # service drbd start
>>> # drbdadm up all
>>> # crm resource cleanup ms_drbd_vmstore
>>> # crm resource cleanup ms_drbd_mount1
>>> # crm resource clenaup ms_drbd_mount2
> 
>> That should all not be needed ... what is the output of "crm_mon
>> -1frA"
>> before you do all that cleanups?
> 
> I will get this output the next time I can put the cluster in this state.
> 
>>> After this sequence of commands the DRBD resources appear to be
>>> functioning normally and the subsequent resources start. Any ideas
>>> on
>>> why DRBD is not being started as expected, or why the cluster is
>>> continuing with starting resources that according to the
>>> o_drbd-fs-vm
>>> constraint should not start until DRBD is master?
> 
>> No idea, maybe creating a crm_report archive and sending it to the
>> list
>> can shed some light on that problem.
> 
>> Regards,
>> Andreas
> 
>> --
>> Need help with Pacemaker?
>> http://www.hastexo.com/now
> 
> Thanks,
> 
> Andrew
> 
>>>
>>> Thanks,
>>>
>>> Andrew
>>> ------------------------------------------------------------------------
>>> *From: *"Andreas Kurz" <andreas at hastexo.com>
>>> *To: *pacemaker at oss.clusterlabs.org
>>> *Sent: *Monday, April 2, 2012 6:33:44 PM
>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
>>> master on failover
>>>
>>> On 04/02/2012 05:47 PM, Andrew Martin wrote:
>>>> Hi Andreas,
>>>>
>>>> Here is the crm_report:
>>>> http://dl.dropbox.com/u/2177298/pcmk-Mon-02-Apr-2012.bz2
>>>
>>> You tried to do some obfuscation on parts of that archive? ...
>>> doesn't
>>> really make it easier to debug ....
>>>
>>> Does the third node ever change its state?
>>>
>>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): pending
>>>
>>> Looking at the logs and the transition graph says it aborts due to
>>> un-runable operations on that node which seems to be related to
>>> it's
>>> pending state.
>>>
>>> Try to get that node up (or down) completely ... maybe a fresh
>>> start-over with a clean /var/lib/heartbeat/crm directory is
>>> sufficient.
>>>
>>> Regards,
>>> Andreas
>>>
>>>>
>>>> Hi Emmanuel,
>>>>
>>>> Here is the configuration:
>>>> node $id="6553a515-273e-42fe-ab9e-00f74bd582c3" node1 \
>>>> attributes standby="off"
>>>> node $id="9100538b-7a1f-41fd-9c1a-c6b4b1c32b18" node2 \
>>>> attributes standby="off"
>>>> node $id="c4bf25d7-a6b7-4863-984d-aafd937c0da4" quorumnode \
>>>> attributes standby="on"
>>>> primitive p_drbd_mount2 ocf:linbit:drbd \
>>>> params drbd_resource="mount2" \
>>>> op start interval="0" timeout="240" \
>>>> op stop interval="0" timeout="100" \
>>>> op monitor interval="10" role="Master" timeout="20"
>>>> start-delay="1m" \
>>>> op monitor interval="20" role="Slave" timeout="20"
>>>> start-delay="1m"
>>>> primitive p_drbd_mount1 ocf:linbit:drbd \
>>>> params drbd_resource="mount1" \
>>>> op start interval="0" timeout="240" \
>>>> op stop interval="0" timeout="100" \
>>>> op monitor interval="10" role="Master" timeout="20"
>>>> start-delay="1m" \
>>>> op monitor interval="20" role="Slave" timeout="20"
>>>> start-delay="1m"
>>>> primitive p_drbd_vmstore ocf:linbit:drbd \
>>>> params drbd_resource="vmstore" \
>>>> op start interval="0" timeout="240" \
>>>> op stop interval="0" timeout="100" \
>>>> op monitor interval="10" role="Master" timeout="20"
>>>> start-delay="1m" \
>>>> op monitor interval="20" role="Slave" timeout="20"
>>>> start-delay="1m"
>>>> primitive p_fs_vmstore ocf:heartbeat:Filesystem \
>>>> params device="/dev/drbd0" directory="/mnt/storage/vmstore"
>>> fstype="ext4" \
>>>> op start interval="0" timeout="60s" \
>>>> op stop interval="0" timeout="60s" \
>>>> op monitor interval="20s" timeout="40s"
>>>> primitive p_libvirt-bin upstart:libvirt-bin \
>>>> op monitor interval="30"
>>>> primitive p_ping ocf:pacemaker:ping \
>>>> params name="p_ping" host_list="192.168.3.1 192.168.3.2"
>>> multiplier="1000" \
>>>> op monitor interval="20s"
>>>> primitive p_sysadmin_notify ocf:heartbeat:MailTo \
>>>> params email="me at example.com" \
>>>> params subject="Pacemaker Change" \
>>>> op start interval="0" timeout="10" \
>>>> op stop interval="0" timeout="10" \
>>>> op monitor interval="10" timeout="10"
>>>> primitive p_vm ocf:heartbeat:VirtualDomain \
>>>> params config="/mnt/storage/vmstore/config/vm.xml" \
>>>> meta allow-migrate="false" \
>>>> op start interval="0" timeout="180" \
>>>> op stop interval="0" timeout="180" \
>>>> op monitor interval="10" timeout="30"
>>>> primitive stonith-node1 stonith:external/tripplitepdu \
>>>> params pdu_ipaddr="192.168.3.100" pdu_port="1" pdu_username="xxx"
>>>> pdu_password="xxx" hostname_to_stonith="node1"
>>>> primitive stonith-node2 stonith:external/tripplitepdu \
>>>> params pdu_ipaddr="192.168.3.100" pdu_port="2" pdu_username="xxx"
>>>> pdu_password="xxx" hostname_to_stonith="node2"
>>>> group g_daemons p_libvirt-bin
>>>> group g_vm p_fs_vmstore p_vm
>>>> ms ms_drbd_mount2 p_drbd_mount2 \
>>>> meta master-max="1" master-node-max="1" clone-max="2"
>>>> clone-node-max="1"
>>>> notify="true"
>>>> ms ms_drbd_mount1 p_drbd_mount1 \
>>>> meta master-max="1" master-node-max="1" clone-max="2"
>>>> clone-node-max="1"
>>>> notify="true"
>>>> ms ms_drbd_vmstore p_drbd_vmstore \
>>>> meta master-max="1" master-node-max="1" clone-max="2"
>>>> clone-node-max="1"
>>>> notify="true"
>>>> clone cl_daemons g_daemons
>>>> clone cl_ping p_ping \
>>>> meta interleave="true"
>>>> clone cl_sysadmin_notify p_sysadmin_notify \
>>>> meta target-role="Started"
>>>> location l-st-node1 stonith-node1 -inf: node1
>>>> location l-st-node2 stonith-node2 -inf: node2
>>>> location l_run_on_most_connected p_vm \
>>>> rule $id="l_run_on_most_connected-rule" p_ping: defined p_ping
>>>> colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master
>>>> ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm
>>>> order o_drbd-fs-vm inf: ms_drbd_vmstore:promote
>>>> ms_drbd_mount1:promote
>>>> ms_drbd_mount2:promote cl_daemons:start g_vm:start
>>>> property $id="cib-bootstrap-options" \
>>>> dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
>>>> cluster-infrastructure="Heartbeat" \
>>>> stonith-enabled="true" \
>>>> no-quorum-policy="freeze" \
>>>> last-lrm-refresh="1333041002" \
>>>> cluster-recheck-interval="5m" \
>>>> crmd-integration-timeout="3m" \
>>>> shutdown-escalation="5m"
>>>>
>>>> Thanks,
>>>>
>>>> Andrew
>>>>
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From: *"emmanuel segura" <emi2fast at gmail.com>
>>>> *To: *"The Pacemaker cluster resource manager"
>>>> <pacemaker at oss.clusterlabs.org>
>>>> *Sent: *Monday, April 2, 2012 9:43:20 AM
>>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources
>>>> to
>>>> master on failover
>>>>
>>>> Sorry Andrew
>>>>
>>>> Can you post me your crm configure show again?
>>>>
>>>> Thanks
>>>>
>>>> Il giorno 30 marzo 2012 18:53, Andrew Martin <amartin at xes-inc.com
>>>> <mailto:amartin at xes-inc.com>> ha scritto:
>>>>
>>>> Hi Emmanuel,
>>>>
>>>> Thanks, that is a good idea. I updated the colocation contraint as
>>>> you described. After, the cluster remains in this state (with the
>>>> filesystem not mounted and the VM not started):
>>>> Online: [ node2 node1 ]
>>>>
>>>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
>>>> Masters: [ node1 ]
>>>> Slaves: [ node2 ]
>>>> Master/Slave Set: ms_drbd_tools [p_drbd_mount1]
>>>> Masters: [ node1 ]
>>>> Slaves: [ node2 ]
>>>> Master/Slave Set: ms_drbd_crm [p_drbd_mount2]
>>>> Masters: [ node1 ]
>>>> Slaves: [ node2 ]
>>>> Clone Set: cl_daemons [g_daemons]
>>>> Started: [ node2 node1 ]
>>>> Stopped: [ g_daemons:2 ]
>>>> stonith-node1 (stonith:external/tripplitepdu): Started node2
>>>> stonith-node2 (stonith:external/tripplitepdu): Started node1
>>>>
>>>> I noticed that Pacemaker had not issued "drbdadm connect" for any
>>>> of
>>>> the DRBD resources on node2
>>>> # service drbd status
>>>> drbd driver loaded OK; device status:
>>>> version: 8.3.7 (api:88/proto:86-91)
>>>> GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by
>>>> root at node2, 2012-02-02 12:29:26
>>>> m:res cs ro ds p
>>>> mounted fstype
>>>> 0:vmstore StandAlone Secondary/Unknown Outdated/DUnknown r----
>>>> 1:mount1 StandAlone Secondary/Unknown Outdated/DUnknown r----
>>>> 2:mount2 StandAlone Secondary/Unknown Outdated/DUnknown r----
>>>> # drbdadm cstate all
>>>> StandAlone
>>>> StandAlone
>>>> StandAlone
>>>>
>>>> After manually issuing "drbdadm connect all" on node2 the rest of
>>>> the resources eventually started (several minutes later) on node1:
>>>> Online: [ node2 node1 ]
>>>>
>>>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
>>>> Masters: [ node1 ]
>>>> Slaves: [ node2 ]
>>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
>>>> Masters: [ node1 ]
>>>> Slaves: [ node2 ]
>>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
>>>> Masters: [ node1 ]
>>>> Slaves: [ node2 ]
>>>> Resource Group: g_vm
>>>> p_fs_vmstore (ocf::heartbeat:Filesystem): Started node1
>>>> p_vm (ocf::heartbeat:VirtualDomain): Started node1
>>>> Clone Set: cl_daemons [g_daemons]
>>>> Started: [ node2 node1 ]
>>>> Stopped: [ g_daemons:2 ]
>>>> Clone Set: cl_sysadmin_notify [p_sysadmin_notify]
>>>> Started: [ node2 node1 ]
>>>> Stopped: [ p_sysadmin_notify:2 ]
>>>> stonith-node1 (stonith:external/tripplitepdu): Started node2
>>>> stonith-node2 (stonith:external/tripplitepdu): Started node1
>>>> Clone Set: cl_ping [p_ping]
>>>> Started: [ node2 node1 ]
>>>> Stopped: [ p_ping:2 ]
>>>>
>>>> The DRBD devices on node1 were all UpToDate, so it doesn't seem
>>>> right that it would need to wait for node2 to be connected before
>>>> it
>>>> could continue promoting additional resources. I then restarted
>>>> heartbeat on node2 to see if it would automatically connect the
>>>> DRBD
>>>> devices this time. After restarting it, the DRBD devices are not
>>>> even configured:
>>>> # service drbd status
>>>> drbd driver loaded OK; device status:
>>>> version: 8.3.7 (api:88/proto:86-91)
>>>> GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by
>>>> root at webapps2host, 2012-02-02 12:29:26
>>>> m:res cs ro ds p mounted fstype
>>>> 0:vmstore Unconfigured
>>>> 1:mount1 Unconfigured
>>>> 2:mount2 Unconfigured
>>>>
>>>> Looking at the log I found this part about the drbd primitives:
>>>> Mar 30 11:10:32 node2 lrmd: [10702]: info: operation monitor[2] on
>>>> p_drbd_vmstore:1 for client 10705: pid 11065 exited with return
>>>> code 7
>>>> Mar 30 11:10:32 node2 crmd: [10705]: info: process_lrm_event: LRM
>>>> operation p_drbd_vmstore:1_monitor_0 (call=2, rc=7, cib-update=11,
>>>> confirmed=true) not running
>>>> Mar 30 11:10:32 node2 lrmd: [10702]: info: operation monitor[4] on
>>>> p_drbd_mount2:1 for client 10705: pid 11069 exited with return
>>>> code 7
>>>> Mar 30 11:10:32 node2 crmd: [10705]: info: process_lrm_event: LRM
>>>> operation p_drbd_mount2:1_monitor_0 (call=4, rc=7, cib-update=12,
>>>> confirmed=true) not running
>>>> Mar 30 11:10:32 node2 lrmd: [10702]: info: operation monitor[3] on
>>>> p_drbd_mount1:1 for client 10705: pid 11066 exited with return
>>>> code 7
>>>> Mar 30 11:10:32 node2 crmd: [10705]: info: process_lrm_event: LRM
>>>> operation p_drbd_mount1:1_monitor_0 (call=3, rc=7, cib-update=13,
>>>> confirmed=true) not running
>>>>
>>>> I am not sure what exit code 7 is - is it possible to manually run
>>>> the monitor code or somehow obtain more debug about this? Here is
>>>> the complete log after restarting heartbeat on node2:
>>>> http://pastebin.com/KsHKi3GW
>>>>
>>>> Thanks,
>>>>
>>>> Andrew
>>>>
>>>>
>>> ------------------------------------------------------------------------
>>>> *From: *"emmanuel segura" <emi2fast at gmail.com
>>>> <mailto:emi2fast at gmail.com>>
>>>> *To: *"The Pacemaker cluster resource manager"
>>>> <pacemaker at oss.clusterlabs.org
>>>> <mailto:pacemaker at oss.clusterlabs.org>>
>>>> *Sent: *Friday, March 30, 2012 10:26:48 AM
>>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources
>>>> to
>>>> master on failover
>>>>
>>>> I think this constrain it's wrong
>>>> ==================================================
>>>> colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master
>>>> ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm
>>>> ===================================================
>>>>
>>>> change to
>>>> ======================================================
>>>> colocation c_drbd_libvirt_vm inf: g_vm ms_drbd_vmstore:Master
>>>> ms_drbd_mount1:Master ms_drbd_mount2:Master
>>>> =======================================================
>>>>
>>>> Il giorno 30 marzo 2012 17:16, Andrew Martin <amartin at xes-inc.com
>>>> <mailto:amartin at xes-inc.com>> ha scritto:
>>>>
>>>> Hi Emmanuel,
>>>>
>>>> Here is the output of crm configure show:
>>>> http://pastebin.com/NA1fZ8dL
>>>>
>>>> Thanks,
>>>>
>>>> Andrew
>>>>
>>>>
>>> ------------------------------------------------------------------------
>>>> *From: *"emmanuel segura" <emi2fast at gmail.com
>>>> <mailto:emi2fast at gmail.com>>
>>>> *To: *"The Pacemaker cluster resource manager"
>>>> <pacemaker at oss.clusterlabs.org
>>>> <mailto:pacemaker at oss.clusterlabs.org>>
>>>> *Sent: *Friday, March 30, 2012 9:43:45 AM
>>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources
>>>> to master on failover
>>>>
>>>> can you show me?
>>>>
>>>> crm configure show
>>>>
>>>> Il giorno 30 marzo 2012 16:10, Andrew Martin
>>>> <amartin at xes-inc.com <mailto:amartin at xes-inc.com>> ha scritto:
>>>>
>>>> Hi Andreas,
>>>>
>>>> Here is a copy of my complete CIB:
>>>> http://pastebin.com/v5wHVFuy
>>>>
>>>> I'll work on generating a report using crm_report as well.
>>>>
>>>> Thanks,
>>>>
>>>> Andrew
>>>>
>>>>
>>> ------------------------------------------------------------------------
>>>> *From: *"Andreas Kurz" <andreas at hastexo.com
>>>> <mailto:andreas at hastexo.com>>
>>>> *To: *pacemaker at oss.clusterlabs.org
>>>> <mailto:pacemaker at oss.clusterlabs.org>
>>>> *Sent: *Friday, March 30, 2012 4:41:16 AM
>>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD
>>>> resources to master on failover
>>>>
>>>> On 03/28/2012 04:56 PM, Andrew Martin wrote:
>>>>> Hi Andreas,
>>>>>
>>>>> I disabled the DRBD init script and then restarted the
>>>> slave node
>>>>> (node2). After it came back up, DRBD did not start:
>>>>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4):
>>>> pending
>>>>> Online: [ node2 node1 ]
>>>>>
>>>>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
>>>>> Masters: [ node1 ]
>>>>> Stopped: [ p_drbd_vmstore:1 ]
>>>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_tools]
>>>>> Masters: [ node1 ]
>>>>> Stopped: [ p_drbd_mount1:1 ]
>>>>> Master/Slave Set: ms_drbd_mount2 [p_drbdmount2]
>>>>> Masters: [ node1 ]
>>>>> Stopped: [ p_drbd_mount2:1 ]
>>>>> ...
>>>>>
>>>>> root at node2:~# service drbd status
>>>>> drbd not loaded
>>>>
>>>> Yes, expected unless Pacemaker starts DRBD
>>>>
>>>>>
>>>>> Is there something else I need to change in the CIB to
>>>> ensure that DRBD
>>>>> is started? All of my DRBD devices are configured like this:
>>>>> primitive p_drbd_mount2 ocf:linbit:drbd \
>>>>> params drbd_resource="mount2" \
>>>>> op monitor interval="15" role="Master" \
>>>>> op monitor interval="30" role="Slave"
>>>>> ms ms_drbd_mount2 p_drbd_mount2 \
>>>>> meta master-max="1" master-node-max="1"
>>> clone-max="2"
>>>>> clone-node-max="1" notify="true"
>>>>
>>>> That should be enough ... unable to say more without seeing
>>>> the complete
>>>> configuration ... too much fragments of information ;-)
>>>>
>>>> Please provide (e.g. pastebin) your complete cib (cibadmin
>>>> -Q) when
>>>> cluster is in that state ... or even better create a
>>>> crm_report archive
>>>>
>>>>>
>>>>> Here is the output from the syslog (grep -i drbd
>>>> /var/log/syslog):
>>>>> Mar 28 09:24:47 node2 crmd: [3213]: info: do_lrm_rsc_op:
>>>> Performing
>>>>> key=12:315:7:24416169-73ba-469b-a2e3-56a22b437cbc
>>>>> op=p_drbd_vmstore:1_monitor_0 )
>>>>> Mar 28 09:24:47 node2 lrmd: [3210]: info:
>>>> rsc:p_drbd_vmstore:1 probe[2]
>>>>> (pid 3455)
>>>>> Mar 28 09:24:47 node2 crmd: [3213]: info: do_lrm_rsc_op:
>>>> Performing
>>>>> key=13:315:7:24416169-73ba-469b-a2e3-56a22b437cbc
>>>>> op=p_drbd_mount1:1_monitor_0 )
>>>>> Mar 28 09:24:48 node2 lrmd: [3210]: info:
>>>> rsc:p_drbd_mount1:1 probe[3]
>>>>> (pid 3456)
>>>>> Mar 28 09:24:48 node2 crmd: [3213]: info: do_lrm_rsc_op:
>>>> Performing
>>>>> key=14:315:7:24416169-73ba-469b-a2e3-56a22b437cbc
>>>>> op=p_drbd_mount2:1_monitor_0 )
>>>>> Mar 28 09:24:48 node2 lrmd: [3210]: info:
>>>> rsc:p_drbd_mount2:1 probe[4]
>>>>> (pid 3457)
>>>>> Mar 28 09:24:48 node2 Filesystem[3458]: [3517]: WARNING:
>>>> Couldn't find
>>>>> device [/dev/drbd0]. Expected /dev/??? to exist
>>>>> Mar 28 09:24:48 node2 crm_attribute: [3563]: info: Invoked:
>>>>> crm_attribute -N node2 -n master-p_drbd_mount2:1 -l
>>> reboot -D
>>>>> Mar 28 09:24:48 node2 crm_attribute: [3557]: info: Invoked:
>>>>> crm_attribute -N node2 -n master-p_drbd_vmstore:1 -l
>>> reboot -D
>>>>> Mar 28 09:24:48 node2 crm_attribute: [3562]: info: Invoked:
>>>>> crm_attribute -N node2 -n master-p_drbd_mount1:1 -l
>>> reboot -D
>>>>> Mar 28 09:24:48 node2 lrmd: [3210]: info: operation
>>>> monitor[4] on
>>>>> p_drbd_mount2:1 for client 3213: pid 3457 exited with
>>>> return code 7
>>>>> Mar 28 09:24:48 node2 lrmd: [3210]: info: operation
>>>> monitor[2] on
>>>>> p_drbd_vmstore:1 for client 3213: pid 3455 exited with
>>>> return code 7
>>>>> Mar 28 09:24:48 node2 crmd: [3213]: info:
>>>> process_lrm_event: LRM
>>>>> operation p_drbd_mount2:1_monitor_0 (call=4, rc=7,
>>>> cib-update=10,
>>>>> confirmed=true) not running
>>>>> Mar 28 09:24:48 node2 lrmd: [3210]: info: operation
>>>> monitor[3] on
>>>>> p_drbd_mount1:1 for client 3213: pid 3456 exited with
>>>> return code 7
>>>>> Mar 28 09:24:48 node2 crmd: [3213]: info:
>>>> process_lrm_event: LRM
>>>>> operation p_drbd_vmstore:1_monitor_0 (call=2, rc=7,
>>>> cib-update=11,
>>>>> confirmed=true) not running
>>>>> Mar 28 09:24:48 node2 crmd: [3213]: info:
>>>> process_lrm_event: LRM
>>>>> operation p_drbd_mount1:1_monitor_0 (call=3, rc=7,
>>>> cib-update=12,
>>>>> confirmed=true) not running
>>>>
>>>> No errors, just probing ... so for any reason Pacemaker does
>>>> not like to
>>>> start it ... use crm_simulate to find out why ... or provide
>>>> information
>>>> as requested above.
>>>>
>>>> Regards,
>>>> Andreas
>>>>
>>>> --
>>>> Need help with Pacemaker?
>>>> http://www.hastexo.com/now
>>>>
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Andrew
>>>>>
>>>>>
>>>>
>>> ------------------------------------------------------------------------
>>>>> *From: *"Andreas Kurz" <andreas at hastexo.com
>>>> <mailto:andreas at hastexo.com>>
>>>>> *To: *pacemaker at oss.clusterlabs.org
>>>> <mailto:pacemaker at oss.clusterlabs.org>
>>>>> *Sent: *Wednesday, March 28, 2012 9:03:06 AM
>>>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD
>>>> resources to
>>>>> master on failover
>>>>>
>>>>> On 03/28/2012 03:47 PM, Andrew Martin wrote:
>>>>>> Hi Andreas,
>>>>>>
>>>>>>> hmm ... what is that fence-peer script doing? If you
>>>> want to use
>>>>>>> resource-level fencing with the help of dopd, activate the
>>>>>>> drbd-peer-outdater script in the line above ... and
>>>> double check if the
>>>>>>> path is correct
>>>>>> fence-peer is just a wrapper for drbd-peer-outdater that
>>>> does some
>>>>>> additional logging. In my testing dopd has been working
>>> well.
>>>>>
>>>>> I see
>>>>>
>>>>>>
>>>>>>>> I am thinking of making the following changes to the
>>>> CIB (as per the
>>>>>>>> official DRBD
>>>>>>>> guide
>>>>>>
>>>>>
>>>>
>>> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html)
>>>> in
>>>>>>>> order to add the DRBD lsb service and require that it
>>>> start before the
>>>>>>>> ocf:linbit:drbd resources. Does this look correct?
>>>>>>>
>>>>>>> Where did you read that? No, deactivate the startup of
>>>> DRBD on system
>>>>>>> boot and let Pacemaker manage it completely.
>>>>>>>
>>>>>>>> primitive p_drbd-init lsb:drbd op monitor interval="30"
>>>>>>>> colocation c_drbd_together inf:
>>>>>>>> p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master
>>>>>>>> ms_drbd_mount2:Master
>>>>>>>> order drbd_init_first inf: ms_drbd_vmstore:promote
>>>>>>>> ms_drbd_mount1:promote ms_drbd_mount2:promote
>>>> p_drbd-init:start
>>>>>>>>
>>>>>>>> This doesn't seem to require that drbd be also running
>>>> on the node where
>>>>>>>> the ocf:linbit:drbd resources are slave (which it would
>>>> need to do to be
>>>>>>>> a DRBD SyncTarget) - how can I ensure that drbd is
>>>> running everywhere?
>>>>>>>> (clone cl_drbd p_drbd-init ?)
>>>>>>>
>>>>>>> This is really not needed.
>>>>>> I was following the official DRBD Users Guide:
>>>>>>
>>>>
>>> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html
>>>>>>
>>>>>> If I am understanding your previous message correctly, I
>>>> do not need to
>>>>>> add a lsb primitive for the drbd daemon? It will be
>>>>>> started/stopped/managed automatically by my
>>>> ocf:linbit:drbd resources
>>>>>> (and I can remove the /etc/rc* symlinks)?
>>>>>
>>>>> Yes, you don't need that LSB script when using Pacemaker
>>>> and should not
>>>>> let init start it.
>>>>>
>>>>> Regards,
>>>>> Andreas
>>>>>
>>>>> --
>>>>> Need help with Pacemaker?
>>>>> http://www.hastexo.com/now
>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Andrew
>>>>>>
>>>>>>
>>>>
>>> ------------------------------------------------------------------------
>>>>>> *From: *"Andreas Kurz" <andreas at hastexo.com
>>>> <mailto:andreas at hastexo.com> <mailto:andreas at hastexo.com
>>>> <mailto:andreas at hastexo.com>>>
>>>>>> *To: *pacemaker at oss.clusterlabs.org
>>>> <mailto:pacemaker at oss.clusterlabs.org>
>>>> <mailto:pacemaker at oss.clusterlabs.org
>>>> <mailto:pacemaker at oss.clusterlabs.org>>
>>>>>> *Sent: *Wednesday, March 28, 2012 7:27:34 AM
>>>>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD
>>>> resources to
>>>>>> master on failover
>>>>>>
>>>>>> On 03/28/2012 12:13 AM, Andrew Martin wrote:
>>>>>>> Hi Andreas,
>>>>>>>
>>>>>>> Thanks, I've updated the colocation rule to be in the
>>>> correct order. I
>>>>>>> also enabled the STONITH resource (this was temporarily
>>>> disabled before
>>>>>>> for some additional testing). DRBD has its own network
>>>> connection over
>>>>>>> the br1 interface (192.168.5.0/24
>>>> <http://192.168.5.0/24> network), a direct crossover cable
>>>>>>> between node1 and node2:
>>>>>>> global { usage-count no; }
>>>>>>> common {
>>>>>>> syncer { rate 110M; }
>>>>>>> }
>>>>>>> resource vmstore {
>>>>>>> protocol C;
>>>>>>> startup {
>>>>>>> wfc-timeout 15;
>>>>>>> degr-wfc-timeout 60;
>>>>>>> }
>>>>>>> handlers {
>>>>>>> #fence-peer
>>>> "/usr/lib/heartbeat/drbd-peer-outdater -t 5";
>>>>>>> fence-peer "/usr/local/bin/fence-peer";
>>>>>>
>>>>>> hmm ... what is that fence-peer script doing? If you want
>>>> to use
>>>>>> resource-level fencing with the help of dopd, activate the
>>>>>> drbd-peer-outdater script in the line above ... and
>>>> double check if the
>>>>>> path is correct
>>>>>>
>>>>>>> split-brain
>>>> "/usr/lib/drbd/notify-split-brain.sh
>>>>>>> me at example.com <mailto:me at example.com>
>>>> <mailto:me at example.com <mailto:me at example.com>>";
>>>>>>> }
>>>>>>> net {
>>>>>>> after-sb-0pri discard-zero-changes;
>>>>>>> after-sb-1pri discard-secondary;
>>>>>>> after-sb-2pri disconnect;
>>>>>>> cram-hmac-alg md5;
>>>>>>> shared-secret "xxxxx";
>>>>>>> }
>>>>>>> disk {
>>>>>>> fencing resource-only;
>>>>>>> }
>>>>>>> on node1 {
>>>>>>> device /dev/drbd0;
>>>>>>> disk /dev/sdb1;
>>>>>>> address 192.168.5.10:7787
>>>> <http://192.168.5.10:7787>;
>>>>>>> meta-disk internal;
>>>>>>> }
>>>>>>> on node2 {
>>>>>>> device /dev/drbd0;
>>>>>>> disk /dev/sdf1;
>>>>>>> address 192.168.5.11:7787
>>>> <http://192.168.5.11:7787>;
>>>>>>> meta-disk internal;
>>>>>>> }
>>>>>>> }
>>>>>>> # and similar for mount1 and mount2
>>>>>>>
>>>>>>> Also, here is my ha.cf <http://ha.cf>. It uses both the
>>>> direct link between the nodes
>>>>>>> (br1) and the shared LAN network on br0 for communicating:
>>>>>>> autojoin none
>>>>>>> mcast br0 239.0.0.43 694 1 0
>>>>>>> bcast br1
>>>>>>> warntime 5
>>>>>>> deadtime 15
>>>>>>> initdead 60
>>>>>>> keepalive 2
>>>>>>> node node1
>>>>>>> node node2
>>>>>>> node quorumnode
>>>>>>> crm respawn
>>>>>>> respawn hacluster /usr/lib/heartbeat/dopd
>>>>>>> apiauth dopd gid=haclient uid=hacluster
>>>>>>>
>>>>>>> I am thinking of making the following changes to the CIB
>>>> (as per the
>>>>>>> official DRBD
>>>>>>> guide
>>>>>>
>>>>>
>>>>
>>> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html)
>>>> in
>>>>>>> order to add the DRBD lsb service and require that it
>>>> start before the
>>>>>>> ocf:linbit:drbd resources. Does this look correct?
>>>>>>
>>>>>> Where did you read that? No, deactivate the startup of
>>>> DRBD on system
>>>>>> boot and let Pacemaker manage it completely.
>>>>>>
>>>>>>> primitive p_drbd-init lsb:drbd op monitor interval="30"
>>>>>>> colocation c_drbd_together inf:
>>>>>>> p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master
>>>>>>> ms_drbd_mount2:Master
>>>>>>> order drbd_init_first inf: ms_drbd_vmstore:promote
>>>>>>> ms_drbd_mount1:promote ms_drbd_mount2:promote
>>>> p_drbd-init:start
>>>>>>>
>>>>>>> This doesn't seem to require that drbd be also running
>>>> on the node where
>>>>>>> the ocf:linbit:drbd resources are slave (which it would
>>>> need to do to be
>>>>>>> a DRBD SyncTarget) - how can I ensure that drbd is
>>>> running everywhere?
>>>>>>> (clone cl_drbd p_drbd-init ?)
>>>>>>
>>>>>> This is really not needed.
>>>>>>
>>>>>> Regards,
>>>>>> Andreas
>>>>>>
>>>>>> --
>>>>>> Need help with Pacemaker?
>>>>>> http://www.hastexo.com/now
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Andrew
>>>>>>>
>>>>
>>> ------------------------------------------------------------------------
>>>>>>> *From: *"Andreas Kurz" <andreas at hastexo.com
>>>> <mailto:andreas at hastexo.com> <mailto:andreas at hastexo.com
>>>> <mailto:andreas at hastexo.com>>>
>>>>>>> *To: *pacemaker at oss.clusterlabs.org
>>>> <mailto:pacemaker at oss.clusterlabs.org>
>>>>> <mailto:*pacemaker at oss.clusterlabs.org
>>>> <mailto:pacemaker at oss.clusterlabs.org>>
>>>>>>> *Sent: *Monday, March 26, 2012 5:56:22 PM
>>>>>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD
>>>> resources to
>>>>>>> master on failover
>>>>>>>
>>>>>>> On 03/24/2012 08:15 PM, Andrew Martin wrote:
>>>>>>>> Hi Andreas,
>>>>>>>>
>>>>>>>> My complete cluster configuration is as follows:
>>>>>>>> ============
>>>>>>>> Last updated: Sat Mar 24 13:51:55 2012
>>>>>>>> Last change: Sat Mar 24 13:41:55 2012
>>>>>>>> Stack: Heartbeat
>>>>>>>> Current DC: node2
>>>> (9100538b-7a1f-41fd-9c1a-c6b4b1c32b18) - partition
>>>>>>>> with quorum
>>>>>>>> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
>>>>>>>> 3 Nodes configured, unknown expected votes
>>>>>>>> 19 Resources configured.
>>>>>>>> ============
>>>>>>>>
>>>>>>>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4):
>>>> OFFLINE
>>>>> (standby)
>>>>>>>> Online: [ node2 node1 ]
>>>>>>>>
>>>>>>>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
>>>>>>>> Masters: [ node2 ]
>>>>>>>> Slaves: [ node1 ]
>>>>>>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
>>>>>>>> Masters: [ node2 ]
>>>>>>>> Slaves: [ node1 ]
>>>>>>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
>>>>>>>> Masters: [ node2 ]
>>>>>>>> Slaves: [ node1 ]
>>>>>>>> Resource Group: g_vm
>>>>>>>> p_fs_vmstore(ocf::heartbeat:Filesystem):Started
>>> node2
>>>>>>>> p_vm(ocf::heartbeat:VirtualDomain):Started node2
>>>>>>>> Clone Set: cl_daemons [g_daemons]
>>>>>>>> Started: [ node2 node1 ]
>>>>>>>> Stopped: [ g_daemons:2 ]
>>>>>>>> Clone Set: cl_sysadmin_notify [p_sysadmin_notify]
>>>>>>>> Started: [ node2 node1 ]
>>>>>>>> Stopped: [ p_sysadmin_notify:2 ]
>>>>>>>> stonith-node1(stonith:external/tripplitepdu):Started
>>> node2
>>>>>>>> stonith-node2(stonith:external/tripplitepdu):Started
>>> node1
>>>>>>>> Clone Set: cl_ping [p_ping]
>>>>>>>> Started: [ node2 node1 ]
>>>>>>>> Stopped: [ p_ping:2 ]
>>>>>>>>
>>>>>>>> node $id="6553a515-273e-42fe-ab9e-00f74bd582c3" node1 \
>>>>>>>> attributes standby="off"
>>>>>>>> node $id="9100538b-7a1f-41fd-9c1a-c6b4b1c32b18" node2 \
>>>>>>>> attributes standby="off"
>>>>>>>> node $id="c4bf25d7-a6b7-4863-984d-aafd937c0da4"
>>>> quorumnode \
>>>>>>>> attributes standby="on"
>>>>>>>> primitive p_drbd_mount2 ocf:linbit:drbd \
>>>>>>>> params drbd_resource="mount2" \
>>>>>>>> op monitor interval="15" role="Master" \
>>>>>>>> op monitor interval="30" role="Slave"
>>>>>>>> primitive p_drbd_mount1 ocf:linbit:drbd \
>>>>>>>> params drbd_resource="mount1" \
>>>>>>>> op monitor interval="15" role="Master" \
>>>>>>>> op monitor interval="30" role="Slave"
>>>>>>>> primitive p_drbd_vmstore ocf:linbit:drbd \
>>>>>>>> params drbd_resource="vmstore" \
>>>>>>>> op monitor interval="15" role="Master" \
>>>>>>>> op monitor interval="30" role="Slave"
>>>>>>>> primitive p_fs_vmstore ocf:heartbeat:Filesystem \
>>>>>>>> params device="/dev/drbd0" directory="/vmstore"
>>>> fstype="ext4" \
>>>>>>>> op start interval="0" timeout="60s" \
>>>>>>>> op stop interval="0" timeout="60s" \
>>>>>>>> op monitor interval="20s" timeout="40s"
>>>>>>>> primitive p_libvirt-bin upstart:libvirt-bin \
>>>>>>>> op monitor interval="30"
>>>>>>>> primitive p_ping ocf:pacemaker:ping \
>>>>>>>> params name="p_ping" host_list="192.168.1.10
>>>> 192.168.1.11"
>>>>>>>> multiplier="1000" \
>>>>>>>> op monitor interval="20s"
>>>>>>>> primitive p_sysadmin_notify ocf:heartbeat:MailTo \
>>>>>>>> params email="me at example.com
>>>> <mailto:me at example.com> <mailto:me at example.com
>>>> <mailto:me at example.com>>" \
>>>>>>>> params subject="Pacemaker Change" \
>>>>>>>> op start interval="0" timeout="10" \
>>>>>>>> op stop interval="0" timeout="10" \
>>>>>>>> op monitor interval="10" timeout="10"
>>>>>>>> primitive p_vm ocf:heartbeat:VirtualDomain \
>>>>>>>> params config="/vmstore/config/vm.xml" \
>>>>>>>> meta allow-migrate="false" \
>>>>>>>> op start interval="0" timeout="120s" \
>>>>>>>> op stop interval="0" timeout="120s" \
>>>>>>>> op monitor interval="10" timeout="30"
>>>>>>>> primitive stonith-node1 stonith:external/tripplitepdu \
>>>>>>>> params pdu_ipaddr="192.168.1.12" pdu_port="1"
>>>> pdu_username="xxx"
>>>>>>>> pdu_password="xxx" hostname_to_stonith="node1"
>>>>>>>> primitive stonith-node2 stonith:external/tripplitepdu \
>>>>>>>> params pdu_ipaddr="192.168.1.12" pdu_port="2"
>>>> pdu_username="xxx"
>>>>>>>> pdu_password="xxx" hostname_to_stonith="node2"
>>>>>>>> group g_daemons p_libvirt-bin
>>>>>>>> group g_vm p_fs_vmstore p_vm
>>>>>>>> ms ms_drbd_mount2 p_drbd_mount2 \
>>>>>>>> meta master-max="1" master-node-max="1"
>>>> clone-max="2"
>>>>>>>> clone-node-max="1" notify="true"
>>>>>>>> ms ms_drbd_mount1 p_drbd_mount1 \
>>>>>>>> meta master-max="1" master-node-max="1"
>>>> clone-max="2"
>>>>>>>> clone-node-max="1" notify="true"
>>>>>>>> ms ms_drbd_vmstore p_drbd_vmstore \
>>>>>>>> meta master-max="1" master-node-max="1"
>>>> clone-max="2"
>>>>>>>> clone-node-max="1" notify="true"
>>>>>>>> clone cl_daemons g_daemons
>>>>>>>> clone cl_ping p_ping \
>>>>>>>> meta interleave="true"
>>>>>>>> clone cl_sysadmin_notify p_sysadmin_notify
>>>>>>>> location l-st-node1 stonith-node1 -inf: node1
>>>>>>>> location l-st-node2 stonith-node2 -inf: node2
>>>>>>>> location l_run_on_most_connected p_vm \
>>>>>>>> rule $id="l_run_on_most_connected-rule" p_ping:
>>>> defined p_ping
>>>>>>>> colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master
>>>>>>>> ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm
>>>>>>>
>>>>>>> As Emmanuel already said, g_vm has to be in the first
>>>> place in this
>>>>>>> collocation constraint .... g_vm must be colocated with
>>>> the drbd masters.
>>>>>>>
>>>>>>>> order o_drbd-fs-vm inf: ms_drbd_vmstore:promote
>>>> ms_drbd_mount1:promote
>>>>>>>> ms_drbd_mount2:promote cl_daemons:start g_vm:start
>>>>>>>> property $id="cib-bootstrap-options" \
>>>>>>>>
>>>> dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \
>>>>>>>> cluster-infrastructure="Heartbeat" \
>>>>>>>> stonith-enabled="false" \
>>>>>>>> no-quorum-policy="stop" \
>>>>>>>> last-lrm-refresh="1332539900" \
>>>>>>>> cluster-recheck-interval="5m" \
>>>>>>>> crmd-integration-timeout="3m" \
>>>>>>>> shutdown-escalation="5m"
>>>>>>>>
>>>>>>>> The STONITH plugin is a custom plugin I wrote for the
>>>> Tripp-Lite
>>>>>>>> PDUMH20ATNET that I'm using as the STONITH device:
>>>>>>>>
>>>>
>>> http://www.tripplite.com/shared/product-pages/en/PDUMH20ATNET.pdf
>>>>>>>
>>>>>>> And why don't using it? .... stonith-enabled="false"
>>>>>>>
>>>>>>>>
>>>>>>>> As you can see, I left the DRBD service to be started
>>>> by the operating
>>>>>>>> system (as an lsb script at boot time) however
>>>> Pacemaker controls
>>>>>>>> actually bringing up/taking down the individual DRBD
>>>> devices.
>>>>>>>
>>>>>>> Don't start drbd on system boot, give Pacemaker the full
>>>> control.
>>>>>>>
>>>>>>> The
>>>>>>>> behavior I observe is as follows: I issue "crm resource
>>>> migrate p_vm" on
>>>>>>>> node1 and failover successfully to node2. During this
>>>> time, node2 fences
>>>>>>>> node1's DRBD devices (using dopd) and marks them as
>>>> Outdated. Meanwhile
>>>>>>>> node2's DRBD devices are UpToDate. I then shutdown both
>>>> nodes and then
>>>>>>>> bring them back up. They reconnect to the cluster (with
>>>> quorum), and
>>>>>>>> node1's DRBD devices are still Outdated as expected and
>>>> node2's DRBD
>>>>>>>> devices are still UpToDate, as expected. At this point,
>>>> DRBD starts on
>>>>>>>> both nodes, however node2 will not set DRBD as master:
>>>>>>>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4):
>>>> OFFLINE
>>>>> (standby)
>>>>>>>> Online: [ node2 node1 ]
>>>>>>>>
>>>>>>>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
>>>>>>>> Slaves: [ node1 node2 ]
>>>>>>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
>>>>>>>> Slaves: [ node1 node 2 ]
>>>>>>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
>>>>>>>> Slaves: [ node1 node2 ]
>>>>>>>
>>>>>>> There should really be no interruption of the drbd
>>>> replication on vm
>>>>>>> migration that activates the dopd ... drbd has its own
>>>> direct network
>>>>>>> connection?
>>>>>>>
>>>>>>> Please share your ha.cf <http://ha.cf> file and your
>>>> drbd configuration. Watch out for
>>>>>>> drbd messages in your kernel log file, that should give
>>>> you additional
>>>>>>> information when/why the drbd connection was lost.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Andreas
>>>>>>>
>>>>>>> --
>>>>>>> Need help with Pacemaker?
>>>>>>> http://www.hastexo.com/now
>>>>>>>
>>>>>>>>
>>>>>>>> I am having trouble sorting through the logging
>>>> information because
>>>>>>>> there is so much of it in /var/log/daemon.log, but I
>>>> can't find an
>>>>>>>> error message printed about why it will not promote
>>>> node2. At this point
>>>>>>>> the DRBD devices are as follows:
>>>>>>>> node2: cstate = WFConnection dstate=UpToDate
>>>>>>>> node1: cstate = StandAlone dstate=Outdated
>>>>>>>>
>>>>>>>> I don't see any reason why node2 can't become DRBD
>>>> master, or am I
>>>>>>>> missing something? If I do "drbdadm connect all" on
>>>> node1, then the
>>>>>>>> cstate on both nodes changes to "Connected" and node2
>>>> immediately
>>>>>>>> promotes the DRBD resources to master. Any ideas on why
>>>> I'm observing
>>>>>>>> this incorrect behavior?
>>>>>>>>
>>>>>>>> Any tips on how I can better filter through the
>>>> pacemaker/heartbeat logs
>>>>>>>> or how to get additional useful debug information?
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>>
>>>>>>>> Andrew
>>>>>>>>
>>>>>>>>
>>>>
>>> ------------------------------------------------------------------------
>>>>>>>> *From: *"Andreas Kurz" <andreas at hastexo.com
>>>> <mailto:andreas at hastexo.com>
>>>>> <mailto:andreas at hastexo.com <mailto:andreas at hastexo.com>>>
>>>>>>>> *To: *pacemaker at oss.clusterlabs.org
>>>> <mailto:pacemaker at oss.clusterlabs.org>
>>>>>> <mailto:*pacemaker at oss.clusterlabs.org
>>>> <mailto:pacemaker at oss.clusterlabs.org>>
>>>>>>>> *Sent: *Wednesday, 1 February, 2012 4:19:25 PM
>>>>>>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD
>>>> resources to
>>>>>>>> master on failover
>>>>>>>>
>>>>>>>> On 01/25/2012 08:58 PM, Andrew Martin wrote:
>>>>>>>>> Hello,
>>>>>>>>>
>>>>>>>>> Recently I finished configuring a two-node cluster
>>>> with pacemaker 1.1.6
>>>>>>>>> and heartbeat 3.0.5 on nodes running Ubuntu 10.04.
>>>> This cluster
>>>>> includes
>>>>>>>>> the following resources:
>>>>>>>>> - primitives for DRBD storage devices
>>>>>>>>> - primitives for mounting the filesystem on the DRBD
>>>> storage
>>>>>>>>> - primitives for some mount binds
>>>>>>>>> - primitive for starting apache
>>>>>>>>> - primitives for starting samba and nfs servers
>>>> (following instructions
>>>>>>>>> here
>>>> <http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf>)
>>>>>>>>> - primitives for exporting nfs shares
>>>> (ocf:heartbeat:exportfs)
>>>>>>>>
>>>>>>>> not enough information ... please share at least your
>>>> complete cluster
>>>>>>>> configuration
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Andreas
>>>>>>>>
>>>>>>>> --
>>>>>>>> Need help with Pacemaker?
>>>>>>>> http://www.hastexo.com/now
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Perhaps this is best described through the output of
>>>> crm_mon:
>>>>>>>>> Online: [ node1 node2 ]
>>>>>>>>>
>>>>>>>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
>>>> (unmanaged)
>>>>>>>>> p_drbd_mount1:0 (ocf::linbit:drbd):
>>>> Started node2
>>>>>>> (unmanaged)
>>>>>>>>> p_drbd_mount1:1 (ocf::linbit:drbd):
>>>> Started node1
>>>>>>>>> (unmanaged) FAILED
>>>>>>>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
>>>>>>>>> p_drbd_mount2:0 (ocf::linbit:drbd):
>>>> Master node1
>>>>>>>>> (unmanaged) FAILED
>>>>>>>>> Slaves: [ node2 ]
>>>>>>>>> Resource Group: g_core
>>>>>>>>> p_fs_mount1 (ocf::heartbeat:Filesystem):
>>>> Started node1
>>>>>>>>> p_fs_mount2 (ocf::heartbeat:Filesystem):
>>>> Started node1
>>>>>>>>> p_ip_nfs (ocf::heartbeat:IPaddr2):
>>>> Started node1
>>>>>>>>> Resource Group: g_apache
>>>>>>>>> p_fs_mountbind1 (ocf::heartbeat:Filesystem):
>>>> Started node1
>>>>>>>>> p_fs_mountbind2 (ocf::heartbeat:Filesystem):
>>>> Started node1
>>>>>>>>> p_fs_mountbind3 (ocf::heartbeat:Filesystem):
>>>> Started node1
>>>>>>>>> p_fs_varwww (ocf::heartbeat:Filesystem):
>>>> Started node1
>>>>>>>>> p_apache (ocf::heartbeat:apache):
>>>> Started node1
>>>>>>>>> Resource Group: g_fileservers
>>>>>>>>> p_lsb_smb (lsb:smbd): Started node1
>>>>>>>>> p_lsb_nmb (lsb:nmbd): Started node1
>>>>>>>>> p_lsb_nfsserver (lsb:nfs-kernel-server):
>>>> Started node1
>>>>>>>>> p_exportfs_mount1 (ocf::heartbeat:exportfs):
>>>> Started node1
>>>>>>>>> p_exportfs_mount2 (ocf::heartbeat:exportfs):
>>>> Started
>>>>> node1
>>>>>>>>>
>>>>>>>>> I have read through the Pacemaker Explained
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>> <http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained>
>>>>>>>>> documentation, however could not find a way to further
>>>> debug these
>>>>>>>>> problems. First, I put node1 into standby mode to
>>>> attempt failover to
>>>>>>>>> the other node (node2). Node2 appeared to start the
>>>> transition to
>>>>>>>>> master, however it failed to promote the DRBD
>>>> resources to master (the
>>>>>>>>> first step). I have attached a copy of this session in
>>>> commands.log and
>>>>>>>>> additional excerpts from /var/log/syslog during
>>>> important steps. I have
>>>>>>>>> attempted everything I can think of to try and start
>>>> the DRBD resource
>>>>>>>>> (e.g. start/stop/promote/manage/cleanup under crm
>>>> resource, restarting
>>>>>>>>> heartbeat) but cannot bring it out of the slave state.
>>>> However, if
>>>>> I set
>>>>>>>>> it to unmanaged and then run drbdadm primary all in
>>>> the terminal,
>>>>>>>>> pacemaker is satisfied and continues starting the rest
>>>> of the
>>>>> resources.
>>>>>>>>> It then failed when attempting to mount the filesystem
>>>> for mount2, the
>>>>>>>>> p_fs_mount2 resource. I attempted to mount the
>>>> filesystem myself
>>>>> and was
>>>>>>>>> successful. I then unmounted it and ran cleanup on
>>>> p_fs_mount2 and then
>>>>>>>>> it mounted. The rest of the resources started as
>>>> expected until the
>>>>>>>>> p_exportfs_mount2 resource, which failed as follows:
>>>>>>>>> p_exportfs_mount2 (ocf::heartbeat:exportfs):
>>>> started node2
>>>>>>>>> (unmanaged) FAILED
>>>>>>>>>
>>>>>>>>> I ran cleanup on this and it started, however when
>>>> running this test
>>>>>>>>> earlier today no command could successfully start this
>>>> exportfs
>>>>>> resource.
>>>>>>>>>
>>>>>>>>> How can I configure pacemaker to better resolve these
>>>> problems and be
>>>>>>>>> able to bring the node up successfully on its own?
>>>> What can I check to
>>>>>>>>> determine why these failures are occuring?
>>>> /var/log/syslog did not seem
>>>>>>>>> to contain very much useful information regarding why
>>>> the failures
>>>>>>>> occurred.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>>
>>>>>>>>> Andrew
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> This body part will be downloaded on demand.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> <mailto:Pacemaker at oss.clusterlabs.org>
>>>>>> <mailto:Pacemaker at oss.clusterlabs.org
>>>> <mailto:Pacemaker at oss.clusterlabs.org>>
>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>
>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>> Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> <mailto:Pacemaker at oss.clusterlabs.org>
>>>>>> <mailto:Pacemaker at oss.clusterlabs.org
>>>> <mailto:Pacemaker at oss.clusterlabs.org>>
>>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>>
>>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>>> Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> <mailto:Pacemaker at oss.clusterlabs.org>
>>>>>> <mailto:Pacemaker at oss.clusterlabs.org
>>>> <mailto:Pacemaker at oss.clusterlabs.org>>
>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>
>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>> Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> <mailto:Pacemaker at oss.clusterlabs.org>
>>>>>> <mailto:Pacemaker at oss.clusterlabs.org
>>>> <mailto:Pacemaker at oss.clusterlabs.org>>
>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>
>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>> Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> <mailto:Pacemaker at oss.clusterlabs.org>
>>>>>> <mailto:Pacemaker at oss.clusterlabs.org
>>>> <mailto:Pacemaker at oss.clusterlabs.org>>
>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>
>>>>>> Project Home: http://www.clusterlabs.org
>>>>>> Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> <mailto:Pacemaker at oss.clusterlabs.org>
>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>
>>>>>> Project Home: http://www.clusterlabs.org
>>>>>> Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> <mailto:Pacemaker at oss.clusterlabs.org>
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> <mailto:Pacemaker at oss.clusterlabs.org>
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> <mailto:Pacemaker at oss.clusterlabs.org>
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> <mailto:Pacemaker at oss.clusterlabs.org>
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> esta es mi vida e me la vivo hasta que dios quiera
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> <mailto:Pacemaker at oss.clusterlabs.org>
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> <mailto:Pacemaker at oss.clusterlabs.org>
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> esta es mi vida e me la vivo hasta que dios quiera
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> <mailto:Pacemaker at oss.clusterlabs.org>
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> <mailto:Pacemaker at oss.clusterlabs.org>
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> esta es mi vida e me la vivo hasta que dios quiera
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started:
>>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>> --
>>> Need help with Pacemaker?
>>> http://www.hastexo.com/now
>>>
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
>> Project Home: http://www.clusterlabs.org
>> Getting started:
>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 222 bytes
Desc: OpenPGP digital signature
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20120411/f918f899/attachment-0001.sig>