[Pacemaker] Nodes will not promote DRBD resources to master on failover

Andrew Martin amartin at xes-inc.com
Mon Apr 2 17:47:22 CEST 2012


Hi Andreas, 


Here is the crm_report: 
http://dl.dropbox.com/u/2177298/pcmk-Mon-02-Apr-2012.bz2 

Hi Emmanuel, 


Here is the configuration: 

node $id="6553a515-273e-42fe-ab9e-00f74bd582c3" node1 \ 
attributes standby="off" 
node $id="9100538b-7a1f-41fd-9c1a-c6b4b1c32b18" node2 \ 
attributes standby="off" 
node $id="c4bf25d7-a6b7-4863-984d-aafd937c0da4" quorumnode \ 
attributes standby="on" 
primitive p_drbd_mount2 ocf:linbit:drbd \ 
params drbd_resource="mount2" \ 
op start interval="0" timeout="240" \ 
op stop interval="0" timeout="100" \ 
op monitor interval="10" role="Master" timeout="20" start-delay="1m" \ 
op monitor interval="20" role="Slave" timeout="20" start-delay="1m" 
primitive p_drbd_mount1 ocf:linbit:drbd \ 
params drbd_resource="mount1" \ 
op start interval="0" timeout="240" \ 
op stop interval="0" timeout="100" \ 
op monitor interval="10" role="Master" timeout="20" start-delay="1m" \ 
op monitor interval="20" role="Slave" timeout="20" start-delay="1m" 
primitive p_drbd_vmstore ocf:linbit:drbd \ 
params drbd_resource="vmstore" \ 
op start interval="0" timeout="240" \ 
op stop interval="0" timeout="100" \ 
op monitor interval="10" role="Master" timeout="20" start-delay="1m" \ 
op monitor interval="20" role="Slave" timeout="20" start-delay="1m" 
primitive p_fs_vmstore ocf:heartbeat:Filesystem \ 
params device="/dev/drbd0" directory="/mnt/storage/vmstore" fstype="ext4" \ 
op start interval="0" timeout="60s" \ 
op stop interval="0" timeout="60s" \ 
op monitor interval="20s" timeout="40s" 
primitive p_libvirt-bin upstart:libvirt-bin \ 
op monitor interval="30" 
primitive p_ping ocf:pacemaker:ping \ 
params name="p_ping" host_list="192.168.3.1 192.168.3.2" multiplier="1000" \ 
op monitor interval="20s" 
primitive p_sysadmin_notify ocf:heartbeat:MailTo \ 
params email="me at example.com" \ 
params subject="Pacemaker Change" \ 
op start interval="0" timeout="10" \ 
op stop interval="0" timeout="10" \ 
op monitor interval="10" timeout="10" 
primitive p_vm ocf:heartbeat:VirtualDomain \ 
params config="/mnt/storage/vmstore/config/vm.xml" \ 
meta allow-migrate="false" \ 
op start interval="0" timeout="180" \ 
op stop interval="0" timeout="180" \ 
op monitor interval="10" timeout="30" 
primitive stonith-node1 stonith:external/tripplitepdu \ 
params pdu_ipaddr="192.168.3.100" pdu_port="1" pdu_username="xxx" pdu_password="xxx" hostname_to_stonith="node1" 
primitive stonith-node2 stonith:external/tripplitepdu \ 
params pdu_ipaddr="192.168.3.100" pdu_port="2" pdu_username="xxx" pdu_password="xxx" hostname_to_stonith="node2" 
group g_daemons p_libvirt-bin 
group g_vm p_fs_vmstore p_vm 
ms ms_drbd_mount2 p_drbd_mount2 \ 
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" 
ms ms_drbd_mount1 p_drbd_mount1 \ 
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" 
ms ms_drbd_vmstore p_drbd_vmstore \ 
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" 
clone cl_daemons g_daemons 
clone cl_ping p_ping \ 
meta interleave="true" 
clone cl_sysadmin_notify p_sysadmin_notify \ 
meta target-role="Started" 
location l-st-node1 stonith-node1 -inf: node1 
location l-st-node2 stonith-node2 -inf: node2 
location l_run_on_most_connected p_vm \ 
rule $id="l_run_on_most_connected-rule" p_ping: defined p_ping 
colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm 
order o_drbd-fs-vm inf: ms_drbd_vmstore:promote ms_drbd_mount1:promote ms_drbd_mount2:promote cl_daemons:start g_vm:start 
property $id="cib-bootstrap-options" \ 
dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \ 
cluster-infrastructure="Heartbeat" \ 
stonith-enabled="true" \ 
no-quorum-policy="freeze" \ 
last-lrm-refresh="1333041002" \ 
cluster-recheck-interval="5m" \ 
crmd-integration-timeout="3m" \ 
shutdown-escalation="5m" 


Thanks, 


Andrew 


----- Original Message -----

From: "emmanuel segura" <emi2fast at gmail.com> 
To: "The Pacemaker cluster resource manager" <pacemaker at oss.clusterlabs.org> 
Sent: Monday, April 2, 2012 9:43:20 AM 
Subject: Re: [Pacemaker] Nodes will not promote DRBD resources to master on failover 

Sorry Andrew 

Can you post me your crm configure show again? 

Thanks 


Il giorno 30 marzo 2012 18:53, Andrew Martin < amartin at xes-inc.com > ha scritto: 




Hi Emmanuel, 


Thanks, that is a good idea. I updated the colocation contraint as you described. After, the cluster remains in this state (with the filesystem not mounted and the VM not started): 
Online: [ node2 node1 ] 


Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore] 
Masters: [ node1 ] 
Slaves: [ node2 ] 
Master/Slave Set: ms_drbd_tools [p_drbd_mount1] 
Masters: [ node1 ] 
Slaves: [ node2 ] 
Master/Slave Set: ms_drbd_crm [p_drbd_mount2] 
Masters: [ node1 ] 
Slaves: [ node2 ] 
Clone Set: cl_daemons [g_daemons] 
Started: [ node2 node1 ] 
Stopped: [ g_daemons:2 ] 
stonith-node1 (stonith:external/tripplitepdu): Started node2 
stonith-node2 (stonith:external/tripplitepdu): Started node1 


I noticed that Pacemaker had not issued "drbdadm connect" for any of the DRBD resources on node2 

# service drbd status 
drbd driver loaded OK; device status: 
version: 8.3.7 (api:88/proto:86-91) 
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root at node2, 2012-02-02 12:29:26 
m:res cs ro ds p mounted fstype 
0:vmstore StandAlone Secondary/Unknown Outdated/DUnknown r---- 
1:mount1 StandAlone Secondary/Unknown Outdated/DUnknown r---- 
2:mount2 StandAlone Secondary/Unknown Outdated/DUnknown r---- 
# drbdadm cstate all 
StandAlone 
StandAlone 
StandAlone 


After manually issuing "drbdadm connect all" on node2 the rest of the resources eventually started (several minutes later) on node1: 

Online: [ node2 node1 ] 


Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore] 
Masters: [ node1 ] 
Slaves: [ node2 ] 
Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] 
Masters: [ node1 ] 
Slaves: [ node2 ] 
Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2] 
Masters: [ node1 ] 
Slaves: [ node2 ] 
Resource Group: g_vm 
p_fs_vmstore (ocf::heartbeat:Filesystem): Started node1 
p_vm (ocf::heartbeat:VirtualDomain): Started node1 
Clone Set: cl_daemons [g_daemons] 
Started: [ node2 node1 ] 
Stopped: [ g_daemons:2 ] 
Clone Set: cl_sysadmin_notify [p_sysadmin_notify] 
Started: [ node2 node1 ] 
Stopped: [ p_sysadmin_notify:2 ] 
stonith-node1 (stonith:external/tripplitepdu): Started node2 
stonith-node2 (stonith:external/tripplitepdu): Started node1 
Clone Set: cl_ping [p_ping] 
Started: [ node2 node1 ] 
Stopped: [ p_ping:2 ] 


The DRBD devices on node1 were all UpToDate, so it doesn't seem right that it would need to wait for node2 to be connected before it could continue promoting additional resources. I then restarted heartbeat on node2 to see if it would automatically connect the DRBD devices this time. After restarting it, the DRBD devices are not even configured: 

# service drbd status 
drbd driver loaded OK; device status: 
version: 8.3.7 (api:88/proto:86-91) 
GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root at webapps2host, 2012-02-02 12:29:26 
m:res cs ro ds p mounted fstype 
0:vmstore Unconfigured 
1:mount1 Unconfigured 
2:mount2 Unconfigured 


Looking at the log I found this part about the drbd primitives: 

Mar 30 11:10:32 node2 lrmd: [10702]: info: operation monitor[2] on p_drbd_vmstore:1 for client 10705: pid 11065 exited with return code 7 
Mar 30 11:10:32 node2 crmd: [10705]: info: process_lrm_event: LRM operation p_drbd_vmstore:1_monitor_0 (call=2, rc=7, cib-update=11, confirmed=true) not running 
Mar 30 11:10:32 node2 lrmd: [10702]: info: operation monitor[4] on p_drbd_mount2:1 for client 10705: pid 11069 exited with return code 7 
Mar 30 11:10:32 node2 crmd: [10705]: info: process_lrm_event: LRM operation p_drbd_mount2:1_monitor_0 (call=4, rc=7, cib-update=12, confirmed=true) not running 
Mar 30 11:10:32 node2 lrmd: [10702]: info: operation monitor[3] on p_drbd_mount1:1 for client 10705: pid 11066 exited with return code 7 
Mar 30 11:10:32 node2 crmd: [10705]: info: process_lrm_event: LRM operation p_drbd_mount1:1_monitor_0 (call=3, rc=7, cib-update=13, confirmed=true) not running 


I am not sure what exit code 7 is - is it possible to manually run the monitor code or somehow obtain more debug about this? Here is the complete log after restarting heartbeat on node2: 
http://pastebin.com/KsHKi3GW 


Thanks, 


Andrew 


From: "emmanuel segura" < emi2fast at gmail.com > 
To: "The Pacemaker cluster resource manager" < pacemaker at oss.clusterlabs.org > 
Sent: Friday, March 30, 2012 10:26:48 AM 
Subject: Re: [Pacemaker] Nodes will not promote DRBD resources to master on failover 

I think this constrain it's wrong 
================================================== 
colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm 
=================================================== 

change to 
====================================================== 
colocation c_drbd_libvirt_vm inf: g_vm ms_drbd_vmstore:Master ms_drbd_mount1:Master ms_drbd_mount2:Master 
======================================================= 


Il giorno 30 marzo 2012 17:16, Andrew Martin < amartin at xes-inc.com > ha scritto: 

<blockquote>


Hi Emmanuel, 


Here is the output of crm configure show: 
http://pastebin.com/NA1fZ8dL 


Thanks, 


Andrew 



From: "emmanuel segura" < emi2fast at gmail.com > 
To: "The Pacemaker cluster resource manager" < pacemaker at oss.clusterlabs.org > 
Sent: Friday, March 30, 2012 9:43:45 AM 
Subject: Re: [Pacemaker] Nodes will not promote DRBD resources to master on failover 

can you show me? 

crm configure show 


Il giorno 30 marzo 2012 16:10, Andrew Martin < amartin at xes-inc.com > ha scritto: 

<blockquote>


Hi Andreas, 


Here is a copy of my complete CIB: 
http://pastebin.com/v5wHVFuy 


I'll work on generating a report using crm_report as well. 


Thanks, 


Andrew 



From: "Andreas Kurz" < andreas at hastexo.com > 
To: pacemaker at oss.clusterlabs.org 
Sent: Friday, March 30, 2012 4:41:16 AM 
Subject: Re: [Pacemaker] Nodes will not promote DRBD resources to master on failover 

On 03/28/2012 04:56 PM, Andrew Martin wrote: 
> Hi Andreas, 
> 
> I disabled the DRBD init script and then restarted the slave node 
> (node2). After it came back up, DRBD did not start: 
> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): pending 
> Online: [ node2 node1 ] 
> 
> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore] 
> Masters: [ node1 ] 
> Stopped: [ p_drbd_vmstore:1 ] 
> Master/Slave Set: ms_drbd_mount1 [p_drbd_tools] 
> Masters: [ node1 ] 
> Stopped: [ p_drbd_mount1:1 ] 
> Master/Slave Set: ms_drbd_mount2 [p_drbdmount2] 
> Masters: [ node1 ] 
> Stopped: [ p_drbd_mount2:1 ] 
> ... 
> 
> root at node2:~# service drbd status 
> drbd not loaded 

Yes, expected unless Pacemaker starts DRBD 

> 
> Is there something else I need to change in the CIB to ensure that DRBD 
> is started? All of my DRBD devices are configured like this: 
> primitive p_drbd_mount2 ocf:linbit:drbd \ 
> params drbd_resource="mount2" \ 
> op monitor interval="15" role="Master" \ 
> op monitor interval="30" role="Slave" 
> ms ms_drbd_mount2 p_drbd_mount2 \ 
> meta master-max="1" master-node-max="1" clone-max="2" 
> clone-node-max="1" notify="true" 

That should be enough ... unable to say more without seeing the complete 
configuration ... too much fragments of information ;-) 

Please provide (e.g. pastebin) your complete cib (cibadmin -Q) when 
cluster is in that state ... or even better create a crm_report archive 

> 
> Here is the output from the syslog (grep -i drbd /var/log/syslog): 
> Mar 28 09:24:47 node2 crmd: [3213]: info: do_lrm_rsc_op: Performing 
> key=12:315:7:24416169-73ba-469b-a2e3-56a22b437cbc 
> op=p_drbd_vmstore:1_monitor_0 ) 
> Mar 28 09:24:47 node2 lrmd: [3210]: info: rsc:p_drbd_vmstore:1 probe[2] 
> (pid 3455) 
> Mar 28 09:24:47 node2 crmd: [3213]: info: do_lrm_rsc_op: Performing 
> key=13:315:7:24416169-73ba-469b-a2e3-56a22b437cbc 
> op=p_drbd_mount1:1_monitor_0 ) 
> Mar 28 09:24:48 node2 lrmd: [3210]: info: rsc:p_drbd_mount1:1 probe[3] 
> (pid 3456) 
> Mar 28 09:24:48 node2 crmd: [3213]: info: do_lrm_rsc_op: Performing 
> key=14:315:7:24416169-73ba-469b-a2e3-56a22b437cbc 
> op=p_drbd_mount2:1_monitor_0 ) 
> Mar 28 09:24:48 node2 lrmd: [3210]: info: rsc:p_drbd_mount2:1 probe[4] 
> (pid 3457) 
> Mar 28 09:24:48 node2 Filesystem[3458]: [3517]: WARNING: Couldn't find 
> device [/dev/drbd0]. Expected /dev/??? to exist 
> Mar 28 09:24:48 node2 crm_attribute: [3563]: info: Invoked: 
> crm_attribute -N node2 -n master-p_drbd_mount2:1 -l reboot -D 
> Mar 28 09:24:48 node2 crm_attribute: [3557]: info: Invoked: 
> crm_attribute -N node2 -n master-p_drbd_vmstore:1 -l reboot -D 
> Mar 28 09:24:48 node2 crm_attribute: [3562]: info: Invoked: 
> crm_attribute -N node2 -n master-p_drbd_mount1:1 -l reboot -D 
> Mar 28 09:24:48 node2 lrmd: [3210]: info: operation monitor[4] on 
> p_drbd_mount2:1 for client 3213: pid 3457 exited with return code 7 
> Mar 28 09:24:48 node2 lrmd: [3210]: info: operation monitor[2] on 
> p_drbd_vmstore:1 for client 3213: pid 3455 exited with return code 7 
> Mar 28 09:24:48 node2 crmd: [3213]: info: process_lrm_event: LRM 
> operation p_drbd_mount2:1_monitor_0 (call=4, rc=7, cib-update=10, 
> confirmed=true) not running 
> Mar 28 09:24:48 node2 lrmd: [3210]: info: operation monitor[3] on 
> p_drbd_mount1:1 for client 3213: pid 3456 exited with return code 7 
> Mar 28 09:24:48 node2 crmd: [3213]: info: process_lrm_event: LRM 
> operation p_drbd_vmstore:1_monitor_0 (call=2, rc=7, cib-update=11, 
> confirmed=true) not running 
> Mar 28 09:24:48 node2 crmd: [3213]: info: process_lrm_event: LRM 
> operation p_drbd_mount1:1_monitor_0 (call=3, rc=7, cib-update=12, 
> confirmed=true) not running 

No errors, just probing ... so for any reason Pacemaker does not like to 
start it ... use crm_simulate to find out why ... or provide information 
as requested above. 

Regards, 
Andreas 

-- 
Need help with Pacemaker? 
http://www.hastexo.com/now 

> 
> Thanks, 
> 
> Andrew 
> 
> ------------------------------------------------------------------------ 
> *From: *"Andreas Kurz" < andreas at hastexo.com > 
> *To: * pacemaker at oss.clusterlabs.org 
> *Sent: *Wednesday, March 28, 2012 9:03:06 AM 
> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to 
> master on failover 
> 
> On 03/28/2012 03:47 PM, Andrew Martin wrote: 
>> Hi Andreas, 
>> 
>>> hmm ... what is that fence-peer script doing? If you want to use 
>>> resource-level fencing with the help of dopd, activate the 
>>> drbd-peer-outdater script in the line above ... and double check if the 
>>> path is correct 
>> fence-peer is just a wrapper for drbd-peer-outdater that does some 
>> additional logging. In my testing dopd has been working well. 
> 
> I see 
> 
>> 
>>>> I am thinking of making the following changes to the CIB (as per the 
>>>> official DRBD 
>>>> guide 
>> 
> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html ) in 
>>>> order to add the DRBD lsb service and require that it start before the 
>>>> ocf:linbit:drbd resources. Does this look correct? 
>>> 
>>> Where did you read that? No, deactivate the startup of DRBD on system 
>>> boot and let Pacemaker manage it completely. 
>>> 
>>>> primitive p_drbd-init lsb:drbd op monitor interval="30" 
>>>> colocation c_drbd_together inf: 
>>>> p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master 
>>>> ms_drbd_mount2:Master 
>>>> order drbd_init_first inf: ms_drbd_vmstore:promote 
>>>> ms_drbd_mount1:promote ms_drbd_mount2:promote p_drbd-init:start 
>>>> 
>>>> This doesn't seem to require that drbd be also running on the node where 
>>>> the ocf:linbit:drbd resources are slave (which it would need to do to be 
>>>> a DRBD SyncTarget) - how can I ensure that drbd is running everywhere? 
>>>> (clone cl_drbd p_drbd-init ?) 
>>> 
>>> This is really not needed. 
>> I was following the official DRBD Users Guide: 
>> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html 
>> 
>> If I am understanding your previous message correctly, I do not need to 
>> add a lsb primitive for the drbd daemon? It will be 
>> started/stopped/managed automatically by my ocf:linbit:drbd resources 
>> (and I can remove the /etc/rc* symlinks)? 
> 
> Yes, you don't need that LSB script when using Pacemaker and should not 
> let init start it. 
> 
> Regards, 
> Andreas 
> 
> -- 
> Need help with Pacemaker? 
> http://www.hastexo.com/now 
> 
>> 
>> Thanks, 
>> 
>> Andrew 
>> 
>> ------------------------------------------------------------------------ 
>> *From: *"Andreas Kurz" < andreas at hastexo.com <mailto: andreas at hastexo.com >> 
>> *To: * pacemaker at oss.clusterlabs.org <mailto: pacemaker at oss.clusterlabs.org > 
>> *Sent: *Wednesday, March 28, 2012 7:27:34 AM 
>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to 
>> master on failover 
>> 
>> On 03/28/2012 12:13 AM, Andrew Martin wrote: 
>>> Hi Andreas, 
>>> 
>>> Thanks, I've updated the colocation rule to be in the correct order. I 
>>> also enabled the STONITH resource (this was temporarily disabled before 
>>> for some additional testing). DRBD has its own network connection over 
>>> the br1 interface ( 192.168.5.0/24 network), a direct crossover cable 
>>> between node1 and node2: 
>>> global { usage-count no; } 
>>> common { 
>>> syncer { rate 110M; } 
>>> } 
>>> resource vmstore { 
>>> protocol C; 
>>> startup { 
>>> wfc-timeout 15; 
>>> degr-wfc-timeout 60; 
>>> } 
>>> handlers { 
>>> #fence-peer "/usr/lib/heartbeat/drbd-peer-outdater -t 5"; 
>>> fence-peer "/usr/local/bin/fence-peer"; 
>> 
>> hmm ... what is that fence-peer script doing? If you want to use 
>> resource-level fencing with the help of dopd, activate the 
>> drbd-peer-outdater script in the line above ... and double check if the 
>> path is correct 
>> 
>>> split-brain "/usr/lib/drbd/notify-split-brain.sh 
>>> me at example.com <mailto: me at example.com >"; 
>>> } 
>>> net { 
>>> after-sb-0pri discard-zero-changes; 
>>> after-sb-1pri discard-secondary; 
>>> after-sb-2pri disconnect; 
>>> cram-hmac-alg md5; 
>>> shared-secret "xxxxx"; 
>>> } 
>>> disk { 
>>> fencing resource-only; 
>>> } 
>>> on node1 { 
>>> device /dev/drbd0; 
>>> disk /dev/sdb1; 
>>> address 192.168.5.10:7787 ; 
>>> meta-disk internal; 
>>> } 
>>> on node2 { 
>>> device /dev/drbd0; 
>>> disk /dev/sdf1; 
>>> address 192.168.5.11:7787 ; 
>>> meta-disk internal; 
>>> } 
>>> } 
>>> # and similar for mount1 and mount2 
>>> 
>>> Also, here is my ha.cf . It uses both the direct link between the nodes 
>>> (br1) and the shared LAN network on br0 for communicating: 
>>> autojoin none 
>>> mcast br0 239.0.0.43 694 1 0 
>>> bcast br1 
>>> warntime 5 
>>> deadtime 15 
>>> initdead 60 
>>> keepalive 2 
>>> node node1 
>>> node node2 
>>> node quorumnode 
>>> crm respawn 
>>> respawn hacluster /usr/lib/heartbeat/dopd 
>>> apiauth dopd gid=haclient uid=hacluster 
>>> 
>>> I am thinking of making the following changes to the CIB (as per the 
>>> official DRBD 
>>> guide 
>> 
> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html ) in 
>>> order to add the DRBD lsb service and require that it start before the 
>>> ocf:linbit:drbd resources. Does this look correct? 
>> 
>> Where did you read that? No, deactivate the startup of DRBD on system 
>> boot and let Pacemaker manage it completely. 
>> 
>>> primitive p_drbd-init lsb:drbd op monitor interval="30" 
>>> colocation c_drbd_together inf: 
>>> p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master 
>>> ms_drbd_mount2:Master 
>>> order drbd_init_first inf: ms_drbd_vmstore:promote 
>>> ms_drbd_mount1:promote ms_drbd_mount2:promote p_drbd-init:start 
>>> 
>>> This doesn't seem to require that drbd be also running on the node where 
>>> the ocf:linbit:drbd resources are slave (which it would need to do to be 
>>> a DRBD SyncTarget) - how can I ensure that drbd is running everywhere? 
>>> (clone cl_drbd p_drbd-init ?) 
>> 
>> This is really not needed. 
>> 
>> Regards, 
>> Andreas 
>> 
>> -- 
>> Need help with Pacemaker? 
>> http://www.hastexo.com/now 
>> 
>>> 
>>> Thanks, 
>>> 
>>> Andrew 
>>> ------------------------------------------------------------------------ 
>>> *From: *"Andreas Kurz" < andreas at hastexo.com <mailto: andreas at hastexo.com >> 
>>> *To: * pacemaker at oss.clusterlabs.org 
> <mailto:* pacemaker at oss.clusterlabs.org > 
>>> *Sent: *Monday, March 26, 2012 5:56:22 PM 
>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to 
>>> master on failover 
>>> 
>>> On 03/24/2012 08:15 PM, Andrew Martin wrote: 
>>>> Hi Andreas, 
>>>> 
>>>> My complete cluster configuration is as follows: 
>>>> ============ 
>>>> Last updated: Sat Mar 24 13:51:55 2012 
>>>> Last change: Sat Mar 24 13:41:55 2012 
>>>> Stack: Heartbeat 
>>>> Current DC: node2 (9100538b-7a1f-41fd-9c1a-c6b4b1c32b18) - partition 
>>>> with quorum 
>>>> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c 
>>>> 3 Nodes configured, unknown expected votes 
>>>> 19 Resources configured. 
>>>> ============ 
>>>> 
>>>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE 
> (standby) 
>>>> Online: [ node2 node1 ] 
>>>> 
>>>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore] 
>>>> Masters: [ node2 ] 
>>>> Slaves: [ node1 ] 
>>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] 
>>>> Masters: [ node2 ] 
>>>> Slaves: [ node1 ] 
>>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2] 
>>>> Masters: [ node2 ] 
>>>> Slaves: [ node1 ] 
>>>> Resource Group: g_vm 
>>>> p_fs_vmstore(ocf::heartbeat:Filesystem):Started node2 
>>>> p_vm(ocf::heartbeat:VirtualDomain):Started node2 
>>>> Clone Set: cl_daemons [g_daemons] 
>>>> Started: [ node2 node1 ] 
>>>> Stopped: [ g_daemons:2 ] 
>>>> Clone Set: cl_sysadmin_notify [p_sysadmin_notify] 
>>>> Started: [ node2 node1 ] 
>>>> Stopped: [ p_sysadmin_notify:2 ] 
>>>> stonith-node1(stonith:external/tripplitepdu):Started node2 
>>>> stonith-node2(stonith:external/tripplitepdu):Started node1 
>>>> Clone Set: cl_ping [p_ping] 
>>>> Started: [ node2 node1 ] 
>>>> Stopped: [ p_ping:2 ] 
>>>> 
>>>> node $id="6553a515-273e-42fe-ab9e-00f74bd582c3" node1 \ 
>>>> attributes standby="off" 
>>>> node $id="9100538b-7a1f-41fd-9c1a-c6b4b1c32b18" node2 \ 
>>>> attributes standby="off" 
>>>> node $id="c4bf25d7-a6b7-4863-984d-aafd937c0da4" quorumnode \ 
>>>> attributes standby="on" 
>>>> primitive p_drbd_mount2 ocf:linbit:drbd \ 
>>>> params drbd_resource="mount2" \ 
>>>> op monitor interval="15" role="Master" \ 
>>>> op monitor interval="30" role="Slave" 
>>>> primitive p_drbd_mount1 ocf:linbit:drbd \ 
>>>> params drbd_resource="mount1" \ 
>>>> op monitor interval="15" role="Master" \ 
>>>> op monitor interval="30" role="Slave" 
>>>> primitive p_drbd_vmstore ocf:linbit:drbd \ 
>>>> params drbd_resource="vmstore" \ 
>>>> op monitor interval="15" role="Master" \ 
>>>> op monitor interval="30" role="Slave" 
>>>> primitive p_fs_vmstore ocf:heartbeat:Filesystem \ 
>>>> params device="/dev/drbd0" directory="/vmstore" fstype="ext4" \ 
>>>> op start interval="0" timeout="60s" \ 
>>>> op stop interval="0" timeout="60s" \ 
>>>> op monitor interval="20s" timeout="40s" 
>>>> primitive p_libvirt-bin upstart:libvirt-bin \ 
>>>> op monitor interval="30" 
>>>> primitive p_ping ocf:pacemaker:ping \ 
>>>> params name="p_ping" host_list="192.168.1.10 192.168.1.11" 
>>>> multiplier="1000" \ 
>>>> op monitor interval="20s" 
>>>> primitive p_sysadmin_notify ocf:heartbeat:MailTo \ 
>>>> params email=" me at example.com <mailto: me at example.com >" \ 
>>>> params subject="Pacemaker Change" \ 
>>>> op start interval="0" timeout="10" \ 
>>>> op stop interval="0" timeout="10" \ 
>>>> op monitor interval="10" timeout="10" 
>>>> primitive p_vm ocf:heartbeat:VirtualDomain \ 
>>>> params config="/vmstore/config/vm.xml" \ 
>>>> meta allow-migrate="false" \ 
>>>> op start interval="0" timeout="120s" \ 
>>>> op stop interval="0" timeout="120s" \ 
>>>> op monitor interval="10" timeout="30" 
>>>> primitive stonith-node1 stonith:external/tripplitepdu \ 
>>>> params pdu_ipaddr="192.168.1.12" pdu_port="1" pdu_username="xxx" 
>>>> pdu_password="xxx" hostname_to_stonith="node1" 
>>>> primitive stonith-node2 stonith:external/tripplitepdu \ 
>>>> params pdu_ipaddr="192.168.1.12" pdu_port="2" pdu_username="xxx" 
>>>> pdu_password="xxx" hostname_to_stonith="node2" 
>>>> group g_daemons p_libvirt-bin 
>>>> group g_vm p_fs_vmstore p_vm 
>>>> ms ms_drbd_mount2 p_drbd_mount2 \ 
>>>> meta master-max="1" master-node-max="1" clone-max="2" 
>>>> clone-node-max="1" notify="true" 
>>>> ms ms_drbd_mount1 p_drbd_mount1 \ 
>>>> meta master-max="1" master-node-max="1" clone-max="2" 
>>>> clone-node-max="1" notify="true" 
>>>> ms ms_drbd_vmstore p_drbd_vmstore \ 
>>>> meta master-max="1" master-node-max="1" clone-max="2" 
>>>> clone-node-max="1" notify="true" 
>>>> clone cl_daemons g_daemons 
>>>> clone cl_ping p_ping \ 
>>>> meta interleave="true" 
>>>> clone cl_sysadmin_notify p_sysadmin_notify 
>>>> location l-st-node1 stonith-node1 -inf: node1 
>>>> location l-st-node2 stonith-node2 -inf: node2 
>>>> location l_run_on_most_connected p_vm \ 
>>>> rule $id="l_run_on_most_connected-rule" p_ping: defined p_ping 
>>>> colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master 
>>>> ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm 
>>> 
>>> As Emmanuel already said, g_vm has to be in the first place in this 
>>> collocation constraint .... g_vm must be colocated with the drbd masters. 
>>> 
>>>> order o_drbd-fs-vm inf: ms_drbd_vmstore:promote ms_drbd_mount1:promote 
>>>> ms_drbd_mount2:promote cl_daemons:start g_vm:start 
>>>> property $id="cib-bootstrap-options" \ 
>>>> dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c" \ 
>>>> cluster-infrastructure="Heartbeat" \ 
>>>> stonith-enabled="false" \ 
>>>> no-quorum-policy="stop" \ 
>>>> last-lrm-refresh="1332539900" \ 
>>>> cluster-recheck-interval="5m" \ 
>>>> crmd-integration-timeout="3m" \ 
>>>> shutdown-escalation="5m" 
>>>> 
>>>> The STONITH plugin is a custom plugin I wrote for the Tripp-Lite 
>>>> PDUMH20ATNET that I'm using as the STONITH device: 
>>>> http://www.tripplite.com/shared/product-pages/en/PDUMH20ATNET.pdf 
>>> 
>>> And why don't using it? .... stonith-enabled="false" 
>>> 
>>>> 
>>>> As you can see, I left the DRBD service to be started by the operating 
>>>> system (as an lsb script at boot time) however Pacemaker controls 
>>>> actually bringing up/taking down the individual DRBD devices. 
>>> 
>>> Don't start drbd on system boot, give Pacemaker the full control. 
>>> 
>>> The 
>>>> behavior I observe is as follows: I issue "crm resource migrate p_vm" on 
>>>> node1 and failover successfully to node2. During this time, node2 fences 
>>>> node1's DRBD devices (using dopd) and marks them as Outdated. Meanwhile 
>>>> node2's DRBD devices are UpToDate. I then shutdown both nodes and then 
>>>> bring them back up. They reconnect to the cluster (with quorum), and 
>>>> node1's DRBD devices are still Outdated as expected and node2's DRBD 
>>>> devices are still UpToDate, as expected. At this point, DRBD starts on 
>>>> both nodes, however node2 will not set DRBD as master: 
>>>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE 
> (standby) 
>>>> Online: [ node2 node1 ] 
>>>> 
>>>> Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore] 
>>>> Slaves: [ node1 node2 ] 
>>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] 
>>>> Slaves: [ node1 node 2 ] 
>>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2] 
>>>> Slaves: [ node1 node2 ] 
>>> 
>>> There should really be no interruption of the drbd replication on vm 
>>> migration that activates the dopd ... drbd has its own direct network 
>>> connection? 
>>> 
>>> Please share your ha.cf file and your drbd configuration. Watch out for 
>>> drbd messages in your kernel log file, that should give you additional 
>>> information when/why the drbd connection was lost. 
>>> 
>>> Regards, 
>>> Andreas 
>>> 
>>> -- 
>>> Need help with Pacemaker? 
>>> http://www.hastexo.com/now 
>>> 
>>>> 
>>>> I am having trouble sorting through the logging information because 
>>>> there is so much of it in /var/log/daemon.log, but I can't find an 
>>>> error message printed about why it will not promote node2. At this point 
>>>> the DRBD devices are as follows: 
>>>> node2: cstate = WFConnection dstate=UpToDate 
>>>> node1: cstate = StandAlone dstate=Outdated 
>>>> 
>>>> I don't see any reason why node2 can't become DRBD master, or am I 
>>>> missing something? If I do "drbdadm connect all" on node1, then the 
>>>> cstate on both nodes changes to "Connected" and node2 immediately 
>>>> promotes the DRBD resources to master. Any ideas on why I'm observing 
>>>> this incorrect behavior? 
>>>> 
>>>> Any tips on how I can better filter through the pacemaker/heartbeat logs 
>>>> or how to get additional useful debug information? 
>>>> 
>>>> Thanks, 
>>>> 
>>>> Andrew 
>>>> 
>>>> ------------------------------------------------------------------------ 
>>>> *From: *"Andreas Kurz" < andreas at hastexo.com 
> <mailto: andreas at hastexo.com >> 
>>>> *To: * pacemaker at oss.clusterlabs.org 
>> <mailto:* pacemaker at oss.clusterlabs.org > 
>>>> *Sent: *Wednesday, 1 February, 2012 4:19:25 PM 
>>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to 
>>>> master on failover 
>>>> 
>>>> On 01/25/2012 08:58 PM, Andrew Martin wrote: 
>>>>> Hello, 
>>>>> 
>>>>> Recently I finished configuring a two-node cluster with pacemaker 1.1.6 
>>>>> and heartbeat 3.0.5 on nodes running Ubuntu 10.04. This cluster 
> includes 
>>>>> the following resources: 
>>>>> - primitives for DRBD storage devices 
>>>>> - primitives for mounting the filesystem on the DRBD storage 
>>>>> - primitives for some mount binds 
>>>>> - primitive for starting apache 
>>>>> - primitives for starting samba and nfs servers (following instructions 
>>>>> here < http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf >) 
>>>>> - primitives for exporting nfs shares (ocf:heartbeat:exportfs) 
>>>> 
>>>> not enough information ... please share at least your complete cluster 
>>>> configuration 
>>>> 
>>>> Regards, 
>>>> Andreas 
>>>> 
>>>> -- 
>>>> Need help with Pacemaker? 
>>>> http://www.hastexo.com/now 
>>>> 
>>>>> 
>>>>> Perhaps this is best described through the output of crm_mon: 
>>>>> Online: [ node1 node2 ] 
>>>>> 
>>>>> Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] (unmanaged) 
>>>>> p_drbd_mount1:0 (ocf::linbit:drbd): Started node2 
>>> (unmanaged) 
>>>>> p_drbd_mount1:1 (ocf::linbit:drbd): Started node1 
>>>>> (unmanaged) FAILED 
>>>>> Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2] 
>>>>> p_drbd_mount2:0 (ocf::linbit:drbd): Master node1 
>>>>> (unmanaged) FAILED 
>>>>> Slaves: [ node2 ] 
>>>>> Resource Group: g_core 
>>>>> p_fs_mount1 (ocf::heartbeat:Filesystem): Started node1 
>>>>> p_fs_mount2 (ocf::heartbeat:Filesystem): Started node1 
>>>>> p_ip_nfs (ocf::heartbeat:IPaddr2): Started node1 
>>>>> Resource Group: g_apache 
>>>>> p_fs_mountbind1 (ocf::heartbeat:Filesystem): Started node1 
>>>>> p_fs_mountbind2 (ocf::heartbeat:Filesystem): Started node1 
>>>>> p_fs_mountbind3 (ocf::heartbeat:Filesystem): Started node1 
>>>>> p_fs_varwww (ocf::heartbeat:Filesystem): Started node1 
>>>>> p_apache (ocf::heartbeat:apache): Started node1 
>>>>> Resource Group: g_fileservers 
>>>>> p_lsb_smb (lsb:smbd): Started node1 
>>>>> p_lsb_nmb (lsb:nmbd): Started node1 
>>>>> p_lsb_nfsserver (lsb:nfs-kernel-server): Started node1 
>>>>> p_exportfs_mount1 (ocf::heartbeat:exportfs): Started node1 
>>>>> p_exportfs_mount2 (ocf::heartbeat:exportfs): Started 
> node1 
>>>>> 
>>>>> I have read through the Pacemaker Explained 
>>>>> 
>>>> 
>>> 
> < http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained > 
>>>>> documentation, however could not find a way to further debug these 
>>>>> problems. First, I put node1 into standby mode to attempt failover to 
>>>>> the other node (node2). Node2 appeared to start the transition to 
>>>>> master, however it failed to promote the DRBD resources to master (the 
>>>>> first step). I have attached a copy of this session in commands.log and 
>>>>> additional excerpts from /var/log/syslog during important steps. I have 
>>>>> attempted everything I can think of to try and start the DRBD resource 
>>>>> (e.g. start/stop/promote/manage/cleanup under crm resource, restarting 
>>>>> heartbeat) but cannot bring it out of the slave state. However, if 
> I set 
>>>>> it to unmanaged and then run drbdadm primary all in the terminal, 
>>>>> pacemaker is satisfied and continues starting the rest of the 
> resources. 
>>>>> It then failed when attempting to mount the filesystem for mount2, the 
>>>>> p_fs_mount2 resource. I attempted to mount the filesystem myself 
> and was 
>>>>> successful. I then unmounted it and ran cleanup on p_fs_mount2 and then 
>>>>> it mounted. The rest of the resources started as expected until the 
>>>>> p_exportfs_mount2 resource, which failed as follows: 
>>>>> p_exportfs_mount2 (ocf::heartbeat:exportfs): started node2 
>>>>> (unmanaged) FAILED 
>>>>> 
>>>>> I ran cleanup on this and it started, however when running this test 
>>>>> earlier today no command could successfully start this exportfs 
>> resource. 
>>>>> 
>>>>> How can I configure pacemaker to better resolve these problems and be 
>>>>> able to bring the node up successfully on its own? What can I check to 
>>>>> determine why these failures are occuring? /var/log/syslog did not seem 
>>>>> to contain very much useful information regarding why the failures 
>>>> occurred. 
>>>>> 
>>>>> Thanks, 
>>>>> 
>>>>> Andrew 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> This body part will be downloaded on demand. 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________ 
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
>> <mailto: Pacemaker at oss.clusterlabs.org > 
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker 
>>>> 
>>>> Project Home: http://www.clusterlabs.org 
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>>> Bugs: http://bugs.clusterlabs.org 
>>>> 
>>>> 
>>>> 
>>>> _______________________________________________ 
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
>> <mailto: Pacemaker at oss.clusterlabs.org > 
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker 
>>>> 
>>>> Project Home: http://www.clusterlabs.org 
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>>> Bugs: http://bugs.clusterlabs.org 
>>> 
>>> 
>>> 
>>> _______________________________________________ 
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
>> <mailto: Pacemaker at oss.clusterlabs.org > 
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker 
>>> 
>>> Project Home: http://www.clusterlabs.org 
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>> Bugs: http://bugs.clusterlabs.org 
>>> 
>>> 
>>> 
>>> _______________________________________________ 
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
>> <mailto: Pacemaker at oss.clusterlabs.org > 
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker 
>>> 
>>> Project Home: http://www.clusterlabs.org 
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>>> Bugs: http://bugs.clusterlabs.org 
>> 
>> 
>> 
>> _______________________________________________ 
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
>> <mailto: Pacemaker at oss.clusterlabs.org > 
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker 
>> 
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
>> 
>> 
>> 
>> _______________________________________________ 
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker 
>> 
>> Project Home: http://www.clusterlabs.org 
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
>> Bugs: http://bugs.clusterlabs.org 
> 
> 
> 
> 
> _______________________________________________ 
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 
> 
> 
> 
> _______________________________________________ 
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 



_______________________________________________ 
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker 

Project Home: http://www.clusterlabs.org 
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
Bugs: http://bugs.clusterlabs.org 


_______________________________________________ 
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker 

Project Home: http://www.clusterlabs.org 
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
Bugs: http://bugs.clusterlabs.org 






-- 
esta es mi vida e me la vivo hasta que dios quiera 

_______________________________________________ 
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker 

Project Home: http://www.clusterlabs.org 
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
Bugs: http://bugs.clusterlabs.org 


_______________________________________________ 
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker 

Project Home: http://www.clusterlabs.org 
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
Bugs: http://bugs.clusterlabs.org 


</blockquote>



-- 
esta es mi vida e me la vivo hasta que dios quiera 

_______________________________________________ 
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker 

Project Home: http://www.clusterlabs.org 
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
Bugs: http://bugs.clusterlabs.org 


_______________________________________________ 
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker 

Project Home: http://www.clusterlabs.org 
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
Bugs: http://bugs.clusterlabs.org 


</blockquote>



-- 
esta es mi vida e me la vivo hasta que dios quiera 

_______________________________________________ 
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org 
http://oss.clusterlabs.org/mailman/listinfo/pacemaker 

Project Home: http://www.clusterlabs.org 
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
Bugs: http://bugs.clusterlabs.org 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20120402/6b0ae5ea/attachment-0001.html>


More information about the Pacemaker mailing list