[Pacemaker] Nodes will not promote DRBD resources to master on failover

Mon Apr 2 16:43:20 CEST 2012

Sorry Andrew

Can you post me your crm configure show again?

Thanks

Il giorno 30 marzo 2012 18:53, Andrew Martin <amartin at xes-inc.com> ha
scritto:

> Hi Emmanuel,
>
> Thanks, that is a good idea. I updated the colocation contraint as you
> described. After, the cluster remains in this state (with the filesystem
> not mounted and the VM not started):
> Online: [ node2 node1 ]
>
>  Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
>      Masters: [ node1 ]
>      Slaves: [ node2 ]
>  Master/Slave Set: ms_drbd_tools [p_drbd_mount1]
>      Masters: [ node1 ]
>      Slaves: [ node2 ]
>  Master/Slave Set: ms_drbd_crm [p_drbd_mount2]
>      Masters: [ node1 ]
>      Slaves: [ node2 ]
>  Clone Set: cl_daemons [g_daemons]
>      Started: [ node2 node1 ]
>      Stopped: [ g_daemons:2 ]
> stonith-node1    (stonith:external/tripplitepdu):        Started node2
> stonith-node2    (stonith:external/tripplitepdu):        Started node1
>
> I noticed that Pacemaker had not issued "drbdadm connect" for any of the
> DRBD resources on node2
> # service drbd status
> drbd driver loaded OK; device status:
> version: 8.3.7 (api:88/proto:86-91)
> GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by root at node2,
> 2012-02-02 12:29:26
> m:res      cs          ro                 ds                 p
>  mounted  fstype
> 0:vmstore  StandAlone  Secondary/Unknown  Outdated/DUnknown  r----
> 1:mount1    StandAlone  Secondary/Unknown  Outdated/DUnknown  r----
> 2:mount2      StandAlone  Secondary/Unknown  Outdated/DUnknown  r----
> # drbdadm cstate all
> StandAlone
> StandAlone
> StandAlone
>
> After manually issuing "drbdadm connect all" on node2 the rest of the
> resources eventually started (several minutes later) on node1:
> Online: [ node2 node1 ]
>
>  Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
>      Masters: [ node1 ]
>      Slaves: [ node2 ]
>  Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
>      Masters: [ node1 ]
>      Slaves: [ node2 ]
>  Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
>      Masters: [ node1 ]
>      Slaves: [ node2 ]
>  Resource Group: g_vm
>      p_fs_vmstore       (ocf::heartbeat:Filesystem):    Started node1
>      p_vm               (ocf::heartbeat:VirtualDomain): Started node1
>  Clone Set: cl_daemons [g_daemons]
>      Started: [ node2 node1 ]
>      Stopped: [ g_daemons:2 ]
>  Clone Set: cl_sysadmin_notify [p_sysadmin_notify]
>      Started: [ node2 node1 ]
>      Stopped: [ p_sysadmin_notify:2 ]
> stonith-node1    (stonith:external/tripplitepdu):        Started node2
> stonith-node2    (stonith:external/tripplitepdu):        Started node1
>  Clone Set: cl_ping [p_ping]
>      Started: [ node2 node1 ]
>      Stopped: [ p_ping:2 ]
>
> The DRBD devices on node1 were all UpToDate, so it doesn't seem right that
> it would need to wait for node2 to be connected before it could continue
> promoting additional resources. I then restarted heartbeat on node2 to see
> if it would automatically connect the DRBD devices this time. After
> restarting it, the DRBD devices are not even configured:
> # service drbd status
> drbd driver loaded OK; device status:
> version: 8.3.7 (api:88/proto:86-91)
> GIT-hash: ea9e28dbff98e331a62bcbcc63a6135808fe2917 build by
> root at webapps2host, 2012-02-02 12:29:26
> m:res      cs            ro  ds  p  mounted  fstype
> 0:vmstore  Unconfigured
> 1:mount1   Unconfigured
> 2:mount2   Unconfigured
>
> Looking at the log I found this part about the drbd primitives:
> Mar 30 11:10:32 node2 lrmd: [10702]: info: operation monitor[2] on
> p_drbd_vmstore:1 for client 10705: pid 11065 exited with return code 7
> Mar 30 11:10:32 node2 crmd: [10705]: info: process_lrm_event: LRM
> operation p_drbd_vmstore:1_monitor_0 (call=2, rc=7, cib-update=11,
> confirmed=true) not running
> Mar 30 11:10:32 node2 lrmd: [10702]: info: operation monitor[4] on
> p_drbd_mount2:1 for client 10705: pid 11069 exited with return code 7
> Mar 30 11:10:32 node2 crmd: [10705]: info: process_lrm_event: LRM
> operation p_drbd_mount2:1_monitor_0 (call=4, rc=7, cib-update=12,
> confirmed=true) not running
> Mar 30 11:10:32 node2 lrmd: [10702]: info: operation monitor[3] on
> p_drbd_mount1:1 for client 10705: pid 11066 exited with return code 7
> Mar 30 11:10:32 node2 crmd: [10705]: info: process_lrm_event: LRM
> operation p_drbd_mount1:1_monitor_0 (call=3, rc=7, cib-update=13,
> confirmed=true) not running
>
> I am not sure what exit code 7 is - is it possible to manually run the
> monitor code or somehow obtain more debug about this? Here is the complete
> log after restarting heartbeat on node2:
> http://pastebin.com/KsHKi3GW
>
> Thanks,
>
> Andrew
>
> ------------------------------
> *From: *"emmanuel segura" <emi2fast at gmail.com>
> *To: *"The Pacemaker cluster resource manager" <
> pacemaker at oss.clusterlabs.org>
> *Sent: *Friday, March 30, 2012 10:26:48 AM
> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
> master on        failover
>
> I think this constrain it's wrong
> ==================================================
> colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master
> ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm
> ===================================================
>
> change to
> ======================================================
> colocation c_drbd_libvirt_vm inf: g_vm ms_drbd_vmstore:Master
> ms_drbd_mount1:Master ms_drbd_mount2:Master
> =======================================================
>
> Il giorno 30 marzo 2012 17:16, Andrew Martin <amartin at xes-inc.com> ha
> scritto:
>
>> Hi Emmanuel,
>>
>> Here is the output of crm configure show:
>> http://pastebin.com/NA1fZ8dL
>>
>> Thanks,
>>
>> Andrew
>>
>> ------------------------------
>> *From: *"emmanuel segura" <emi2fast at gmail.com>
>> *To: *"The Pacemaker cluster resource manager" <
>> pacemaker at oss.clusterlabs.org>
>> *Sent: *Friday, March 30, 2012 9:43:45 AM
>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
>> master on        failover
>>
>> can you show me?
>>
>> crm configure show
>>
>> Il giorno 30 marzo 2012 16:10, Andrew Martin <amartin at xes-inc.com> ha
>> scritto:
>>
>>> Hi Andreas,
>>>
>>> Here is a copy of my complete CIB:
>>> http://pastebin.com/v5wHVFuy
>>>
>>> I'll work on generating a report using crm_report as well.
>>>
>>> Thanks,
>>>
>>> Andrew
>>>
>>> ------------------------------
>>> *From: *"Andreas Kurz" <andreas at hastexo.com>
>>> *To: *pacemaker at oss.clusterlabs.org
>>> *Sent: *Friday, March 30, 2012 4:41:16 AM
>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
>>> master on failover
>>>
>>> On 03/28/2012 04:56 PM, Andrew Martin wrote:
>>> > Hi Andreas,
>>> >
>>> > I disabled the DRBD init script and then restarted the slave node
>>> > (node2). After it came back up, DRBD did not start:
>>> > Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): pending
>>> > Online: [ node2 node1 ]
>>> >
>>> >  Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
>>> >      Masters: [ node1 ]
>>> >      Stopped: [ p_drbd_vmstore:1 ]
>>> >  Master/Slave Set: ms_drbd_mount1 [p_drbd_tools]
>>> >      Masters: [ node1 ]
>>> >      Stopped: [ p_drbd_mount1:1 ]
>>> >  Master/Slave Set: ms_drbd_mount2 [p_drbdmount2]
>>> >      Masters: [ node1 ]
>>> >      Stopped: [ p_drbd_mount2:1 ]
>>> > ...
>>> >
>>> > root at node2:~# service drbd status
>>> > drbd not loaded
>>>
>>> Yes, expected unless Pacemaker starts DRBD
>>>
>>> >
>>> > Is there something else I need to change in the CIB to ensure that DRBD
>>> > is started? All of my DRBD devices are configured like this:
>>> > primitive p_drbd_mount2 ocf:linbit:drbd \
>>> >         params drbd_resource="mount2" \
>>> >         op monitor interval="15" role="Master" \
>>> >         op monitor interval="30" role="Slave"
>>> > ms ms_drbd_mount2 p_drbd_mount2 \
>>> >         meta master-max="1" master-node-max="1" clone-max="2"
>>> > clone-node-max="1" notify="true"
>>>
>>> That should be enough ... unable to say more without seeing the complete
>>> configuration ... too much fragments of information ;-)
>>>
>>> Please provide (e.g. pastebin) your complete cib (cibadmin -Q) when
>>> cluster is in that state ... or even better create a crm_report archive
>>>
>>> >
>>> > Here is the output from the syslog (grep -i drbd /var/log/syslog):
>>> > Mar 28 09:24:47 node2 crmd: [3213]: info: do_lrm_rsc_op: Performing
>>> > key=12:315:7:24416169-73ba-469b-a2e3-56a22b437cbc
>>> > op=p_drbd_vmstore:1_monitor_0 )
>>> > Mar 28 09:24:47 node2 lrmd: [3210]: info: rsc:p_drbd_vmstore:1 probe[2]
>>> > (pid 3455)
>>> > Mar 28 09:24:47 node2 crmd: [3213]: info: do_lrm_rsc_op: Performing
>>> > key=13:315:7:24416169-73ba-469b-a2e3-56a22b437cbc
>>> > op=p_drbd_mount1:1_monitor_0 )
>>> > Mar 28 09:24:48 node2 lrmd: [3210]: info: rsc:p_drbd_mount1:1 probe[3]
>>> > (pid 3456)
>>> > Mar 28 09:24:48 node2 crmd: [3213]: info: do_lrm_rsc_op: Performing
>>> > key=14:315:7:24416169-73ba-469b-a2e3-56a22b437cbc
>>> > op=p_drbd_mount2:1_monitor_0 )
>>> > Mar 28 09:24:48 node2 lrmd: [3210]: info: rsc:p_drbd_mount2:1 probe[4]
>>> > (pid 3457)
>>> > Mar 28 09:24:48 node2 Filesystem[3458]: [3517]: WARNING: Couldn't find
>>> > device [/dev/drbd0]. Expected /dev/??? to exist
>>> > Mar 28 09:24:48 node2 crm_attribute: [3563]: info: Invoked:
>>> > crm_attribute -N node2 -n master-p_drbd_mount2:1 -l reboot -D
>>> > Mar 28 09:24:48 node2 crm_attribute: [3557]: info: Invoked:
>>> > crm_attribute -N node2 -n master-p_drbd_vmstore:1 -l reboot -D
>>> > Mar 28 09:24:48 node2 crm_attribute: [3562]: info: Invoked:
>>> > crm_attribute -N node2 -n master-p_drbd_mount1:1 -l reboot -D
>>> > Mar 28 09:24:48 node2 lrmd: [3210]: info: operation monitor[4] on
>>> > p_drbd_mount2:1 for client 3213: pid 3457 exited with return code 7
>>> > Mar 28 09:24:48 node2 lrmd: [3210]: info: operation monitor[2] on
>>> > p_drbd_vmstore:1 for client 3213: pid 3455 exited with return code 7
>>> > Mar 28 09:24:48 node2 crmd: [3213]: info: process_lrm_event: LRM
>>> > operation p_drbd_mount2:1_monitor_0 (call=4, rc=7, cib-update=10,
>>> > confirmed=true) not running
>>> > Mar 28 09:24:48 node2 lrmd: [3210]: info: operation monitor[3] on
>>> > p_drbd_mount1:1 for client 3213: pid 3456 exited with return code 7
>>> > Mar 28 09:24:48 node2 crmd: [3213]: info: process_lrm_event: LRM
>>> > operation p_drbd_vmstore:1_monitor_0 (call=2, rc=7, cib-update=11,
>>> > confirmed=true) not running
>>> > Mar 28 09:24:48 node2 crmd: [3213]: info: process_lrm_event: LRM
>>> > operation p_drbd_mount1:1_monitor_0 (call=3, rc=7, cib-update=12,
>>> > confirmed=true) not running
>>>
>>> No errors, just probing ... so for any reason Pacemaker does not like to
>>> start it ... use crm_simulate to find out why ... or provide information
>>> as requested above.
>>>
>>> Regards,
>>> Andreas
>>>
>>> --
>>> Need help with Pacemaker?
>>> http://www.hastexo.com/now
>>>
>>> >
>>> > Thanks,
>>> >
>>> > Andrew
>>> >
>>> >
>>> ------------------------------------------------------------------------
>>> > *From: *"Andreas Kurz" <andreas at hastexo.com>
>>> > *To: *pacemaker at oss.clusterlabs.org
>>> > *Sent: *Wednesday, March 28, 2012 9:03:06 AM
>>> > *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
>>> > master on failover
>>> >
>>> > On 03/28/2012 03:47 PM, Andrew Martin wrote:
>>> >> Hi Andreas,
>>> >>
>>> >>> hmm ... what is that fence-peer script doing? If you want to use
>>> >>> resource-level fencing with the help of dopd, activate the
>>> >>> drbd-peer-outdater script in the line above ... and double check if
>>> the
>>> >>> path is correct
>>> >> fence-peer is just a wrapper for drbd-peer-outdater that does some
>>> >> additional logging. In my testing dopd has been working well.
>>> >
>>> > I see
>>> >
>>> >>
>>> >>>> I am thinking of making the following changes to the CIB (as per the
>>> >>>> official DRBD
>>> >>>> guide
>>> >>
>>> >
>>> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html)
>>> in
>>> >>>> order to add the DRBD lsb service and require that it start before
>>> the
>>> >>>> ocf:linbit:drbd resources. Does this look correct?
>>> >>>
>>> >>> Where did you read that? No, deactivate the startup of DRBD on system
>>> >>> boot and let Pacemaker manage it completely.
>>> >>>
>>> >>>> primitive p_drbd-init lsb:drbd op monitor interval="30"
>>> >>>> colocation c_drbd_together inf:
>>> >>>> p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master
>>> >>>> ms_drbd_mount2:Master
>>> >>>> order drbd_init_first inf: ms_drbd_vmstore:promote
>>> >>>> ms_drbd_mount1:promote ms_drbd_mount2:promote p_drbd-init:start
>>> >>>>
>>> >>>> This doesn't seem to require that drbd be also running on the node
>>> where
>>> >>>> the ocf:linbit:drbd resources are slave (which it would need to do
>>> to be
>>> >>>> a DRBD SyncTarget) - how can I ensure that drbd is running
>>> everywhere?
>>> >>>> (clone cl_drbd p_drbd-init ?)
>>> >>>
>>> >>> This is really not needed.
>>> >> I was following the official DRBD Users Guide:
>>> >>
>>> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html
>>> >>
>>> >> If I am understanding your previous message correctly, I do not need
>>> to
>>> >> add a lsb primitive for the drbd daemon? It will be
>>> >> started/stopped/managed automatically by my ocf:linbit:drbd resources
>>> >> (and I can remove the /etc/rc* symlinks)?
>>> >
>>> > Yes, you don't need that LSB script when using Pacemaker and should not
>>> > let init start it.
>>> >
>>> > Regards,
>>> > Andreas
>>> >
>>> > --
>>> > Need help with Pacemaker?
>>> > http://www.hastexo.com/now
>>> >
>>> >>
>>> >> Thanks,
>>> >>
>>> >> Andrew
>>> >>
>>> >>
>>> ------------------------------------------------------------------------
>>> >> *From: *"Andreas Kurz" <andreas at hastexo.com <mailto:
>>> andreas at hastexo.com>>
>>> >> *To: *pacemaker at oss.clusterlabs.org <mailto:
>>> pacemaker at oss.clusterlabs.org>
>>> >> *Sent: *Wednesday, March 28, 2012 7:27:34 AM
>>> >> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
>>> >> master on failover
>>> >>
>>> >> On 03/28/2012 12:13 AM, Andrew Martin wrote:
>>> >>> Hi Andreas,
>>> >>>
>>> >>> Thanks, I've updated the colocation rule to be in the correct order.
>>> I
>>> >>> also enabled the STONITH resource (this was temporarily disabled
>>> before
>>> >>> for some additional testing). DRBD has its own network connection
>>> over
>>> >>> the br1 interface (192.168.5.0/24 network), a direct crossover cable
>>> >>> between node1 and node2:
>>> >>> global { usage-count no; }
>>> >>> common {
>>> >>>         syncer { rate 110M; }
>>> >>> }
>>> >>> resource vmstore {
>>> >>>         protocol C;
>>> >>>         startup {
>>> >>>                 wfc-timeout  15;
>>> >>>                 degr-wfc-timeout 60;
>>> >>>         }
>>> >>>         handlers {
>>> >>>                 #fence-peer "/usr/lib/heartbeat/drbd-peer-outdater
>>> -t 5";
>>> >>>                 fence-peer "/usr/local/bin/fence-peer";
>>> >>
>>> >> hmm ... what is that fence-peer script doing? If you want to use
>>> >> resource-level fencing with the help of dopd, activate the
>>> >> drbd-peer-outdater script in the line above ... and double check if
>>> the
>>> >> path is correct
>>> >>
>>> >>>                 split-brain "/usr/lib/drbd/notify-split-brain.sh
>>> >>> me at example.com <mailto:me at example.com>";
>>> >>>         }
>>> >>>         net {
>>> >>>                 after-sb-0pri discard-zero-changes;
>>> >>>                 after-sb-1pri discard-secondary;
>>> >>>                 after-sb-2pri disconnect;
>>> >>>                 cram-hmac-alg md5;
>>> >>>                 shared-secret "xxxxx";
>>> >>>         }
>>> >>>         disk {
>>> >>>                 fencing resource-only;
>>> >>>         }
>>> >>>         on node1 {
>>> >>>                 device /dev/drbd0;
>>> >>>                 disk /dev/sdb1;
>>> >>>                 address 192.168.5.10:7787;
>>> >>>                 meta-disk internal;
>>> >>>         }
>>> >>>         on node2 {
>>> >>>                 device /dev/drbd0;
>>> >>>                 disk /dev/sdf1;
>>> >>>                 address 192.168.5.11:7787;
>>> >>>                 meta-disk internal;
>>> >>>         }
>>> >>> }
>>> >>> # and similar for mount1 and mount2
>>> >>>
>>> >>> Also, here is my ha.cf. It uses both the direct link between the
>>> nodes
>>> >>> (br1) and the shared LAN network on br0 for communicating:
>>> >>> autojoin none
>>> >>> mcast br0 239.0.0.43 694 1 0
>>> >>> bcast br1
>>> >>> warntime 5
>>> >>> deadtime 15
>>> >>> initdead 60
>>> >>> keepalive 2
>>> >>> node node1
>>> >>> node node2
>>> >>> node quorumnode
>>> >>> crm respawn
>>> >>> respawn hacluster /usr/lib/heartbeat/dopd
>>> >>> apiauth dopd gid=haclient uid=hacluster
>>> >>>
>>> >>> I am thinking of making the following changes to the CIB (as per the
>>> >>> official DRBD
>>> >>> guide
>>> >>
>>> >
>>> http://www.drbd.org/users-guide/s-pacemaker-crm-drbd-backed-service.html)
>>> in
>>> >>> order to add the DRBD lsb service and require that it start before
>>> the
>>> >>> ocf:linbit:drbd resources. Does this look correct?
>>> >>
>>> >> Where did you read that? No, deactivate the startup of DRBD on system
>>> >> boot and let Pacemaker manage it completely.
>>> >>
>>> >>> primitive p_drbd-init lsb:drbd op monitor interval="30"
>>> >>> colocation c_drbd_together inf:
>>> >>> p_drbd-init ms_drbd_vmstore:Master ms_drbd_mount1:Master
>>> >>> ms_drbd_mount2:Master
>>> >>> order drbd_init_first inf: ms_drbd_vmstore:promote
>>> >>> ms_drbd_mount1:promote ms_drbd_mount2:promote p_drbd-init:start
>>> >>>
>>> >>> This doesn't seem to require that drbd be also running on the node
>>> where
>>> >>> the ocf:linbit:drbd resources are slave (which it would need to do
>>> to be
>>> >>> a DRBD SyncTarget) - how can I ensure that drbd is running
>>> everywhere?
>>> >>> (clone cl_drbd p_drbd-init ?)
>>> >>
>>> >> This is really not needed.
>>> >>
>>> >> Regards,
>>> >> Andreas
>>> >>
>>> >> --
>>> >> Need help with Pacemaker?
>>> >> http://www.hastexo.com/now
>>> >>
>>> >>>
>>> >>> Thanks,
>>> >>>
>>> >>> Andrew
>>> >>>
>>> ------------------------------------------------------------------------
>>> >>> *From: *"Andreas Kurz" <andreas at hastexo.com <mailto:
>>> andreas at hastexo.com>>
>>> >>> *To: *pacemaker at oss.clusterlabs.org
>>> > <mailto:*pacemaker at oss.clusterlabs.org>
>>> >>> *Sent: *Monday, March 26, 2012 5:56:22 PM
>>> >>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
>>> >>> master on failover
>>> >>>
>>> >>> On 03/24/2012 08:15 PM, Andrew Martin wrote:
>>> >>>> Hi Andreas,
>>> >>>>
>>> >>>> My complete cluster configuration is as follows:
>>> >>>> ============
>>> >>>> Last updated: Sat Mar 24 13:51:55 2012
>>> >>>> Last change: Sat Mar 24 13:41:55 2012
>>> >>>> Stack: Heartbeat
>>> >>>> Current DC: node2 (9100538b-7a1f-41fd-9c1a-c6b4b1c32b18) - partition
>>> >>>> with quorum
>>> >>>> Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
>>> >>>> 3 Nodes configured, unknown expected votes
>>> >>>> 19 Resources configured.
>>> >>>> ============
>>> >>>>
>>> >>>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE
>>> > (standby)
>>> >>>> Online: [ node2 node1 ]
>>> >>>>
>>> >>>>  Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
>>> >>>>      Masters: [ node2 ]
>>> >>>>      Slaves: [ node1 ]
>>> >>>>  Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
>>> >>>>      Masters: [ node2 ]
>>> >>>>      Slaves: [ node1 ]
>>> >>>>  Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
>>> >>>>      Masters: [ node2 ]
>>> >>>>      Slaves: [ node1 ]
>>> >>>>  Resource Group: g_vm
>>> >>>>      p_fs_vmstore(ocf::heartbeat:Filesystem):Started node2
>>> >>>>      p_vm(ocf::heartbeat:VirtualDomain):Started node2
>>> >>>>  Clone Set: cl_daemons [g_daemons]
>>> >>>>      Started: [ node2 node1 ]
>>> >>>>      Stopped: [ g_daemons:2 ]
>>> >>>>  Clone Set: cl_sysadmin_notify [p_sysadmin_notify]
>>> >>>>      Started: [ node2 node1 ]
>>> >>>>      Stopped: [ p_sysadmin_notify:2 ]
>>> >>>>  stonith-node1(stonith:external/tripplitepdu):Started node2
>>> >>>>  stonith-node2(stonith:external/tripplitepdu):Started node1
>>> >>>>  Clone Set: cl_ping [p_ping]
>>> >>>>      Started: [ node2 node1 ]
>>> >>>>      Stopped: [ p_ping:2 ]
>>> >>>>
>>> >>>> node $id="6553a515-273e-42fe-ab9e-00f74bd582c3" node1 \
>>> >>>>         attributes standby="off"
>>> >>>> node $id="9100538b-7a1f-41fd-9c1a-c6b4b1c32b18" node2 \
>>> >>>>         attributes standby="off"
>>> >>>> node $id="c4bf25d7-a6b7-4863-984d-aafd937c0da4" quorumnode \
>>> >>>>         attributes standby="on"
>>> >>>> primitive p_drbd_mount2 ocf:linbit:drbd \
>>> >>>>         params drbd_resource="mount2" \
>>> >>>>         op monitor interval="15" role="Master" \
>>> >>>>         op monitor interval="30" role="Slave"
>>> >>>> primitive p_drbd_mount1 ocf:linbit:drbd \
>>> >>>>         params drbd_resource="mount1" \
>>> >>>>         op monitor interval="15" role="Master" \
>>> >>>>         op monitor interval="30" role="Slave"
>>> >>>> primitive p_drbd_vmstore ocf:linbit:drbd \
>>> >>>>         params drbd_resource="vmstore" \
>>> >>>>         op monitor interval="15" role="Master" \
>>> >>>>         op monitor interval="30" role="Slave"
>>> >>>> primitive p_fs_vmstore ocf:heartbeat:Filesystem \
>>> >>>>         params device="/dev/drbd0" directory="/vmstore"
>>> fstype="ext4" \
>>> >>>>         op start interval="0" timeout="60s" \
>>> >>>>         op stop interval="0" timeout="60s" \
>>> >>>>         op monitor interval="20s" timeout="40s"
>>> >>>> primitive p_libvirt-bin upstart:libvirt-bin \
>>> >>>>         op monitor interval="30"
>>> >>>> primitive p_ping ocf:pacemaker:ping \
>>> >>>>         params name="p_ping" host_list="192.168.1.10 192.168.1.11"
>>> >>>> multiplier="1000" \
>>> >>>>         op monitor interval="20s"
>>> >>>> primitive p_sysadmin_notify ocf:heartbeat:MailTo \
>>> >>>>         params email="me at example.com <mailto:me at example.com>" \
>>> >>>>         params subject="Pacemaker Change" \
>>> >>>>         op start interval="0" timeout="10" \
>>> >>>>         op stop interval="0" timeout="10" \
>>> >>>>         op monitor interval="10" timeout="10"
>>> >>>> primitive p_vm ocf:heartbeat:VirtualDomain \
>>> >>>>         params config="/vmstore/config/vm.xml" \
>>> >>>>         meta allow-migrate="false" \
>>> >>>>         op start interval="0" timeout="120s" \
>>> >>>>         op stop interval="0" timeout="120s" \
>>> >>>>         op monitor interval="10" timeout="30"
>>> >>>> primitive stonith-node1 stonith:external/tripplitepdu \
>>> >>>>         params pdu_ipaddr="192.168.1.12" pdu_port="1"
>>> pdu_username="xxx"
>>> >>>> pdu_password="xxx" hostname_to_stonith="node1"
>>> >>>> primitive stonith-node2 stonith:external/tripplitepdu \
>>> >>>>         params pdu_ipaddr="192.168.1.12" pdu_port="2"
>>> pdu_username="xxx"
>>> >>>> pdu_password="xxx" hostname_to_stonith="node2"
>>> >>>> group g_daemons p_libvirt-bin
>>> >>>> group g_vm p_fs_vmstore p_vm
>>> >>>> ms ms_drbd_mount2 p_drbd_mount2 \
>>> >>>>         meta master-max="1" master-node-max="1" clone-max="2"
>>> >>>> clone-node-max="1" notify="true"
>>> >>>> ms ms_drbd_mount1 p_drbd_mount1 \
>>> >>>>         meta master-max="1" master-node-max="1" clone-max="2"
>>> >>>> clone-node-max="1" notify="true"
>>> >>>> ms ms_drbd_vmstore p_drbd_vmstore \
>>> >>>>         meta master-max="1" master-node-max="1" clone-max="2"
>>> >>>> clone-node-max="1" notify="true"
>>> >>>> clone cl_daemons g_daemons
>>> >>>> clone cl_ping p_ping \
>>> >>>>         meta interleave="true"
>>> >>>> clone cl_sysadmin_notify p_sysadmin_notify
>>> >>>> location l-st-node1 stonith-node1 -inf: node1
>>> >>>> location l-st-node2 stonith-node2 -inf: node2
>>> >>>> location l_run_on_most_connected p_vm \
>>> >>>>         rule $id="l_run_on_most_connected-rule" p_ping: defined
>>> p_ping
>>> >>>> colocation c_drbd_libvirt_vm inf: ms_drbd_vmstore:Master
>>> >>>> ms_drbd_mount1:Master ms_drbd_mount2:Master g_vm
>>> >>>
>>> >>> As Emmanuel already said, g_vm has to be in the first place in this
>>> >>> collocation constraint .... g_vm must be colocated with the drbd
>>> masters.
>>> >>>
>>> >>>> order o_drbd-fs-vm inf: ms_drbd_vmstore:promote
>>> ms_drbd_mount1:promote
>>> >>>> ms_drbd_mount2:promote cl_daemons:start g_vm:start
>>> >>>> property $id="cib-bootstrap-options" \
>>> >>>>         dc-version="1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c"
>>> \
>>> >>>>         cluster-infrastructure="Heartbeat" \
>>> >>>>         stonith-enabled="false" \
>>> >>>>         no-quorum-policy="stop" \
>>> >>>>         last-lrm-refresh="1332539900" \
>>> >>>>         cluster-recheck-interval="5m" \
>>> >>>>         crmd-integration-timeout="3m" \
>>> >>>>         shutdown-escalation="5m"
>>> >>>>
>>> >>>> The STONITH plugin is a custom plugin I wrote for the Tripp-Lite
>>> >>>> PDUMH20ATNET that I'm using as the STONITH device:
>>> >>>> http://www.tripplite.com/shared/product-pages/en/PDUMH20ATNET.pdf
>>> >>>
>>> >>> And why don't using it? .... stonith-enabled="false"
>>> >>>
>>> >>>>
>>> >>>> As you can see, I left the DRBD service to be started by the
>>> operating
>>> >>>> system (as an lsb script at boot time) however Pacemaker controls
>>> >>>> actually bringing up/taking down the individual DRBD devices.
>>> >>>
>>> >>> Don't start drbd on system boot, give Pacemaker the full control.
>>> >>>
>>> >>> The
>>> >>>> behavior I observe is as follows: I issue "crm resource migrate
>>> p_vm" on
>>> >>>> node1 and failover successfully to node2. During this time, node2
>>> fences
>>> >>>> node1's DRBD devices (using dopd) and marks them as Outdated.
>>> Meanwhile
>>> >>>> node2's DRBD devices are UpToDate. I then shutdown both nodes and
>>> then
>>> >>>> bring them back up. They reconnect to the cluster (with quorum), and
>>> >>>> node1's DRBD devices are still Outdated as expected and node2's DRBD
>>> >>>> devices are still UpToDate, as expected. At this point, DRBD starts
>>> on
>>> >>>> both nodes, however node2 will not set DRBD as master:
>>> >>>> Node quorumnode (c4bf25d7-a6b7-4863-984d-aafd937c0da4): OFFLINE
>>> > (standby)
>>> >>>> Online: [ node2 node1 ]
>>> >>>>
>>> >>>>  Master/Slave Set: ms_drbd_vmstore [p_drbd_vmstore]
>>> >>>>      Slaves: [ node1 node2 ]
>>> >>>>  Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1]
>>> >>>>      Slaves: [ node1 node 2 ]
>>> >>>>  Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
>>> >>>>      Slaves: [ node1 node2 ]
>>> >>>
>>> >>> There should really be no interruption of the drbd replication on vm
>>> >>> migration that activates the dopd ... drbd has its own direct network
>>> >>> connection?
>>> >>>
>>> >>> Please share your ha.cf file and your drbd configuration. Watch out
>>> for
>>> >>> drbd messages in your kernel log file, that should give you
>>> additional
>>> >>> information when/why the drbd connection was lost.
>>> >>>
>>> >>> Regards,
>>> >>> Andreas
>>> >>>
>>> >>> --
>>> >>> Need help with Pacemaker?
>>> >>> http://www.hastexo.com/now
>>> >>>
>>> >>>>
>>> >>>> I am having trouble sorting through the logging information because
>>> >>>> there is so much of it in /var/log/daemon.log, but I can't  find an
>>> >>>> error message printed about why it will not promote node2. At this
>>> point
>>> >>>> the DRBD devices are as follows:
>>> >>>> node2: cstate = WFConnection dstate=UpToDate
>>> >>>> node1: cstate = StandAlone dstate=Outdated
>>> >>>>
>>> >>>> I don't see any reason why node2 can't become DRBD master, or am I
>>> >>>> missing something? If I do "drbdadm connect all" on node1, then the
>>> >>>> cstate on both nodes changes to "Connected" and node2 immediately
>>> >>>> promotes the DRBD resources to master. Any ideas on why I'm
>>> observing
>>> >>>> this incorrect behavior?
>>> >>>>
>>> >>>> Any tips on how I can better filter through the pacemaker/heartbeat
>>> logs
>>> >>>> or how to get additional useful debug information?
>>> >>>>
>>> >>>> Thanks,
>>> >>>>
>>> >>>> Andrew
>>> >>>>
>>> >>>>
>>> ------------------------------------------------------------------------
>>> >>>> *From: *"Andreas Kurz" <andreas at hastexo.com
>>> > <mailto:andreas at hastexo.com>>
>>> >>>> *To: *pacemaker at oss.clusterlabs.org
>>> >> <mailto:*pacemaker at oss.clusterlabs.org>
>>> >>>> *Sent: *Wednesday, 1 February, 2012 4:19:25 PM
>>> >>>> *Subject: *Re: [Pacemaker] Nodes will not promote DRBD resources to
>>> >>>> master on failover
>>> >>>>
>>> >>>> On 01/25/2012 08:58 PM, Andrew Martin wrote:
>>> >>>>> Hello,
>>> >>>>>
>>> >>>>> Recently I finished configuring a two-node cluster with pacemaker
>>> 1.1.6
>>> >>>>> and heartbeat 3.0.5 on nodes running Ubuntu 10.04. This cluster
>>> > includes
>>> >>>>> the following resources:
>>> >>>>> - primitives for DRBD storage devices
>>> >>>>> - primitives for mounting the filesystem on the DRBD storage
>>> >>>>> - primitives for some mount binds
>>> >>>>> - primitive for starting apache
>>> >>>>> - primitives for starting samba and nfs servers (following
>>> instructions
>>> >>>>> here <http://www.linbit.com/fileadmin/tech-guides/ha-nfs.pdf>)
>>> >>>>> - primitives for exporting nfs shares (ocf:heartbeat:exportfs)
>>> >>>>
>>> >>>> not enough information ... please share at least your complete
>>> cluster
>>> >>>> configuration
>>> >>>>
>>> >>>> Regards,
>>> >>>> Andreas
>>> >>>>
>>> >>>> --
>>> >>>> Need help with Pacemaker?
>>> >>>> http://www.hastexo.com/now
>>> >>>>
>>> >>>>>
>>> >>>>> Perhaps this is best described through the output of crm_mon:
>>> >>>>> Online: [ node1 node2 ]
>>> >>>>>
>>> >>>>>  Master/Slave Set: ms_drbd_mount1 [p_drbd_mount1] (unmanaged)
>>> >>>>>      p_drbd_mount1:0     (ocf::linbit:drbd):     Started node2
>>> >>> (unmanaged)
>>> >>>>>      p_drbd_mount1:1     (ocf::linbit:drbd):     Started node1
>>> >>>>> (unmanaged) FAILED
>>> >>>>>  Master/Slave Set: ms_drbd_mount2 [p_drbd_mount2]
>>> >>>>>      p_drbd_mount2:0       (ocf::linbit:drbd):     Master node1
>>> >>>>> (unmanaged) FAILED
>>> >>>>>      Slaves: [ node2 ]
>>> >>>>>  Resource Group: g_core
>>> >>>>>      p_fs_mount1 (ocf::heartbeat:Filesystem):    Started node1
>>> >>>>>      p_fs_mount2   (ocf::heartbeat:Filesystem):    Started node1
>>> >>>>>      p_ip_nfs   (ocf::heartbeat:IPaddr2):       Started node1
>>> >>>>>  Resource Group: g_apache
>>> >>>>>      p_fs_mountbind1    (ocf::heartbeat:Filesystem):    Started
>>> node1
>>> >>>>>      p_fs_mountbind2    (ocf::heartbeat:Filesystem):    Started
>>> node1
>>> >>>>>      p_fs_mountbind3    (ocf::heartbeat:Filesystem):    Started
>>> node1
>>> >>>>>      p_fs_varwww        (ocf::heartbeat:Filesystem):    Started
>>> node1
>>> >>>>>      p_apache   (ocf::heartbeat:apache):        Started node1
>>> >>>>>  Resource Group: g_fileservers
>>> >>>>>      p_lsb_smb  (lsb:smbd):     Started node1
>>> >>>>>      p_lsb_nmb  (lsb:nmbd):     Started node1
>>> >>>>>      p_lsb_nfsserver    (lsb:nfs-kernel-server):        Started
>>> node1
>>> >>>>>      p_exportfs_mount1   (ocf::heartbeat:exportfs):      Started
>>> node1
>>> >>>>>      p_exportfs_mount2     (ocf::heartbeat:exportfs):      Started
>>> > node1
>>> >>>>>
>>> >>>>> I have read through the Pacemaker Explained
>>> >>>>>
>>> >>>>
>>> >>>
>>> > <
>>> http://www.clusterlabs.org/doc/en-US/Pacemaker/1.1/html-single/Pacemaker_Explained
>>> >
>>> >>>>> documentation, however could not find a way to further debug these
>>> >>>>> problems. First, I put node1 into standby mode to attempt failover
>>> to
>>> >>>>> the other node (node2). Node2 appeared to start the transition to
>>> >>>>> master, however it failed to promote the DRBD resources to master
>>> (the
>>> >>>>> first step). I have attached a copy of this session in
>>> commands.log and
>>> >>>>> additional excerpts from /var/log/syslog during important steps. I
>>> have
>>> >>>>> attempted everything I can think of to try and start the DRBD
>>> resource
>>> >>>>> (e.g. start/stop/promote/manage/cleanup under crm resource,
>>> restarting
>>> >>>>> heartbeat) but cannot bring it out of the slave state. However, if
>>> > I set
>>> >>>>> it to unmanaged and then run drbdadm primary all in the terminal,
>>> >>>>> pacemaker is satisfied and continues starting the rest of the
>>> > resources.
>>> >>>>> It then failed when attempting to mount the filesystem for mount2,
>>> the
>>> >>>>> p_fs_mount2 resource. I attempted to mount the filesystem myself
>>> > and was
>>> >>>>> successful. I then unmounted it and ran cleanup on p_fs_mount2 and
>>> then
>>> >>>>> it mounted. The rest of the resources started as expected until the
>>> >>>>> p_exportfs_mount2 resource, which failed as follows:
>>> >>>>> p_exportfs_mount2     (ocf::heartbeat:exportfs):      started node2
>>> >>>>> (unmanaged) FAILED
>>> >>>>>
>>> >>>>> I ran cleanup on this and it started, however when running this
>>> test
>>> >>>>> earlier today no command could successfully start this exportfs
>>> >> resource.
>>> >>>>>
>>> >>>>> How can I configure pacemaker to better resolve these problems and
>>> be
>>> >>>>> able to bring the node up successfully on its own? What can I
>>> check to
>>> >>>>> determine why these failures are occuring? /var/log/syslog did not
>>> seem
>>> >>>>> to contain very much useful information regarding why the failures
>>> >>>> occurred.
>>> >>>>>
>>> >>>>> Thanks,
>>> >>>>>
>>> >>>>> Andrew
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>>
>>> >>>>> This body part will be downloaded on demand.
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> _______________________________________________
>>> >>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> >> <mailto:Pacemaker at oss.clusterlabs.org>
>>> >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> >>>>
>>> >>>> Project Home: http://www.clusterlabs.org
>>> >>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> >>>> Bugs: http://bugs.clusterlabs.org
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> _______________________________________________
>>> >>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> >> <mailto:Pacemaker at oss.clusterlabs.org>
>>> >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> >>>>
>>> >>>> Project Home: http://www.clusterlabs.org
>>> >>>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> >>>> Bugs: http://bugs.clusterlabs.org
>>> >>>
>>> >>>
>>> >>>
>>> >>> _______________________________________________
>>> >>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> >> <mailto:Pacemaker at oss.clusterlabs.org>
>>> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> >>>
>>> >>> Project Home: http://www.clusterlabs.org
>>> >>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> >>> Bugs: http://bugs.clusterlabs.org
>>> >>>
>>> >>>
>>> >>>
>>> >>> _______________________________________________
>>> >>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> >> <mailto:Pacemaker at oss.clusterlabs.org>
>>> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> >>>
>>> >>> Project Home: http://www.clusterlabs.org
>>> >>> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> >>> Bugs: http://bugs.clusterlabs.org
>>> >>
>>> >>
>>> >>
>>> >> _______________________________________________
>>> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> >> <mailto:Pacemaker at oss.clusterlabs.org>
>>> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> >>
>>> >> Project Home: http://www.clusterlabs.org
>>> >> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> >> Bugs: http://bugs.clusterlabs.org
>>> >>
>>> >>
>>> >>
>>> >> _______________________________________________
>>> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> >>
>>> >> Project Home: http://www.clusterlabs.org
>>> >> Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> >> Bugs: http://bugs.clusterlabs.org
>>> >
>>> >
>>> >
>>> >
>>> > _______________________________________________
>>> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> >
>>> > Project Home: http://www.clusterlabs.org
>>> > Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> > Bugs: http://bugs.clusterlabs.org
>>> >
>>> >
>>> >
>>> > _______________________________________________
>>> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> >
>>> > Project Home: http://www.clusterlabs.org
>>> > Getting started:
>>> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> > Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>
>>
>> --
>> esta es mi vida e me la vivo hasta que dios quiera
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
>>
>
>
> --
> esta es mi vida e me la vivo hasta que dios quiera
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
>

-- 
esta es mi vida e me la vivo hasta que dios quiera
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20120402/525949d9/attachment-0001.html>