[Pacemaker] drbd under pacemaker - always get split brain

Wed Jul 11 03:37:26 EDT 2012

On 07/11/2012 04:50 AM, Andrew Beekhof wrote:
> On Wed, Jul 11, 2012 at 8:06 AM, Andreas Kurz <andreas at hastexo.com> wrote:
>> On Tue, Jul 10, 2012 at 8:12 AM, Nikola Ciprich
>> <nikola.ciprich at linuxbox.cz> wrote:
>>> Hello Andreas,
>>>> Why not using the RA that comes with the resource-agent package?
>>> well, I've historically used my scripts, haven't even noticed when LVM
>>> resource appeared.. I switched to it now.., thanks for the hint..
>>>> this "become-primary-on" was never activated?
>>> nope.
>>>
>>>
>>>> Is the drbd init script deactivated on system boot? Cluster logs should
>>>> give more insights ....
>>> yes, it's deactivated. I tried resyncinc drbd by hand, deleted logs,
>>> rebooted both nodes, checked drbd ain't started and started corosync.
>>> result is here:
>>> http://nelide.cz/nik/logs.tar.gz
>>
>> It really really looks like Pacemaker is too fast when promoting to
>> primary ... before the connection to the already up second node can be
>> established.
> 
> Do you mean we're violating a constraint?
> Or is it a problem of the RA returning too soon?

It looks like a RA problem ... notifications after the start of the
resource and the following promote are very fast and DRBD is still not
finished with establishing the connection to the peer. I can't remember
seeing this before.

Regards,
Andreas

> 
>> I see in your logs you have DRBD 8.3.13 userland  but
>> 8.3.11 DRBD module installed ... can you test with 8.3.13 kernel module
>> ... there have been fixes that look like addressing this problem.
>>
>> Another quick-fix, that should also do: add a start-delay of some
>> seconds to the start operation of DRBD
>>
>> ... or fix your after-split-brain policies to automatically solve this
>> special type of split-brain (with 0 blocks to sync).
>>
>> Best Regards,
>> Andreas
>>
>> --
>> Need help with Pacemaker?
>> http://www.hastexo.com/now
>>
>>>
>>> thanks for Your time.
>>> n.
>>>
>>>
>>>>
>>>> Regards,
>>>> Andreas
>>>>
>>>> --
>>>> Need help with Pacemaker?
>>>> http://www.hastexo.com/now
>>>>
>>>>>
>>>>> thanks a lot in advance
>>>>>
>>>>> nik
>>>>>
>>>>>
>>>>> On Sun, Jul 08, 2012 at 12:47:16AM +0200, Andreas Kurz wrote:
>>>>>> On 07/02/2012 11:49 PM, Nikola Ciprich wrote:
>>>>>>> hello,
>>>>>>>
>>>>>>> I'm trying to solve quite mysterious problem here..
>>>>>>> I've got new cluster with bunch of SAS disks for testing purposes.
>>>>>>> I've configured DRBDs (in primary/primary configuration)
>>>>>>>
>>>>>>> when I start drbd using drbdadm, it get's up nicely (both nodes
>>>>>>> are Primary, connected).
>>>>>>> however when I start it using corosync, I always get split-brain, although
>>>>>>> there are no data written, no network disconnection, anything..
>>>>>>
>>>>>> your full drbd and Pacemaker configuration please ... some snippets from
>>>>>> something are very seldom helpful ...
>>>>>>
>>>>>> Regards,
>>>>>> Andreas
>>>>>>
>>>>>> --
>>>>>> Need help with Pacemaker?
>>>>>> http://www.hastexo.com/now
>>>>>>
>>>>>>>
>>>>>>> here's drbd resource config:
>>>>>>> primitive drbd-sas0 ocf:linbit:drbd \
>>>>>>>     params drbd_resource="drbd-sas0" \
>>>>>>>     operations $id="drbd-sas0-operations" \
>>>>>>>     op start interval="0" timeout="240s" \
>>>>>>>     op stop interval="0" timeout="200s" \
>>>>>>>     op promote interval="0" timeout="200s" \
>>>>>>>     op demote interval="0" timeout="200s" \
>>>>>>>     op monitor interval="179s" role="Master" timeout="150s" \
>>>>>>>     op monitor interval="180s" role="Slave" timeout="150s"
>>>>>>>
>>>>>>> ms ms-drbd-sas0 drbd-sas0 \
>>>>>>>    meta clone-max="2" clone-node-max="1" master-max="2" master-node-max="1" notify="true" globally-unique="false" interleave="true" target-role="Started"
>>>>>>>
>>>>>>>
>>>>>>> here's the dmesg output when pacemaker tries to promote drbd, causing the splitbrain:
>>>>>>> [  157.646292] block drbd2: Starting worker thread (from drbdsetup [6892])
>>>>>>> [  157.646539] block drbd2: disk( Diskless -> Attaching )
>>>>>>> [  157.650364] block drbd2: Found 1 transactions (1 active extents) in activity log.
>>>>>>> [  157.650560] block drbd2: Method to ensure write ordering: drain
>>>>>>> [  157.650688] block drbd2: drbd_bm_resize called with capacity == 584667688
>>>>>>> [  157.653442] block drbd2: resync bitmap: bits=73083461 words=1141930 pages=2231
>>>>>>> [  157.653760] block drbd2: size = 279 GB (292333844 KB)
>>>>>>> [  157.671626] block drbd2: bitmap READ of 2231 pages took 18 jiffies
>>>>>>> [  157.673722] block drbd2: recounting of set bits took additional 2 jiffies
>>>>>>> [  157.673846] block drbd2: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
>>>>>>> [  157.673972] block drbd2: disk( Attaching -> UpToDate )
>>>>>>> [  157.674100] block drbd2: attached to UUIDs 0150944D23F16BAE:0000000000000000:8C175205284E3262:8C165205284E3263
>>>>>>> [  157.685539] block drbd2: conn( StandAlone -> Unconnected )
>>>>>>> [  157.685704] block drbd2: Starting receiver thread (from drbd2_worker [6893])
>>>>>>> [  157.685928] block drbd2: receiver (re)started
>>>>>>> [  157.686071] block drbd2: conn( Unconnected -> WFConnection )
>>>>>>> [  158.960577] block drbd2: role( Secondary -> Primary )
>>>>>>> [  158.960815] block drbd2: new current UUID 015E111F18D08945:0150944D23F16BAE:8C175205284E3262:8C165205284E3263
>>>>>>> [  162.686990] block drbd2: Handshake successful: Agreed network protocol version 96
>>>>>>> [  162.687183] block drbd2: conn( WFConnection -> WFReportParams )
>>>>>>> [  162.687404] block drbd2: Starting asender thread (from drbd2_receiver [6927])
>>>>>>> [  162.687741] block drbd2: data-integrity-alg: <not-used>
>>>>>>> [  162.687930] block drbd2: drbd_sync_handshake:
>>>>>>> [  162.688057] block drbd2: self 015E111F18D08945:0150944D23F16BAE:8C175205284E3262:8C165205284E3263 bits:0 flags:0
>>>>>>> [  162.688244] block drbd2: peer 7EC38CBFC3D28FFF:0150944D23F16BAF:8C175205284E3263:8C165205284E3263 bits:0 flags:0
>>>>>>> [  162.688428] block drbd2: uuid_compare()=100 by rule 90
>>>>>>> [  162.688544] block drbd2: helper command: /sbin/drbdadm initial-split-brain minor-2
>>>>>>> [  162.691332] block drbd2: helper command: /sbin/drbdadm initial-split-brain minor-2 exit code 0 (0x0)
>>>>>>>
>>>>>>> to me it seems to be that it's promoting it too early, and I also wonder why there is the
>>>>>>> "new current UUID" stuff?
>>>>>>>
>>>>>>> I'm using centos6, kernel 3.0.36, drbd-8.3.13, pacemaker-1.1.6
>>>>>>>
>>>>>>> could anybody please try to advice me? I'm sure I'm doing something stupid, but can't figure out what...
>>>>>>>
>>>>>>> thanks a lot in advance
>>>>>>>
>>>>>>> with best regards
>>>>>>>
>>>>>>> nik
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>>
>>>>>>> Project Home: http://www.clusterlabs.org
>>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>> _______________________________________________
>>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>>
>>>>>> Project Home: http://www.clusterlabs.org
>>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>>
>>>>> Project Home: http://www.clusterlabs.org
>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>>> Bugs: http://bugs.clusterlabs.org
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://bugs.clusterlabs.org
>>>
>>>
>>> --
>>> -------------------------------------
>>> Ing. Nikola CIPRICH
>>> LinuxBox.cz, s.r.o.
>>> 28.rijna 168, 709 00 Ostrava
>>>
>>> tel.:   +420 591 166 214
>>> fax:    +420 596 621 273
>>> mobil:  +420 777 093 799
>>> www.linuxbox.cz
>>>
>>> mobil servis: +420 737 238 656
>>> email servis: servis at linuxbox.cz
>>> -------------------------------------
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 

-- 
Need help with Pacemaker?
http://www.hastexo.com/now

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 222 bytes
Desc: OpenPGP digital signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120711/9b6abfa3/attachment-0003.sig>