[Pacemaker] drbd under pacemaker - always get split brain
Andreas Kurz
andreas at hastexo.com
Mon Jul 9 21:43:41 UTC 2012
On 07/09/2012 12:58 PM, Nikola Ciprich wrote:
> Hello Andreas,
>
> yes, You're right. I should have sent those in the initial post. Sorry about that.
> I've created very simple test configuration on which I'm able to simulate the problem.
> there's no stonith etc, since it's just two virtual machines for the test.
>
> crm conf:
>
> primitive drbd-sas0 ocf:linbit:drbd \
> params drbd_resource="drbd-sas0" \
> operations $id="drbd-sas0-operations" \
> op start interval="0" timeout="240s" \
> op stop interval="0" timeout="200s" \
> op promote interval="0" timeout="200s" \
> op demote interval="0" timeout="200s" \
> op monitor interval="179s" role="Master" timeout="150s" \
> op monitor interval="180s" role="Slave" timeout="150s"
>
> primitive lvm ocf:lbox:lvm.ocf \
Why not using the RA that comes with the resource-agent package?
> op start interval="0" timeout="180" \
> op stop interval="0" timeout="180"
>
> ms ms-drbd-sas0 drbd-sas0 \
> meta clone-max="2" clone-node-max="1" master-max="2" master-node-max="1" notify="true" globally-unique="false" interleave="true" target-role="Started"
>
> clone cl-lvm lvm \
> meta globally-unique="false" ordered="false" interleave="true" notify="false" target-role="Started" \
> params lvm-clone-max="2" lvm-clone-node-max="1"
>
> colocation col-lvm-drbd-sas0 inf: cl-lvm ms-drbd-sas0:Master
>
> order ord-drbd-sas0-lvm inf: ms-drbd-sas0:promote cl-lvm:start
>
> property $id="cib-bootstrap-options" \
> dc-version="1.1.6-3.el6-a02c0f19a00c1eb2527ad38f146ebc0834814558" \
> cluster-infrastructure="openais" \
> expected-quorum-votes="2" \
> no-quorum-policy="ignore" \
> stonith-enabled="false"
>
> lvm resource starts vgshared volume group on top of drbd (LVM filters are set to
> use /dev/drbd* devices only)
>
> drbd configuration:
>
> global {
> usage-count no;
> }
>
> common {
> protocol C;
>
> handlers {
> pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; ";
> pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; ";
> local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; ";
>
> # pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
> # pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
> # local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
> # fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
> # split-brain "/usr/lib/drbd/notify-split-brain.sh root";
> # out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
> # before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
> # after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
> }
>
> net {
> allow-two-primaries;
> after-sb-0pri discard-zero-changes;
> after-sb-1pri discard-secondary;
> after-sb-2pri call-pri-lost-after-sb;
> #rr-conflict disconnect;
> max-buffers 8000;
> max-epoch-size 8000;
> sndbuf-size 0;
> ping-timeout 50;
> }
>
> syncer {
> rate 100M;
> al-extents 3833;
> # al-extents 257;
> # verify-alg sha1;
> }
>
> disk {
> on-io-error detach;
> no-disk-barrier;
> no-disk-flushes;
> no-md-flushes;
> }
>
> startup {
> # wfc-timeout 0;
> degr-wfc-timeout 120; # 2 minutes.
> # become-primary-on both;
this "become-primary-on" was never activated?
>
> }
> }
>
> note that pri-on-incon-degr etc handlers are intentionally commented out so I can
> see what's going on.. otherwise machine always got immediate reboot..
>
> any idea?
Is the drbd init script deactivated on system boot? Cluster logs should
give more insights ....
Regards,
Andreas
--
Need help with Pacemaker?
http://www.hastexo.com/now
>
> thanks a lot in advance
>
> nik
>
>
> On Sun, Jul 08, 2012 at 12:47:16AM +0200, Andreas Kurz wrote:
>> On 07/02/2012 11:49 PM, Nikola Ciprich wrote:
>>> hello,
>>>
>>> I'm trying to solve quite mysterious problem here..
>>> I've got new cluster with bunch of SAS disks for testing purposes.
>>> I've configured DRBDs (in primary/primary configuration)
>>>
>>> when I start drbd using drbdadm, it get's up nicely (both nodes
>>> are Primary, connected).
>>> however when I start it using corosync, I always get split-brain, although
>>> there are no data written, no network disconnection, anything..
>>
>> your full drbd and Pacemaker configuration please ... some snippets from
>> something are very seldom helpful ...
>>
>> Regards,
>> Andreas
>>
>> --
>> Need help with Pacemaker?
>> http://www.hastexo.com/now
>>
>>>
>>> here's drbd resource config:
>>> primitive drbd-sas0 ocf:linbit:drbd \
>>> params drbd_resource="drbd-sas0" \
>>> operations $id="drbd-sas0-operations" \
>>> op start interval="0" timeout="240s" \
>>> op stop interval="0" timeout="200s" \
>>> op promote interval="0" timeout="200s" \
>>> op demote interval="0" timeout="200s" \
>>> op monitor interval="179s" role="Master" timeout="150s" \
>>> op monitor interval="180s" role="Slave" timeout="150s"
>>>
>>> ms ms-drbd-sas0 drbd-sas0 \
>>> meta clone-max="2" clone-node-max="1" master-max="2" master-node-max="1" notify="true" globally-unique="false" interleave="true" target-role="Started"
>>>
>>>
>>> here's the dmesg output when pacemaker tries to promote drbd, causing the splitbrain:
>>> [ 157.646292] block drbd2: Starting worker thread (from drbdsetup [6892])
>>> [ 157.646539] block drbd2: disk( Diskless -> Attaching )
>>> [ 157.650364] block drbd2: Found 1 transactions (1 active extents) in activity log.
>>> [ 157.650560] block drbd2: Method to ensure write ordering: drain
>>> [ 157.650688] block drbd2: drbd_bm_resize called with capacity == 584667688
>>> [ 157.653442] block drbd2: resync bitmap: bits=73083461 words=1141930 pages=2231
>>> [ 157.653760] block drbd2: size = 279 GB (292333844 KB)
>>> [ 157.671626] block drbd2: bitmap READ of 2231 pages took 18 jiffies
>>> [ 157.673722] block drbd2: recounting of set bits took additional 2 jiffies
>>> [ 157.673846] block drbd2: 0 KB (0 bits) marked out-of-sync by on disk bit-map.
>>> [ 157.673972] block drbd2: disk( Attaching -> UpToDate )
>>> [ 157.674100] block drbd2: attached to UUIDs 0150944D23F16BAE:0000000000000000:8C175205284E3262:8C165205284E3263
>>> [ 157.685539] block drbd2: conn( StandAlone -> Unconnected )
>>> [ 157.685704] block drbd2: Starting receiver thread (from drbd2_worker [6893])
>>> [ 157.685928] block drbd2: receiver (re)started
>>> [ 157.686071] block drbd2: conn( Unconnected -> WFConnection )
>>> [ 158.960577] block drbd2: role( Secondary -> Primary )
>>> [ 158.960815] block drbd2: new current UUID 015E111F18D08945:0150944D23F16BAE:8C175205284E3262:8C165205284E3263
>>> [ 162.686990] block drbd2: Handshake successful: Agreed network protocol version 96
>>> [ 162.687183] block drbd2: conn( WFConnection -> WFReportParams )
>>> [ 162.687404] block drbd2: Starting asender thread (from drbd2_receiver [6927])
>>> [ 162.687741] block drbd2: data-integrity-alg: <not-used>
>>> [ 162.687930] block drbd2: drbd_sync_handshake:
>>> [ 162.688057] block drbd2: self 015E111F18D08945:0150944D23F16BAE:8C175205284E3262:8C165205284E3263 bits:0 flags:0
>>> [ 162.688244] block drbd2: peer 7EC38CBFC3D28FFF:0150944D23F16BAF:8C175205284E3263:8C165205284E3263 bits:0 flags:0
>>> [ 162.688428] block drbd2: uuid_compare()=100 by rule 90
>>> [ 162.688544] block drbd2: helper command: /sbin/drbdadm initial-split-brain minor-2
>>> [ 162.691332] block drbd2: helper command: /sbin/drbdadm initial-split-brain minor-2 exit code 0 (0x0)
>>>
>>> to me it seems to be that it's promoting it too early, and I also wonder why there is the
>>> "new current UUID" stuff?
>>>
>>> I'm using centos6, kernel 3.0.36, drbd-8.3.13, pacemaker-1.1.6
>>>
>>> could anybody please try to advice me? I'm sure I'm doing something stupid, but can't figure out what...
>>>
>>> thanks a lot in advance
>>>
>>> with best regards
>>>
>>> nik
>>>
>>>
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>>
>>
>>
>>
>>
>
>
>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 222 bytes
Desc: OpenPGP digital signature
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20120709/67032152/attachment-0004.sig>
More information about the Pacemaker
mailing list