[Pacemaker] [Linux-HA] Probably a regression of the linbit drbd agent between pacemaker 1.1.8 and 1.1.10

Mon Aug 26 23:01:43 EDT 2013

On 27/08/2013, at 3:31 AM, Andreas Mock <Andreas.Mock at web.de> wrote:

> Hi all,
> 
> while the linbit drbd resource agent seems to work perfectly on
> pacemaker 1.1.8 (standard software repository) we have problems
> with the last release 1.1.10 and also with the newest head
> 1.1.11.xxx. 
> 
> As using drbd is not so uncommon I really hope to find interested
> people helping me out. I can provide as much debug information as
> you want.
> 
> 
> Environment:
> RHEL 6.4 clone (Scientific Linux 6.4) cman based cluster.
> DRBD 8.4.3 compiled from sources.
> 64bit
> 
> - A drbd resource configured following the linbit documentation.
> - Manual start and stop (up/down) and setting primary of drbd resource
> working smoothly.
> - 2 nodes dis03-test/dis04-test
> 
> 
> 
> - Following simple config on pacemaker 1.1.8
> configure
>    property no-quorum-policy=stop
>    property stonith-enabled=true
>    rsc_defaults resource-stickiness=2
>    primitive r_stonith-dis03-test stonith:fence_mock \
>        meta resource-stickiness="INFINITY" target-role="Started" \
>        op monitor interval="180" timeout="300" requires="nothing" \
>        op start interval="0" timeout="300" \
>        op stop interval="0" timeout="300" \
>        params vmname=dis03-test pcmk_host_list="dis03-test"
>    primitive r_stonith-dis04-test stonith:fence_mock \
>        meta resource-stickiness="INFINITY" target-role="Started" \
>        op monitor interval="180" timeout="300" requires="nothing" \
>        op start interval="0" timeout="300" \
>        op stop interval="0" timeout="300" \
>        params vmname=dis04-test pcmk_host_list="dis04-test"
>    location r_stonith-dis03_hates_dis03 r_stonith-dis03-test \
>        rule $id="r_stonith-dis03_hates_dis03-test_rule" -inf: #uname eq
> dis03-test
>    location r_stonith-dis04_hates_dis04 r_stonith-dis04-test \
>        rule $id="r_stonith-dis04_hates_dis04-test_rule" -inf: #uname eq
> dis04-test
>    primitive r_drbd_postfix ocf:linbit:drbd \
>        params drbd_resource="postfix" drbdconf="/usr/local/etc/drbd.conf" \
>        op monitor interval="15s"  timeout="60s" role="Master" \
>        op monitor interval="45s"  timeout="60s" role="Slave" \
>        op start timeout="240" \
>        op stop timeout="240" \
>        meta target-role="Stopped" migration-threshold="2"
>    ms ms_drbd_postfix r_drbd_postfix \
>        meta master-max="1" master-node-max="1" \
>        clone-max="2" clone-node-max="1" \
>        notify="true" \
>        meta target-role="Stopped"
> commit
> 
> - Pacemaker is started from scratch
> - Config above is applied by crm -f <file> where
> <file> has the above config snippet.
> 
> - After that crm_mon shows the following status
> ----------------------8<-------------------------
> Last updated: Mon Aug 26 18:42:47 2013
> Last change: Mon Aug 26 18:42:42 2013 via cibadmin on dis03-test
> Stack: cman
> Current DC: dis03-test - partition with quorum
> Version: 1.1.10-1.el6-9abe687
> 2 Nodes configured
> 4 Resources configured
> 
> 
> Online: [ dis03-test dis04-test ]
> 
> Full list of resources:
> 
> r_stonith-dis03-test   (stonith:fence_mock):   Started dis04-test
> r_stonith-dis04-test   (stonith:fence_mock):   Started dis03-test
> Master/Slave Set: ms_drbd_postfix [r_drbd_postfix]
>     Stopped: [ dis03-test dis04-test ]
> 
> Migration summary:
> * Node dis04-test:
> * Node dis03-test:
> ----------------------8<-------------------------
> 
> cat /proc/drbd
> version: 8.4.3 (api:1/proto:86-101)
> GIT-hash: 89a294209144b68adb3ee85a73221f964d3ee515 build by root at dis03-test,
> 2013-07-24 17:19:24
> 
> on both nodes. The drbd resource was shutdown previously in a clean state,
> so that any node can be the primary.
> 
> - Now the weird behaviour when trying to start the drbd
> with
>   crm resource start ms_drbd_postfix
> 
> 
> Output of crm_mon -1rf
> ----------------------8<-------------------------
> Last updated: Mon Aug 26 18:46:33 2013
> Last change: Mon Aug 26 18:46:30 2013 via cibadmin on dis04-test
> Stack: cman
> Current DC: dis03-test - partition with quorum
> Version: 1.1.10-1.el6-9abe687
> 2 Nodes configured
> 4 Resources configured
> 
> 
> Online: [ dis03-test dis04-test ]
> 
> Full list of resources:
> 
> r_stonith-dis03-test   (stonith:fence_mock):   Started dis04-test
> r_stonith-dis04-test   (stonith:fence_mock):   Started dis03-test
> Master/Slave Set: ms_drbd_postfix [r_drbd_postfix]
>     Slaves: [ dis03-test ]
>     Stopped: [ dis04-test ]
> 
> Migration summary:
> * Node dis04-test:
>   r_drbd_postfix: migration-threshold=2 fail-count=2 last-failure='Mon Aug
> 26 18:46:30 2013'
> * Node dis03-test:
> 

Its hard to imagine how pacemaker could cause drbdadm to fail, short of leaving the other side promoted while trying to promote another.
Perhaps the drbd folks could comment on what the error means.

> Failed actions:
>    r_drbd_postfix_promote_0 (node=dis04-test, call=34, rc=1,
> status=complete, last-rc-change=Mon Aug 26 18:46:29 2013
> , queued=1212ms, exec=0ms
> ): unknown error
> ----------------------8<-------------------------
> 
> In the log of the drbd agent I can find the following
> when the promoting request is handled on dis03-test
> 
> ----------------------8<-------------------------
> ++ drbdadm -c /usr/local/etc/drbd.conf primary postfix
> 0: State change failed: (-2) Need access to UpToDate data
> Command 'drbdsetup primary 0' terminated with exit code 17
> + cmd_out=
> + ret=17
> + '[' 17 '!=' 0 ']'
> + ocf_log err 'postfix: Called drbdadm -c /usr/local/etc/drbd.conf primary
> postfix'
> + '[' 2 -lt 2 ']'
> + __OCF_PRIO=err
> + shift
> ----------------------8<-------------------------
> 
> While working without problems on pacemaker 1.1.8 it doesn't work here.
> The error message let me assume that there is a kind of
> race condition where pacemaker is firing the promotion too early.
> Probably it has something to do with applying attributes from the
> drbd resource agent.
> But this is just a guess and I really don't know.
> 
> ONE ADDITIONAL information: As soon as I do a
> resource cleanup on the "defective" node the master
> is promoted as expected. That means a:
>   crm resource cleanup r_drbd_postfix dis03-test
> results in the following:
> 
> ----------------------8<-------------------------
> Last updated: Mon Aug 26 19:29:38 2013
> Last change: Mon Aug 26 19:29:28 2013 via cibadmin on dis04-test
> Stack: cman
> Current DC: dis03-test - partition with quorum
> Version: 1.1.10-1.el6-9abe687
> 2 Nodes configured
> 4 Resources configured
> 
> 
> Online: [ dis03-test dis04-test ]
> 
> Full list of resources:
> 
> r_stonith-dis03-test   (stonith:fence_mock):   Started dis04-test
> r_stonith-dis04-test   (stonith:fence_mock):   Started dis03-test
> Master/Slave Set: ms_drbd_postfix [r_drbd_postfix]
>     Masters: [ dis03-test ]
>     Slaves: [ dis04-test ]
> 
> Migration summary:
> * Node dis03-test:
> * Node dis04-test:
> ----------------------8<-------------------------
> 
> 
> 
> I really hope I can get some attention as pacemaker 1.1.10
> is a milestone for Andrew and drbd from linbit is pretty sure
> a building block of many pacemaker based clusters.
> 
> Cluster log of DC dis03-test at http://pastebin.com/2S9Y6V3P
> DRBD agent log at http://pastebin.com/ceYNEAhH
> 
> 
> So, any help welcome.
> 
> Best regards
> Andreas Mock
> 
> 
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130827/75acee29/attachment-0003.sig>