[Pacemaker] [Linux-HA] Probably a regression of the linbit drbd agent between pacemaker 1.1.8 and 1.1.10
Andrew Beekhof
andrew at beekhof.net
Tue Aug 27 22:11:40 UTC 2013
On 27/08/2013, at 2:51 PM, Andreas Mock <Andreas.Mock at web.de> wrote:
> Hi Andrew,
>
> as this is a real showstopper at the moment I invested some other
> hours to be sure (as far as possible) not having made an error.
>
> Some additions:
> 1) I mirrored the whole mini drbd config to another pacemaker cluster.
> Same result: pacemaker 1.1.8 works, pacemaker 1.1.10 not
The version of drbd is the same too?
> 2) When I remove the target role Stopped from the drbd ms resource
> and insert the config snippet related to the drbd device via crm -f <file>
> to a lean running pacemaker config (pacemaker cluster options, stonith
> resources),
> it seems to work. That means one of the nodes gets promoted.
>
> Then after stopping 'crm resource stop ms_drbd_xxx' and starting again
> I see the same promotion error as described.
>
> The drbd resource agent is using /usr/sbin/crm_master.
> Is there a possibility that feedback given through this client tool
> is changing the timing behaviour of pacemaker? Or the way
> transitions are scheduled?
> Any idea that may be related to a change in pacemaker?
# git diff --stat Pacemaker-1.1.8..Pacemaker-1.1.10 | tail -n 1
1610 files changed, 109697 insertions(+), 62940 deletions(-)
Needle, meet haystack.
Particularly since I have no idea what that drbd error means.
If you want me to have a look, you'll need to create a crm_report archive of "works" and "not works".
Logs aren't enough.
>
> Best regards
> Andreas Mock
>
>
> -----Ursprüngliche Nachricht-----
> Von: Andrew Beekhof [mailto:andrew at beekhof.net]
> Gesendet: Dienstag, 27. August 2013 05:02
> An: General Linux-HA mailing list
> Cc: pacemaker at oss.clusterlabs.org
> Betreff: Re: [Pacemaker] [Linux-HA] Probably a regression of the linbit drbd
> agent between pacemaker 1.1.8 and 1.1.10
>
>
> On 27/08/2013, at 3:31 AM, Andreas Mock <Andreas.Mock at web.de> wrote:
>
>> Hi all,
>>
>> while the linbit drbd resource agent seems to work perfectly on
>> pacemaker 1.1.8 (standard software repository) we have problems with
>> the last release 1.1.10 and also with the newest head 1.1.11.xxx.
>>
>> As using drbd is not so uncommon I really hope to find interested
>> people helping me out. I can provide as much debug information as you
>> want.
>>
>>
>> Environment:
>> RHEL 6.4 clone (Scientific Linux 6.4) cman based cluster.
>> DRBD 8.4.3 compiled from sources.
>> 64bit
>>
>> - A drbd resource configured following the linbit documentation.
>> - Manual start and stop (up/down) and setting primary of drbd resource
>> working smoothly.
>> - 2 nodes dis03-test/dis04-test
>>
>>
>>
>> - Following simple config on pacemaker 1.1.8 configure
>> property no-quorum-policy=stop
>> property stonith-enabled=true
>> rsc_defaults resource-stickiness=2
>> primitive r_stonith-dis03-test stonith:fence_mock \
>> meta resource-stickiness="INFINITY" target-role="Started" \
>> op monitor interval="180" timeout="300" requires="nothing" \
>> op start interval="0" timeout="300" \
>> op stop interval="0" timeout="300" \
>> params vmname=dis03-test pcmk_host_list="dis03-test"
>> primitive r_stonith-dis04-test stonith:fence_mock \
>> meta resource-stickiness="INFINITY" target-role="Started" \
>> op monitor interval="180" timeout="300" requires="nothing" \
>> op start interval="0" timeout="300" \
>> op stop interval="0" timeout="300" \
>> params vmname=dis04-test pcmk_host_list="dis04-test"
>> location r_stonith-dis03_hates_dis03 r_stonith-dis03-test \
>> rule $id="r_stonith-dis03_hates_dis03-test_rule" -inf: #uname
>> eq dis03-test
>> location r_stonith-dis04_hates_dis04 r_stonith-dis04-test \
>> rule $id="r_stonith-dis04_hates_dis04-test_rule" -inf: #uname
>> eq dis04-test
>> primitive r_drbd_postfix ocf:linbit:drbd \
>> params drbd_resource="postfix" drbdconf="/usr/local/etc/drbd.conf"
> \
>> op monitor interval="15s" timeout="60s" role="Master" \
>> op monitor interval="45s" timeout="60s" role="Slave" \
>> op start timeout="240" \
>> op stop timeout="240" \
>> meta target-role="Stopped" migration-threshold="2"
>> ms ms_drbd_postfix r_drbd_postfix \
>> meta master-max="1" master-node-max="1" \
>> clone-max="2" clone-node-max="1" \
>> notify="true" \
>> meta target-role="Stopped"
>> commit
>>
>> - Pacemaker is started from scratch
>> - Config above is applied by crm -f <file> where <file> has the above
>> config snippet.
>>
>> - After that crm_mon shows the following status
>> ----------------------8<-------------------------
>> Last updated: Mon Aug 26 18:42:47 2013 Last change: Mon Aug 26
>> 18:42:42 2013 via cibadmin on dis03-test
>> Stack: cman
>> Current DC: dis03-test - partition with quorum
>> Version: 1.1.10-1.el6-9abe687
>> 2 Nodes configured
>> 4 Resources configured
>>
>>
>> Online: [ dis03-test dis04-test ]
>>
>> Full list of resources:
>>
>> r_stonith-dis03-test (stonith:fence_mock): Started dis04-test
>> r_stonith-dis04-test (stonith:fence_mock): Started dis03-test
>> Master/Slave Set: ms_drbd_postfix [r_drbd_postfix]
>> Stopped: [ dis03-test dis04-test ]
>>
>> Migration summary:
>> * Node dis04-test:
>> * Node dis03-test:
>> ----------------------8<-------------------------
>>
>> cat /proc/drbd
>> version: 8.4.3 (api:1/proto:86-101)
>> GIT-hash: 89a294209144b68adb3ee85a73221f964d3ee515 build by
>> root at dis03-test,
>> 2013-07-24 17:19:24
>>
>> on both nodes. The drbd resource was shutdown previously in a clean
>> state, so that any node can be the primary.
>>
>> - Now the weird behaviour when trying to start the drbd with
>> crm resource start ms_drbd_postfix
>>
>>
>> Output of crm_mon -1rf
>> ----------------------8<-------------------------
>> Last updated: Mon Aug 26 18:46:33 2013 Last change: Mon Aug 26
>> 18:46:30 2013 via cibadmin on dis04-test
>> Stack: cman
>> Current DC: dis03-test - partition with quorum
>> Version: 1.1.10-1.el6-9abe687
>> 2 Nodes configured
>> 4 Resources configured
>>
>>
>> Online: [ dis03-test dis04-test ]
>>
>> Full list of resources:
>>
>> r_stonith-dis03-test (stonith:fence_mock): Started dis04-test
>> r_stonith-dis04-test (stonith:fence_mock): Started dis03-test
>> Master/Slave Set: ms_drbd_postfix [r_drbd_postfix]
>> Slaves: [ dis03-test ]
>> Stopped: [ dis04-test ]
>>
>> Migration summary:
>> * Node dis04-test:
>> r_drbd_postfix: migration-threshold=2 fail-count=2 last-failure='Mon
>> Aug
>> 26 18:46:30 2013'
>> * Node dis03-test:
>>
>
> Its hard to imagine how pacemaker could cause drbdadm to fail, short of
> leaving the other side promoted while trying to promote another.
> Perhaps the drbd folks could comment on what the error means.
>
>> Failed actions:
>> r_drbd_postfix_promote_0 (node=dis04-test, call=34, rc=1,
>> status=complete, last-rc-change=Mon Aug 26 18:46:29 2013 ,
>> queued=1212ms, exec=0ms
>> ): unknown error
>> ----------------------8<-------------------------
>>
>> In the log of the drbd agent I can find the following when the
>> promoting request is handled on dis03-test
>>
>> ----------------------8<-------------------------
>> ++ drbdadm -c /usr/local/etc/drbd.conf primary postfix
>> 0: State change failed: (-2) Need access to UpToDate data Command
>> 'drbdsetup primary 0' terminated with exit code 17
>> + cmd_out=
>> + ret=17
>> + '[' 17 '!=' 0 ']'
>> + ocf_log err 'postfix: Called drbdadm -c /usr/local/etc/drbd.conf
>> + primary
>> postfix'
>> + '[' 2 -lt 2 ']'
>> + __OCF_PRIO=err
>> + shift
>> ----------------------8<-------------------------
>>
>> While working without problems on pacemaker 1.1.8 it doesn't work here.
>> The error message let me assume that there is a kind of race condition
>> where pacemaker is firing the promotion too early.
>> Probably it has something to do with applying attributes from the drbd
>> resource agent.
>> But this is just a guess and I really don't know.
>>
>> ONE ADDITIONAL information: As soon as I do a resource cleanup on the
>> "defective" node the master is promoted as expected. That means a:
>> crm resource cleanup r_drbd_postfix dis03-test results in the
>> following:
>>
>> ----------------------8<-------------------------
>> Last updated: Mon Aug 26 19:29:38 2013 Last change: Mon Aug 26
>> 19:29:28 2013 via cibadmin on dis04-test
>> Stack: cman
>> Current DC: dis03-test - partition with quorum
>> Version: 1.1.10-1.el6-9abe687
>> 2 Nodes configured
>> 4 Resources configured
>>
>>
>> Online: [ dis03-test dis04-test ]
>>
>> Full list of resources:
>>
>> r_stonith-dis03-test (stonith:fence_mock): Started dis04-test
>> r_stonith-dis04-test (stonith:fence_mock): Started dis03-test
>> Master/Slave Set: ms_drbd_postfix [r_drbd_postfix]
>> Masters: [ dis03-test ]
>> Slaves: [ dis04-test ]
>>
>> Migration summary:
>> * Node dis03-test:
>> * Node dis04-test:
>> ----------------------8<-------------------------
>>
>>
>>
>> I really hope I can get some attention as pacemaker 1.1.10 is a
>> milestone for Andrew and drbd from linbit is pretty sure a building
>> block of many pacemaker based clusters.
>>
>> Cluster log of DC dis03-test at http://pastebin.com/2S9Y6V3P DRBD
>> agent log at http://pastebin.com/ceYNEAhH
>>
>>
>> So, any help welcome.
>>
>> Best regards
>> Andreas Mock
>>
>>
>> _______________________________________________
>> Linux-HA mailing list
>> Linux-HA at lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
>
>
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130828/84a9f9af/attachment-0004.sig>
More information about the Pacemaker
mailing list