[Pacemaker] [Linux-HA] Probably a regression of the linbit drbd agent between pacemaker 1.1.8 and 1.1.10

Tue Aug 27 22:11:40 UTC 2013

On 27/08/2013, at 2:51 PM, Andreas Mock <Andreas.Mock at web.de> wrote:

> Hi Andrew,
> 
> as this is a real showstopper at the moment I invested some other
> hours to be sure (as far as possible) not having made an error.
> 
> Some additions:
> 1) I mirrored the whole mini drbd config to another pacemaker cluster.
> Same result: pacemaker 1.1.8 works, pacemaker 1.1.10 not 

The version of drbd is the same too?

> 2) When I remove the target role Stopped from the drbd ms resource
> and insert the config snippet related to the drbd device via crm -f <file>
> to a lean running pacemaker config (pacemaker cluster options, stonith
> resources),
> it seems to work. That means one of the nodes gets promoted.
> 
> Then after stopping 'crm resource stop ms_drbd_xxx' and starting again
> I see the same promotion error as described.
> 
> The drbd resource agent is using /usr/sbin/crm_master.
> Is there a possibility that feedback given through this client tool
> is changing the timing behaviour of pacemaker? Or the way
> transitions are scheduled?
> Any idea that may be related to a change in pacemaker?

# git diff --stat Pacemaker-1.1.8..Pacemaker-1.1.10 | tail -n 1
 1610 files changed, 109697 insertions(+), 62940 deletions(-)

Needle, meet haystack.
Particularly since I have no idea what that drbd error means.

If you want me to have a look, you'll need to create a crm_report archive of "works" and "not works".
Logs aren't enough.

> 
> Best regards
> Andreas Mock
> 
> 
> -----Ursprüngliche Nachricht-----
> Von: Andrew Beekhof [mailto:andrew at beekhof.net] 
> Gesendet: Dienstag, 27. August 2013 05:02
> An: General Linux-HA mailing list
> Cc: pacemaker at oss.clusterlabs.org
> Betreff: Re: [Pacemaker] [Linux-HA] Probably a regression of the linbit drbd
> agent between pacemaker 1.1.8 and 1.1.10
> 
> 
> On 27/08/2013, at 3:31 AM, Andreas Mock <Andreas.Mock at web.de> wrote:
> 
>> Hi all,
>> 
>> while the linbit drbd resource agent seems to work perfectly on 
>> pacemaker 1.1.8 (standard software repository) we have problems with 
>> the last release 1.1.10 and also with the newest head 1.1.11.xxx.
>> 
>> As using drbd is not so uncommon I really hope to find interested 
>> people helping me out. I can provide as much debug information as you 
>> want.
>> 
>> 
>> Environment:
>> RHEL 6.4 clone (Scientific Linux 6.4) cman based cluster.
>> DRBD 8.4.3 compiled from sources.
>> 64bit
>> 
>> - A drbd resource configured following the linbit documentation.
>> - Manual start and stop (up/down) and setting primary of drbd resource 
>> working smoothly.
>> - 2 nodes dis03-test/dis04-test
>> 
>> 
>> 
>> - Following simple config on pacemaker 1.1.8 configure
>>   property no-quorum-policy=stop
>>   property stonith-enabled=true
>>   rsc_defaults resource-stickiness=2
>>   primitive r_stonith-dis03-test stonith:fence_mock \
>>       meta resource-stickiness="INFINITY" target-role="Started" \
>>       op monitor interval="180" timeout="300" requires="nothing" \
>>       op start interval="0" timeout="300" \
>>       op stop interval="0" timeout="300" \
>>       params vmname=dis03-test pcmk_host_list="dis03-test"
>>   primitive r_stonith-dis04-test stonith:fence_mock \
>>       meta resource-stickiness="INFINITY" target-role="Started" \
>>       op monitor interval="180" timeout="300" requires="nothing" \
>>       op start interval="0" timeout="300" \
>>       op stop interval="0" timeout="300" \
>>       params vmname=dis04-test pcmk_host_list="dis04-test"
>>   location r_stonith-dis03_hates_dis03 r_stonith-dis03-test \
>>       rule $id="r_stonith-dis03_hates_dis03-test_rule" -inf: #uname 
>> eq dis03-test
>>   location r_stonith-dis04_hates_dis04 r_stonith-dis04-test \
>>       rule $id="r_stonith-dis04_hates_dis04-test_rule" -inf: #uname 
>> eq dis04-test
>>   primitive r_drbd_postfix ocf:linbit:drbd \
>>       params drbd_resource="postfix" drbdconf="/usr/local/etc/drbd.conf"
> \
>>       op monitor interval="15s"  timeout="60s" role="Master" \
>>       op monitor interval="45s"  timeout="60s" role="Slave" \
>>       op start timeout="240" \
>>       op stop timeout="240" \
>>       meta target-role="Stopped" migration-threshold="2"
>>   ms ms_drbd_postfix r_drbd_postfix \
>>       meta master-max="1" master-node-max="1" \
>>       clone-max="2" clone-node-max="1" \
>>       notify="true" \
>>       meta target-role="Stopped"
>> commit
>> 
>> - Pacemaker is started from scratch
>> - Config above is applied by crm -f <file> where <file> has the above 
>> config snippet.
>> 
>> - After that crm_mon shows the following status
>> ----------------------8<-------------------------
>> Last updated: Mon Aug 26 18:42:47 2013 Last change: Mon Aug 26 
>> 18:42:42 2013 via cibadmin on dis03-test
>> Stack: cman
>> Current DC: dis03-test - partition with quorum
>> Version: 1.1.10-1.el6-9abe687
>> 2 Nodes configured
>> 4 Resources configured
>> 
>> 
>> Online: [ dis03-test dis04-test ]
>> 
>> Full list of resources:
>> 
>> r_stonith-dis03-test   (stonith:fence_mock):   Started dis04-test
>> r_stonith-dis04-test   (stonith:fence_mock):   Started dis03-test
>> Master/Slave Set: ms_drbd_postfix [r_drbd_postfix]
>>    Stopped: [ dis03-test dis04-test ]
>> 
>> Migration summary:
>> * Node dis04-test:
>> * Node dis03-test:
>> ----------------------8<-------------------------
>> 
>> cat /proc/drbd
>> version: 8.4.3 (api:1/proto:86-101)
>> GIT-hash: 89a294209144b68adb3ee85a73221f964d3ee515 build by 
>> root at dis03-test,
>> 2013-07-24 17:19:24
>> 
>> on both nodes. The drbd resource was shutdown previously in a clean 
>> state, so that any node can be the primary.
>> 
>> - Now the weird behaviour when trying to start the drbd with
>>  crm resource start ms_drbd_postfix
>> 
>> 
>> Output of crm_mon -1rf
>> ----------------------8<-------------------------
>> Last updated: Mon Aug 26 18:46:33 2013 Last change: Mon Aug 26 
>> 18:46:30 2013 via cibadmin on dis04-test
>> Stack: cman
>> Current DC: dis03-test - partition with quorum
>> Version: 1.1.10-1.el6-9abe687
>> 2 Nodes configured
>> 4 Resources configured
>> 
>> 
>> Online: [ dis03-test dis04-test ]
>> 
>> Full list of resources:
>> 
>> r_stonith-dis03-test   (stonith:fence_mock):   Started dis04-test
>> r_stonith-dis04-test   (stonith:fence_mock):   Started dis03-test
>> Master/Slave Set: ms_drbd_postfix [r_drbd_postfix]
>>    Slaves: [ dis03-test ]
>>    Stopped: [ dis04-test ]
>> 
>> Migration summary:
>> * Node dis04-test:
>>  r_drbd_postfix: migration-threshold=2 fail-count=2 last-failure='Mon 
>> Aug
>> 26 18:46:30 2013'
>> * Node dis03-test:
>> 
> 
> Its hard to imagine how pacemaker could cause drbdadm to fail, short of
> leaving the other side promoted while trying to promote another.
> Perhaps the drbd folks could comment on what the error means.
> 
>> Failed actions:
>>   r_drbd_postfix_promote_0 (node=dis04-test, call=34, rc=1, 
>> status=complete, last-rc-change=Mon Aug 26 18:46:29 2013 , 
>> queued=1212ms, exec=0ms
>> ): unknown error
>> ----------------------8<-------------------------
>> 
>> In the log of the drbd agent I can find the following when the 
>> promoting request is handled on dis03-test
>> 
>> ----------------------8<-------------------------
>> ++ drbdadm -c /usr/local/etc/drbd.conf primary postfix
>> 0: State change failed: (-2) Need access to UpToDate data Command 
>> 'drbdsetup primary 0' terminated with exit code 17
>> + cmd_out=
>> + ret=17
>> + '[' 17 '!=' 0 ']'
>> + ocf_log err 'postfix: Called drbdadm -c /usr/local/etc/drbd.conf 
>> + primary
>> postfix'
>> + '[' 2 -lt 2 ']'
>> + __OCF_PRIO=err
>> + shift
>> ----------------------8<-------------------------
>> 
>> While working without problems on pacemaker 1.1.8 it doesn't work here.
>> The error message let me assume that there is a kind of race condition 
>> where pacemaker is firing the promotion too early.
>> Probably it has something to do with applying attributes from the drbd 
>> resource agent.
>> But this is just a guess and I really don't know.
>> 
>> ONE ADDITIONAL information: As soon as I do a resource cleanup on the 
>> "defective" node the master is promoted as expected. That means a:
>>  crm resource cleanup r_drbd_postfix dis03-test results in the 
>> following:
>> 
>> ----------------------8<-------------------------
>> Last updated: Mon Aug 26 19:29:38 2013 Last change: Mon Aug 26 
>> 19:29:28 2013 via cibadmin on dis04-test
>> Stack: cman
>> Current DC: dis03-test - partition with quorum
>> Version: 1.1.10-1.el6-9abe687
>> 2 Nodes configured
>> 4 Resources configured
>> 
>> 
>> Online: [ dis03-test dis04-test ]
>> 
>> Full list of resources:
>> 
>> r_stonith-dis03-test   (stonith:fence_mock):   Started dis04-test
>> r_stonith-dis04-test   (stonith:fence_mock):   Started dis03-test
>> Master/Slave Set: ms_drbd_postfix [r_drbd_postfix]
>>    Masters: [ dis03-test ]
>>    Slaves: [ dis04-test ]
>> 
>> Migration summary:
>> * Node dis03-test:
>> * Node dis04-test:
>> ----------------------8<-------------------------
>> 
>> 
>> 
>> I really hope I can get some attention as pacemaker 1.1.10 is a 
>> milestone for Andrew and drbd from linbit is pretty sure a building 
>> block of many pacemaker based clusters.
>> 
>> Cluster log of DC dis03-test at http://pastebin.com/2S9Y6V3P DRBD 
>> agent log at http://pastebin.com/ceYNEAhH
>> 
>> 
>> So, any help welcome.
>> 
>> Best regards
>> Andreas Mock
>> 
>> 
>> _______________________________________________
>> Linux-HA mailing list
>> Linux-HA at lists.linux-ha.org
>> http://lists.linux-ha.org/mailman/listinfo/linux-ha
>> See also: http://linux-ha.org/ReportingProblems
> 
> 
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20130828/84a9f9af/attachment-0004.sig>