[Pacemaker] [Linux-HA] Probably a regression of the linbit drbd agent between pacemaker 1.1.8 and 1.1.10

Andreas Mock andreas.mock at web.de
Tue Aug 27 00:51:45 EDT 2013


Hi Andrew,

as this is a real showstopper at the moment I invested some other
hours to be sure (as far as possible) not having made an error.

Some additions:
1) I mirrored the whole mini drbd config to another pacemaker cluster.
Same result: pacemaker 1.1.8 works, pacemaker 1.1.10 not 
2) When I remove the target role Stopped from the drbd ms resource
and insert the config snippet related to the drbd device via crm -f <file>
to a lean running pacemaker config (pacemaker cluster options, stonith
resources),
it seems to work. That means one of the nodes gets promoted.

Then after stopping 'crm resource stop ms_drbd_xxx' and starting again
I see the same promotion error as described.

The drbd resource agent is using /usr/sbin/crm_master.
Is there a possibility that feedback given through this client tool
is changing the timing behaviour of pacemaker? Or the way
transitions are scheduled?
Any idea that may be related to a change in pacemaker?

Best regards
Andreas Mock


-----Ursprüngliche Nachricht-----
Von: Andrew Beekhof [mailto:andrew at beekhof.net] 
Gesendet: Dienstag, 27. August 2013 05:02
An: General Linux-HA mailing list
Cc: pacemaker at oss.clusterlabs.org
Betreff: Re: [Pacemaker] [Linux-HA] Probably a regression of the linbit drbd
agent between pacemaker 1.1.8 and 1.1.10


On 27/08/2013, at 3:31 AM, Andreas Mock <Andreas.Mock at web.de> wrote:

> Hi all,
> 
> while the linbit drbd resource agent seems to work perfectly on 
> pacemaker 1.1.8 (standard software repository) we have problems with 
> the last release 1.1.10 and also with the newest head 1.1.11.xxx.
> 
> As using drbd is not so uncommon I really hope to find interested 
> people helping me out. I can provide as much debug information as you 
> want.
> 
> 
> Environment:
> RHEL 6.4 clone (Scientific Linux 6.4) cman based cluster.
> DRBD 8.4.3 compiled from sources.
> 64bit
> 
> - A drbd resource configured following the linbit documentation.
> - Manual start and stop (up/down) and setting primary of drbd resource 
> working smoothly.
> - 2 nodes dis03-test/dis04-test
> 
> 
> 
> - Following simple config on pacemaker 1.1.8 configure
>    property no-quorum-policy=stop
>    property stonith-enabled=true
>    rsc_defaults resource-stickiness=2
>    primitive r_stonith-dis03-test stonith:fence_mock \
>        meta resource-stickiness="INFINITY" target-role="Started" \
>        op monitor interval="180" timeout="300" requires="nothing" \
>        op start interval="0" timeout="300" \
>        op stop interval="0" timeout="300" \
>        params vmname=dis03-test pcmk_host_list="dis03-test"
>    primitive r_stonith-dis04-test stonith:fence_mock \
>        meta resource-stickiness="INFINITY" target-role="Started" \
>        op monitor interval="180" timeout="300" requires="nothing" \
>        op start interval="0" timeout="300" \
>        op stop interval="0" timeout="300" \
>        params vmname=dis04-test pcmk_host_list="dis04-test"
>    location r_stonith-dis03_hates_dis03 r_stonith-dis03-test \
>        rule $id="r_stonith-dis03_hates_dis03-test_rule" -inf: #uname 
> eq dis03-test
>    location r_stonith-dis04_hates_dis04 r_stonith-dis04-test \
>        rule $id="r_stonith-dis04_hates_dis04-test_rule" -inf: #uname 
> eq dis04-test
>    primitive r_drbd_postfix ocf:linbit:drbd \
>        params drbd_resource="postfix" drbdconf="/usr/local/etc/drbd.conf"
\
>        op monitor interval="15s"  timeout="60s" role="Master" \
>        op monitor interval="45s"  timeout="60s" role="Slave" \
>        op start timeout="240" \
>        op stop timeout="240" \
>        meta target-role="Stopped" migration-threshold="2"
>    ms ms_drbd_postfix r_drbd_postfix \
>        meta master-max="1" master-node-max="1" \
>        clone-max="2" clone-node-max="1" \
>        notify="true" \
>        meta target-role="Stopped"
> commit
> 
> - Pacemaker is started from scratch
> - Config above is applied by crm -f <file> where <file> has the above 
> config snippet.
> 
> - After that crm_mon shows the following status
> ----------------------8<-------------------------
> Last updated: Mon Aug 26 18:42:47 2013 Last change: Mon Aug 26 
> 18:42:42 2013 via cibadmin on dis03-test
> Stack: cman
> Current DC: dis03-test - partition with quorum
> Version: 1.1.10-1.el6-9abe687
> 2 Nodes configured
> 4 Resources configured
> 
> 
> Online: [ dis03-test dis04-test ]
> 
> Full list of resources:
> 
> r_stonith-dis03-test   (stonith:fence_mock):   Started dis04-test
> r_stonith-dis04-test   (stonith:fence_mock):   Started dis03-test
> Master/Slave Set: ms_drbd_postfix [r_drbd_postfix]
>     Stopped: [ dis03-test dis04-test ]
> 
> Migration summary:
> * Node dis04-test:
> * Node dis03-test:
> ----------------------8<-------------------------
> 
> cat /proc/drbd
> version: 8.4.3 (api:1/proto:86-101)
> GIT-hash: 89a294209144b68adb3ee85a73221f964d3ee515 build by 
> root at dis03-test,
> 2013-07-24 17:19:24
> 
> on both nodes. The drbd resource was shutdown previously in a clean 
> state, so that any node can be the primary.
> 
> - Now the weird behaviour when trying to start the drbd with
>   crm resource start ms_drbd_postfix
> 
> 
> Output of crm_mon -1rf
> ----------------------8<-------------------------
> Last updated: Mon Aug 26 18:46:33 2013 Last change: Mon Aug 26 
> 18:46:30 2013 via cibadmin on dis04-test
> Stack: cman
> Current DC: dis03-test - partition with quorum
> Version: 1.1.10-1.el6-9abe687
> 2 Nodes configured
> 4 Resources configured
> 
> 
> Online: [ dis03-test dis04-test ]
> 
> Full list of resources:
> 
> r_stonith-dis03-test   (stonith:fence_mock):   Started dis04-test
> r_stonith-dis04-test   (stonith:fence_mock):   Started dis03-test
> Master/Slave Set: ms_drbd_postfix [r_drbd_postfix]
>     Slaves: [ dis03-test ]
>     Stopped: [ dis04-test ]
> 
> Migration summary:
> * Node dis04-test:
>   r_drbd_postfix: migration-threshold=2 fail-count=2 last-failure='Mon 
> Aug
> 26 18:46:30 2013'
> * Node dis03-test:
> 

Its hard to imagine how pacemaker could cause drbdadm to fail, short of
leaving the other side promoted while trying to promote another.
Perhaps the drbd folks could comment on what the error means.

> Failed actions:
>    r_drbd_postfix_promote_0 (node=dis04-test, call=34, rc=1, 
> status=complete, last-rc-change=Mon Aug 26 18:46:29 2013 , 
> queued=1212ms, exec=0ms
> ): unknown error
> ----------------------8<-------------------------
> 
> In the log of the drbd agent I can find the following when the 
> promoting request is handled on dis03-test
> 
> ----------------------8<-------------------------
> ++ drbdadm -c /usr/local/etc/drbd.conf primary postfix
> 0: State change failed: (-2) Need access to UpToDate data Command 
> 'drbdsetup primary 0' terminated with exit code 17
> + cmd_out=
> + ret=17
> + '[' 17 '!=' 0 ']'
> + ocf_log err 'postfix: Called drbdadm -c /usr/local/etc/drbd.conf 
> + primary
> postfix'
> + '[' 2 -lt 2 ']'
> + __OCF_PRIO=err
> + shift
> ----------------------8<-------------------------
> 
> While working without problems on pacemaker 1.1.8 it doesn't work here.
> The error message let me assume that there is a kind of race condition 
> where pacemaker is firing the promotion too early.
> Probably it has something to do with applying attributes from the drbd 
> resource agent.
> But this is just a guess and I really don't know.
> 
> ONE ADDITIONAL information: As soon as I do a resource cleanup on the 
> "defective" node the master is promoted as expected. That means a:
>   crm resource cleanup r_drbd_postfix dis03-test results in the 
> following:
> 
> ----------------------8<-------------------------
> Last updated: Mon Aug 26 19:29:38 2013 Last change: Mon Aug 26 
> 19:29:28 2013 via cibadmin on dis04-test
> Stack: cman
> Current DC: dis03-test - partition with quorum
> Version: 1.1.10-1.el6-9abe687
> 2 Nodes configured
> 4 Resources configured
> 
> 
> Online: [ dis03-test dis04-test ]
> 
> Full list of resources:
> 
> r_stonith-dis03-test   (stonith:fence_mock):   Started dis04-test
> r_stonith-dis04-test   (stonith:fence_mock):   Started dis03-test
> Master/Slave Set: ms_drbd_postfix [r_drbd_postfix]
>     Masters: [ dis03-test ]
>     Slaves: [ dis04-test ]
> 
> Migration summary:
> * Node dis03-test:
> * Node dis04-test:
> ----------------------8<-------------------------
> 
> 
> 
> I really hope I can get some attention as pacemaker 1.1.10 is a 
> milestone for Andrew and drbd from linbit is pretty sure a building 
> block of many pacemaker based clusters.
> 
> Cluster log of DC dis03-test at http://pastebin.com/2S9Y6V3P DRBD 
> agent log at http://pastebin.com/ceYNEAhH
> 
> 
> So, any help welcome.
> 
> Best regards
> Andreas Mock
> 
> 
> _______________________________________________
> Linux-HA mailing list
> Linux-HA at lists.linux-ha.org
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems






More information about the Pacemaker mailing list