[Pacemaker] DRBD promotion timeout after pacemaker stop on other node

Sun Nov 10 23:30:22 UTC 2013

On 5 Nov 2013, at 2:22 am, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:

> Hi Andrew, David, all,
> 
> Just found interesting fact, don't know is it a bug or not.
> 
> When doing service pacemaker stop on a node which has drbd resource
> promoted, that resource does not promote on another node, and promote
> operation timeouts.
> 
> This is related to drbd fence integration with pacemaker and to
> insufficient default (recommended) promote timeout for drbd resource.
> 
> crm-fence-peer.sh places constraint to cib one second after promote
> operation timeouts (promote op has 90s timeout, and crm-fence-peer.sh
> uses that value as a timeout, and fully utilizes it if it cannot say for
> sure that peer node is in a "sane" state - online or cleanly offline).
> 
> It seems like increasing promote op timeout helps, but, I'd expect that
> to complete almost immediately, instead of waiting extra 90 seconds for
> nothing.
> 
> Looking at crm-fence-peer.sh script, it would determine peer state as
> offline immediately if node state (all of)
> * doesn't contain "expected" tag or has it set to "down"
> * has "in_ccm" tag set to false
> * has "crmd" tag set to anything except "online"
> 
> On the other hand, crmd sets "expected" = "down" only after fencing is
> complete (probably the same for "in_ccm"?). Shouldn't is do the same (or
> may be just remove that tag) if clean shutdown about to be complete?

That would make sense.  Are you using the plugin, cman or corosync 2?

> Or may be it is possible to provide some different hint for
> crm_fence_peer.sh?
> 
> Another option (actually hack) would be to delay shutdown between
> resources stop and processes stop (so drbd handler on the other node
> determines peer is still online, and places constraint immediately), but
> that is very fragile.
> 
> pacemaker is one-week-old merge of clusterlab and bekkhof masters, drbd
> is 8.4.4. All runs on corosync2 (2.3.1) with libqb 0.16 on CentOS6.
> 
> Vladislav
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org