[Pacemaker] DRBD promotion timeout after pacemaker stop on other node

Mon Nov 11 03:32:23 UTC 2013

11.11.2013 02:30, Andrew Beekhof wrote:
> 
> On 5 Nov 2013, at 2:22 am, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:
> 
>> Hi Andrew, David, all,
>>
>> Just found interesting fact, don't know is it a bug or not.
>>
>> When doing service pacemaker stop on a node which has drbd resource
>> promoted, that resource does not promote on another node, and promote
>> operation timeouts.
>>
>> This is related to drbd fence integration with pacemaker and to
>> insufficient default (recommended) promote timeout for drbd resource.
>>
>> crm-fence-peer.sh places constraint to cib one second after promote
>> operation timeouts (promote op has 90s timeout, and crm-fence-peer.sh
>> uses that value as a timeout, and fully utilizes it if it cannot say for
>> sure that peer node is in a "sane" state - online or cleanly offline).
>>
>> It seems like increasing promote op timeout helps, but, I'd expect that
>> to complete almost immediately, instead of waiting extra 90 seconds for
>> nothing.
>>
>> Looking at crm-fence-peer.sh script, it would determine peer state as
>> offline immediately if node state (all of)
>> * doesn't contain "expected" tag or has it set to "down"
>> * has "in_ccm" tag set to false
>> * has "crmd" tag set to anything except "online"
>>
>> On the other hand, crmd sets "expected" = "down" only after fencing is
>> complete (probably the same for "in_ccm"?). Shouldn't is do the same (or
>> may be just remove that tag) if clean shutdown about to be complete?
> 
> That would make sense.  Are you using the plugin, cman or corosync 2?

corosync2


> 
>> Or may be it is possible to provide some different hint for
>> crm_fence_peer.sh?
>>
>> Another option (actually hack) would be to delay shutdown between
>> resources stop and processes stop (so drbd handler on the other node
>> determines peer is still online, and places constraint immediately), but
>> that is very fragile.
>>
>> pacemaker is one-week-old merge of clusterlab and bekkhof masters, drbd
>> is 8.4.4. All runs on corosync2 (2.3.1) with libqb 0.16 on CentOS6.
>>
>> Vladislav
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>