[Pacemaker] DRBD promotion timeout after pacemaker stop on other node

Mon Nov 4 10:22:00 EST 2013

Hi Andrew, David, all,

Just found interesting fact, don't know is it a bug or not.

When doing service pacemaker stop on a node which has drbd resource
promoted, that resource does not promote on another node, and promote
operation timeouts.

This is related to drbd fence integration with pacemaker and to
insufficient default (recommended) promote timeout for drbd resource.

crm-fence-peer.sh places constraint to cib one second after promote
operation timeouts (promote op has 90s timeout, and crm-fence-peer.sh
uses that value as a timeout, and fully utilizes it if it cannot say for
sure that peer node is in a "sane" state - online or cleanly offline).

It seems like increasing promote op timeout helps, but, I'd expect that
to complete almost immediately, instead of waiting extra 90 seconds for
nothing.

Looking at crm-fence-peer.sh script, it would determine peer state as
offline immediately if node state (all of)
* doesn't contain "expected" tag or has it set to "down"
* has "in_ccm" tag set to false
* has "crmd" tag set to anything except "online"

On the other hand, crmd sets "expected" = "down" only after fencing is
complete (probably the same for "in_ccm"?). Shouldn't is do the same (or
may be just remove that tag) if clean shutdown about to be complete?
Or may be it is possible to provide some different hint for
crm_fence_peer.sh?

Another option (actually hack) would be to delay shutdown between
resources stop and processes stop (so drbd handler on the other node
determines peer is still online, and places constraint immediately), but
that is very fragile.

pacemaker is one-week-old merge of clusterlab and bekkhof masters, drbd
is 8.4.4. All runs on corosync2 (2.3.1) with libqb 0.16 on CentOS6.

Vladislav