[Pacemaker] DRBD promotion timeout after pacemaker stop on other node

Mon Nov 11 07:00:59 CET 2013

11.11.2013 06:32, Vladislav Bogdanov wrote:
> 11.11.2013 02:30, Andrew Beekhof wrote:
>>
>> On 5 Nov 2013, at 2:22 am, Vladislav Bogdanov <bubble at hoster-ok.com> wrote:
>>
>>> Hi Andrew, David, all,
>>>
>>> Just found interesting fact, don't know is it a bug or not.
>>>
>>> When doing service pacemaker stop on a node which has drbd resource
>>> promoted, that resource does not promote on another node, and promote
>>> operation timeouts.
>>>
>>> This is related to drbd fence integration with pacemaker and to
>>> insufficient default (recommended) promote timeout for drbd resource.
>>>
>>> crm-fence-peer.sh places constraint to cib one second after promote
>>> operation timeouts (promote op has 90s timeout, and crm-fence-peer.sh
>>> uses that value as a timeout, and fully utilizes it if it cannot say for
>>> sure that peer node is in a "sane" state - online or cleanly offline).
>>>
>>> It seems like increasing promote op timeout helps, but, I'd expect that
>>> to complete almost immediately, instead of waiting extra 90 seconds for
>>> nothing.
>>>
>>> Looking at crm-fence-peer.sh script, it would determine peer state as
>>> offline immediately if node state (all of)
>>> * doesn't contain "expected" tag or has it set to "down"
>>> * has "in_ccm" tag set to false
>>> * has "crmd" tag set to anything except "online"
>>>
>>> On the other hand, crmd sets "expected" = "down" only after fencing is
>>> complete (probably the same for "in_ccm"?). Shouldn't is do the same (or
>>> may be just remove that tag) if clean shutdown about to be complete?
>>
>> That would make sense.  Are you using the plugin, cman or corosync 2?

Is this ok or I miss something?