[Pacemaker] Y should pacemaker be started simultaneously.

Digimer lists at alteeve.ca
Sat Oct 18 00:32:22 EDT 2014


On 18/10/14 12:18 AM, Andrei Borzenkov wrote:
> В Mon, 06 Oct 2014 10:27:49 -0400
> Digimer <lists at alteeve.ca> пишет:
>
>> On 06/10/14 02:11 AM, Andrei Borzenkov wrote:
>>> On Mon, Oct 6, 2014 at 9:03 AM, Digimer <lists at alteeve.ca> wrote:
>>>> If stonith was configured, after the time out, the first node would fence
>>>> the second node ("unable to reach" != "off").
>>>>
>>>> Alternatively, you can set corosync to 'wait_for_all' and have the first
>>>> node do nothing until it sees the peer.
>>>>
>>>
>>> Am I right that wait_for_all is available only in corosync 2.x and not in 1.x?
>>
>> You are correct, yes.
>>
>>>> To do otherwise would be to risk a split-brain. Each node needs to know the
>>>> state of the peer in order to run services safely. By having both start at
>>>> the same time, then they know what the other is doing. By disabling quorum,
>>>> you allow one node to continue to operate when the other leaves, but it
>>>> needs that initial connection to know for sure what it's doing.
>>>>
>>>
>>> Does it apply to both corosync 1.x and 2.x or only to 2.x with
>>> wait_for_all? Because I actually also was confused about precise
>>> meaning of disabling quorum in pacemaker (setting no-quorum-policy:
>>> ignore). So if I have two node cluster with pacemaker 1.x and corosync
>>> 1.x with no-quorum-policy=ignore and no fencing - what happens when
>>> one single node starts?
>>
>> Quorum tells the cluster that if a peer leaves (gracefully or was
>> fenced), the remaining node is allowed to continue providing services.
>>
>> Stonith is needed to put a node that is in an unknown state into a known
>> state; Be it because it couldn't reach the node when starting or because
>> the node stopped responding.
>>
>> So quorum and stonith play rather different roles.
>>
>> Without stonith, regardless of quorum, you risk split-brains and/or data
>> corruption. Operating a cluster without stonith is to operate a cluster
>> in an undermined state and should never be done.
>>
>
> OK I try to rephrase. Is it possible to achieve the same effect as
> wait_for_all in corosync 2.x with combination of pacemaker 1.1.x and
> corosync 1.x? I.e. ensure that cluster does not come up *on the
> first startup* until all nodes are present? So just make cluster nodes
> wait for others to join instead of trying to stonith them?

No, not that I know of. To achieve the same behaviour, I wrote my own 
program[1] to do this. It is called on boot and waits for the peer to 
become reachable, then it starts the cluster stack. So the same effect 
is gained, but it's done outside corosync directly.

Note that I write it for corosync 1.x + cman + rgmanager, but the 
concepts port trivially.

digimer

1. https://github.com/digimer/an-cdb/blob/master/tools/safe_anvil_start

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?




More information about the Pacemaker mailing list