[Pacemaker] How to really deal with gateway restarts?

Thu Jun 17 15:01:04 UTC 2010

> On Tue, Jun 15, 2010 at 5:21 PM, Maros Timko <timkom at gmail.com> wrote:
>>>>>> I thought "dampen" attribute could help with some of the options, but
>>>>>> actually it is does not.
>>>>>
>>>>> It should do. ?Hard to say without any logs from the two machines.
>>>>
>>>> Unfort. I don't have log files here, can provide you if that would help.
>>>> Are you sure dampen should help here? From my testing it only does:
>>>> "Unless the next attribute value is stable for dampen interval, do not
>>>> change the attribute value in CIB". However, the pingd attribute is
>>>> set for two nodes, thus they are stored in separate XML section
>>>> meaning they are not correlated by dampen.
>>>
>>> But the attribute that is set in both cases is called "pingd", so yes,
>>> dampen should definitely apply here.
>>> What version of pacemaker do you have? ?That would also be relevant.
>>
>> # rpm -qa|egrep 'pacem|heart|glue'
>> heartbeat-3.0.1-1.el5.x86_64
>> cluster-glue-libs-1.0.1-1.el5.x86_64
>> cluster-glue-1.0.1-1.el5.x86_64
>> pacemaker-1.0.7-2.el5.x86_64
>> heartbeat-libs-3.0.1-1.el5.x86_64
>> pacemaker-libs-1.0.7-2.el5.x86_64
>>
>> Please find attached ha-debug logs from following tests:
>
> A hb_report archive would be preferred, it contains everything needed
> to figure out whats going on and I wouldn't need to ask for you
> configuration ;-)

OK, for simplicity I have created an academic configuration that uses
dampen 10 seconds because it uses monitor interval 10 seconds. So if
attrd should wait dampen period until all nodes send updates before
updating CIB (and possibly triggering failover), this should work
nomatter what delay could be between monitoring cycles of the nodes.
However, it proves that:
 - if current node is DC and is active (running resources), it moves
resources (or restarts if stop would take longer) on gateway failure
 - if current node is not a DC and is active (running resources), it
moves resources (or restarts if stop would take longer) when gateway
connection is re-established
Please find attached hb-report as well as ha-debug files from both
nodes because I increased debug level for attrd but they did not get
into merged ha-log file in the report.
What I did:
 1. DC was active, I disconnected both public cables at the same time.
     Resources migrated to standby
 2. DC was not active, I reconnected both public cables at the same time.
     Resources migrated to standby DC

Let me know if you would need anything else.

Tino
>
>> ?1. dc and non-dc from situation where VM was restarted. Currently
>> active node was DC.
>> ?2. dc_noreboot and non-dc_noreboot from situation where VM was not
>> restarted (I would like to achieve this). Currently active node was
>> not the DC.
>> Both tests used dampen=5s
>>
>> So it seems like it would be better to have DC assigned into standby
>> node every time (this would also make failovers faster).
>
> Nope.
>
>> But there is
>> no option how ho force DC election or assign the role. Am I right?
>
> Right and for good reason, because its not relevant.
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: report-ping.tgz
Type: application/x-gzip
Size: 42078 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100617/738181bd/attachment-0002.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Dummy-ha-debug-dc
Type: application/octet-stream
Size: 43229 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100617/738181bd/attachment-0008.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Dummy-ha-debug-non-dc
Type: application/octet-stream
Size: 28463 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100617/738181bd/attachment-0009.obj>