[Pacemaker] Stuck in a STONITH cycle

Tue Oct 16 16:04:11 EDT 2012

On Wed, Oct 17, 2012 at 3:37 AM, David Parker <dparker at utica.edu> wrote:
> On 10/16/2012 04:45 AM, Andrew Beekhof wrote:
>>
>> On Tue, Oct 16, 2012 at 3:04 PM, David Parker<dparker at utica.edu>  wrote:
>>>
>>> ----- Original Message -----
>>> From: David Parker<dparker at utica.edu>
>>> Date: Friday, October 12, 2012 4:57 pm
>>> Subject: [Pacemaker] Stuck in a STONITH cycle
>>> To: pacemaker at oss.clusterlabs.org
>>>
>>>> I have two nodes set up in a cluster to provide a MySQL server
>>>> (mysqld)
>>>> in HA on a virtual IP address.  This was working fine until
>>>> I had to
>>>> reboot the servers.  All I did was change the interface
>>>> each node uses
>>>> for its primary IP address (changed from eth1 to eth0 on each
>>>> node).
>>>> Now I'm stuck in a cycle.  Let's say node 1 has the virtual
>>>> IP and is
>>>> running mysqld, and node 2 is down.  When node 2 boots up,
>>>> it will
>>>> STONITH node 1 for no apparent reason and take over the
>>>> resources, which
>>>> shouldn't happen.  When node 1 boots up again, it will
>>>> STONITH node 2
>>>> and take over the resources, which again shouldn't happen.
>>>
>>> ...
>>>>
>>>> Oct 12 16:27:22 ha1 crmd: [1176]: info: populate_cib_nodes_ha:
>>>> Requesting the list of configured nodes
>>>> Oct 12 16:27:23 ha1 crmd: [1176]: WARN: get_uuid: Could not
>>>> calculate
>>>> UUID for ha2
>>>> Oct 12 16:27:23 ha1 crmd: [1176]: WARN: populate_cib_nodes_ha:
>>>> Node ha2:
>>>> no uuid found
>>>> Oct 12 16:27:23 ha1 crmd: [1176]: info: do_state_transition: All
>>>> 1
>>>> cluster nodes are eligible to run resources.
>>>>
>>>> The exact opposite shows up on the node "ha2" (it says ha1 has
>>>> no
>>>> uuid).  Did the UUID of each node change because the
>>>> physical interface
>>>> changed?  Any other ideas?  Thanks in advance.
>>>>
>>> Just wanted to follow up in case anyone else encounters this problem.  I
>>> was
>>> able to solve the problem by moving the primary IP address of each node
>>> back
>>> to its original interface (eth1), so it seems the UUID of each is node in
>>> the cluster depends on the interface.
>>
>> No. The on disk uuid isn't dynamic.
>> In fact once set, it never changes.
>>
>> I'm not sure what you managed to do, but I'm glad you have it working
>> again.
>
>
> Thanks, Andrew.  I checked out the link you provided in your other
> response[1], and the STONITH death match exactly describes the behavior I
> was seeing.  Strangely, though, none of the three conditions listed in that
> article were present in my configuration.  Network communication was not
> broken, neither node was physically failing, and there were no HA resources
> acting wonky.
>
> The weirdest part is that the nodes could ping each other, but their ability
> to see each other via the crm was broken.  The error in each node's log was
> that it couldn't calculate the UUID for the other node.

For Heartbeat based clusters there is actually an on-disk table of
known nodes and their UUIDs.
So the error message is a bit misleading, there's no calculation, just a lookup.

Very very strange.

> For some reason,
> changing the interface back on each node solved the problem, but I guess
> we'll never know why it happened in the first place.
>
> [1] http://oss.clusterlabs.org/pipermail/pacemaker/2012-October/015674.html
>
>
>>>   With each node's IP address back on
>>> eth1, the cluster works fine and there's no STONITH cycle.
>>>
>>> Now another question...  Is there a way to update the UUID of each node
>>> if
>>> you do something crazy and move IP addresses to new interfaces, like I
>>> did?
>>
>>
>>
>>>      Thanks,
>>>      Dave
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
>
> --
>
> Dave Parker
> Systems Administrator
> Utica College
> Integrated Information Technology Services
> (315) 792-3229
> Registered Linux User #408177
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org