[Pacemaker] Stuck in a STONITH cycle

Tue Oct 16 12:37:09 EDT 2012

On 10/16/2012 04:45 AM, Andrew Beekhof wrote:
> On Tue, Oct 16, 2012 at 3:04 PM, David Parker<dparker at utica.edu>  wrote:
>> ----- Original Message -----
>> From: David Parker<dparker at utica.edu>
>> Date: Friday, October 12, 2012 4:57 pm
>> Subject: [Pacemaker] Stuck in a STONITH cycle
>> To: pacemaker at oss.clusterlabs.org
>>
>>> I have two nodes set up in a cluster to provide a MySQL server
>>> (mysqld)
>>> in HA on a virtual IP address.  This was working fine until
>>> I had to
>>> reboot the servers.  All I did was change the interface
>>> each node uses
>>> for its primary IP address (changed from eth1 to eth0 on each
>>> node).
>>> Now I'm stuck in a cycle.  Let's say node 1 has the virtual
>>> IP and is
>>> running mysqld, and node 2 is down.  When node 2 boots up,
>>> it will
>>> STONITH node 1 for no apparent reason and take over the
>>> resources, which
>>> shouldn't happen.  When node 1 boots up again, it will
>>> STONITH node 2
>>> and take over the resources, which again shouldn't happen.
>> ...
>>> Oct 12 16:27:22 ha1 crmd: [1176]: info: populate_cib_nodes_ha:
>>> Requesting the list of configured nodes
>>> Oct 12 16:27:23 ha1 crmd: [1176]: WARN: get_uuid: Could not
>>> calculate
>>> UUID for ha2
>>> Oct 12 16:27:23 ha1 crmd: [1176]: WARN: populate_cib_nodes_ha:
>>> Node ha2:
>>> no uuid found
>>> Oct 12 16:27:23 ha1 crmd: [1176]: info: do_state_transition: All
>>> 1
>>> cluster nodes are eligible to run resources.
>>>
>>> The exact opposite shows up on the node "ha2" (it says ha1 has
>>> no
>>> uuid).  Did the UUID of each node change because the
>>> physical interface
>>> changed?  Any other ideas?  Thanks in advance.
>>>
>> Just wanted to follow up in case anyone else encounters this problem.  I was
>> able to solve the problem by moving the primary IP address of each node back
>> to its original interface (eth1), so it seems the UUID of each is node in
>> the cluster depends on the interface.
> No. The on disk uuid isn't dynamic.
> In fact once set, it never changes.
>
> I'm not sure what you managed to do, but I'm glad you have it working again.

Thanks, Andrew.  I checked out the link you provided in your other 
response[1], and the STONITH death match exactly describes the behavior 
I was seeing.  Strangely, though, none of the three conditions listed in 
that article were present in my configuration.  Network communication 
was not broken, neither node was physically failing, and there were no 
HA resources acting wonky.

The weirdest part is that the nodes could ping each other, but their 
ability to see each other via the crm was broken.  The error in each 
node's log was that it couldn't calculate the UUID for the other node.  
For some reason, changing the interface back on each node solved the 
problem, but I guess we'll never know why it happened in the first place.

[1] http://oss.clusterlabs.org/pipermail/pacemaker/2012-October/015674.html

>>   With each node's IP address back on
>> eth1, the cluster works fine and there's no STONITH cycle.
>>
>> Now another question...  Is there a way to update the UUID of each node if
>> you do something crazy and move IP addresses to new interfaces, like I did?
>
>
>>      Thanks,
>>      Dave
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-- 

Dave Parker
Systems Administrator
Utica College
Integrated Information Technology Services
(315) 792-3229
Registered Linux User #408177