[Pacemaker] IPMI stonith resource gets stuck

Jérôme Charaoui jcharaoui at cmaisonneuve.qc.ca
Fri Jan 30 16:03:18 UTC 2015


Le 2015-01-30 07:49, Dejan Muhamedagic a écrit :
> Hi,
>
> On Wed, Jan 28, 2015 at 01:53:17PM -0500, Jérôme Charaoui wrote:
>> Hi,
>>
>> I'm testing a 2-node Corosync (1.4.6) and Pacemaker
>> (1.1.10+git20130802) cluster on Debian 8.0 and having some problems
>> with the stonith resources.
>>
>> I've set up two external/ipmi resources on each node and wanted to
>> test how they would react by physically unplugging the IPMI device
>> network interfaces.
>>
>> On the DC, no problem, the resource monitor fails, stop op succeeds
>> and due to location constraints, as expected the resource enters the
>> stop state and stays there. After replugging the network cable and
>> cleaningup the resource, it gets restored to normal state.
>>
>> On the slave node, different scenario: after monitor op fails, stop
>> op also fails for an unknown reason. The cluster then retries the
>
> The stop operation for stonith devices does not involve the
> device at all, it's just stonithd operation, something like
> "disable resource". From the "slave" logs, after some abort,
>
> Jan 28 12:04:22 [31422] scatlas01 stonith-ng:    error: crm_abort:      crm_glib_handler: Forked child 15705 to record non-fatal assert at logging.c:73 : Source ID 63 was not found when attempting to remove it
>
> stonithd exits:
>
> Jan 28 12:05:42 [31422] scatlas01 stonith-ng:     info: st_child_term:  Child 16540 timed out, sending SIGTERM
> Jan 28 12:05:42 [31422] scatlas01 stonith-ng:     info: crm_signal_dispatch:    Invoking handler for signal 15: Terminated
> Jan 28 12:05:42 [31422] scatlas01 stonith-ng:     info: stonith_shutdown:       Terminating with  2 clients
>
> Apparently, there're a number of stop operations started, for the
> same resource, which all exited (or got cancelled) around
> 12:29:09. There probably was some confusion in lrmd after
> stonithd left.

Thank you for looking at this, much appreciated.

The timeout issue intrigued me because I had noticed ipmitool taking 
sometimes over 10 seconds attempting to execute an action on a 
non-responding IPMI device over the lanplus interface.

So I had a look at the ipmi stonith plugin code and the ipmitool manpage 
itself and noticed this little gem in the latter:

-R <count> Set  the  number  of  retries  for lan/lanplus interface 
(default=4).

I then went ahead and added "-R 1" in the plugin's ipmitool_opts 
variable, and my problem went away!


 > In short, you ran into a bug, but I guess that
 > that bug got fixed in the meantime.

This bug report seems like a match:
https://github.com/ClusterLabs/pacemaker/pull/334

If I'm not mistaken in reading the changelog, this fix was released in 
1.12, correct?


> Beekhof and David Vossel should know.
>
> Thanks,
>
> Dejan
>
>> stop operation unsuccessfully until I have the node enter/exit
>> standby mode. Replugging the network cable on the IPMI device has no
>> effect.
>>
>> At least, that's what I figure is happenning from these logs:
>>
>> DC: http://pastebin.com/raw.php?i=QpwG6nea
>> Slave: http://pastebin.com/raw.php?i=3nesX8yJ
>> Config: http://pastebin.com/raw.php?i=3FrJuwWz
>>
>> Any help tracking down the issue would be much appreciated.
>>
>> Thanks!
>>
>> --
>> Jérôme Charaoui
>> Technicien informatique
>> Collège de Maisonneuve
>>
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>






More information about the Pacemaker mailing list