[Pacemaker] IPMI stonith resource gets stuck

Dejan Muhamedagic dejanmm at fastmail.fm
Fri Jan 30 07:49:48 EST 2015


Hi,

On Wed, Jan 28, 2015 at 01:53:17PM -0500, Jérôme Charaoui wrote:
> Hi,
> 
> I'm testing a 2-node Corosync (1.4.6) and Pacemaker
> (1.1.10+git20130802) cluster on Debian 8.0 and having some problems
> with the stonith resources.
> 
> I've set up two external/ipmi resources on each node and wanted to
> test how they would react by physically unplugging the IPMI device
> network interfaces.
> 
> On the DC, no problem, the resource monitor fails, stop op succeeds
> and due to location constraints, as expected the resource enters the
> stop state and stays there. After replugging the network cable and
> cleaningup the resource, it gets restored to normal state.
> 
> On the slave node, different scenario: after monitor op fails, stop
> op also fails for an unknown reason. The cluster then retries the

The stop operation for stonith devices does not involve the
device at all, it's just stonithd operation, something like
"disable resource". From the "slave" logs, after some abort,

Jan 28 12:04:22 [31422] scatlas01 stonith-ng:    error: crm_abort:      crm_glib_handler: Forked child 15705 to record non-fatal assert at logging.c:73 : Source ID 63 was not found when attempting to remove it

stonithd exits:

Jan 28 12:05:42 [31422] scatlas01 stonith-ng:     info: st_child_term:  Child 16540 timed out, sending SIGTERM
Jan 28 12:05:42 [31422] scatlas01 stonith-ng:     info: crm_signal_dispatch:    Invoking handler for signal 15: Terminated
Jan 28 12:05:42 [31422] scatlas01 stonith-ng:     info: stonith_shutdown:       Terminating with  2 clients 

Apparently, there're a number of stop operations started, for the
same resource, which all exited (or got cancelled) around
12:29:09. There probably was some confusion in lrmd after
stonithd left. In short, you ran into a bug, but I guess that
that bug got fixed in the meantime.

Beekhof and David Vossel should know.

Thanks,

Dejan

> stop operation unsuccessfully until I have the node enter/exit
> standby mode. Replugging the network cable on the IPMI device has no
> effect.
> 
> At least, that's what I figure is happenning from these logs:
> 
> DC: http://pastebin.com/raw.php?i=QpwG6nea
> Slave: http://pastebin.com/raw.php?i=3nesX8yJ
> Config: http://pastebin.com/raw.php?i=3FrJuwWz
> 
> Any help tracking down the issue would be much appreciated.
> 
> Thanks!
> 
> -- 
> Jérôme Charaoui
> Technicien informatique
> Collège de Maisonneuve
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org




More information about the Pacemaker mailing list