[Pacemaker] Stonithd segfaulting and causing unclean?

Fri Mar 21 01:08:41 UTC 2014

On 21 Mar 2014, at 12:40 am, Michał Margula <alchemyx at uznam.net.pl> wrote:

> Hello,
> 
> We had many unresolved issues some time ago with Pacemaker. I think
> almost all of them got solved by fixing link between clusters (removed
> media converters, replaced them with NIC with SFP+, upgraded to 10Gbps).
> 
> Now it seems to be working fine with few exceptions:
> 
> - if I kill one node manually (power off, but IPMI is still operational
> so stonith is working fine)
> 
> or
> 
> - if I move one of nodes to standby and it had few Xen domUs
> 
> 
> It gets Unclean. Funny thing is that if I kill (or make a standby) node
> B, also node A gets unclean. So I have situation that crm_mon shows
> Node-A: UNCLEAN (Online), Node-B: Unclean (OFFLINE). To be honest I have
> much trouble diagnosing it (BTW: is there a some kind of documentation
> how to read logs of pacemaker?)
> 
> One thing I found that makes me worried is:
> 
> Mar 20 04:16:39 rivendell-A kernel: [  774.635312] stonithd[10089]:
> segfault at 0 ip 00007f51a1aa5bd4 sp 00007fff20c7fb50 error 4 in
> libcrmcommon.so.2.0.0[7f51a1a93000+2d000]
> 
> And it happens on both nodes. And also it seems that it only happens
> when I define manual fencing device (meatware) as such:
> 
> primitive manual-fencing-of-A stonith:meatware \
>        params hostlist="rivendell-B" \
>        op monitor interval="60s" \
>        meta target-role="Started"
> primitive manual-fencing-of-B stonith:meatware \
>        params hostlist="rivendell-A" \
>        op monitor interval="60s" \
>        meta target-role="Started"
> location location-manual-fencing-of-A manual-fencing-of-A -inf: rivendell-A
> location location-manual-fencing-of-B manual-fencing-of-B -inf: rivendell-B
> 
> Here is our configuration which currently is used (without manual
> fencing) - http://pastebin.com/CudX6wx3
> 
> BTW - is there a way to recover from such situation? I can only fix it
> by restarting corosync or rebooting a node. But it then kills other node
> because of UNCLEAN state.
> 
> Also if it is a pacemaker bug how to debug it/fix it? We are currently
> using Debian Wheezy 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff.

First step is looking at the logs for any errors.
Second is producing a stack trace from the crash.

Third is reading http://blog.clusterlabs.org/blog/2014/potential-for-data-corruption-in-pacemaker-1-dot-1-6-through-1-dot-1-9/ and getting a newer version.

> 
> I see there are more up to date versions but not with Debian. Should I
> consider upgrading?
> 
> Thank you!
> 
> -- 
> Michał Margula, alchemyx at uznam.net.pl, http://alchemyx.uznam.net.pl/
> "W życiu piękne są tylko chwile" [Ryszard Riedel]
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140321/38fac1f1/attachment-0004.sig>