[Pacemaker] Stonithd segfaulting and causing unclean?
Andrew Beekhof
andrew at beekhof.net
Fri Mar 21 01:08:41 UTC 2014
On 21 Mar 2014, at 12:40 am, Michał Margula <alchemyx at uznam.net.pl> wrote:
> Hello,
>
> We had many unresolved issues some time ago with Pacemaker. I think
> almost all of them got solved by fixing link between clusters (removed
> media converters, replaced them with NIC with SFP+, upgraded to 10Gbps).
>
> Now it seems to be working fine with few exceptions:
>
> - if I kill one node manually (power off, but IPMI is still operational
> so stonith is working fine)
>
> or
>
> - if I move one of nodes to standby and it had few Xen domUs
>
>
> It gets Unclean. Funny thing is that if I kill (or make a standby) node
> B, also node A gets unclean. So I have situation that crm_mon shows
> Node-A: UNCLEAN (Online), Node-B: Unclean (OFFLINE). To be honest I have
> much trouble diagnosing it (BTW: is there a some kind of documentation
> how to read logs of pacemaker?)
>
> One thing I found that makes me worried is:
>
> Mar 20 04:16:39 rivendell-A kernel: [ 774.635312] stonithd[10089]:
> segfault at 0 ip 00007f51a1aa5bd4 sp 00007fff20c7fb50 error 4 in
> libcrmcommon.so.2.0.0[7f51a1a93000+2d000]
>
> And it happens on both nodes. And also it seems that it only happens
> when I define manual fencing device (meatware) as such:
>
> primitive manual-fencing-of-A stonith:meatware \
> params hostlist="rivendell-B" \
> op monitor interval="60s" \
> meta target-role="Started"
> primitive manual-fencing-of-B stonith:meatware \
> params hostlist="rivendell-A" \
> op monitor interval="60s" \
> meta target-role="Started"
> location location-manual-fencing-of-A manual-fencing-of-A -inf: rivendell-A
> location location-manual-fencing-of-B manual-fencing-of-B -inf: rivendell-B
>
> Here is our configuration which currently is used (without manual
> fencing) - http://pastebin.com/CudX6wx3
>
> BTW - is there a way to recover from such situation? I can only fix it
> by restarting corosync or rebooting a node. But it then kills other node
> because of UNCLEAN state.
>
> Also if it is a pacemaker bug how to debug it/fix it? We are currently
> using Debian Wheezy 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff.
First step is looking at the logs for any errors.
Second is producing a stack trace from the crash.
Third is reading http://blog.clusterlabs.org/blog/2014/potential-for-data-corruption-in-pacemaker-1-dot-1-6-through-1-dot-1-9/ and getting a newer version.
>
> I see there are more up to date versions but not with Debian. Should I
> consider upgrading?
>
> Thank you!
>
> --
> Michał Margula, alchemyx at uznam.net.pl, http://alchemyx.uznam.net.pl/
> "W życiu piękne są tylko chwile" [Ryszard Riedel]
>
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140321/38fac1f1/attachment-0004.sig>
More information about the Pacemaker
mailing list