[Pacemaker] CIB write-to-disk bug?
Alan Robertson
alanr at unix.sh
Thu Apr 1 14:27:02 UTC 2010
Lars Ellenberg wrote:
> On Thu, Apr 01, 2010 at 12:12:47AM -0600, Alan Robertson wrote:
>> OK....
>>
>> Since there was no ssh-as-root between the cluster nodes, I didn't
>> send all the logs along from every node in the cluster - and it
>> didn't occur to me to look at all of them.
>>
>> However, the problem has gotten curioser and curioser - because ALL
>> the nodes in the cluster reported the same problem at the same
>> time...
>>
>> That makes it a lot less likely to be a race condition with the disk
>> writing infrastructure...
>>
>> I've attached the relevant lines from the various machines -
>> slightly processed (date stamp format changed and a few other minor
>> things).
>>
>> Let me know if you want me to send all the system logs along...
>
> There should be core files.
> You should be able to get some interessting information out there,
> especially "the_cib" and "digest" at the point of abort().
>
>>>
>>> Also, for my reference - what method are you using to compute the
>>> digest of the file? That is, what command should I execute to get
>>> the same results?
>
> It's an md5sum over the xml tree -- not over the formated ascii buffer,
> though, so "md5sum cib.xml" won't do.
> I think it is the same as
> echo " $(perl -pe 's/^\s*(.*?)\s*\z/$1/g' cib.whatever)" | md5sum
> But there is "cibadmin --md5-sum -x cib.xml",
> to use the exact same code path.
This is a change from how it used to be (the last time I looked - at
least according to my not-always-reliable memory). Thanks for the update.
>> 2010/03/31_19:02:52 vhost0384 [13294]: ERROR: crm_abort:
>> write_cib_contents: Triggered fatal assert at io.c:624 :
>> retrieveCib(tmp1, tmp2, FALSE) != NULL
>
> So it did not verify right after it was written.
> Can you reproduce?
I have no idea. I didn't do anything much. Hopefully the test suite
does a lot more strenuous things...
> The core files may actually contains some hints,
> so have a look there.
None of them verified. All the nodes in the cluster failed the test at
the same time - and now I have no official CIBs on disk - on any cluster
nodes... I sent Andrew all the CIBs, and all the core files, and
basically everything under /var/lib/heartbeat/ from one machine.
They're from the latest official release - so the binaries that match
them are readily available.
Thanks Lars!
--
Alan Robertson <alanr at unix.sh>
"Openness is the foundation and preservative of friendship... Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
More information about the Pacemaker
mailing list