[Pacemaker] CIB write-to-disk bug?

Thu Apr 1 16:29:21 UTC 2010

On Thu, Apr 01, 2010 at 08:27:02AM -0600, Alan Robertson wrote:
> Lars Ellenberg wrote:
> >On Thu, Apr 01, 2010 at 12:12:47AM -0600, Alan Robertson wrote:
> >>OK....
> >>
> >>Since there was no ssh-as-root between the cluster nodes, I didn't
> >>send all the logs along from every node in the cluster - and it
> >>didn't occur to me to look at all of them.
> >>
> >>However, the problem has gotten curioser and curioser - because ALL
> >>the nodes in the cluster reported the same problem at the same
> >>time...
> >>
> >>That makes it a lot less likely to be a race condition with the disk
> >>writing infrastructure...
> >>
> >>I've attached the relevant lines from the various machines -
> >>slightly processed (date stamp format changed and a few other minor
> >>things).
> >>
> >>Let me know if you want me to send all the system logs along...
> >
> >There should be core files.
> >You should be able to get some interessting information out there,
> >especially "the_cib" and "digest" at the point of abort().
> >
> >>>
> >>>Also, for my reference - what method are you using to compute the
> >>>digest of the file?  That is, what command should I execute to get
> >>>the same results?
> >
> >It's an md5sum over the xml tree -- not over the formated ascii buffer,
> >though, so "md5sum cib.xml" won't do.
> >I think it is the same as
> > echo " $(perl -pe 's/^\s*(.*?)\s*\z/$1/g' cib.whatever)" | md5sum
> >But there is "cibadmin --md5-sum -x cib.xml",
> >to use the exact same code path.
> 
> This is a change from how it used to be (the last time I looked - at
> least according to my not-always-reliable memory).  Thanks for the
> update.
> 
> 
> >>2010/03/31_19:02:52	vhost0384	[13294]: ERROR: crm_abort:
> >>write_cib_contents: Triggered fatal assert at io.c:624 :
> >>retrieveCib(tmp1, tmp2, FALSE) != NULL
> >
> >So it did not verify right after it was written.
> >Can you reproduce?
> 
> I have no idea.  I didn't do anything much.  Hopefully the test
> suite does a lot more strenuous things...
> 
> >The core files may actually contains some hints,
> >so have a look there.
> 
> None of them verified.  All the nodes in the cluster failed the test
> at the same time - and now I have no official CIBs on disk - on any
> cluster nodes...  I sent Andrew all the CIBs, and all the core

Well, Andrew is on vacation right now... you will have noticed.

> files, and basically everything under /var/lib/heartbeat/ from one
> machine. They're from the latest official release - so the binaries
> that match them are readily available.

The strange thing is that your "corrupt" cib.uHFtAW
contains a <status/> thing.  it should not.
No other cib*.raw or cib.xml does.

Because <status/> is explicitly filtered out in write_cib_contents:
 free_xml_from_parent(the_cib, cib_status_root);
before
 write_xml_file(the_cib, tmp1, FALSE),
so that should never have made it in there.

Something is very wrong somewhere...

Did you manage to get two status sections in there, somehow?
You tried anything funky with the cib as last action before this failed?

Do it again, with higher log level.  Sorry, no time right now to rebuild
your exact thing with your exact gcc and stuff to look at your core file.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.