[Pacemaker] CIB write-to-disk bug?

Thu Apr 1 05:42:35 EDT 2010

On Thu, Apr 01, 2010 at 12:12:47AM -0600, Alan Robertson wrote:
> OK....
> 
> Since there was no ssh-as-root between the cluster nodes, I didn't
> send all the logs along from every node in the cluster - and it
> didn't occur to me to look at all of them.
> 
> However, the problem has gotten curioser and curioser - because ALL
> the nodes in the cluster reported the same problem at the same
> time...
> 
> That makes it a lot less likely to be a race condition with the disk
> writing infrastructure...
> 
> I've attached the relevant lines from the various machines -
> slightly processed (date stamp format changed and a few other minor
> things).
> 
> Let me know if you want me to send all the system logs along...

There should be core files.
You should be able to get some interessting information out there,
especially "the_cib" and "digest" at the point of abort().

> >I did not make manual changes on a running CIB. I was using the
> >cluster shell at the time.   The CIB it is complaining about
> >appears to be an intact, valid CIB with contents approximately
> >like they should have been at the time.  By the way, I have a
> >report from another IBMer that they have seen systems that stop
> >writing to their local CIBs.  I'll contact him.
> >
> >Here are some relevant facts:
> >  These machines are virtual guests in a cloud somewhere - operations
> >    have somewhat unpredictable latency.  But, nothing too egregious
> >    was happening at the time or Heartbeat would have bitched.
> >  I was doing some testing at the time.  I was putting on and
> >    taking off constraints using the cluster shell
> >    migrate and unmigrate operations.
> >
> >Given that the file looks intact, and I know how the CIB is
> >written to disk (since I originally wrote that code), I wonder if
> >it isn't a versioning issue / race condition.  That is, the code
> >for writing to disk does NOT guarantee when it gets done (assuming
> >you're still using it).  It would be easy to do a checksum on the
> >wrong version compared to the version you thought it should be (or
> >before it completed).
> >
> >Andrew:  You should have already received all the relevant logs to
> >you on a separate email.
> >
> >Also, for my reference - what method are you using to compute the
> >digest of the file?  That is, what command should I execute to get
> >the same results?

It's an md5sum over the xml tree -- not over the formated ascii buffer,
though, so "md5sum cib.xml" won't do.
I think it is the same as
 echo " $(perl -pe 's/^\s*(.*?)\s*\z/$1/g' cib.whatever)" | md5sum
But there is "cibadmin --md5-sum -x cib.xml",
to use the exact same code path.

> 2010/03/31_19:02:52	vhost0384	[13294]: ERROR: crm_abort:
> write_cib_contents: Triggered fatal assert at io.c:624 :
> retrieveCib(tmp1, tmp2, FALSE) != NULL

So it did not verify right after it was written.
Can you reproduce?

The core files may actually contains some hints,
so have a look there.

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.