[Pacemaker] CIB write-to-disk bug?

Alan Robertson alanr at unix.sh
Wed Mar 31 23:15:08 UTC 2010


Hi,

I've run into what looks at first blush to be a CIB bug in writing to disk.

The key messages from this incident are these:


Mar 31 19:02:52 vhost0384 cib: [13294]: ERROR: validate_cib_digest: 
Digest comparision failed: expected 316049fa7ee8d2e107573ce7cded07cf 
(/var/lib/heartbeat/crm/cib.GUdD9T), calculated 
0bac3440f5c42f0f37d22ea7dfe433e8
Mar 31 19:02:52 vhost0384 cib: [13294]: ERROR: retrieveCib: Checksum of 
/var/lib/heartbeat/crm/cib.uHFtAW failed!  Configuration contents ignored!
Mar 31 19:02:52 vhost0384 cib: [13294]: ERROR: retrieveCib: Usually this 
is caused by manual changes, please refer to 
http://clusterlabs.org/wiki/FAQ#cib_changes_detected
Mar 31 19:02:52 vhost0384 cib: [13294]: WARN: retrieveCib: Continuing 
but /var/lib/heartbeat/crm/cib.uHFtAW will NOT used.


I did not make manual changes on a running CIB. I was using the cluster 
shell at the time.   The CIB it is complaining about appears to be an 
intact, valid CIB with contents approximately like they should have been 
at the time.  By the way, I have a report from another IBMer that they 
have seen systems that stop writing to their local CIBs.  I'll contact him.

Here are some relevant facts:
   These machines are virtual guests in a cloud somewhere - operations
	have somewhat unpredictable latency.  But, nothing too egregious
	was happening at the time or Heartbeat would have bitched.
   I was doing some testing at the time.  I was putting on and
	taking off constraints using the cluster shell
	migrate and unmigrate operations.

Given that the file looks intact, and I know how the CIB is written to 
disk (since I originally wrote that code), I wonder if it isn't a 
versioning issue / race condition.  That is, the code for writing to 
disk does NOT guarantee when it gets done (assuming you're still using 
it).  It would be easy to do a checksum on the wrong version compared to 
the version you thought it should be (or before it completed).

Andrew:  You should have already received all the relevant logs to you 
on a separate email.

Also, for my reference - what method are you using to compute the digest 
of the file?  That is, what command should I execute to get the same 
results?

-- 
     Alan Robertson <alanr at unix.sh>

"Openness is the foundation and preservative of friendship...  Let me 
claim from you at all times your undisguised opinions." - William 
Wilberforce




More information about the Pacemaker mailing list