[Pacemaker] CIB write-to-disk bug?
Alan Robertson
alanr at unix.sh
Wed Mar 31 23:15:08 UTC 2010
Hi,
I've run into what looks at first blush to be a CIB bug in writing to disk.
The key messages from this incident are these:
Mar 31 19:02:52 vhost0384 cib: [13294]: ERROR: validate_cib_digest:
Digest comparision failed: expected 316049fa7ee8d2e107573ce7cded07cf
(/var/lib/heartbeat/crm/cib.GUdD9T), calculated
0bac3440f5c42f0f37d22ea7dfe433e8
Mar 31 19:02:52 vhost0384 cib: [13294]: ERROR: retrieveCib: Checksum of
/var/lib/heartbeat/crm/cib.uHFtAW failed! Configuration contents ignored!
Mar 31 19:02:52 vhost0384 cib: [13294]: ERROR: retrieveCib: Usually this
is caused by manual changes, please refer to
http://clusterlabs.org/wiki/FAQ#cib_changes_detected
Mar 31 19:02:52 vhost0384 cib: [13294]: WARN: retrieveCib: Continuing
but /var/lib/heartbeat/crm/cib.uHFtAW will NOT used.
I did not make manual changes on a running CIB. I was using the cluster
shell at the time. The CIB it is complaining about appears to be an
intact, valid CIB with contents approximately like they should have been
at the time. By the way, I have a report from another IBMer that they
have seen systems that stop writing to their local CIBs. I'll contact him.
Here are some relevant facts:
These machines are virtual guests in a cloud somewhere - operations
have somewhat unpredictable latency. But, nothing too egregious
was happening at the time or Heartbeat would have bitched.
I was doing some testing at the time. I was putting on and
taking off constraints using the cluster shell
migrate and unmigrate operations.
Given that the file looks intact, and I know how the CIB is written to
disk (since I originally wrote that code), I wonder if it isn't a
versioning issue / race condition. That is, the code for writing to
disk does NOT guarantee when it gets done (assuming you're still using
it). It would be easy to do a checksum on the wrong version compared to
the version you thought it should be (or before it completed).
Andrew: You should have already received all the relevant logs to you
on a separate email.
Also, for my reference - what method are you using to compute the digest
of the file? That is, what command should I execute to get the same
results?
--
Alan Robertson <alanr at unix.sh>
"Openness is the foundation and preservative of friendship... Let me
claim from you at all times your undisguised opinions." - William
Wilberforce
More information about the Pacemaker
mailing list