[Pacemaker] getting started - crm hangs when adding resources, even "crm ra classes" hangs

Wed Mar 14 16:58:20 CET 2012

On Mar 14, 2012, at 9:45 AM, Florian Haas wrote:
>>> The current cluster-glue package in squeeze-backports,
>>> cluster-glue_1.0.9+hg2665-1~bpo60+2, has upstart disabled.
>>> Double-check that you're running that version. If you do, and the
>>> issue persists, please let us know.
>> 
>> Indeed, that's the version that hit the repo last night when I decided to quit. This morning, I tried that version and concluded I was experiencing the same issue.
> 
> Are you absolutely certain?
> 
> Can you confirm that you're running the ~bpo60+2 (note trailing "2")
> build, that you're actually running an lrmd binary from that version
> (meaning: that you properly killed your lrmd prior to installing that
> package), _and_ that "lrmadmin -
> C" does *not* list "upstart?

Let's discard all of my previous conclusions. Apparently I was confused. 

Now, I'm sure I'm running +2 on all three nodes. And, I restarted pacemaker and corosync on all the nodes. I'm basing my knowledge of what versions I'm running on apt-cache policy, output copied below. From that, I'm also reasonably sure that whatever patched versions of cluster-glue and glib I built are not installed now.

I can confirm that lrmadmin -C does not list upstart (also below). Nor does it leak sockets, as reported by "lsof -f | grep lrm_callback_sock". However, sometimes pacemakerd will not stop cleanly. I thought it might happen when stopping pacemaker on the current DC, but after successfully reproducing this failure twice, I couldn't do it again. Pacemakerd seems to exit, but fail to notify the other nodes of its shutdown. Syslog is flooded with "Retransmit List" messages (log attached). These persist until I stop corosync. Asked immediately after stopping pacemaker and corosync on one node, "crm status" other nodes will report that node as still online. After a while, the stopped node switches to offline; I assume some timeout is expiring and they are assuming it crashed.

# lrmadmin -C
There are 4 RA classes supported:
lsb
ocf
heartbeat
stonith

# apt-cache policy pacemaker corosync cluster-glue libglib2.0-0
libglib2.0-0:
 Installed: 2.24.2-1
 Candidate: 2.24.2-1
 Version table:
*** 2.24.2-1 0
       500 http://ftp.egr.msu.edu/debian/ squeeze/main amd64 Packages
       100 /var/lib/dpkg/status
cluster-glue:
 Installed: 1.0.9+hg2665-1~bpo60+2
 Candidate: 1.0.9+hg2665-1~bpo60+2
 Package pin: 1.0.9+hg2665-1~bpo60+2
 Version table:
*** 1.0.9+hg2665-1~bpo60+2 1000
       100 http://backports.debian.org/debian-backports/ squeeze-backports/main amd64 Packages
       100 /var/lib/dpkg/status
    1.0.6-1 1000
       500 http://ftp.egr.msu.edu/debian/ squeeze/main amd64 Packages
corosync:
 Installed: 1.4.2-1~bpo60+1
 Candidate: 1.4.2-1~bpo60+1
 Package pin: 1.4.2-1~bpo60+1
 Version table:
*** 1.4.2-1~bpo60+1 1000
       100 http://backports.debian.org/debian-backports/ squeeze-backports/main amd64 Packages
       100 /var/lib/dpkg/status
    1.2.1-4 1000
       500 http://ftp.egr.msu.edu/debian/ squeeze/main amd64 Packages
pacemaker:
 Installed: 1.1.6-2~bpo60+1
 Candidate: 1.1.6-2~bpo60+1
 Package pin: 1.1.6-2~bpo60+1
 Version table:
*** 1.1.6-2~bpo60+1 1000
       100 http://backports.debian.org/debian-backports/ squeeze-backports/main amd64 Packages
       100 /var/lib/dpkg/status
    1.0.9.1+hg15626-1 1000
       500 http://ftp.egr.msu.edu/debian/ squeeze/main amd64 Packages

-------------- next part --------------
A non-text attachment was scrubbed...
Name: pacemaker_shutdown.log.gz
Type: application/x-gzip
Size: 3539 bytes
Desc: not available
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20120314/ac06793d/attachment-0001.gz>
-------------- next part --------------