[Pacemaker] do_lrm_control: Failed to sign on to the LRM repeatedly!

Thu Nov 18 23:08:38 UTC 2010

I have prolem with a cluster that wont start up. It is running a 2 node
failover (master slave) clustered ftp server using drbd to duplicate the
filesystem.

Upgraded from 10.04 Lucid to 10.10 Maverick to obtain support for
upstart resource agents.
Running:
 pacemaker 1.0.9.1-2ubuntu4
 corosync 1.2.1-1ubuntu1
 cluster-agents 1:1.0.3-3

Before the upgrade it was working reasonably OK (except for detecting
vsftpd running which I diagnosed as being due to upstart having hijacked
the lsb compliant sysv startup script and replaced it with its own
non-compliant version).

daemon Logs show:

crmd: WARN: lrm_signon: can not initiate connection
crmd: [4963]: WARN: do_lrm_control: Failed to sign on to the LRM 29 (30
max) time

netstat -anp shows:
unix 2 [ ] DGRAM 22204 4546/lrmd

which implies at least part of lrmd is running.
I dont know what this implies but I cannot find any unix sockets in the
filing system

ps axf shows::
25525 ?        Ssl    0:00 /usr/sbin/corosync
25532 ?        SLs    0:00  \_ /usr/lib/heartbeat/stonithd
25533 ?        S      0:00  \_ /usr/lib/heartbeat/cib
25534 ?        Z      0:00  \_ [lrmd] <defunct>
25535 ?        S      0:00  \_ /usr/lib/heartbeat/attrd
25536 ?        Z      0:00  \_ [pengine] <defunct>
25537 ?        S      0:00  \_ /usr/lib/heartbeat/crmd
25540 ?        S      0:00  \_ /usr/lib/heartbeat/cib
25541 ?        S      0:00  \_ /usr/lib/heartbeat/lrmd
25542 ?        S      0:00  \_ /usr/lib/heartbeat/attrd
25543 ?        S      0:00  \_ /usr/lib/heartbeat/pengine
25547 ?        Z      0:00  \_ [corosync] <defunct>
25548 ?        Z      0:00  \_ [corosync] <defunct>
25553 ?        Z      0:00  \_ [corosync] <defunct>
25555 ?        Z      0:00  \_ [corosync] <defunct>
25866 ?        S      0:00  \_ /usr/lib/heartbeat/crmd

(This was from another run so the pids differ from above).

crm_mon -1 shows:

============
Last updated: Wed Nov 17 00:13:25 2010
Stack: openais
Current DC: NONE
2 Nodes configured, 2 expected votes
2 Resources configured.
============

OFFLINE: [ node1 node2 ]

Clearly the Current DC:NONE is the symptom that results from lrmd not
being communicative

strace analysis shows initial (defunct) lrmd creating
"/var/run/heartbeat/lrm_cmd_sock" and ..callback_sock
then being terminated via a SIGTERM kill about 1 second later by the 2nd lrmd
instance that continues running. This appears to cause the first
instance to delete the socket. 

I havent followed the src enough yet to understand whether this is expected
or an erroneous condition but it appears the missing socket is the cause
of the error messages. Whether this is why my cluster wont start I am
not 100% sure.

It may be some form of timing condition because I did manage to get the
stack running once via corosync stops and starts with a random delay in
between. 
(I note that "/etc/init.d/corosync stop" leaves some processes running!)

Can anyone help me debug and find root cause and a solution?

Thanks