[Pacemaker] CentOS 6.4 - pacemaker 1.1.8 - heartbeat

Mon Apr 8 11:24:16 CEST 2013

Am 08.04.2013 03:54, schrieb Andrew Beekhof:
> Looks like pacemaker is already running. How are you try to start
> pacemaker?
>

Started as usual with the init script:

service heartbeat start

This produces the following error messages:

Apr 5 13:35:11 lb1 pengine[29465]: error: main: Failed to create IPC
server: shutting down and inhibiting respawn
Apr 5 13:35:27 lb1 pengine[29468]: error: qb_ipcs_us_publish: Could not
bind AF_UNIX (): Address already in use (98)
Apr 5 13:35:27 lb1 pengine[29468]: error: mainloop_add_ipc_server: Could
not start pengine IPC server: Address already in use (-98)

And the cluster node not starting completely, here is the complete log
sequence:

Apr 5 13:09:43 lb1 heartbeat: [29423]: info: Starting
"/usr/lib64/heartbeat/crmd" as uid 499 gid 498 (pid 29423)
Apr 5 13:09:43 lb1 heartbeat: [29422]: info: Starting
"/usr/lib64/heartbeat/attrd" as uid 499 gid 498 (pid 29422)
Apr 5 13:09:43 lb1 heartbeat: [29420]: info: Starting
"/usr/lib64/heartbeat/lrmd -r" as uid 0 gid 0 (pid 29420)
Apr 5 13:09:43 lb1 heartbeat: [29421]: info: Starting
"/usr/lib64/heartbeat/stonithd" as uid 0 gid 0 (pid 29421)
Apr 5 13:09:43 lb1 heartbeat: [29419]: info: Starting
"/usr/lib64/heartbeat/cib" as uid 499 gid 498 (pid 29419)
Apr 5 13:09:43 lb1 heartbeat: [29418]: info: Starting
"/usr/lib64/heartbeat/ccm" as uid 499 gid 498 (pid 29418)
Apr 5 13:09:43 lb1 lrmd: [29420]: info: max-children set to 4 (1
processors online)
Apr 5 13:09:44 lb1 lrmd: [29420]: info: enabling coredumps
Apr 5 13:09:44 lb1 lrmd: [29420]: info: Started.
Apr 5 13:09:44 lb1 ccm: [29418]: info: Hostname: lb1
Apr 5 13:09:44 lb1 heartbeat: [29409]: info: the send queue length from
heartbeat to client ccm is set to 1024
Apr 5 13:09:44 lb1 attrd[29422]: notice: crm_cluster_connect: Connecting
to cluster infrastructure: heartbeat
Apr 5 13:09:44 lb1 heartbeat: [29409]: info: the send queue length from
heartbeat to client attrd is set to 1024
Apr 5 13:09:44 lb1 crmd[29423]: notice: main: CRM Git Version: 2a917dd
Apr 5 13:09:44 lb1 stonith-ng[29421]: notice: crm_cluster_connect:
Connecting to cluster infrastructure: heartbeat
Apr 5 13:09:44 lb1 attrd[29422]: notice: main: Starting mainloop...
Apr 5 13:09:44 lb1 heartbeat: [29409]: info: the send queue length from
heartbeat to client stonith-ng is set to 1024
Apr 5 13:09:44 lb1 pengine[29426]: error: qb_ipcs_us_publish: Could not
bind AF_UNIX (): Address already in use (98)
Apr 5 13:09:44 lb1 pengine[29426]: error: mainloop_add_ipc_server: Could
not start pengine IPC server: Address already in use (-98)
Apr 5 13:09:44 lb1 pengine[29426]: error: main: Failed to create IPC
server: shutting down and inhibiting respawn
Apr 5 13:09:44 lb1 cib[29419]: notice: crm_cluster_connect: Connecting
to cluster infrastructure: heartbeat
Apr 5 13:09:44 lb1 crmd[29423]: warning: do_cib_control: Couldn't
complete CIB registration 1 times... pause and retry
Apr 5 13:09:44 lb1 crmd[29423]: error: crmdManagedChildDied: Child
process pengine exited (pid=29426, rc=100)
Apr 5 13:09:44 lb1 heartbeat: [29409]: WARN: 1 lost packet(s) for [lb2]
[74316:74318]
Apr 5 13:09:44 lb1 heartbeat: [29409]: info: No pkts missing from lb2!
Apr 5 13:09:44 lb1 heartbeat: [29409]: info: the send queue length from
heartbeat to client cib is set to 1024
Apr 5 13:09:45 lb1 cib[29419]: notice: cib_server_process_diff: Not
applying diff 0.44.18 -> 0.44.19 (sync in progress)
Apr 5 13:09:45 lb1 cib[29419]: notice: cib_server_process_diff: Not
applying diff 0.44.19 -> 0.44.20 (sync in progress)
Apr 5 13:09:45 lb1 cib[29419]: notice: cib_server_process_diff: Not
applying diff 0.44.20 -> 0.44.21 (sync in progress)
Apr 5 13:09:45 lb1 cib[29419]: notice: cib_server_process_diff: Not
applying diff 0.44.21 -> 0.44.22 (sync in progress)
Apr 5 13:09:45 lb1 cib[29419]: notice: cib_server_process_diff: Not
applying diff 0.44.22 -> 0.44.23 (sync in progress)
Apr 5 13:09:45 lb1 stonith-ng[29421]: notice: setup_cib: Watching for
stonith topology changes
Apr 5 13:09:46 lb1 crmd[29423]: notice: crm_cluster_connect: Connecting
to cluster infrastructure: heartbeat
Apr 5 13:09:46 lb1 heartbeat: [29409]: info: the send queue length from
heartbeat to client crmd is set to 1024
Apr 5 13:09:47 lb1 cib[29419]: notice: crm_update_peer_state:
crm_update_ccm_node: Node lb2[1] - state is now member (was (null))
Apr 5 13:09:47 lb1 cib[29419]: notice: crm_update_peer_state:
crm_update_ccm_node: Node lb1[0] - state is now member (was (null))
Apr 5 13:09:48 lb1 crmd[29423]: warning: do_lrm_control: Failed to sign
on to the LRM 1 (30 max) times
Apr 5 13:09:48 lb1 crmd[29423]: notice: crmd_client_status_callback:
Status update: Client lb1/crmd now has status [join] (DC=false)
Apr 5 13:09:48 lb1 crmd[29423]: notice: crmd_client_status_callback:
Status update: Client lb1/crmd now has status [online] (DC=false)
Apr 5 13:09:48 lb1 crmd[29423]: notice: crmd_client_status_callback:
Status update: Client lb2/crmd now has status [online] (DC=false)
Apr 5 13:09:48 lb1 crmd[29423]: warning: do_lrm_control: Failed to sign
on to the LRM 2 (30 max) times
Apr 5 13:09:48 lb1 crmd[29423]: warning: do_lrm_control: Failed to sign
on to the LRM 3 (30 max) times
Apr 5 13:09:50 lb1 crmd[29423]: warning: do_lrm_control: Failed to sign
on to the LRM 4 (30 max) times
Apr 5 13:09:52 lb1 crmd[29423]: warning: do_lrm_control: Failed to sign
on to the LRM 5 (30 max) times
Apr 5 13:09:54 lb1 crmd[29423]: warning: do_lrm_control: Failed to sign
on to the LRM 6 (30 max) times
Apr 5 13:09:56 lb1 crmd[29423]: warning: do_lrm_control: Failed to sign
on to the LRM 7 (30 max) times

....

>
>>
>> # /usr/libexec/pacemaker/pengine -V info: qb_ipcs_us_publish:
>> server name: pengine error: qb_ipcs_us_publish:   Could not bind
>> AF_UNIX (): Address already in use (98) info: qb_ipcs_us_withdraw:
>> withdrawing server sockets info: qb_ipcs_us_withdraw:  withdrawing
>> server sockets error: mainloop_add_ipc_server:      Could not start
>> pengine IPC server: Address already in use (-98) error: main:
>> Failed to create IPC server: shutting down and inhibiting respawn
>> info: crm_xml_cleanup:      Cleaning up memory from libxml2
>>

Just tried this to isolate the problem.

Trying this there was definitely no other cluster component running!

Thanks
Andreas