[Pacemaker] heartbeat stop hangs sometimes
Markus M.
adrock0905 at alice.de
Mon Mar 1 21:36:10 UTC 2010
Lars Ellenberg wrote:
> I've seen this too,
> a few times.
>...
> And I don't yet have a reliable way to reproduce it, either.
> If you have, let us know!
We are using a simple shell script which executes /etc/init.d/heartbeat
start/stop using different delays between start/stop (starts with 60
seconds, increments 20 seconds each time).
The script does a maximum of 24 iterations and it never run through all
without heartbeat hung so far.
> Maybe the following helps (sorry, patch is likely not whitespace clean)
I applied the patch to the pacemaker source rpm found at clusterlabs.org
. Unfortunately it doesn't fix the problem. Heartbeat still hangs:
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_STOPPING [ input=I_STOP
cause=C_FSA_INTERNAL origin=notify_crmd ]
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_dc_release: DC role
released
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: stop_subsystem: Sent -TERM
to pengine: [17092]
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_te_control:
Transitioner is now inactive
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_te_control:
Disconnecting STONITH...
Mar 01 22:06:43 dbprod21 crmd: [17075]: info:
tengine_stonith_connection_destroy: Fencing daemon disconnected
Mar 01 22:06:43 dbprod21 crmd: [17075]: notice: Not currently connected.
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_shutdown: Terminating
the pengine
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: stop_subsystem: Sent -TERM
to pengine: [17092]
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_shutdown: Waiting for
subsystems to exit
Mar 01 22:06:43 dbprod21 crmd: [17075]: WARN: register_fsa_input_adv:
do_shutdown stalled the FSA with pending inputs
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_shutdown: All
subsystems stopped, continuing
Mar 01 22:06:43 dbprod21 crmd: [17075]: WARN: do_log: FSA: Input
I_RELEASE_SUCCESS from do_dc_release() received in state S_STOPPING
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_shutdown: Terminating
the pengine
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: stop_subsystem: Sent -TERM
to pengine: [17092]
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_shutdown: Waiting for
subsystems to exit
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_shutdown: All
subsystems stopped, continuing
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: crmdManagedChildDied:
Process pengine:[17092] exited (signal=0, exitcode=0)
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: pe_msg_dispatch: Received
HUP from pengine:[17092]
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: pe_connection_destroy:
Connection to the Policy Engine released
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_shutdown: All
subsystems stopped, continuing
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_lrm_control:
Disconnected from the LRM
Mar 01 22:06:43 dbprod21 ccm: [17070]: info: client (pid=17075) removed
from ccm
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_ha_control:
Disconnected from Heartbeat
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_cib_control:
Disconnecting CIB
Mar 01 22:06:43 dbprod21 cib: [17071]: info: cib_process_readwrite: We
are now in R/O mode
Mar 01 22:06:43 dbprod21 crmd: [17075]: info:
crmd_cib_connection_destroy: Connection to the CIB terminated...
Mar 01 22:06:43 dbprod21 cib: [17071]: WARN: send_ipc_message: IPC
Channel to 17075 is not connected
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_exit: Performing
A_EXIT_0 - gracefully exiting the CRMd
Mar 01 22:06:43 dbprod21 cib: [17071]: WARN: send_via_callback_channel:
Delivery of reply to client 17075/89bca114-6817-460e-90e7-c5ccd4ef6a23
failed
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: free_mem: Dropping
I_TERMINATE: [ state=S_STOPPING cause=C_FSA_INTERNAL origin=do_stop ]
Mar 01 22:06:43 dbprod21 cib: [17071]: WARN: do_local_notify: A-Sync
reply to crmd failed: reply failed
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_exit: [crmd] stopped (0)
Mar 01 22:06:43 dbprod21 heartbeat: [17057]: info: killing
/usr/lib64/heartbeat/attrd process group 17074 with signal 15
Mar 01 22:14:58 dbprod21 cib: [17071]: info: cib_stats: Processed 40
operations (1750.00us average, 0% utilization) in the last 10min
[root at dbprod21 log]# ps -efw | grep heart
root 17057 1 0 22:03 ? 00:00:00 heartbeat: master
control process
root 17060 17057 0 22:03 ? 00:00:00 heartbeat: FIFO reader
root 17061 17057 0 22:03 ? 00:00:00 heartbeat: write: ucast eth0
root 17062 17057 0 22:03 ? 00:00:00 heartbeat: read: ucast eth0
root 17063 17057 0 22:03 ? 00:00:00 heartbeat: write: ucast eth0
root 17064 17057 0 22:03 ? 00:00:00 heartbeat: read: ucast eth0
root 17065 17057 0 22:03 ? 00:00:00 heartbeat: write: serial
/dev/ttyS0
root 17066 17057 0 22:03 ? 00:00:00 heartbeat: read: serial
/dev/ttyS0
101 17070 17057 0 22:04 ? 00:00:00 /usr/lib64/heartbeat/ccm
101 17071 17057 0 22:04 ? 00:00:00 /usr/lib64/heartbeat/cib
root 17072 17057 0 22:04 ? 00:00:00 /usr/lib64/heartbeat/lrmd -r
root 17073 17057 0 22:04 ? 00:00:00
/usr/lib64/heartbeat/stonithd
101 17074 17057 0 22:04 ? 00:00:00 /usr/lib64/heartbeat/attrd
root 17506 16920 0 22:35 pts/0 00:00:00 grep heart
Regards
Markus
More information about the Pacemaker
mailing list