[Pacemaker] heartbeat stop hangs sometimes

Mon Mar 1 21:36:10 UTC 2010

Lars Ellenberg wrote:
> I've seen this too,
> a few times.
 >...
> And I don't yet have a reliable way to reproduce it, either.
> If you have, let us know!

We are using a simple shell script which executes /etc/init.d/heartbeat 
start/stop using different delays between start/stop (starts with 60 
seconds, increments 20 seconds each time).

The script does a maximum of 24 iterations and it never run through all 
without heartbeat hung so far.

> Maybe the following helps (sorry, patch is likely not whitespace clean)

I applied the patch to the pacemaker source rpm found at clusterlabs.org 
. Unfortunately it doesn't fix the problem. Heartbeat still hangs:

Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_state_transition: State 
transition S_TRANSITION_ENGINE -> S_STOPPING [ input=I_STOP 
cause=C_FSA_INTERNAL origin=notify_crmd ]
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_dc_release: DC role 
released
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: stop_subsystem: Sent -TERM 
to pengine: [17092]
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_te_control: 
Transitioner is now inactive
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_te_control: 
Disconnecting STONITH...
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: 
tengine_stonith_connection_destroy: Fencing daemon disconnected
Mar 01 22:06:43 dbprod21 crmd: [17075]: notice: Not currently connected.
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_shutdown: Terminating 
the pengine
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: stop_subsystem: Sent -TERM 
to pengine: [17092]
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_shutdown: Waiting for 
subsystems to exit
Mar 01 22:06:43 dbprod21 crmd: [17075]: WARN: register_fsa_input_adv: 
do_shutdown stalled the FSA with pending inputs
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_shutdown: All 
subsystems stopped, continuing
Mar 01 22:06:43 dbprod21 crmd: [17075]: WARN: do_log: FSA: Input 
I_RELEASE_SUCCESS from do_dc_release() received in state S_STOPPING
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_shutdown: Terminating 
the pengine
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: stop_subsystem: Sent -TERM 
to pengine: [17092]
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_shutdown: Waiting for 
subsystems to exit
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_shutdown: All 
subsystems stopped, continuing
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: crmdManagedChildDied: 
Process pengine:[17092] exited (signal=0, exitcode=0)
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: pe_msg_dispatch: Received 
HUP from pengine:[17092]
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: pe_connection_destroy: 
Connection to the Policy Engine released
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_shutdown: All 
subsystems stopped, continuing
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_lrm_control: 
Disconnected from the LRM
Mar 01 22:06:43 dbprod21 ccm: [17070]: info: client (pid=17075) removed 
from ccm
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_ha_control: 
Disconnected from Heartbeat
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_cib_control: 
Disconnecting CIB
Mar 01 22:06:43 dbprod21 cib: [17071]: info: cib_process_readwrite: We 
are now in R/O mode
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: 
crmd_cib_connection_destroy: Connection to the CIB terminated...
Mar 01 22:06:43 dbprod21 cib: [17071]: WARN: send_ipc_message: IPC 
Channel to 17075 is not connected
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_exit: Performing 
A_EXIT_0 - gracefully exiting the CRMd
Mar 01 22:06:43 dbprod21 cib: [17071]: WARN: send_via_callback_channel: 
Delivery of reply to client 17075/89bca114-6817-460e-90e7-c5ccd4ef6a23 
failed
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: free_mem: Dropping 
I_TERMINATE: [ state=S_STOPPING cause=C_FSA_INTERNAL origin=do_stop ]
Mar 01 22:06:43 dbprod21 cib: [17071]: WARN: do_local_notify: A-Sync 
reply to crmd failed: reply failed
Mar 01 22:06:43 dbprod21 crmd: [17075]: info: do_exit: [crmd] stopped (0)
Mar 01 22:06:43 dbprod21 heartbeat: [17057]: info: killing 
/usr/lib64/heartbeat/attrd process group 17074 with signal 15
Mar 01 22:14:58 dbprod21 cib: [17071]: info: cib_stats: Processed 40 
operations (1750.00us average, 0% utilization) in the last 10min

[root at dbprod21 log]# ps -efw | grep heart
root     17057     1  0 22:03 ?        00:00:00 heartbeat: master 
control process
root     17060 17057  0 22:03 ?        00:00:00 heartbeat: FIFO reader
root     17061 17057  0 22:03 ?        00:00:00 heartbeat: write: ucast eth0
root     17062 17057  0 22:03 ?        00:00:00 heartbeat: read: ucast eth0
root     17063 17057  0 22:03 ?        00:00:00 heartbeat: write: ucast eth0
root     17064 17057  0 22:03 ?        00:00:00 heartbeat: read: ucast eth0
root     17065 17057  0 22:03 ?        00:00:00 heartbeat: write: serial 
/dev/ttyS0
root     17066 17057  0 22:03 ?        00:00:00 heartbeat: read: serial 
/dev/ttyS0
101      17070 17057  0 22:04 ?        00:00:00 /usr/lib64/heartbeat/ccm
101      17071 17057  0 22:04 ?        00:00:00 /usr/lib64/heartbeat/cib
root     17072 17057  0 22:04 ?        00:00:00 /usr/lib64/heartbeat/lrmd -r
root     17073 17057  0 22:04 ?        00:00:00 
/usr/lib64/heartbeat/stonithd
101      17074 17057  0 22:04 ?        00:00:00 /usr/lib64/heartbeat/attrd
root     17506 16920  0 22:35 pts/0    00:00:00 grep heart

Regards
Markus