<div dir="ltr">Hi, <div><br></div><div>I am working a classic corosync+pacemaker linux-HA cluster (2 servers), after reboot one server, when it come back, corosync is running, pacemaker is dead. </div><div><br></div><div>in corosync.log, we can see as below: </div>
<div>--------------------------------------------------------</div><div><div>Jul 17 03:56:04 [2068] <a href="http://foo.bar.com">foo.bar.com</a> crmd: info: crmd_exit: <span class="" style="white-space:pre">        </span>Dropping I_TERMINATE: [ state=S_STOPPING cause=C_FSA_INTERNAL origin=do_stop ]</div>
<div>Jul 17 03:56:04 [2068] <a href="http://foo.bar.com">foo.bar.com</a> crmd: debug: lrm_state_verify_stopped: <span class="" style="white-space:pre">        </span>Checking for active resources before exit</div><div>Jul 17 03:56:04 [2068] <a href="http://foo.bar.com">foo.bar.com</a> crmd: info: crmd_cs_destroy: <span class="" style="white-space:pre">        </span>connection closed</div>
<div>Jul 17 03:56:04 [2068] <a href="http://foo.bar.com">foo.bar.com</a> crmd: info: crmd_init: <span class="" style="white-space:pre">        </span>Inhibiting automated respawn</div><div><b>Jul 17 03:56:04 [2068] <a href="http://foo.bar.com">foo.bar.com</a> crmd: info: crmd_init: <span class="" style="white-space:pre">        </span>2068 stopped: Network is down (100)</b></div>
<div><b>Jul 17 03:56:04 [2068] <a href="http://foo.bar.com">foo.bar.com</a> crmd: warning: crmd_fast_exit: <span class="" style="white-space:pre">        </span>Inhibiting respawn: 100 -> 100</b></div><div>Jul 17 03:56:04 [2068] <a href="http://foo.bar.com">foo.bar.com</a> crmd: info: crm_xml_cleanup: <span class="" style="white-space:pre">        </span>Cleaning up memory from libxml2</div>
<div>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com">foo.bar.com</a> pacemakerd: debug: qb_ipcs_dispatch_connection_request: <span class="" style="white-space:pre">        </span>HUP conn (2057-2068-14)</div><div>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com">foo.bar.com</a> pacemakerd: debug: qb_ipcs_disconnect: <span class="" style="white-space:pre">        </span>qb_ipcs_disconnect(2057-2068-14) state:2</div>
<div>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com">foo.bar.com</a> pacemakerd: info: crm_client_destroy: <span class="" style="white-space:pre">        </span>Destroying 0 events</div><div>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com">foo.bar.com</a> pacemakerd: debug: qb_rb_close: <span class="" style="white-space:pre">        </span>Free'ing ringbuffer: /dev/shm/qb-pacemakerd-response-2057-2068-14-header</div>
<div>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com">foo.bar.com</a> pacemakerd: debug: qb_rb_close: <span class="" style="white-space:pre">        </span>Free'ing ringbuffer: /dev/shm/qb-pacemakerd-event-2057-2068-14-header</div>
<div>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com">foo.bar.com</a> pacemakerd: debug: qb_rb_close: <span class="" style="white-space:pre">        </span>Free'ing ringbuffer: /dev/shm/qb-pacemakerd-request-2057-2068-14-header</div>
<div>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com">foo.bar.com</a> pacemakerd: error: pcmk_child_exit: <span class="" style="white-space:pre">        </span>Child process crmd (2068) exited: Network is down (100)</div><div>
Jul 17 03:56:04 [2057] <a href="http://foo.bar.com">foo.bar.com</a> pacemakerd: warning: pcmk_child_exit: <span class="" style="white-space:pre">        </span>Pacemaker child process crmd no longer wishes to be respawned. Shutting ourselves down.</div>
<div>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com">foo.bar.com</a> pacemakerd: debug: update_node_processes: <span class="" style="white-space:pre">        </span>Node <a href="http://foo.bar.com">foo.bar.com</a> now has process list: 00000000000000000000000000111112 (was 00000000000000000000000000111312)</div>
<div><b>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com">foo.bar.com</a> pacemakerd: notice: pcmk_shutdown_worker: <span class="" style="white-space:pre">        </span>Shuting down Pacemaker</b></div><div><b>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com">foo.bar.com</a> pacemakerd: debug: pcmk_shutdown_worker: <span class="" style="white-space:pre">        </span>crmd confirmed stopped</b></div>
<div>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com">foo.bar.com</a> pacemakerd: notice: stop_child: <span class="" style="white-space:pre">        </span>Stopping pengine: Sent -15 to process 2067</div><div>Jul 17 03:56:04 [2067] <a href="http://foo.bar.com">foo.bar.com</a> pengine: info: crm_signal_dispatch: <span class="" style="white-space:pre">        </span>Invoking handler for signal 15: Terminated</div>
<div>Jul 17 03:56:04 [2067] <a href="http://foo.bar.com">foo.bar.com</a> pengine: info: qb_ipcs_us_withdraw: <span class="" style="white-space:pre">        </span>withdrawing server sockets</div></div><div><br></div><div>
<br></div><div><div>Jul 17 03:56:04 [2063] <a href="http://foo.bar.com">foo.bar.com</a> cib: debug: qb_ipcs_unref: <span class="" style="white-space:pre">        </span>qb_ipcs_unref() - destroying</div><div>Jul 17 03:56:04 [2063] <a href="http://foo.bar.com">foo.bar.com</a> cib: info: crm_xml_cleanup: <span class="" style="white-space:pre">        </span>Cleaning up memory from libxml2</div>
<div>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com">foo.bar.com</a> pacemakerd: info: pcmk_child_exit: <span class="" style="white-space:pre">        </span>Child process cib (2063) exited: OK (0)</div><div>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com">foo.bar.com</a> pacemakerd: debug: update_node_processes: <span class="" style="white-space:pre">        </span>Node <a href="http://foo.bar.com">foo.bar.com</a> now has process list: 00000000000000000000000000000002 (was 00000000000000000000000000000102)</div>
<div>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com">foo.bar.com</a> pacemakerd: warning: qb_ipcs_event_sendv: <span class="" style="white-space:pre">        </span>new_event_notification (2057-2063-13): Broken pipe (32)</div>
<div><b>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com">foo.bar.com</a> pacemakerd: debug: pcmk_shutdown_worker: <span class="" style="white-space:pre">        </span>cib confirmed stopped</b></div><div><b>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com">foo.bar.com</a> pacemakerd: notice: pcmk_shutdown_worker: <span class="" style="white-space:pre">        </span>Shutdown complete</b></div>
<div>Jul 17 03:56:04 [2057] <a href="http://foo.bar.com">foo.bar.com</a> pacemakerd: notice: pcmk_shutdown_worker: <span class="" style="white-space:pre">        </span>Attempting to inhibit respawning after fatal error</div><div>
Jul 17 03:56:04 [2057] <a href="http://foo.bar.com">foo.bar.com</a> pacemakerd: info: crm_xml_cleanup: <span class="" style="white-space:pre">        </span>Cleaning up memory from libxml2</div><div>Jul 17 03:56:04 corosync [CPG ] exit_fn for conn=0x17e3a20</div>
<div>Jul 17 03:56:04 corosync [pcmk ] WARN: route_ais_message: Sending message to local.stonith-ng failed: ipc delivery failed (rc=-2)</div><div>Jul 17 03:56:04 corosync [CPG ] got procleave message from cluster node 433183754</div>
<div>Jul 17 03:56:07 corosync [pcmk ] WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)</div><div><b>Jul 17 03:56:19 corosync [pcmk ] WARN: route_ais_message: Sending message to local.stonith-ng failed: ipc delivery failed (rc=-2)</b></div>
<div><b>Jul 17 03:56:19 corosync [pcmk ] WARN: route_ais_message: Sending message to local.stonith-ng failed: ipc delivery failed (rc=-2)</b></div></div><div>--------------------------------------------------------<br></div>
<div><br></div><div>here is my HA cluster parameters and package versions</div><div>--------------------------------------------------------<br></div><div><div>property cib-bootstrap-options: \</div><div> dc-version=1.1.10-1.el6_4.4-368c726 \</div>
<div> cluster-infrastructure="classic openais (with plugin)" \</div><div> expected-quorum-votes=2 \</div><div> stonith-enabled=false \</div><div> no-quorum-policy=ignore \</div><div> start-failure-is-fatal=false \</div>
<div> default-action-timeout=300s</div><div>rsc_defaults rsc-options: \</div><div> resource-stickiness=100</div></div><div><br></div><div><div><br></div><div>pacemaker-1.1.10-1.el6_4.4.x86_64</div><div>corosync-1.4.1-15.el6_4.1.x86_64</div>
<div><br></div></div><div>--------------------------------------------------------</div><div><br></div><div>I am not sure if network has flash disconnection, both servers are VMware VMs, but looks logs show that. </div><div>
so is it the root cause of unexpected network issues? actually I understand that's what HA should handle. </div><div>or any other clue about the root cause? </div><div><br></div><div>many thanks, </div><div>Emre</div>
</div>