[Pacemaker] pacemaker is down after server reboot, corosync.log show "network is down(100)", "Shuting down Pacemaker"

Fri Jul 18 04:40:56 CEST 2014

attach the corosync.conf

------------------------------------
compatibility: whitetank
totem {
    version: 2
    token: 10000
    token_retransmits_before_loss_const: 10
    secauth: off
    threads: 0
    interface {
        ringnumber: 0
                member: {
                        memberaddr: 10.0.0.1
                }
                member: {
                        memberaddr: 10.0.0.2
                }
        bindnetaddr: 10.0.0.1
        mcastport: 5405
        ttl: 1
    }
    transport: udpu
}
logging {
    fileline: off
    to_stderr: no
    to_logfile: yes
    to_syslog: yes
    syslog_facility: local6
    syslog_priority: debug
    debug:on
    logfile: /var/log/cluster/corosync.log
    timestamp: on
    logger_subsys {
        subsys: AMF
        debug: off
    }
}
amf {
    mode: disabled
}
service{
    ver:1
    name:pacemaker
}
aisexec{
    user:root
    group:root
}

-----------------------------------


2014-07-18 10:35 GMT+08:00 Emre He <emre.he at gmail.com>:

> Hi,
>
> I am working a classic corosync+pacemaker linux-HA cluster (2 servers),
> after reboot one server, when it come back, corosync is running, pacemaker
> is dead.
>
> in corosync.log, we can see as below:
> --------------------------------------------------------
> Jul 17 03:56:04 [2068] foo.bar.com       crmd:     info: crmd_exit: Dropping
> I_TERMINATE: [ state=S_STOPPING cause=C_FSA_INTERNAL origin=do_stop ]
> Jul 17 03:56:04 [2068] foo.bar.com       crmd:    debug:
> lrm_state_verify_stopped: Checking for active resources before exit
> Jul 17 03:56:04 [2068] foo.bar.com       crmd:     info: crmd_cs_destroy: connection
> closed
> Jul 17 03:56:04 [2068] foo.bar.com       crmd:     info: crmd_init: Inhibiting
> automated respawn
> *Jul 17 03:56:04 [2068] foo.bar.com <http://foo.bar.com>       crmd:
> info: crmd_init: 2068 stopped: Network is down (100)*
> *Jul 17 03:56:04 [2068] foo.bar.com <http://foo.bar.com>       crmd:
>  warning: crmd_fast_exit: Inhibiting respawn: 100 -> 100*
> Jul 17 03:56:04 [2068] foo.bar.com       crmd:     info: crm_xml_cleanup: Cleaning
> up memory from libxml2
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:    debug:
> qb_ipcs_dispatch_connection_request: HUP conn (2057-2068-14)
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:    debug:
> qb_ipcs_disconnect: qb_ipcs_disconnect(2057-2068-14) state:2
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:     info:
> crm_client_destroy: Destroying 0 events
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:    debug: qb_rb_close: Free'ing
> ringbuffer: /dev/shm/qb-pacemakerd-response-2057-2068-14-header
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:    debug: qb_rb_close: Free'ing
> ringbuffer: /dev/shm/qb-pacemakerd-event-2057-2068-14-header
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:    debug: qb_rb_close: Free'ing
> ringbuffer: /dev/shm/qb-pacemakerd-request-2057-2068-14-header
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:    error: pcmk_child_exit: Child
> process crmd (2068) exited: Network is down (100)
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:  warning: pcmk_child_exit: Pacemaker
> child process crmd no longer wishes to be respawned. Shutting ourselves
> down.
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:    debug:
> update_node_processes: Node foo.bar.com now has process list:
> 00000000000000000000000000111112 (was 00000000000000000000000000111312)
> *Jul 17 03:56:04 [2057] foo.bar.com <http://foo.bar.com> pacemakerd:
> notice: pcmk_shutdown_worker: Shuting down Pacemaker*
> *Jul 17 03:56:04 [2057] foo.bar.com <http://foo.bar.com> pacemakerd:
>  debug: pcmk_shutdown_worker: crmd confirmed stopped*
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:   notice: stop_child: Stopping
> pengine: Sent -15 to process 2067
> Jul 17 03:56:04 [2067] foo.bar.com    pengine:     info:
> crm_signal_dispatch: Invoking handler for signal 15: Terminated
> Jul 17 03:56:04 [2067] foo.bar.com    pengine:     info:
> qb_ipcs_us_withdraw: withdrawing server sockets
>
>
> Jul 17 03:56:04 [2063] foo.bar.com        cib:    debug: qb_ipcs_unref: qb_ipcs_unref()
> - destroying
> Jul 17 03:56:04 [2063] foo.bar.com        cib:     info: crm_xml_cleanup: Cleaning
> up memory from libxml2
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:     info: pcmk_child_exit: Child
> process cib (2063) exited: OK (0)
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:    debug:
> update_node_processes: Node foo.bar.com now has process list:
> 00000000000000000000000000000002 (was 00000000000000000000000000000102)
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:  warning:
> qb_ipcs_event_sendv: new_event_notification (2057-2063-13): Broken pipe
> (32)
> *Jul 17 03:56:04 [2057] foo.bar.com <http://foo.bar.com> pacemakerd:
>  debug: pcmk_shutdown_worker: cib confirmed stopped*
> *Jul 17 03:56:04 [2057] foo.bar.com <http://foo.bar.com> pacemakerd:
> notice: pcmk_shutdown_worker: Shutdown complete*
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:   notice:
> pcmk_shutdown_worker: Attempting to inhibit respawning after fatal error
> Jul 17 03:56:04 [2057] foo.bar.com pacemakerd:     info: crm_xml_cleanup: Cleaning
> up memory from libxml2
> Jul 17 03:56:04 corosync [CPG   ] exit_fn for conn=0x17e3a20
> Jul 17 03:56:04 corosync [pcmk  ] WARN: route_ais_message: Sending message
> to local.stonith-ng failed: ipc delivery failed (rc=-2)
> Jul 17 03:56:04 corosync [CPG   ] got procleave message from cluster node
> 433183754
> Jul 17 03:56:07 corosync [pcmk  ] WARN: route_ais_message: Sending message
> to local.cib failed: ipc delivery failed (rc=-2)
> *Jul 17 03:56:19 corosync [pcmk  ] WARN: route_ais_message: Sending
> message to local.stonith-ng failed: ipc delivery failed (rc=-2)*
> *Jul 17 03:56:19 corosync [pcmk  ] WARN: route_ais_message: Sending
> message to local.stonith-ng failed: ipc delivery failed (rc=-2)*
> --------------------------------------------------------
>
> here is my HA cluster parameters and package versions
> --------------------------------------------------------
> property cib-bootstrap-options: \
>         dc-version=1.1.10-1.el6_4.4-368c726 \
>         cluster-infrastructure="classic openais (with plugin)" \
>         expected-quorum-votes=2 \
>         stonith-enabled=false \
>         no-quorum-policy=ignore \
>         start-failure-is-fatal=false \
>         default-action-timeout=300s
> rsc_defaults rsc-options: \
>         resource-stickiness=100
>
>
> pacemaker-1.1.10-1.el6_4.4.x86_64
> corosync-1.4.1-15.el6_4.1.x86_64
>
> --------------------------------------------------------
>
> I am not sure if network has flash disconnection, both servers are VMware
> VMs, but looks logs show that.
> so is it the root cause of unexpected network issues? actually I
> understand that's what HA should handle.
> or any other clue about the root cause?
>
> many thanks,
> Emre
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://oss.clusterlabs.org/pipermail/pacemaker/attachments/20140718/e1ecee0d/attachment.html>