[Pacemaker] stonithd dumps core since 1.0.0
Andrew Beekhof
beekhof at gmail.com
Tue Oct 14 13:31:01 UTC 2008
On Oct 14, 2008, at 3:15 PM, Roderick van Domburg wrote:
> Hello everyone,
>
> We have been running cman+gfs2 and heartbeat+pacemaker
> simultaneously on our systems. This worked great until we updated to
> heartbeat-2.99.2 and pacemaker-1.0.0 yesterday, which crashes while
> calling is_openais_cluster(). Previously we ran heartbeat-2.99.1 and
> pacemaker-0.7.3 successfully.
Not so much a core dump (unexpected termination) as an assertion
failure (self initiated "lets get out of here NOW").
What you're seeing is me in the middle of refreshing all the
packages... specifically I haven't enabled Heartbeat support in the
Pacemaker packages which is why you're seeing:
> Oct 14 14:50:55 node1 stonithd: [1492]: ERROR: crm_abort:
> is_heartbeat_cluster: Triggered fatal assert at utils.c:1626 :
> is_openais_cluster()
Which is basically Pacemaker saying "You're trying to run me on top of
Heartbeat and I wasn't built to support that".
A saner error might not be a bad idea.
I'll go enable Heartbeat support now.
>
>
> I'll post this to the linux-ha list too.
>
> /var/log/messages:
>
> Oct 14 14:49:55 node1 logd: [1455]: info: logd started with default
> configuration.
> Oct 14 14:49:55 node1 logd: [1463]: info: G_main_add_SignalHandler:
> Added signal handler for signal 15
> Oct 14 14:49:55 node1 logd: [1455]: info: G_main_add_SignalHandler:
> Added signal handler for signal 15
> Oct 14 14:49:55 node1 heartbeat: [1479]: info: Enabling logging daemon
> Oct 14 14:49:55 node1 heartbeat: [1479]: info: logfile and debug
> file are those specified in logd config file (default /etc/logd.cf)
> Oct 14 14:49:55 node1 heartbeat: [1479]: info: ******************
> Oct 14 14:49:55 node1 heartbeat: [1479]: info: Configuration
> validated. Starting heartbeat 2.99.2
> Oct 14 14:49:55 node1 heartbeat: [1480]: info: heartbeat: version
> 2.99.2
> Oct 14 14:49:55 node1 heartbeat: [1480]: info: Heartbeat generation:
> 1219055953
> Oct 14 14:49:55 node1 heartbeat: [1480]: info: glib: UDP multicast
> heartbeat started for group 239.0.0.45 port 694 interface eth0
> (ttl=1 loop=0)
> Oct 14 14:49:55 node1 heartbeat: [1480]: info:
> G_main_add_TriggerHandler: Added signal manual handler
> Oct 14 14:49:55 node1 heartbeat: [1480]: info:
> G_main_add_TriggerHandler: Added signal manual handler
> Oct 14 14:49:55 node1 heartbeat: [1480]: notice: Using watchdog
> device: /dev/watchdog
> Oct 14 14:49:55 node1 heartbeat: [1480]: info:
> G_main_add_SignalHandler: Added signal handler for signal 17
> Oct 14 14:49:55 node1 heartbeat: [1480]: info: Local status now set
> to: 'up'
> Oct 14 14:50:55 node1 heartbeat: [1480]: WARN: node node2: is dead
> Oct 14 14:50:55 node1 heartbeat: [1480]: info: Comm_now_up():
> updating status to active
> Oct 14 14:50:55 node1 heartbeat: [1480]: info: Local status now set
> to: 'active'
> Oct 14 14:50:55 node1 heartbeat: [1480]: info: Starting child client
> "/usr/lib64/heartbeat/ccm" (498,496)
> Oct 14 14:50:55 node1 heartbeat: [1480]: info: Starting child client
> "/usr/lib64/heartbeat/cib" (498,496)
> Oct 14 14:50:55 node1 heartbeat: [1480]: info: Starting child client
> "/usr/lib64/heartbeat/lrmd -r" (0,0)
> Oct 14 14:50:55 node1 heartbeat: [1480]: info: Starting child client
> "/usr/lib64/heartbeat/stonithd" (0,0)
> Oct 14 14:50:55 node1 heartbeat: [1480]: info: Starting child client
> "/usr/lib64/heartbeat/attrd" (498,496)
> Oct 14 14:50:55 node1 heartbeat: [1480]: info: Starting child client
> "/usr/lib64/heartbeat/crmd" (498,496)
> Oct 14 14:50:55 node1 heartbeat: [1489]: info: Starting "/usr/lib64/
> heartbeat/ccm" as uid 498 gid 496 (pid 1489)
> Oct 14 14:50:55 node1 heartbeat: [1492]: info: Starting "/usr/lib64/
> heartbeat/stonithd" as uid 0 gid 0 (pid 1492)
> Oct 14 14:50:55 node1 heartbeat: [1491]: info: Starting "/usr/lib64/
> heartbeat/lrmd -r" as uid 0 gid 0 (pid 1491)
> Oct 14 14:50:55 node1 heartbeat: [1493]: info: Starting "/usr/lib64/
> heartbeat/attrd" as uid 498 gid 496 (pid 1493)
> Oct 14 14:50:55 node1 heartbeat: [1490]: info: Starting "/usr/lib64/
> heartbeat/cib" as uid 498 gid 496 (pid 1490)
> Oct 14 14:50:55 node1 heartbeat: [1494]: info: Starting "/usr/lib64/
> heartbeat/crmd" as uid 498 gid 496 (pid 1494)
> Oct 14 14:50:55 node1 lrmd: [1491]: info: G_main_add_SignalHandler:
> Added signal handler for signal 15
> Oct 14 14:50:55 node1 stonithd: [1492]: info:
> G_main_add_SignalHandler: Added signal handler for signal 10
> Oct 14 14:50:55 node1 stonithd: [1492]: info:
> G_main_add_SignalHandler: Added signal handler for signal 12
> Oct 14 14:50:55 node1 cib: [1490]: info: G_main_add_SignalHandler:
> Added signal handler for signal 15
> Oct 14 14:50:55 node1 cib: [1490]: info: G_main_add_TriggerHandler:
> Added signal manual handler
> Oct 14 14:50:55 node1 cib: [1490]: info: G_main_add_SignalHandler:
> Added signal handler for signal 17
> Oct 14 14:50:55 node1 attrd: [1493]: info: G_main_add_SignalHandler:
> Added signal handler for signal 15
> Oct 14 14:50:55 node1 attrd: [1493]: info: main: Starting up....
> Oct 14 14:50:55 node1 attrd: [1493]: ERROR: main: HA Signon failed
> Oct 14 14:50:55 node1 attrd: [1493]: ERROR: main: Aborting startup
> Oct 14 14:50:55 node1 heartbeat: [1480]: WARN: Managed /usr/lib64/
> heartbeat/attrd process 1493 exited with return code 100.
> Oct 14 14:50:55 node1 ccm: [1489]: info: Hostname: node1
> Oct 14 14:50:55 node1 crmd: [1494]: info: main: CRM Hg Version:
> node: 9a6c6d1dd87154b11fdf9ff7fadf5fd33500bca4
> Oct 14 14:50:55 node1 crmd: [1494]: info: crmd_init: Starting crmd
> Oct 14 14:50:55 node1 crmd: [1494]: info: G_main_add_SignalHandler:
> Added signal handler for signal 15
> Oct 14 14:50:55 node1 crmd: [1494]: info: G_main_add_TriggerHandler:
> Added signal manual handler
> Oct 14 14:50:55 node1 crmd: [1494]: info: G_main_add_SignalHandler:
> Added signal handler for signal 17
> Oct 14 14:50:55 node1 stonithd: [1492]: ERROR: crm_abort:
> is_heartbeat_cluster: Triggered fatal assert at utils.c:1626 :
> is_openais_cluster()
> Oct 14 14:50:55 node1 cib: [1490]: info: retrieveCib: Reading
> cluster configuration from: /var/lib/heartbeat/crm/cib.xml (digest: /
> var/lib/heartbeat/crm/cib.xml.sig)
> Oct 14 14:50:55 node1 lrmd: [1491]: info: G_main_add_SignalHandler:
> Added signal handler for signal 17
> Oct 14 14:50:55 node1 lrmd: [1491]: info: G_main_add_SignalHandler:
> Added signal handler for signal 10
> Oct 14 14:50:55 node1 lrmd: [1491]: info: G_main_add_SignalHandler:
> Added signal handler for signal 12
> Oct 14 14:50:55 node1 lrmd: [1491]: info: Started.
> Oct 14 14:50:55 node1 heartbeat: [1480]: WARN: Managed /usr/lib64/
> heartbeat/stonithd process 1492 killed by signal 6 [SIGABRT - Abort].
> Oct 14 14:50:55 node1 heartbeat: [1480]: ERROR: Managed /usr/lib64/
> heartbeat/stonithd process 1492 dumped core
> Oct 14 14:50:55 node1 heartbeat: [1480]: ERROR: Respawning client "/
> usr/lib64/heartbeat/stonithd":
> Oct 14 14:50:55 node1 heartbeat: [1480]: info: Starting child client
> "/usr/lib64/heartbeat/stonithd" (0,0)
> Oct 14 14:50:56 node1 cib: [1490]: info: startCib: CIB
> Initialization completed successfully
> Oct 14 14:50:56 node1 cib: [1490]: CRIT: cib_init: Cannot sign in to
> the cluster... terminating
> Oct 14 14:50:56 node1 heartbeat: [1480]: WARN: Managed /usr/lib64/
> heartbeat/cib process 1490 exited with return code 100.
> Oct 14 14:50:56 node1 heartbeat: [1480]: EMERG: Rebooting system.
> Reason: /usr/lib64/heartbeat/cib
> Oct 14 14:50:56 node1 crmd: [1494]: WARN: do_cib_control: Couldn't
> complete CIB registration 1 times... pause and retry
> Oct 14 14:50:56 node1 crmd: [1494]: info: crmd_init: Starting crmd's
> mainloop
> Oct 14 14:50:56 node1 heartbeat: [1495]: info: Starting "/usr/lib64/
> heartbeat/stonithd" as uid 0 gid 0 (pid 1495)
> Oct 14 14:50:56 node1 stonithd: [1495]: info:
> G_main_add_SignalHandler: Added signal handler for signal 10
> Oct 14 14:50:56 node1 stonithd: [1495]: info:
> G_main_add_SignalHandler: Added signal handler for signal 12
> Oct 14 14:50:56 node1 stonithd: [1495]: ERROR: crm_abort:
> is_heartbeat_cluster: Triggered fatal assert at utils.c:1626 :
> is_openais_cluster()
> Oct 14 14:50:57 node1 kernel: md: stopping all md devices.
> Oct 14 14:51:17 node1 syslogd 1.4.1: restart.
>
> This occurs no matter whether cman and openais are running or not.
>
> I have attached the coredump.
> Version information:
>
> - CentOS 5.2 x86_64 (2.6.18-92.1.13.el5xen)
> - heartbeat-common.x86_64 2.99.2-21.1
> - heartbeat-resources.x86_64 2.99.2-21.1
> - heartbeat.x86_64 2.99.2-21.1
> - libheartbeat2.x86_64 2.99.2-21.1
> - pacemaker.x86_64 1.0.0-1.6
> - libpacemaker3.x86_64 1.0.0-1.6
> - openais.x86_64 0.80.3-19.1
> - cman.x86_64 2.0.84-2.el5_2.1
>
> ha.cf:
>
> autojoin none
> mcast eth0 239.0.0.45 694 1 0
> warntime 15
> deadtime 60
> initdead 60
> keepalive 3
> node node1
> node node2
> crm on
> watchdog /dev/watchdog
> use_logd on
>
> openais.conf:
>
> totem {
> version: 2
> secauth: on
> threads: 1
> heartbeat_failures_allowed: 3
> interface {
> ringnumber: 0
> bindnetaddr: 10.0.3.1
> mcastaddr: 239.0.0.45
> mcastport: 5405
> }
> }
>
> logging {
> debug: off
> timestamp: on
> }
>
> amf {
> mode: disabled
> }
>
> I have tried switching either to another IP, but to no avail.
> Any insights into this behavior?
>
> Kind regards,
>
> Roderick
> <core.1492>
>
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at clusterlabs.org
> http://list.clusterlabs.org/mailman/listinfo/pacemaker
More information about the Pacemaker
mailing list