[Pacemaker] stonithd dumps core since 1.0.0

Tue Oct 14 13:31:01 UTC 2008

On Oct 14, 2008, at 3:15 PM, Roderick van Domburg wrote:

> Hello everyone,
>
> We have been running cman+gfs2 and heartbeat+pacemaker  
> simultaneously on our systems. This worked great until we updated to  
> heartbeat-2.99.2 and pacemaker-1.0.0 yesterday, which crashes while  
> calling is_openais_cluster(). Previously we ran heartbeat-2.99.1 and  
> pacemaker-0.7.3 successfully.

Not so much a core dump (unexpected termination) as an assertion  
failure (self initiated "lets get out of here NOW").

What you're seeing is me in the middle of refreshing all the  
packages... specifically I haven't enabled Heartbeat support in the  
Pacemaker packages which is why you're seeing:

> Oct 14 14:50:55 node1 stonithd: [1492]: ERROR: crm_abort:  
> is_heartbeat_cluster: Triggered fatal assert at utils.c:1626 :  
> is_openais_cluster()

Which is basically Pacemaker saying "You're trying to run me on top of  
Heartbeat and I wasn't built to support that".
A saner error might not be a bad idea.

I'll go enable Heartbeat support now.

>
>
> I'll post this to the linux-ha list too.
>
> /var/log/messages:
>
> Oct 14 14:49:55 node1 logd: [1455]: info: logd started with default  
> configuration.
> Oct 14 14:49:55 node1 logd: [1463]: info: G_main_add_SignalHandler:  
> Added signal handler for signal 15
> Oct 14 14:49:55 node1 logd: [1455]: info: G_main_add_SignalHandler:  
> Added signal handler for signal 15
> Oct 14 14:49:55 node1 heartbeat: [1479]: info: Enabling logging daemon
> Oct 14 14:49:55 node1 heartbeat: [1479]: info: logfile and debug  
> file are those specified in logd config file (default /etc/logd.cf)
> Oct 14 14:49:55 node1 heartbeat: [1479]: info: ******************
> Oct 14 14:49:55 node1 heartbeat: [1479]: info: Configuration  
> validated. Starting heartbeat 2.99.2
> Oct 14 14:49:55 node1 heartbeat: [1480]: info: heartbeat: version  
> 2.99.2
> Oct 14 14:49:55 node1 heartbeat: [1480]: info: Heartbeat generation:  
> 1219055953
> Oct 14 14:49:55 node1 heartbeat: [1480]: info: glib: UDP multicast  
> heartbeat started for group 239.0.0.45 port 694 interface eth0  
> (ttl=1 loop=0)
> Oct 14 14:49:55 node1 heartbeat: [1480]: info:  
> G_main_add_TriggerHandler: Added signal manual handler
> Oct 14 14:49:55 node1 heartbeat: [1480]: info:  
> G_main_add_TriggerHandler: Added signal manual handler
> Oct 14 14:49:55 node1 heartbeat: [1480]: notice: Using watchdog  
> device: /dev/watchdog
> Oct 14 14:49:55 node1 heartbeat: [1480]: info:  
> G_main_add_SignalHandler: Added signal handler for signal 17
> Oct 14 14:49:55 node1 heartbeat: [1480]: info: Local status now set  
> to: 'up'
> Oct 14 14:50:55 node1 heartbeat: [1480]: WARN: node node2: is dead
> Oct 14 14:50:55 node1 heartbeat: [1480]: info: Comm_now_up():  
> updating status to active
> Oct 14 14:50:55 node1 heartbeat: [1480]: info: Local status now set  
> to: 'active'
> Oct 14 14:50:55 node1 heartbeat: [1480]: info: Starting child client  
> "/usr/lib64/heartbeat/ccm" (498,496)
> Oct 14 14:50:55 node1 heartbeat: [1480]: info: Starting child client  
> "/usr/lib64/heartbeat/cib" (498,496)
> Oct 14 14:50:55 node1 heartbeat: [1480]: info: Starting child client  
> "/usr/lib64/heartbeat/lrmd -r" (0,0)
> Oct 14 14:50:55 node1 heartbeat: [1480]: info: Starting child client  
> "/usr/lib64/heartbeat/stonithd" (0,0)
> Oct 14 14:50:55 node1 heartbeat: [1480]: info: Starting child client  
> "/usr/lib64/heartbeat/attrd" (498,496)
> Oct 14 14:50:55 node1 heartbeat: [1480]: info: Starting child client  
> "/usr/lib64/heartbeat/crmd" (498,496)
> Oct 14 14:50:55 node1 heartbeat: [1489]: info: Starting "/usr/lib64/ 
> heartbeat/ccm" as uid 498  gid 496 (pid 1489)
> Oct 14 14:50:55 node1 heartbeat: [1492]: info: Starting "/usr/lib64/ 
> heartbeat/stonithd" as uid 0  gid 0 (pid 1492)
> Oct 14 14:50:55 node1 heartbeat: [1491]: info: Starting "/usr/lib64/ 
> heartbeat/lrmd -r" as uid 0  gid 0 (pid 1491)
> Oct 14 14:50:55 node1 heartbeat: [1493]: info: Starting "/usr/lib64/ 
> heartbeat/attrd" as uid 498  gid 496 (pid 1493)
> Oct 14 14:50:55 node1 heartbeat: [1490]: info: Starting "/usr/lib64/ 
> heartbeat/cib" as uid 498  gid 496 (pid 1490)
> Oct 14 14:50:55 node1 heartbeat: [1494]: info: Starting "/usr/lib64/ 
> heartbeat/crmd" as uid 498  gid 496 (pid 1494)
> Oct 14 14:50:55 node1 lrmd: [1491]: info: G_main_add_SignalHandler:  
> Added signal handler for signal 15
> Oct 14 14:50:55 node1 stonithd: [1492]: info:  
> G_main_add_SignalHandler: Added signal handler for signal 10
> Oct 14 14:50:55 node1 stonithd: [1492]: info:  
> G_main_add_SignalHandler: Added signal handler for signal 12
> Oct 14 14:50:55 node1 cib: [1490]: info: G_main_add_SignalHandler:  
> Added signal handler for signal 15
> Oct 14 14:50:55 node1 cib: [1490]: info: G_main_add_TriggerHandler:  
> Added signal manual handler
> Oct 14 14:50:55 node1 cib: [1490]: info: G_main_add_SignalHandler:  
> Added signal handler for signal 17
> Oct 14 14:50:55 node1 attrd: [1493]: info: G_main_add_SignalHandler:  
> Added signal handler for signal 15
> Oct 14 14:50:55 node1 attrd: [1493]: info: main: Starting up....
> Oct 14 14:50:55 node1 attrd: [1493]: ERROR: main: HA Signon failed
> Oct 14 14:50:55 node1 attrd: [1493]: ERROR: main: Aborting startup
> Oct 14 14:50:55 node1 heartbeat: [1480]: WARN: Managed /usr/lib64/ 
> heartbeat/attrd process 1493 exited with return code 100.
> Oct 14 14:50:55 node1 ccm: [1489]: info: Hostname: node1
> Oct 14 14:50:55 node1 crmd: [1494]: info: main: CRM Hg Version:  
> node: 9a6c6d1dd87154b11fdf9ff7fadf5fd33500bca4
> Oct 14 14:50:55 node1 crmd: [1494]: info: crmd_init: Starting crmd
> Oct 14 14:50:55 node1 crmd: [1494]: info: G_main_add_SignalHandler:  
> Added signal handler for signal 15
> Oct 14 14:50:55 node1 crmd: [1494]: info: G_main_add_TriggerHandler:  
> Added signal manual handler
> Oct 14 14:50:55 node1 crmd: [1494]: info: G_main_add_SignalHandler:  
> Added signal handler for signal 17
> Oct 14 14:50:55 node1 stonithd: [1492]: ERROR: crm_abort:  
> is_heartbeat_cluster: Triggered fatal assert at utils.c:1626 :  
> is_openais_cluster()
> Oct 14 14:50:55 node1 cib: [1490]: info: retrieveCib: Reading  
> cluster configuration from: /var/lib/heartbeat/crm/cib.xml (digest: / 
> var/lib/heartbeat/crm/cib.xml.sig)
> Oct 14 14:50:55 node1 lrmd: [1491]: info: G_main_add_SignalHandler:  
> Added signal handler for signal 17
> Oct 14 14:50:55 node1 lrmd: [1491]: info: G_main_add_SignalHandler:  
> Added signal handler for signal 10
> Oct 14 14:50:55 node1 lrmd: [1491]: info: G_main_add_SignalHandler:  
> Added signal handler for signal 12
> Oct 14 14:50:55 node1 lrmd: [1491]: info: Started.
> Oct 14 14:50:55 node1 heartbeat: [1480]: WARN: Managed /usr/lib64/ 
> heartbeat/stonithd process 1492 killed by signal 6 [SIGABRT - Abort].
> Oct 14 14:50:55 node1 heartbeat: [1480]: ERROR: Managed /usr/lib64/ 
> heartbeat/stonithd process 1492 dumped core
> Oct 14 14:50:55 node1 heartbeat: [1480]: ERROR: Respawning client "/ 
> usr/lib64/heartbeat/stonithd":
> Oct 14 14:50:55 node1 heartbeat: [1480]: info: Starting child client  
> "/usr/lib64/heartbeat/stonithd" (0,0)
> Oct 14 14:50:56 node1 cib: [1490]: info: startCib: CIB  
> Initialization completed successfully
> Oct 14 14:50:56 node1 cib: [1490]: CRIT: cib_init: Cannot sign in to  
> the cluster... terminating
> Oct 14 14:50:56 node1 heartbeat: [1480]: WARN: Managed /usr/lib64/ 
> heartbeat/cib process 1490 exited with return code 100.
> Oct 14 14:50:56 node1 heartbeat: [1480]: EMERG: Rebooting system.   
> Reason: /usr/lib64/heartbeat/cib
> Oct 14 14:50:56 node1 crmd: [1494]: WARN: do_cib_control: Couldn't  
> complete CIB registration 1 times... pause and retry
> Oct 14 14:50:56 node1 crmd: [1494]: info: crmd_init: Starting crmd's  
> mainloop
> Oct 14 14:50:56 node1 heartbeat: [1495]: info: Starting "/usr/lib64/ 
> heartbeat/stonithd" as uid 0  gid 0 (pid 1495)
> Oct 14 14:50:56 node1 stonithd: [1495]: info:  
> G_main_add_SignalHandler: Added signal handler for signal 10
> Oct 14 14:50:56 node1 stonithd: [1495]: info:  
> G_main_add_SignalHandler: Added signal handler for signal 12
> Oct 14 14:50:56 node1 stonithd: [1495]: ERROR: crm_abort:  
> is_heartbeat_cluster: Triggered fatal assert at utils.c:1626 :  
> is_openais_cluster()
> Oct 14 14:50:57 node1 kernel: md: stopping all md devices.
> Oct 14 14:51:17 node1 syslogd 1.4.1: restart.
>
> This occurs no matter whether cman and openais are running or not.
>
> I have attached the coredump.
> Version information:
>
> - CentOS 5.2 x86_64 (2.6.18-92.1.13.el5xen)
> - heartbeat-common.x86_64 2.99.2-21.1
> - heartbeat-resources.x86_64 2.99.2-21.1
> - heartbeat.x86_64 2.99.2-21.1
> - libheartbeat2.x86_64 2.99.2-21.1
> - pacemaker.x86_64 1.0.0-1.6
> - libpacemaker3.x86_64 1.0.0-1.6
> - openais.x86_64 0.80.3-19.1
> - cman.x86_64 2.0.84-2.el5_2.1
>
> ha.cf:
>
> autojoin none
> mcast eth0 239.0.0.45 694 1 0
> warntime 15
> deadtime 60
> initdead 60
> keepalive 3
> node node1
> node node2
> crm on
> watchdog /dev/watchdog
> use_logd on
>
> openais.conf:
>
> totem {
> 	version: 2
> 	secauth: on
> 	threads: 1
> 	heartbeat_failures_allowed: 3
> 	interface {
> 		ringnumber: 0
> 		bindnetaddr: 10.0.3.1
> 		mcastaddr: 239.0.0.45
> 		mcastport: 5405
> 	}
> }
>
> logging {
> 	debug: off
> 	timestamp: on
> }
>
> amf {
> 	mode: disabled
> }
>
> I have tried switching either to another IP, but to no avail.
> Any insights into this behavior?
>
> Kind regards,
>
> Roderick
> <core.1492>
>
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at clusterlabs.org
> http://list.clusterlabs.org/mailman/listinfo/pacemaker