[Pacemaker] SOLVED: node is unable to join cluster after upgrade (crmd dies)

Wed Jun 18 17:49:51 EDT 2014

Am 18.06.2014 um 10:17 schrieb Krause, Markus <krause at biochem.mpg.de>:

> 
> Am 18.06.2014 um 01:13 schrieb Andrew Beekhof <andrew at beekhof.net>:
> 
>> 
>> On 18 Jun 2014, at 2:59 am, Krause, Markus <krause at biochem.mpg.de> wrote:
>> 
>>> Hi all, 
>>> 
>>> I am using pacemaker/openais and drbd to have a high-available MySQL server which worked „for years“ (sind SLES 11 SP 0) without configuration chances. 
>>> Just two days ago I did an update (zypper up) which now leads to the issue described below.
>>> Although the subject of my email is quite similar to many others on this list I could not find an equivalent (so it seems) problem or a solution. I tried to provide as much information as seemed to be reasonable to me so I apologize in advance for the many lines!
>> 
>> This looks ominous:
>> 
>>  error: reap_dead_nodes: 	We're not part of the cluster anymore
>> 
>> It could be a software bug, but I'd start by checking the firewall.
>> Upgrades will sometimes affect those.
> 
> Hi Andrew, 
> 
> thanks for your response.
> I already turned off the firewall, currently on both nodes iptables -nvL shows (here for sql01a):
> ———————
> sql01a:~ # iptables -nvL
> Chain INPUT (policy ACCEPT 0 packets, 0 bytes)
>  pkts bytes target     prot opt in     out     source               destination         
> 
> Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
>  pkts bytes target     prot opt in     out     source               destination         
> 
> Chain OUTPUT (policy ACCEPT 0 packets, 0 bytes)
>  pkts bytes target     prot opt in     out     source               destination         
> sql01a:~ # 
> ———————
> 
> regards,
>   Markus

Hi all, 
as I noticed that some permissions on the first node (sql01a) looked „weird“ I decided to reinstall the whole system on this node and after copying all the configuration from the second host everything is working as expected without further changes. It seems that the update process messed up some permissions (and more?).

Thanks to all who answered to this issue for their time an assistance!

BTW: Is there a „clean" way to remove all cluster relevant settings (deleting dirs, configurations) apart from reinstalling all if something is wrong on only one host?

regards,
   Markus

> 
>> 
>>> 
>>> Two servers with identical configuration (hard and software) running SLES 11 SP3 (with HAE) have the following crm config:
>>> 
>>> ———————
>>> sql01a:~ # crm configure show
>>> node sql01a
>>> node sql01b
>>> primitive drbd_drbd0 ocf:linbit:drbd \
>>> 	params drbd_resource="drbd0" \
>>> 	op monitor interval="14" role="Master" \
>>> 	op monitor interval="16" role="Slave" \
>>> 	op start interval="0" timeout="240" \
>>> 	op stop interval="0" timeout="240"
>>> primitive ha_ip ocf:heartbeat:IPaddr2 \
>>> 	op monitor interval="5" timeout="20" \
>>> 	params ip="10.4.11.5" cidr_netmask="16" iflabel="ha" nic="eth0"
>>> primitive mount_drbd0 ocf:heartbeat:Filesystem \
>>> 	params device="/dev/drbd0" fstype="ext3" directory="/var/drbd0" \
>>> 	meta target-role="Started" \
>>> 	op start interval="0" timeout="60" \
>>> 	op stop interval="0" timeout="300"
>>> primitive mysqld ocf:heartbeat:mysql \
>>> 	params binary="/usr/sbin/mysqld" config="/etc/my.cnf" datadir="/var/drbd0/mysql/data" user="mysql" group="mysql" log="/var/log/mysql.log" pid="/var/lib/mysql/mysqld.pid" socket="/var/lib/mysql/mysql.sock" test_table="ha_check.mysql_check" test_user=„XXX" test_passwd=„XXX" enable_creation="no" \
>>> 	op monitor interval="10" timeout="60" \
>>> 	op start interval="0" timeout="240" \
>>> 	op stop interval="0" timeout="240"
>>> primitive ping_gateway ocf:pacemaker:ping \
>>> 	params dampen="10" multiplier="100" host_list="10.4.0.1" \
>>> 	op start interval="0" timeout="60" \
>>> 	op monitor interval="5" timeout="60"
>>> group group_mysqld mount_drbd0 mysqld ha_ip
>>> ms ms_drbd_drbd0 drbd_drbd0 \
>>> 	meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true"
>>> clone ping_clone ping_gateway
>>> location loc_drbdmaster_drbd0 ms_drbd_drbd0 \
>>> 	rule $id="loc_drbdmaster_ping-rule" $role="Master" pingd: defined pingd
>>> colocation col_ms_drbd_drbd0_mount_drbd0 inf: group_mysqld ms_drbd_drbd0:Master
>>> order ord_ms_drbd_drbd0_mount_drbd0 inf: ms_drbd_drbd0:promote group_mysqld:start
>>> property $id="cib-bootstrap-options" \
>>> 	dc-version="1.1.10-f3eeaf4" \
>>> 	cluster-infrastructure="classic openais (with plugin)" \
>>> 	expected-quorum-votes="2" \
>>> 	no-quorum-policy="ignore" \
>>> 	stonith-enabled="false“
>>> ———————
>>> 
>>> this worked since the first setup long ago.
>>> after the update (zypper up) two days ago only sql01b is running, sql01a is unable to join the cluster.
>>> 
>>> ———————
>>> sql01a:~ # crm status:
>>> Last updated: Tue Jun 17 18:20:29 2014
>>> Last change: Tue Jun 17 17:26:20 2014 by root via crm_attribute on sql01b
>>> Stack: classic openais (with plugin)
>>> Current DC: sql01b - partition with quorum
>>> Version: 1.1.10-f3eeaf4
>>> 2 Nodes configured, 2 expected votes
>>> 7 Resources configured
>>> 
>>> 
>>> Online: [ sql01b ]
>>> OFFLINE: [ sql01a ]
>>> 
>>> Clone Set: ping_clone [ping_gateway]
>>>    Started: [ sql01b ]
>>>    Stopped: [ sql01a ]
>>> Master/Slave Set: ms_drbd_drbd0 [drbd_drbd0]
>>>    Masters: [ sql01b ]
>>>    Stopped: [ sql01a ]
>>> Resource Group: group_mysqld
>>>    mount_drbd0	(ocf::heartbeat:Filesystem):	Started sql01b 
>>>    mysqld	(ocf::heartbeat:mysql):	Started sql01b 
>>>    ha_ip	(ocf::heartbeat:IPaddr2):	Started sql01b
>>> ———————
>>> 
>>> Starting the cluster tools on sql01a („rcopenais start“) leads to the following lines (snipped the „OK-looking“ lines at top as I enabled debugging in corosync.conf; if necessary i can of course post the whole output, but that’s a lot of lines ...) in /var/log/messages:
>>> 
>>> ——————— 
>>> [snip]
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [MAIN  ] main.c:278 Completed service synchronization, ready to provide service.
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [TOTEM ] totempg.c:292 waiting_trans_ack changed to 0
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1565 WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1565 WARN: route_ais_message: Sending message to local.crmd failed: ipc delivery failed (rc=-2)
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1565 WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1565 WARN: route_ais_message: Sending message to local.crmd failed: ipc delivery failed (rc=-2)
>>> Jun 17 18:20:18 sql01a mgmtd: [14062]: debug: main: run the loop...
>>> Jun 17 18:20:18 sql01a mgmtd: [14062]: info: Started.
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:486 ERROR: pcmk_wait_dispatch: Child process cib exited (pid=14056, rc=100)
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:490 notice: pcmk_wait_dispatch: Child process cib no longer wishes to be respawned
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1672 debug: send_cluster_id: Local update: id=17499146, born=1452, seq=1452
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] utils.c:340 info: update_member: Node sql01a now has process list: 00000000000000000000000000151212 (1380882)
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1672 debug: send_cluster_id: Local update: id=17499146, born=1452, seq=1452
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:486 ERROR: pcmk_wait_dispatch: Child process crmd exited (pid=14061, rc=100)
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:490 notice: pcmk_wait_dispatch: Child process crmd no longer wishes to be respawned
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1672 debug: send_cluster_id: Local update: id=17499146, born=1452, seq=1452
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] utils.c:340 info: update_member: Node sql01a now has process list: 00000000000000000000000000151012 (1380370)
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1672 debug: send_cluster_id: Local update: id=17499146, born=1452, seq=1452
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:486 ERROR: pcmk_wait_dispatch: Child process attrd exited (pid=14059, rc=100)
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:490 notice: pcmk_wait_dispatch: Child process attrd no longer wishes to be respawned
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1672 debug: send_cluster_id: Local update: id=17499146, born=1452, seq=1452
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] utils.c:340 info: update_member: Node sql01a now has process list: 00000000000000000000000000150012 (1376274)
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1672 debug: send_cluster_id: Local update: id=17499146, born=1452, seq=1452
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:486 ERROR: pcmk_wait_dispatch: Child process pengine exited (pid=14060, rc=100)
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:490 notice: pcmk_wait_dispatch: Child process pengine no longer wishes to be respawned
>>> [snip]
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1365 info: send_member_notification: Sending membership update 1452 to 1 children
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:962 debug: pcmk_cluster_id_callback: Node update: sql01a (1.1.10)
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:962 debug: pcmk_cluster_id_callback: Node update: sql01a (1.1.10)
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1565 WARN: route_ais_message: Sending message to local.crmd failed: ipc delivery failed (rc=-2)
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1565 WARN: route_ais_message: Sending message to local.crmd failed: ipc delivery failed (rc=-2)
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1565 WARN: route_ais_message: Sending message to local.cib failed: ipc delivery failed (rc=-2)
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1565 WARN: route_ais_message: Sending message to local.crmd failed: ipc delivery failed (rc=-2)
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1565 WARN: route_ais_message: Sending message to local.crmd failed: ipc delivery failed (rc=-2)
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1565 WARN: route_ais_message: Sending message to local.attrd failed: ipc delivery failed (rc=-2)
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1565 WARN: route_ais_message: Sending message to local.attrd failed: ipc delivery failed (rc=-2)
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1565 WARN: route_ais_message: Sending message to local.attrd failed: ipc delivery failed (rc=-2)
>>> Jun 17 18:20:18 sql01a corosync[14051]:   [pcmk  ] plugin.c:1565 WARN: route_ais_message: Sending message to local.attrd failed: ipc delivery failed (rc=-2)
>>> Jun 17 18:20:22 sql01a sshd[14068]: Connection closed by 10.4.1.60 [preauth]
>>> Jun 17 18:20:23 sql01a corosync[14051]:   [CONFDB] confdb.c:405 lib_init_fn: conn=0x663570
>>> Jun 17 18:20:23 sql01a corosync[14051]:   [CONFDB] confdb.c:411 exit_fn for conn=0x663570
>>> Jun 17 18:20:27 sql01a stonith-ng[14057]:    error: setup_cib: Could not connect to the CIB service: Transport endpoint is not connected (-107)
>>> Jun 17 18:20:27 sql01a stonith-ng[14057]:   notice: crm_update_peer_state: plugin_handle_membership: Node sql01a[17499146] - state is now member (was (null))
>>> Jun 17 18:20:27 sql01a stonith-ng[14057]:   notice: get_node_name: Could not obtain a node name for classic openais (with plugin) nodeid 34276362
>>> Jun 17 18:20:27 sql01a stonith-ng[14057]:   notice: crm_update_peer_state: plugin_handle_membership: Node (null)[34276362] - state is now member (was (null))
>>> Jun 17 18:20:27 sql01a stonith-ng[14057]:   notice: plugin_handle_membership: Membership 1452: quorum acquired
>>> [snip]
>>> ———————
>>> 
>>> some more info:
>>> ———————
>>> sql01a:~ # corosync-cfgtool -s
>>> Printing ring status.
>>> Local node ID 17499146
>>> RING ID 0
>>> 	id	= 10.4.11.1
>>> 	status	= ring 0 active with no faults
>>> RING ID 1
>>> 	id	= 10.255.4.1
>>> 	status	= ring 1 active with no faults
>>> sql01a:~ # corosync-objctl | grep member
>>> runtime.totem.pg.mrp.srp.members.17499146.ip=r(0) ip(10.4.11.1) r(1) ip(10.255.4.1) 
>>> runtime.totem.pg.mrp.srp.members.17499146.join_count=1
>>> runtime.totem.pg.mrp.srp.members.17499146.status=joined
>>> runtime.totem.pg.mrp.srp.members.34276362.ip=r(0) ip(10.4.11.2) r(1) ip(10.255.4.2) 
>>> runtime.totem.pg.mrp.srp.members.34276362.join_count=1
>>> runtime.totem.pg.mrp.srp.members.34276362.status=joined
>>> ———————
>>> This is similar to the other (working) host.
>>> 
>>> Interesting is:
>>> ———————
>>> sql01a:~ # rcopenais status
>>> Running
>>> sql01a:~ # ps ax | grep -E "pace|heart|coro" | grep -v grep
>>> 14051 ?        Ssl    0:00 /usr/sbin/corosync
>>> 14057 ?        S<     0:00 /usr/lib64/pacemaker/stonithd
>>> 14058 ?        S<     0:00 /usr/lib64/pacemaker/lrmd
>>> 14062 ?        S<     0:00 /usr/lib64/heartbeat/mgmtd
>>> sql01a:~ # 
>>> ———————
>>> 
>>> 
>>> For digging further what’s going wrong I then started the missing processes (cib, attrd, pengine and crmd) from the command line (using the -V switch). cib, attrd and pengine are starting fine (even crm now shows a result, although „wrong“) but crmd fails:
>>> 
>>> ———————
>>> sql01a:~ # /usr/lib64/pacemaker/crmd -V
>>> notice: main: 	CRM Git Version: f3eeaf4
>>>   info: do_log: 	FSA: Input I_STARTUP from crmd_init() received in state S_STARTING
>>>   info: config_find_next: 	Processing additional service options...
>>>   info: get_config_opt: 	Found 'openais_clm' for option: name
>>>   info: config_find_next: 	Processing additional service options...
>>>   info: get_config_opt: 	Found 'openais_evt' for option: name
>>>   info: config_find_next: 	Processing additional service options...
>>>   info: get_config_opt: 	Found 'openais_ckpt' for option: name
>>>   info: config_find_next: 	Processing additional service options...
>>>   info: get_config_opt: 	Found 'openais_amf_v2' for option: name
>>>   info: config_find_next: 	Processing additional service options...
>>>   info: get_config_opt: 	Found 'openais_msg' for option: name
>>>   info: config_find_next: 	Processing additional service options...
>>>   info: get_config_opt: 	Found 'openais_lck' for option: name
>>>   info: config_find_next: 	Processing additional service options...
>>>   info: get_config_opt: 	Found 'openais_tmr' for option: name
>>>   info: config_find_next: 	Processing additional service options...
>>>   info: get_config_opt: 	Found 'pacemaker' for option: name
>>>   info: get_config_opt: 	Found '0' for option: ver
>>>   info: get_cluster_type: 	Detected an active 'classic openais (with plugin)' cluster
>>>   info: do_cib_control: 	CIB connection established
>>> notice: crm_cluster_connect: 	Connecting to cluster infrastructure: classic openais (with plugin)
>>>   info: init_cs_connection_classic: 	Creating connection to our Corosync plugin
>>>   info: init_cs_connection_classic: 	AIS connection established
>>> notice: get_node_name: 	Defaulting to uname -n for the local classic openais (with plugin) node name
>>>   info: plugin_get_details: 	Server details: id=17499146 uname=sql01a cname=pcmk
>>>   info: crm_get_peer: 	Created entry 6002e998-b8ba-4daa-9c5d-1082bd1ab73a/0x65b8a0 for node sql01a/17499146 (1 total)
>>>   info: crm_get_peer: 	Node 17499146 is now known as sql01a
>>>   info: peer_update_callback: 	sql01a is now (null)
>>>   info: crm_get_peer: 	Node 17499146 has uuid sql01a
>>>   info: crm_update_peer_proc: 	init_cs_connection_classic: Node sql01a[17499146] - unknown is now online
>>>   info: peer_update_callback: 	Client sql01a/peer now has status [online] (DC=<null>)
>>>   info: init_cs_connection_once: 	Connection to 'classic openais (with plugin)': established
>>> notice: get_node_name: 	Defaulting to uname -n for the local classic openais (with plugin) node name
>>>   info: do_ha_control: 	Connected to the cluster
>>>   info: lrmd_ipc_connect: 	Connecting to lrmd
>>>   info: do_lrm_control: 	LRM connection established
>>>   info: do_started: 	Delaying start, no membership data (0000000000100000)
>>>   info: do_started: 	Delaying start, no membership data (0000000000100000)
>>>   info: do_started: 	Delaying start, no membership data (0000000000100000)
>>> notice: plugin_handle_membership: 	Membership 1460: quorum acquired
>>>   info: crm_update_peer_proc: 	plugin_handle_membership: Node sql01a[17499146] - unknown is now member
>>>   info: peer_update_callback: 	Client sql01a/peer now has status [offline] (DC=<null>)
>>> notice: peer_update_callback: 	Our peer connection failed
>>> notice: crm_update_peer_state: 	plugin_handle_membership: Node sql01a[17499146] - state is now member (was (null))
>>>   info: peer_update_callback: 	sql01a is now member (was (null))
>>>   info: crm_update_peer: 	plugin_handle_membership: Node sql01a: id=17499146 state=member addr=r(0) ip(10.4.11.1) r(1) ip(10.255.4.1) (new) votes=1 (new) born=1460 seen=1460 proc=00000000000000000000000000140012
>>>   info: crm_get_peer: 	Created entry 8925fb86-2b70-4c91-af03-a8251a9182fc/0x65ecf0 for node sql01b/34276362 (2 total)
>>>   info: crm_get_peer: 	Node 34276362 is now known as sql01b
>>>   info: peer_update_callback: 	sql01b is now (null)
>>>   info: crm_get_peer: 	Node 34276362 has uuid sql01b
>>>   info: crm_update_peer_proc: 	plugin_handle_membership: Node sql01b[34276362] - unknown is now member
>>>   info: peer_update_callback: 	Client sql01b/peer now has status [online] (DC=<null>)
>>> notice: crm_update_peer_state: 	plugin_handle_membership: Node sql01b[34276362] - state is now member (was (null))
>>>   info: peer_update_callback: 	sql01b is now member (was (null))
>>>   info: crm_update_peer: 	plugin_handle_membership: Node sql01b: id=34276362 state=member addr=r(0) ip(10.4.11.2) r(1) ip(10.255.4.2) (new) votes=1 (new) born=1436 seen=1460 proc=00000000000000000000000000151312
>>>  error: reap_dead_nodes: 	We're not part of the cluster anymore
>>>   info: plugin_handle_membership: 	Membership 1460: quorum retained
>>>  error: do_log: 	FSA: Input I_ERROR from peer_update_callback() received in state S_STARTING
>>> notice: do_state_transition: 	State transition S_STARTING -> S_RECOVERY [ input=I_ERROR cause=C_CRMD_STATUS_CALLBACK origin=peer_update_callback ]
>>> warning: do_recover: 	Fast-tracking shutdown in response to errors
>>>  error: do_started: 	Start cancelled... S_RECOVERY
>>>  error: do_log: 	FSA: Input I_ERROR from reap_dead_nodes() received in state S_RECOVERY
>>>   info: do_dc_release: 	DC role released
>>>   info: do_te_control: 	Transitioner is now inactive
>>>  error: do_log: 	FSA: Input I_TERMINATE from do_recover() received in state S_RECOVERY
>>>   info: do_state_transition: 	State transition S_RECOVERY -> S_TERMINATE [ input=I_TERMINATE cause=C_FSA_INTERNAL origin=do_recover ]
>>> notice: lrm_state_verify_stopped: 	Stopped 0 recurring operations at shutdown (0 ops remaining)
>>>   info: do_lrm_control: 	Disconnecting from the LRM
>>>   info: lrmd_api_disconnect: 	Disconnecting from lrmd service
>>>   info: lrmd_ipc_connection_destroy: 	IPC connection destroyed
>>>   info: lrm_connection_destroy: 	LRM Connection disconnected
>>>   info: lrmd_api_disconnect: 	Disconnecting from lrmd service
>>> notice: do_lrm_control: 	Disconnected from the LRM
>>>   info: crm_cluster_disconnect: 	Disconnecting from cluster infrastructure: classic openais (with plugin)
>>> notice: terminate_cs_connection: 	Disconnecting from Corosync
>>>   info: cluster_disconnect_cpg: 	No CPG connection
>>>   info: crm_cluster_disconnect: 	Disconnected from classic openais (with plugin)
>>>   info: do_ha_control: 	Disconnected from the cluster
>>>   info: do_cib_control: 	Disconnecting CIB
>>>   info: crmd_cib_connection_destroy: 	Connection to the CIB terminated...
>>>   info: do_exit: 	Performing A_EXIT_0 - gracefully exiting the CRMd
>>>   info: do_exit: 	[crmd] stopped (0)
>>>   info: crmd_exit: 	Dropping I_RELEASE_SUCCESS: [ state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_dc_release ]
>>>   info: crmd_exit: 	Dropping I_TERMINATE: [ state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_stop ]
>>>   info: crmd_cs_destroy: 	connection closed
>>>   info: crmd_init: 	13510 stopped: OK (0)
>>>  error: crmd_fast_exit: 	Could not recover from internal error
>>>   info: crm_xml_cleanup: 	Cleaning up memory from libxml2
>>> sql01a:~ # 
>>> ———————
>>> 
>>> what I tried up to now:
>>> - stop sql01a, remove all cib.xml and crm.xml in their folders and restart sql01a (I know this is quite harsh but what did I have to loose …) -> the config on sql01a has been taken from sql01b but the crmd always dies as before (same error messages)
>>> - I removed all cluster related packages (pacemaker, corosync, drbd, …) from sql01a and deleted the remaining dirs and config files, reinstalled all with copies from sql01b -> same result.
>>> - I removed the whole crm config from sql01b (the working cluster host) using cibadmin -Q —force and re-added the config (using crm „commands") -> same result
>>> - of course rebooted/restarted both servers several times -> nothing changes …
>>> 
>>> 
>>> As deleting the config from sql01a (several times) I thought the error must be in the config/communication on/with sql01b but I also removed and added the config there, in addition there is now firewall between them and there is communication.
>>> 
>>> Does anybody know/understand what is going on here?
>>> How can I solve this issue and get my cluster back to work as it did for quite a long time recently?
>>> 
>>> 
>>> thanks in advance for any hints!
>>> 
>>> best regards,
>>>  Markus
>>> 
>>> 
>>> 
>>> 
>>> -- .- .-. -.- ..- ... / -.- .-. .- ..- ... .
>>> Markus Krause
>>> Max Planck Institute of Biochemistry
>>> - Computing Center (core services, networking, OS X)
>>> Am Klopferspitz 18
>>> D-82152 Martinsried, Germany
>>> Phone: +49(89)8578-2825
>>> 
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>> 
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://bugs.clusterlabs.org
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> -- .- .-. -.- ..- ... / -.- .-. .- ..- ... .
> Markus Krause
> Max Planck Institute of Biochemistry
> - Computing Center (core services, networking, OS X)
> Am Klopferspitz 18
> D-82152 Martinsried, Germany
> Phone: +49(89)8578-2825
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-- .- .-. -.- ..- ... / -.- .-. .- ..- ... .
Markus Krause
Max Planck Institute of Biochemistry
- Computing Center (core services, networking, OS X)
Am Klopferspitz 18
D-82152 Martinsried, Germany
Phone: +49(89)8578-2825

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140618/9dfdbe55/attachment-0003.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4514 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140618/9dfdbe55/attachment-0003.p7s>