[Pacemaker] Pacemaker crash on node unstandby/standby.
Justin Burnham
jburnham25 at gmail.com
Sat Oct 19 01:45:10 UTC 2013
Hi,
I am having an issue with Pacemaker on the cman stack where I can reliably
cause pacemaker to crash and coredump when I put the node in standby or
unstandby. Here is my messages log from when I do a unstandby like so "pcs
cluster unstandby `uname -n`":
Oct 18 18:33:11 localhost crmd[19258]: notice: do_state_transition: State
transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
origin=abort_transition_graph ]
[7/3857]
Oct 18 18:33:11 localhost cib[19253]: notice: cib:diff: Diff: --- 0.45.4
Oct 18 18:33:11 localhost cib[19253]: notice: cib:diff: Diff: +++ 0.46.1
01d8e730566358dcb36a7bc5b9ac587c
Oct 18 18:33:11 localhost cib[19253]: notice: cib:diff: --
<nvpair id="nodes-node01-standby" name="standby" value="on" />
Oct 18 18:33:11 localhost cib[19253]: notice: cib:diff: ++ <cib
epoch="46" num_updates="1" admin_epoch="0" validate-with="pacemaker-1.2"
crm_feature_set="3.0.7" update-origin="node01"
update-client="crm_attribute" cib-last-
written="Fri Oct 18 18:32:57 2013" have-quorum="0" dc-uuid="node01" />
Oct 18 18:33:11 localhost pengine[19257]: notice: unpack_config: On loss
of CCM Quorum: Ignore
Oct 18 18:33:11 localhost pengine[19257]: notice: LogActions: Start
reader_vip_2#011(node01)
Oct 18 18:33:11 localhost pengine[19257]: notice: LogActions: Start
writer_vip#011(node01)
Oct 18 18:33:11 localhost pengine[19257]: notice: LogActions: Start
reader_vip_1#011(node01)
Oct 18 18:33:11 localhost cib[19253]: error: cib_ipc_closed: Client
not found in the hashtable
Oct 18 18:33:11 localhost cib[19253]: error: qb_ipcs_connection_unref:
ref:0 state:3 (19253-20281-21)
Oct 18 18:33:11 localhost pengine[19257]: notice: process_pe_message:
Calculated Transition 4: /var/lib/pacemaker/pengine/pe-input-77.bz2
Oct 18 18:33:12 localhost lrmd[19255]: notice: operation_finished:
reader_vip_2_start_0:20284 [ 2013/10/18_18:33:12 INFO: ip -f inet addr add
10.10.10.9/32 brd 10.10.10.9 dev eth1 ]
Oct 18 18:33:12 localhost lrmd[19255]: notice: operation_finished:
reader_vip_2_start_0:20284 [ 2013/10/18_18:33:12 INFO: ip link set eth1 up ]
Oct 18 18:33:12 localhost lrmd[19255]: notice: operation_finished:
reader_vip_2_start_0:20284 [ 2013/10/18_18:33:12 INFO:
/usr/lib64/heartbeat/send_arp -i 200 -r 5 -p
/var/run/heartbeat/rsctmp/send_arp-10.10.10.9 eth1 10.10.10.9
auto not_used not_used ]
Oct 18 18:33:13 localhost abrtd: Directory 'ccpp-2013-10-18-18:33:11-19253'
creation detected
Oct 18 18:33:13 localhost lrmd[19255]: notice: operation_finished:
reader_vip_1_start_0:20286 [ 2013/10/18_18:33:12 INFO: ip -f inet addr add
10.10.10.8/32 brd 10.10.10.8 dev eth1 ]
Oct 18 18:33:13 localhost lrmd[19255]: notice: operation_finished:
reader_vip_1_start_0:20286 [ 2013/10/18_18:33:12 INFO: ip link set eth1 up ]
Oct 18 18:33:13 localhost lrmd[19255]: notice: operation_finished:
reader_vip_1_start_0:20286 [ 2013/10/18_18:33:12 INFO:
/usr/lib64/heartbeat/send_arp -i 200 -r 5 -p
/var/run/heartbeat/rsctmp/send_arp-10.10.10.8 eth1 10.10.10.8
auto not_used not_used ]
Oct 18 18:33:13 localhost abrt[20283]: Saved core dump of pid 19253
(/usr/libexec/pacemaker/cib) to
/var/spool/abrt/ccpp-2013-10-18-18:33:11-19253 (10153984 bytes)
Oct 18 18:33:13 localhost rsyslogd-2177: imuxsock lost 75 messages from pid
18983 due to rate-limiting
Oct 18 18:33:13 localhost pacemakerd[19247]: notice: pcmk_child_exit:
Child process cib terminated with signal 6 (pid=19253, core=128)
Oct 18 18:33:13 localhost pacemakerd[19247]: notice: pcmk_process_exit:
Respawning failed child process: cib
Oct 18 18:33:13 localhost cib[20433]: notice: crm_add_logfile: Additional
logging available in /var/log/cluster/corosync.log
Oct 18 18:33:13 localhost lrmd[19255]: notice: operation_finished:
writer_vip_start_0:20285 [ 2013/10/18_18:33:12 INFO: ip -f inet addr add
10.10.10.10/32 brd 10.10.10.10 dev eth1 ]
Oct 18 18:33:13 localhost lrmd[19255]: notice: operation_finished:
writer_vip_start_0:20285 [ 2013/10/18_18:33:12 INFO: ip link set eth1 up ]
Oct 18 18:33:13 localhost lrmd[19255]: notice: operation_finished:
writer_vip_start_0:20285 [ 2013/10/18_18:33:13 INFO:
/usr/lib64/heartbeat/send_arp -i 200 -r 5 -p
/var/run/heartbeat/rsctmp/send_arp-10.10.10.10 eth1 10.10.10.10
auto not_used not_used ]
Oct 18 18:33:13 localhost cib[20433]: notice: crm_cluster_connect:
Connecting to cluster infrastructure: cman
Oct 18 18:33:13 localhost crmd[19258]: notice: crm_ipc_send: Connection
to cib_shm closed
Oct 18 18:33:13 localhost crmd[19258]: error:
cib_native_perform_op_delegate: Couldn't perform cib_modify operation
(timeout=120s): -107: Transport endpoint is not connected (107)
Oct 18 18:33:13 localhost crmd[19258]: error:
cib_native_perform_op_delegate: CIB disconnected
Oct 18 18:33:13 localhost crmd[19258]: warning: cib_rsc_callback: Resource
update -70 failed: (rc=-70) Communication error on send
Oct 18 18:33:13 localhost crmd[19258]: notice: process_lrm_event: LRM
operation reader_vip_2_start_0 (call=60, rc=0, cib-update=-70,
confirmed=true) ok
Oct 18 18:33:13 localhost crmd[19258]: warning: cib_rsc_callback: Resource
update -107 failed: (rc=-107) Transport endpoint is not connected
Oct 18 18:33:13 localhost crmd[19258]: notice: process_lrm_event: LRM
operation reader_vip_1_start_0 (call=64, rc=0, cib-update=-107,
confirmed=true) ok
Oct 18 18:33:13 localhost crmd[19258]: warning: cib_rsc_callback: Resource
update -107 failed: (rc=-107) Transport endpoint is not connected
Oct 18 18:33:13 localhost crmd[19258]: notice: process_lrm_event: LRM
operation writer_vip_start_0 (call=62, rc=0, cib-update=-107,
confirmed=true) ok
Oct 18 18:33:13 localhost crmd[19258]: error: crm_ipc_read: Connection
to cib_shm failed
Oct 18 18:33:13 localhost crmd[19258]: error: mainloop_gio_callback:
Connection to cib_shm[0x26c7080] closed (I/O condition=17)
Oct 18 18:33:13 localhost crmd[19258]: error:
crmd_cib_connection_destroy: Connection to the CIB terminated...
Oct 18 18:33:13 localhost crmd[19258]: error: do_log: FSA: Input I_ERROR
from crmd_cib_connection_destroy() received in state S_TRANSITION_ENGINE
Oct 18 18:33:13 localhost crmd[19258]: warning: do_state_transition: State
transition S_TRANSITION_ENGINE -> S_RECOVERY [ input=I_ERROR
cause=C_FSA_INTERNAL origin=crmd_cib_connection_destroy ]
Oct 18 18:33:13 localhost crmd[19258]: error: do_recover: Action
A_RECOVER (0000000001000000) not supported
Oct 18 18:33:13 localhost crmd[19258]: warning: do_election_vote: Not
voting in election, we're in state S_RECOVERY
Oct 18 18:33:13 localhost crmd[19258]: warning: destroy_action: Cancelling
timer for action 4 (src=71)
Oct 18 18:33:13 localhost crmd[19258]: warning: destroy_action: Cancelling
timer for action 6 (src=72)
Oct 18 18:33:13 localhost crmd[19258]: warning: destroy_action: Cancelling
timer for action 8 (src=73)
Oct 18 18:33:13 localhost crmd[19258]: error: do_log: FSA: Input
I_TERMINATE from do_recover() received in state S_RECOVERY
Oct 18 18:33:13 localhost crmd[19258]: error: verify_stopped: Resource
reader_vip_1 was active at shutdown. You may ignore this error if it is
unmanaged.
Oct 18 18:33:13 localhost crmd[19258]: error: verify_stopped: Resource
reader_vip_2 was active at shutdown. You may ignore this error if it is
unmanaged.
Oct 18 18:33:13 localhost crmd[19258]: error: verify_stopped: Resource
writer_vip was active at shutdown. You may ignore this error if it is
unmanaged.
Oct 18 18:33:13 localhost crmd[19258]: notice: terminate_cs_connection:
Disconnecting from Corosync
Oct 18 18:33:13 localhost crmd[19258]: error: do_exit: Could not recover
from internal error
Oct 18 18:33:13 localhost pacemakerd[19247]: error:
qb_ipcs_connection_unref: ref:0 state:3 (19247-19258-19)
Oct 18 18:33:14 localhost abrtd: Directory 'ccpp-2013-10-18-18:33:14-19247'
creation detected
Oct 18 18:33:14 localhost abrt[20454]: Saved core dump of pid 19247
(/usr/sbin/pacemakerd) to /var/spool/abrt/ccpp-2013-10-18-18:33:14-19247
(9646080 bytes)
Oct 18 18:33:24 localhost abrtd: Sending an email...
Oct 18 18:33:24 localhost abrtd: Email was sent to: root at localhost
Oct 18 18:33:25 localhost abrtd: New problem directory
/var/spool/abrt/ccpp-2013-10-18-18:33:11-19253, processing
I get "[abrt] full crash report" emails as well.
some basic info from the emails
Subject: [abrt] full crash report
User-Agent: Heirloom mailx 12.4 7/29/08
Content-Type: text/plain; charset=us-ascii
Status: RO
abrt_version: 2.0.8
cgroup:
cmdline: /usr/libexec/pacemaker/cib
executable: /usr/libexec/pacemaker/cib
kernel: 2.6.32-358.18.1.el6.x86_64
pid: 19253
pwd: /var/lib/pacemaker/cores
time: Fri 18 Oct 2013 06:33:11 PM PDT
uid: 495
username: hacluster
sosreport.tar.xz: Binary file, 401160 bytes
and
Subject: [abrt] full crash report
User-Agent: Heirloom mailx 12.4 7/29/08
Content-Type: text/plain; charset=us-ascii
Status: R
abrt_version: 2.0.8
cgroup:
cmdline: pacemakerd
executable: /usr/sbin/pacemakerd
kernel: 2.6.32-358.18.1.el6.x86_64
pid: 19247
pwd: /var/lib/pacemaker/cores/root
time: Fri 18 Oct 2013 06:33:14 PM PDT
uid: 0
username: root
sosreport.tar.xz: Binary file, 401964 bytes
This is Centos 6.4 running the following versions:
corosync-1.4.1-15.el6_4.1.x86_64
pacemaker-libs-1.1.8-7.el6.x86_64
corosynclib-1.4.1-15.el6_4.1.x86_64
pacemaker-cli-1.1.8-7.el6.x86_64
pacemaker-cluster-libs-1.1.8-7.el6.x86_64
pacemaker-1.1.8-7.el6.x86_64
Here is my corosync.conf file:
compatibility: whitetank
totem {
version: 2
rrp_mode: none
netmtu: 1500
secuath: on
threads: 0
transport: udp
token: 1000
token_retransmit: 238
hold: 100
join: 50
consensus: 1200
merge: 200
downcheck: 1000
fail_recv_const: 2500
seqno_unchanged_const: 30
heartbeat_failures_allowed: 0
max_network_delay: 50
window_size: 50
max_messages: 17
miss_count_const: 5
rrp_problem_count_timeout: 2000
rrp_problem_count_threshold: 10
rrp_problem_count_mcast_threshold:
rrp_token_expired_timeout: 47
rrp_autorecovery_check_timeout: 1000
interface: {
ringnumber: 0
bindnetaddr: 10.10.10.0
broadcast: no
mcastaddr: 239.0.0.111
mcastport: 4000
ttl: 1
}
}
logging {
timestamp: off
fileline: off
to_stderr: no
to_logfile: yes
to_syslog: yes
logfile: /var/log/corosync.log
logfile_priority: info
syslog_facility: daemon
syslog_priority: info
debug: off
}
amf {
mode: disabled
}
and my cluster.conf
<?xml version="1.0"?>
<cluster config_version="1" name="nodes">
<logging debug="on"/>
<clusternodes>
<clusternode name="node01" nodeid="1"/>
<clusternode name="node02" nodeid="2"/>
</clusternodes>
</cluster>
If I'm doing something obviously wrong please let me know. If you would
like the full contents of the coredump and the abrt emails, I can also
provide those.
Thanks for your time,
Justin
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20131018/5428b017/attachment-0003.html>
More information about the Pacemaker
mailing list