[Pacemaker] Segfault on monitor resource

Mon Jan 26 17:22:57 UTC 2015

Oh, I forgot some important details:

root# (S) crm status
============
Last updated: Mon Jan 26 18:21:35 2015
Last change: Sun Jan 25 05:19:13 2015 via crm_resource on lb01
Stack: Heartbeat
Current DC: lb01 (43b2c5a1-9552-4438-962b-6e98a2dd67c7) - partition with
quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
2 Nodes configured, unknown expected votes
8 Resources configured.
============

Online: [ lb01 lb02 ]

 IP-rsc_mysql (ocf::heartbeat:IPaddr2): Started lb02
 IP-rsc_nginx (ocf::heartbeat:IPaddr2): Started lb02
 IP-rsc_nginx6 (ocf::heartbeat:IPv6addr): Started lb02
 IP-rsc_mysql6 (ocf::heartbeat:IPv6addr): Started lb02
 IP-rsc_elasticsearch6 (ocf::heartbeat:IPv6addr): Started lb02
 IP-rsc_elasticsearch (ocf::heartbeat:IPaddr2): Started lb02
 Ldirector-rsc (ocf::heartbeat:ldirectord): Started lb02
 Nginx-rsc (ocf::heartbeat:nginx): Started lb02

This is running on:

Debian                                7.8
pacemaker                          1.1.7-1

2015-01-26 18:20 GMT+01:00 Oscar Salvador <osalvador.vilardaga at gmail.com>:

> Hi!
>
> I'm writing here because two days ago I experienced a strange problem in
> my Pacemaker Cluster.
> Everything was working fine, till suddenly a Segfault in Nginx monitor
> resource happened:
>
> Jan 25 03:55:24 lb02 crmd: [9975]: notice: run_graph: ==== Transition 7551
> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pengine/pe-input-90.bz2): Complete
> Jan 25 03:55:24 lb02 crmd: [9975]: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
> Jan 25 04:00:08 lb02 cib: [9971]: info: cib_stats: Processed 1 operations
> (0.00us average, 0% utilization) in the last 10min
> Jan 25 04:10:24 lb02 crmd: [9975]: info: crm_timer_popped: PEngine Recheck
> Timer (I_PE_CALC) just popped (900000ms)
> Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_TIMER_POPPED
> origin=crm_timer_popped ]
> Jan 25 04:10:24 lb02 crmd: [9975]: info: do_state_transition: Progressed
> to state S_POLICY_ENGINE after C_TIMER_POPPED
> Jan 25 04:10:24 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
> failed op Ldirector-rsc_last_failure_0 on lb02: not running (7)
> Jan 25 04:10:24 lb02 pengine: [10028]: notice: common_apply_stickiness:
> Ldirector-rsc can fail 999997 more times on lb02 before being forced off
> Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> cause=C_IPC_MESSAGE origin=handle_response ]
> Jan 25 04:10:24 lb02 pengine: [10028]: notice: process_pe_message:
> Transition 7552: PEngine Input stored in: /var/lib/pengine/pe-input-90.bz2
> Jan 25 04:10:24 lb02 crmd: [9975]: info: do_te_invoke: Processing graph
> 7552 (ref=pe_calc-dc-1422155424-7644) derived from
> /var/lib/pengine/pe-input-90.bz2
> Jan 25 04:10:24 lb02 crmd: [9975]: notice: run_graph: ==== Transition 7552
> (Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0,
> Source=/var/lib/pengine/pe-input-90.bz2): Complete
> Jan 25 04:10:24 lb02 crmd: [9975]: notice: do_state_transition: State
> transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS
> cause=C_FSA_INTERNAL origin=notify_crmd ]
>
>
> Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output:
> (Nginx-rsc:monitor:stderr) Segmentation fault   ******* here it starts
>
> As you can see, the last line.
> And then:
>
> Jan 25 04:10:30 lb02 lrmd: [9972]: info: RA output:
> (Nginx-rsc:monitor:stderr) Killed
> /usr/lib/ocf/resource.d//heartbeat/nginx: 910:
> /usr/lib/ocf/resource.d//heartbeat/nginx: Cannot fork
>
> I guess here Nginx was killed.
>
> And then I have some others errors till Pacemaker decide to move the
> resources to the node:
>
> Jan 25 04:10:30 lb02 crmd: [9975]: info: process_lrm_event: LRM operation
> Nginx-rsc_monitor_10000 (call=52, rc=2, cib-update=7633, confirmed=false)
> invalid parameter
> Jan 25 04:10:30 lb02 crmd: [9975]: info: process_graph_event: Detected
> action Nginx-rsc_monitor_10000 from a different transition: 5739 vs. 7552
> Jan 25 04:10:30 lb02 crmd: [9975]: info: abort_transition_graph:
> process_graph_event:476 - Triggered transition abort (complete=1,
> tag=lrm_rsc_op, id=Nginx-rsc_last_failure_0,
> magic=0:2;4:5739:0:42d1ed53-9686-4174-84e7-d2c230ed8832, cib=
> 3.14.40) : Old event
> Jan 25 04:10:30 lb02 crmd: [9975]: WARN: update_failcount: Updating
> failcount for Nginx-rsc on lb02 after failed monitor: rc=2 (update=value++,
> time=1422155430)
> Jan 25 04:10:30 lb02 crmd: [9975]: notice: do_state_transition: State
> transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL
> origin=abort_transition_graph ]
> Jan 25 04:10:30 lb02 attrd: [9974]: info: log-rotate detected on logfile
> /var/log/ha-log
> Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_trigger_update: Sending
> flush op to all hosts for: fail-count-Nginx-rsc (1)
> Jan 25 04:10:30 lb02 pengine: [10028]: ERROR: unpack_rsc_op: Preventing
> Nginx-rsc from re-starting on lb02: operation monitor failed 'invalid
> parameter' (rc=2)
> Jan 25 04:10:30 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
> failed op Nginx-rsc_last_failure_0 on lb02: invalid parameter (2)
> Jan 25 04:10:30 lb02 pengine: [10028]: WARN: unpack_rsc_op: Processing
> failed op Ldirector-rsc_last_failure_0 on lb02: not running (7)
> Jan 25 04:10:30 lb02 pengine: [10028]: notice: common_apply_stickiness:
> Ldirector-rsc can fail 999997 more times on lb02 before being forced off
> Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
>  IP-rsc_mysql (lb02)
> Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
>  IP-rsc_nginx (lb02)
> Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
>  IP-rsc_nginx6        (lb02)
> Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Stop
>  IP-rsc_elasticsearch (lb02)
> Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Move
>  Ldirector-rsc        (Started lb02 -> lb01)
> Jan 25 04:10:30 lb02 pengine: [10028]: notice: LogActions: Move
>  Nginx-rsc    (Started lb02 -> lb01)
> Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_perform_update: Sent
> update 23: fail-count-Nginx-rsc=1
> Jan 25 04:10:30 lb02 attrd: [9974]: notice: attrd_trigger_update: Sending
> flush op to all hosts for: last-failure-Nginx-rsc (1422155430)
>
> I see that Pacemaker is complaining about some errors like "invalid
> paraemter", for example in these lines:
>
> Jan 25 04:10:30 lb02 crmd: [9975]: info: process_lrm_event: LRM operation
> Nginx-rsc_monitor_10000 (call=52, rc=2, cib-update=7633, confirmed=false)
> invalid parameter
>
> Jan 25 04:10:30 lb02 pengine: [10028]: ERROR: unpack_rsc_op: Preventing
> Nginx-rsc from re-starting on lb02: operation monitor failed 'invalid
> parameter' (rc=2)
>
> It sounds(for me) like a syntax problem defining the resources, but I've
> checked the confic with crm_verify and there is no error:
>
> root# (S) crm_verify -LVV
> root# (S)
>
> So I'm just wondering why pacemaker is complaining about an invalid
> parameter.
>
> This is my CIB objetcs:
>
> node $id="43b2c5a1-9552-4438-962b-6e98a2dd67c7" lb01
> node $id="68328520-68e0-42fd-9adf-062655691643" lb02
> primitive IP-rsc_elasticsearch ocf:heartbeat:IPaddr2 \
> params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
> primitive IP-rsc_elasticsearch6 ocf:heartbeat:IPv6addr \
> params ipv6addr="xxxxxxxxxxxxxxxx" \
> op monitor interval="10s"
> primitive IP-rsc_mysql ocf:heartbeat:IPaddr2 \
> params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
> primitive IP-rsc_mysql6 ocf:heartbeat:IPv6addr \
> params ipv6addr="xxxxxxxxxxxxxx" \
> op monitor interval="10s"
> primitive IP-rsc_nginx ocf:heartbeat:IPaddr2 \
> params ip="xx.xx.xx.xx" nic="eth0" cidr_netmask="255.255.255.224"
> primitive IP-rsc_nginx6 ocf:heartbeat:IPv6addr \
> params ipv6addr="xxxxxxxxxxxxxx" \
> op monitor interval="10s"
> primitive Ldirector-rsc ocf:heartbeat:ldirectord \
> op monitor interval="10s" timeout="30s"
> primitive Nginx-rsc ocf:heartbeat:nginx \
> op monitor interval="10s" timeout="30s"
> location cli-standby-IP-rsc_elasticsearch6 IP-rsc_elasticsearch6 \
> rule $id="cli-standby-rule-IP-rsc_elasticsearch6" -inf: #uname eq lb01
> location cli-standby-IP-rsc_mysql IP-rsc_mysql \
> rule $id="cli-standby-rule-IP-rsc_mysql" -inf: #uname eq lb01
> location cli-standby-IP-rsc_mysql6 IP-rsc_mysql6 \
> rule $id="cli-standby-rule-IP-rsc_mysql6" -inf: #uname eq lb01
> location cli-standby-IP-rsc_nginx IP-rsc_nginx \
> rule $id="cli-standby-rule-IP-rsc_nginx" -inf: #uname eq lb01
> location cli-standby-IP-rsc_nginx6 IP-rsc_nginx6 \
> rule $id="cli-standby-rule-IP-rsc_nginx6" -inf: #uname eq lb01
> colocation hcu_c inf: Nginx-rsc Ldirector-rsc IP-rsc_mysql IP-rsc_nginx
> IP-rsc_nginx6 IP-rsc_elasticsearch
> order hcu_o inf: IP-rsc_nginx IP-rsc_nginx6 IP-rsc_mysql Ldirector-rsc
> Nginx-rsc IP-rsc_elasticsearch
> property $id="cib-bootstrap-options" \
> dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
> cluster-infrastructure="Heartbeat" \
> stonith-enabled="false
>
>
> Do you have some hints that I can follow?
>
> Thanks in advance!
>
> Oscar
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20150126/6141bfa5/attachment.htm>