[Pacemaker] lrmd fork: cannot allocate memory

Wed Feb 12 06:57:16 EST 2014

Hello,

I have a problem with pacemaker cluster (3 nodes, SAP production environment)

Node 1

Feb 11 12:00:39 s-xxx-05 lrmd: [12995]: info: operation monitor[85] on ip_wd_WIC_pri for client 12998: pid 27282 exited with return code 0
Feb 11 12:01:16 s-xxx-05 lrmd: [12995]: info: RA output: (ipbck_wd_WIC_pri:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/IPaddr2: fork: Cannot allocate memory
Feb 11 12:01:16 s-xxx-05 lrmd: [12995]: info: RA output: (ipbck_wd_WIC_pri:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/IPaddr2: fork: Cannot allocate memory
Feb 11 12:01:16 s-xxx-05 lrmd: [12995]: info: RA output: (ipbck_wd_WIC_pri:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/IPaddr2: fork: Cannot allocate memory
Feb 11 12:01:16 s-xxx-05 lrmd: [12995]: info: RA output: (ipbck_wd_WIC_pri:monitor:stderr) /usr/lib/ocf/resource.d//heartbeat/IPaddr2: fork: Cannot allocate memory
Feb 11 12:01:16 s-xxx-05 crmd: [12998]: info: process_lrm_event: LRM operation ipbck_wd_WIC_pri_monitor_10000 (call=87, rc=7, cib-update=105, confirmed=false) not running
Feb 11 12:01:16 s-xxx-05 attrd: [12996]: notice: attrd_ais_dispatch: Update relayed from s-xxx-06
Feb 11 12:01:16 s-xxx-05 attrd: [12996]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-ipbck_wd_WIC_pri (1)
Feb 11 12:01:16 s-xxx-05 attrd: [12996]: notice: attrd_perform_update: Sent update 28: fail-count-ipbck_wd_WIC_pri=1
Feb 11 12:01:16 s-xxx-05 attrd: [12996]: notice: attrd_ais_dispatch: Update relayed from s-xxx-06
Feb 11 12:01:16 s-xxx-05 attrd: [12996]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-ipbck_wd_WIC_pri (1392116476)
Feb 11 12:01:16 s-xxx-05 attrd: [12996]: notice: attrd_perform_update: Sent update 31: last-failure-ipbck_wd_WIC_pri=1392116476
Feb 11 12:01:17 s-xxx-05 lrmd: [12995]: ERROR: perform_ra_op::3123: fork: Cannot allocate memory
Feb 11 12:01:17 s-xxx-05 lrmd: [12995]: ERROR: unable to perform_ra_op on operation monitor[14] on usrsap_WBW_pri:2 for client 12998, its parameters: CRM_meta_record_pending=[false] CRM_meta_clone=[2] fstype=[ocfs2] device=[/dev/sapBWPvg/sapWBW] CRM_meta_clone_node_max=[1] CRM_meta_notify=[false] CRM_meta_clone_max=[3] CRM_meta_globally_unique=[false] crm_feature_set=[3.0.6] directory=[/usr/sap/WBW] CRM_meta_name=[monitor] CRM_meta_interval=[60000] CRM_meta_timeout=[60000]
Feb 11 12:01:17 s-xxx-05 lrmd: [12995]: ERROR: perform_ra_op::3123: fork: Cannot allocate memory
Feb 11 12:01:17 s-xxx-05 lrmd: [12995]: ERROR: unable to perform_ra_op on operation stop[95] on webdisp_WIC_pri for client 12998, its parameters: CRM_meta_name=[stop] crm_feature_set=[3.0.6] CRM_meta_record_pending=[false] CRM_meta_timeout=[300000] InstanceName=[WIC_W39_vsicpwd] START_PROFILE=[/sapmnt/WIC/profile/WIC_W39_vsicpwd]

Node 2

Feb 11 12:00:17 s-xxx-06 pengine: [10338]: notice: process_pe_message: Transition 3196: PEngine Input stored in: /var/lib/pengine/pe-input-476.bz2
Feb 11 12:01:16 s-xxx-06 crmd: [10339]: info: process_graph_event: Detected action ipbck_wd_WIC_pri_monitor_10000 from a different transition: 2546 vs. 3196
Feb 11 12:01:16 s-xxx-06 crmd: [10339]: info: abort_transition_graph: process_graph_event:476 - Triggered transition abort (complete=1, tag=lrm_rsc_op, id=ipbck_wd_WIC_pri_last_failure_0, magic=0:7;321:2546:0:8544b0c8-b0fd-4249-a6ad-0ca818ba5f67, cib=0.1910.325) : Old event
Feb 11 12:01:16 s-xxx-06 crmd: [10339]: WARN: update_failcount: Updating failcount for ipbck_wd_WIC_pri on s-xxx-05 after failed monitor: rc=7 (update=value++, time=1392116476)
Feb 11 12:01:16 s-xxx-06 crmd: [10339]: notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]
Feb 11 12:01:16 s-xxx-06 crmd: [10339]: info: abort_transition_graph: te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair, id=status-s-xxx-05-fail-count-ipbck_wd_WIC_pri, name=fail-count-ipbck_wd_WIC_pri, value=1, magic=NA, cib=0.1910.326) : Transient attribute: update
Feb 11 12:01:16 s-xxx-06 crmd: [10339]: info: abort_transition_graph: te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair, id=status-s-xxx-05-last-failure-ipbck_wd_WIC_pri, name=last-failure-ipbck_wd_WIC_pri, value=1392116476, magic=NA, cib=0.1910.327) : Transient attribute: update
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: unpack_config: On loss of CCM Quorum: Ignore
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: WARN: unpack_nodes: Blind faith: not fencing unseen nodes
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: WARN: unpack_rsc_op: Processing failed op sapmnt_ICP_pri:1_last_failure_0 on s-xxx-04: unknown exec error (-2)
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: WARN: unpack_rsc_op: Processing failed op sapmnt_ICP_pri:2_last_failure_0 on s-xxx-05: unknown exec error (-2)
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: WARN: unpack_rsc_op: Processing failed op ipbck_wd_WIC_pri_last_failure_0 on s-xxx-05: not running (7)
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: common_apply_stickiness: ocfs_global_clone can fail 4 more times on s-xxx-04 before being forced off
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: common_apply_stickiness: ocfs_global_clone can fail 4 more times on s-xxx-04 before being forced off
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: common_apply_stickiness: ocfs_global_clone can fail 4 more times on s-xxx-04 before being forced off
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: common_apply_stickiness: ocfs_global_clone can fail 4 more times on s-xxx-05 before being forced off
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: common_apply_stickiness: ocfs_global_clone can fail 4 more times on s-xxx-05 before being forced off
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: common_apply_stickiness: ocfs_global_clone can fail 4 more times on s-xxx-05 before being forced off
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: common_apply_stickiness: ipbck_wd_WIC_pri can fail 4 more times on s-xxx-05 before being forced off
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: LogActions: Recover ipbck_wd_WIC_pri      (Started s-xxx-05)
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: LogActions: Restart ascs_ICP_pri  (Started s-xxx-05)
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: LogActions: Restart webdisp_WIC_pri       (Started s-xxx-05)
Feb 11 12:01:17 s-xxx-06 crmd: [10339]: notice: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Feb 11 12:01:17 s-xxx-06 crmd: [10339]: info: do_te_invoke: Processing graph 3197 (ref=pe_calc-dc-1392116477-4106) derived from /var/lib/pengine/pe-input-477.bz2
Feb 11 12:01:17 s-xxx-06 crmd: [10339]: info: te_rsc_command: Initiating action 414: stop webdisp_WIC_pri_stop_0 on s-xxx-05
Feb 11 12:01:17 s-xxx-06 crmd: [10339]: WARN: status_from_rc: Action 414 (webdisp_WIC_pri_stop_0) on s-xxx-05 failed (target: 0 vs. rc: -2): Error
Feb 11 12:01:17 s-xxx-06 crmd: [10339]: WARN: update_failcount: Updating failcount for webdisp_WIC_pri on s-xxx-05 after failed stop: rc=-2 (update=INFINITY, time=1392116477)
Feb 11 12:01:17 s-xxx-06 crmd: [10339]: info: abort_transition_graph: match_graph_event:277 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=webdisp_WIC_pri_last_failure_0, magic=4:-2;414:3197:0:8544b0c8-b0fd-4249-a6ad-0ca818ba5f67, cib=0.1910.328) : Event failed
Feb 11 12:01:17 s-xxx-06 crmd: [10339]: notice: run_graph: ==== Transition 3197 (Complete=2, Pending=0, Fired=0, Skipped=11, Incomplete=0, Source=/var/lib/pengine/pe-input-477.bz2): Stopped
Feb 11 12:01:17 s-xxx-06 crmd: [10339]: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
Feb 11 12:01:17 s-xxx-06 crmd: [10339]: info: abort_transition_graph: te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair, id=status-s-xxx-05-fail-count-webdisp_WIC_pri, name=fail-count-webdisp_WIC_pri, value=INFINITY, magic=NA, cib=0.1910.329) : Transient attribute: update
Feb 11 12:01:17 s-xxx-06 crmd: [10339]: info: abort_transition_graph: te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair, id=status-s-xxx-05-last-failure-webdisp_WIC_pri, name=last-failure-webdisp_WIC_pri, value=1392116477, magic=NA, cib=0.1910.330) : Transient attribute: update
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: process_pe_message: Transition 3197: PEngine Input stored in: /var/lib/pengine/pe-input-477.bz2
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: unpack_config: On loss of CCM Quorum: Ignore
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: WARN: unpack_nodes: Blind faith: not fencing unseen nodes
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: WARN: unpack_rsc_op: Processing failed op sapmnt_ICP_pri:1_last_failure_0 on s-xxx-04: unknown exec error (-2)
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: WARN: unpack_rsc_op: Processing failed op sapmnt_ICP_pri:2_last_failure_0 on s-xxx-05: unknown exec error (-2)
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: WARN: unpack_rsc_op: Processing failed op webdisp_WIC_pri_last_failure_0 on s-xxx-05: unknown exec error (-2)
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: WARN: pe_fence_node: Node s-xxx-05 will be fenced to recover from resource failure(s)
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: WARN: unpack_rsc_op: Processing failed op ipbck_wd_WIC_pri_last_failure_0 on s-xxx-05: not running (7)
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: common_apply_stickiness: ocfs_global_clone can fail 4 more times on s-xxx-04 before being forced off
.
.
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: LogActions: Move    ipbck_wd_WIC_pri      (Started s-xxx-05 -> s-xxx-04)
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: LogActions: Move    ascs_ICP_pri  (Started s-xxx-05 -> s-xxx-04)
Feb 11 12:01:17 s-xxx-06 pengine: [10338]: notice: LogActions: Move    webdisp_WIC_pri       (Started s-xxx-05 -> s-xxx-04)
Feb 11 12:01:17 s-xxx-06 crmd: [10339]: notice: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response ]
Feb 11 12:01:17 s-xxx-06 crmd: [10339]: info: do_te_invoke: Processing graph 3198 (ref=pe_calc-dc-1392116477-4108) derived from /var/lib/pengine/pe-warn-26.bz2
Feb 11 12:01:17 s-xxx-06 crmd: [10339]: notice: te_fence_node: Executing reboot fencing operation (464) on s-xxx-05 (timeout=12000)
Feb 11 12:01:17 s-xxx-06 stonith-ng: [10335]: info: initiate_remote_stonith_op: Initiating remote operation reboot for s-xxx-05: fff269bd-70f1-490b-a46f-92f2eaaa04f1
Feb 11 12:01:18 s-xxx-06 pengine: [10338]: WARN: process_pe_message: Transition 3198: WARNINGs found during PE processing. PEngine Input stored in: /var/lib/pengine/pe-warn-26.bz2
Feb 11 12:01:18 s-xxx-06 pengine: [10338]: notice: process_pe_message: Configuration WARNINGs found during PE processing.  Please run "crm_verify -L" to identify issues.
Feb 11 12:01:18 s-xxx-06 stonith-ng: [10335]: info: can_fence_host_with_device: Refreshing port list for stonith-sbd_pri
Feb 11 12:01:18 s-xxx-06 stonith-ng: [10335]: info: can_fence_host_with_device: stonith-sbd_pri can fence s-xxx-05: dynamic-list
Feb 11 12:01:18 s-xxx-06 stonith-ng: [10335]: info: call_remote_stonith: Requesting that s-xxx-06 perform op reboot s-xxx-05
Feb 11 12:01:18 s-xxx-06 stonith-ng: [10335]: info: can_fence_host_with_device: stonith-sbd_pri can fence s-xxx-05: dynamic-list
Feb 11 12:01:18 s-xxx-06 stonith-ng: [10335]: info: stonith_fence: Found 1 matching devices for 's-xxx-05'
Feb 11 12:01:18 s-xxx-06 stonith-ng: [10335]: info: stonith_command: Processed st_fence from s-xxx-06: rc=-1
Feb 11 12:01:18 s-xxx-06 sbd: [25130]: info: Delivery process handling /dev/mapper/SBD_LUN_QUORUM
Feb 11 12:01:18 s-xxx-06 sbd: [25130]: info: Writing reset to node slot s-xxx-05

Node 3

Feb 11 12:00:01 s-xxx-04 /usr/sbin/cron[22525]: (root) CMD ([ -x /usr/lib64/sa/sa1 ] && exec /usr/lib64/sa/sa1 -S ALL 1 1)
Feb 11 12:00:01 s-xxx-04 syslog-ng[4795]: Log statistics; dropped='pipe(/dev/xconsole)=0', dropped='pipe(/dev/tty10)=0', processed='center(queued)=11361', processed='center(received)=6355', processed='destination(messages)=1462', processed='destination(mailinfo)=4893', processed='destination(mailwarn)=0', processed='destination(localmessages)=0', processed='destination(newserr)=0', processed='destination(mailerr)=0', processed='destination(netmgm)=0', processed='destination(warn)=103', processed='destination(console)=5', processed='destination(null)=0', processed='destination(mail)=4893', processed='destination(xconsole)=5', processed='destination(firewall)=0', processed='destination(acpid)=0', processed='destination(newscrit)=0', processed='destination(newsnotice)=0', processed='source(src)=6355'
Feb 11 12:01:17 s-xxx-04 stonith-ng: [12951]: info: crm_new_peer: Node s-xxx-06 now has id: 101344266
Feb 11 12:01:17 s-xxx-04 stonith-ng: [12951]: info: crm_new_peer: Node 101344266 is now known as s-xxx-06
Feb 11 12:01:17 s-xxx-04 stonith-ng: [12951]: info: stonith_command: Processed st_query from s-xxx-06: rc=0
Feb 11 12:01:23 s-xxx-04 corosync[12944]:  [TOTEM ] A processor failed, forming new configuration.
Feb 11 12:01:29 s-xxx-04 corosync[12944]:  [CLM   ] CLM CONFIGURATION CHANGE

Can this error "Cannot allocate memory" to indicate that there cannot be any memory allocated for a new Resource Agent instance ?

I have 128Gb of RAM

THP is setting to never

Version:

openais-1.1.4-5.8.7.1
libopenais3-1.1.4-5.8.7.1
pacemaker-mgmt-2.1.1-0.6.2.17
pacemaker-1.1.7-0.13.9
drbd-pacemaker-8.4.2-0.6.6.7
pacemaker-mgmt-client-2.1.1-0.6.2.17
libpacemaker3-1.1.7-0.13.9

OS Sles 11 SP2 kernel 3.0.80-0.7-default

Ask me for more information ?

Thanks

Bye
Walter

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.clusterlabs.org/pipermail/pacemaker/attachments/20140212/8dc83259/attachment-0002.html>