[Pacemaker] lrmd segfault at pacemaker 1.1.11-rc1

Tue Dec 17 21:05:09 UTC 2013

----- Original Message -----
> From: "Kazunori INOUE" <kazunori.inoue3 at gmail.com>
> To: "pm" <pacemaker at oss.clusterlabs.org>
> Sent: Tuesday, December 17, 2013 5:43:53 AM
> Subject: [Pacemaker] lrmd segfault at pacemaker 1.1.11-rc1
> 
> Hi,
> 
> When repeated 'node standby' and 'node online', lrmd crashed with
> SIGSEGV because "op->id" in cancel_recurring_action() was NULL.

That's a really weird one... I don't see how it is possible for op->id to be NULL there.   You might need to give valgrind a shot to detect whatever is really going on here.

-- Vossel

> 
> Dec 17 19:01:21 vm3 crmd[2433]:     info: do_state_transition: State
> transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS
> cause=C_IPC_MESSAGE origin=handle_response ]
> Dec 17 19:01:21 vm3 crmd[2433]:     info: do_te_invoke: Processing
> graph 437 (ref=pe_calc-dc-1387274481-5672) derived from
> /var/lib/pacemaker/pengine/pe-input-437.bz2
> Dec 17 19:01:21 vm3 crmd[2433]:   notice: te_rsc_command: Initiating
> action 17: stop prmStonith4_stop_0 on vm3 (local)
> Dec 17 19:01:21 vm3 crmd[2433]:     info: do_lrm_rsc_op: Performing
> key=17:437:0:40d7b9a2-c373-4459-a811-9c225d1a9555
> op=prmStonith4_stop_0
> Dec 17 19:01:21 vm3 lrmd[2430]:     info: log_execute: executing -
> rsc:prmStonith4 action:stop call_id:3487
> Dec 17 19:01:21 vm3 stonith-ng[2429]:     info: stonith_command:
> Processed st_device_remove from lrmd.2430: OK (0)
> Dec 17 19:01:21 vm3 lrmd[2430]:     info: log_finished: finished -
> rsc:prmStonith4 action:stop call_id:3487  exit-code:0 exec-time:0ms
> queue-time:0ms
> Dec 17 19:01:21 vm3 pengine[2432]:   notice: process_pe_message:
> Calculated Transition 437: /var/lib/pacemaker/pengine/pe-input-437.bz2
> Dec 17 19:01:21 vm3 crmd[2433]:   notice: te_rsc_command: Initiating
> action 33: stop prmPg_stop_0 on vm3 (local)
> Dec 17 19:01:21 vm3 lrmd[2430]:     info: cancel_recurring_action:
> Cancelling operation prmPg_monitor_10000
> Dec 17 19:01:21 vm3 crmd[2433]:     info: do_lrm_rsc_op: Performing
> key=33:437:0:40d7b9a2-c373-4459-a811-9c225d1a9555 op=prmPg_stop_0
> Dec 17 19:01:21 vm3 lrmd[2430]:     info: log_execute: executing -
> rsc:prmPg action:stop call_id:3489
> Dec 17 19:01:21 vm3 crmd[2433]:     info: process_lrm_event: LRM
> operation prmStonith4_monitor_3600000 (call=3473, status=1,
> cib-update=0, confirmed=true) Cancelled
> Dec 17 19:01:21 vm3 crmd[2433]:   notice: process_lrm_event: LRM
> operation prmStonith4_stop_0 (call=3487, rc=0, cib-update=3090,
> confirmed=true) ok
> Dec 17 19:01:21 vm3 crmd[2433]:     info: process_lrm_event: LRM
> operation prmPg_monitor_10000 (call=3485, status=1, cib-update=0,
> confirmed=true) Cancelled
> Dec 17 19:01:21 vm3 crmd[2433]:     info: match_graph_event: Action
> prmStonith4_stop_0 (17) confirmed on vm3 (rc=0)
> Dec 17 19:01:21 vm3 crmd[2433]:   notice: te_rsc_command: Initiating
> action 40: stop prmPing_stop_0 on vm3 (local)
> Dec 17 19:01:21 vm3 cib[2428]:     info: cib_process_request:
> Completed cib_modify operation for section status: OK (rc=0,
> origin=local/crmd/3090, version=0.440.2)
> Dec 17 19:01:21 vm3 stonith-ng[2429]:     info: crm_client_destroy:
> Destroying 0 events
> Dec 17 19:01:21 vm3 pacemakerd[2424]:    error: child_death_dispatch:
> Managed process 2430 (lrmd) dumped core
> Dec 17 19:01:21 vm3 pacemakerd[2424]:   notice: pcmk_child_exit: Child
> process lrmd terminated with signal 11 (pid=2430, core=1)
> Dec 17 19:01:21 vm3 pacemakerd[2424]:   notice: pcmk_process_exit:
> Respawning failed child process: lrmd
> Dec 17 19:01:21 vm3 pacemakerd[2424]:    error: pcmk_process_exit:
> Rebooting system
> Dec 17 19:10:40 vm3 root: Mark:pcmk:1387275040
> 
> $ gdb /usr/libexec/pacemaker/lrmd core.2430
> (gdb) bt
> #0  0x000000323f8480ac in vfprintf () from /lib64/libc.so.6
> #1  0x000000323f86f9d2 in vsnprintf () from /lib64/libc.so.6
> #2  0x0000003fcb81726d in qb_log_real_va_ (cs=0x3fcf208658,
> ap=0x7ffff6f5fc80) at log.c:230
> #3  0x0000003fcb8173ea in qb_log_real_ (cs=0x3fcf208658) at log.c:255
> #4  0x0000003fcf003a9c in cancel_recurring_action (op=0xb9fae0) at
> services.c:356
> #5  0x0000003fcf003bc6 in services_action_cancel (name=0xb9f350
> "prmPing", action=0xb9ee90 "monitor", interval=10000) at
> services.c:381
> #6  0x0000000000406595 in cancel_op (rsc_id=0xb9f350 "prmPing",
> action=0xb9ee90 "monitor", interval=10000) at lrmd.c:1197
> #7  0x00000000004067aa in process_lrmd_rsc_cancel (client=0xb926c0,
> id=7030, request=0xb95ad0) at lrmd.c:1261
> #8  0x0000000000406a51 in process_lrmd_message (client=0xb926c0,
> id=7030, request=0xb95ad0) at lrmd.c:1300
> #9  0x0000000000402a06 in lrmd_ipc_dispatch (c=0xb91af0,
> data=0x7f9f30acbc08, size=362) at main.c:141
> #10 0x0000003fcb8126f8 in _process_request_ (c=0xb91af0,
> ms_timeout=10) at ipcs.c:698
> #11 0x0000003fcb812ad5 in qb_ipcs_dispatch_connection_request (fd=5,
> revents=1, data=0xb91af0) at ipcs.c:801
> #12 0x0000003fcc0327b1 in gio_read_socket (gio=0xb92880,
> condition=G_IO_IN, data=0xb91138) at mainloop.c:437
> #13 0x0000003fc9c3feb2 in g_main_context_dispatch () from
> /lib64/libglib-2.0.so.0
> #14 0x0000003fc9c43d68 in ?? () from /lib64/libglib-2.0.so.0
> #15 0x0000003fc9c44275 in g_main_loop_run () from /lib64/libglib-2.0.so.0
> #16 0x00000000004030cc in main (argc=1, argv=0x7ffff6f606c8) at main.c:314
> 
> Although I'm investigating the cause, I have not discovered yet...
> 
> Because size was big, I put crm_report here.
> https://drive.google.com/file/d/0B9eNn1AWfKD4WGY5bllMQW1BbDA/edit?usp=sharing
> 
> Best Regards,
> Kazunori INOUE
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>