[Pacemaker] inconsistence in crm_mon and crm resource show

Wed Mar 21 16:48:45 UTC 2012

Fixed now,

By mistake I removed property stonith-enabled=false, and therefore the second node always tried fence the second node which crashed/was rebooted. Result was that all resources were down and waiting till fence will return done.
After I have returned the parameter back, the behavior is as expected, resources are started on the second node without fence wait

Best regards

Jozef

-----Original Message-----
From: Janec, Jozef 
Sent: Wednesday, March 21, 2012 11:47 AM
To: The Pacemaker cluster resource manager
Subject: Re: [Pacemaker] inconsistence in crm_mon and crm resource show

> 
> On 2012-03-21T09:42:26, "Janec, Jozef" <jozef.janec at hp.com> wrote:
> 
> > Node b300ple0: UNCLEAN (offline)
> >         rs_nw_dbjj7     (ocf::heartbeat:IPaddr) Started
> >         rs_nw_cijj7     (ocf::heartbeat:IPaddr) Started
> > Node b400ple0: online
> >         sbd_fense_SHARED2       (stonith:external/sbd) Started
> >
> > Inactive resources:
> >
> > rs_nw_cijj7    (ocf::heartbeat:IPaddr):        Started b300ple0
> > rs_nw_dbjj7    (ocf::heartbeat:IPaddr):        Started b300ple0
> >
> > b400ple0:(/root/home/root)(root)#crm resource show
> > rs_nw_cijj7    (ocf::heartbeat:IPaddr) Started
> > sbd_fense_SHARED2      (stonith:external/sbd) Started
> > rs_nw_dbjj7    (ocf::heartbeat:IPaddr) Started
> > b400ple0:(/root/home/root)(root)#
> >
> > b400ple0:(/root/home/root)(root)#/usr/sbin/crm_resource -W -r
> > rs_nw_cijj7 resource rs_nw_cijj7 is running on: b300ple0 
> > b400ple0:(/root/home/root)(root)#
> >
> > but b300ple0 is down
> 
> Resources are still considered owned because the node wasn't fenced yet.
> 

[Jozef Janec]
Yes I can see in logs:

Mar 21 06:18:00 b400ple0 stonith-ng: [8603]: ERROR: log_operation: Operation 'reboot' [3159] for host 'b300ple0' with device 'sbd_fense_SHARED2' returned: 1 (call 0 from (null)) Mar 21 06:18:00 b400ple0 stonith-ng: [8603]: info: process_remote_stonith_execExecResult <st-reply st_origin="stonith_construct_async_reply" t="stonith-ng" st_op="st_notify" st_remote_op="5cb46419-bfdb-4115-85d9-6ec447b38823" st_callid="0" st_callopt="0" st_rc="1" st_output="Performing: stonith -t external/sbd -T reset b300ple0 failed: b300ple0 0.05859375" src="b400ple0" seq="172" /> Mar 21 06:18:06 b400ple0 stonith-ng: [8603]: ERROR: remote_op_timeout: Action reboot (5cb46419-bfdb-4115-85d9-6ec447b38823) for b300ple0 timed out Mar 21 06:18:06 b400ple0 stonith-ng: [8603]: info: remote_op_done: Notifing clients of 5cb46419-bfdb-4115-85d9-6ec447b38823 (reboot of b300ple0 from a8125881-30df-4bd4-a5b1-666020a29eba by (null)): 1, rc=-7 Mar 21 06:18:06 b400ple0 crmd: [8608]: info: tengine_stonith_callbackStonithOp <remote-op state="1" st_target="b300ple0" st_op="reboot" /> Mar 21 06:18:06 b400ple0 stonith-ng: [8603]: info: stonith_notify_client: Sending st_fence-notification to client 8608/bc1b0c7d-2cec-4e96-9523-5f6c51b52508
Mar 21 06:18:06 b400ple0 crmd: [8608]: info: tengine_stonith_callback: Stonith operation 44/15:49:0:44f2b175-7292-473a-a4e8-f9abda5b3ef6: Operation timed out (-7) Mar 21 06:18:06 b400ple0 crmd: [8608]: ERROR: tengine_stonith_callback: Stonith of b300ple0 failed (-7)... aborting transition.
Mar 21 06:18:06 b400ple0 crmd: [8608]: info: abort_transition_graph: tengine_stonith_callback:401 - Triggered transition abort (complete=0) : Stonith failed

Because I reboted the ndoe manualy to simulate outage, and I haven't started the rcopenais the sbd daemon isn't started yet too

b400ple0:(/var/log/ha)(root)#/usr/sbin/sbd -d /dev/mapper/SHARED1_part1 list
0       b400ple0        clear
1       b300ple0        reset   b400ple0
b400ple0:(/var/log/ha)(root)#/usr/sbin/sbd -d /dev/mapper/SHARED2_part1  list
0       b300ple0        reset   b400ple0
1       b400ple0        clear

It is waiting till the sbd will pick up the command and reset this.

Question is where is located the information that the resource is still up it is in lrm part? I have found that I can use crm node clearstate which should set offline state on node and probably release the resources, but I want to find where exactly it is hidden. All information are located or should be located in cib, and I would like to know exactly which one is responsible for this behavior to understand it better

Best regards

Jozef

_______________________________________________
Pacemaker mailing list: Pacemaker at oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org