[Pacemaker] wrong device in stonith_admin -l

David Vossel dvossel at redhat.com
Mon Dec 17 11:43:18 EST 2012



----- Original Message -----
> From: laurent+pacemaker at u-picardie.fr
> To: pacemaker at oss.clusterlabs.org
> Sent: Tuesday, December 11, 2012 6:51:20 PM
> Subject: [Pacemaker] wrong device in stonith_admin -l
> 
> 
> Hi,
> 
> I've just observed something weird.
> A node is running a stonith resource for which gethosts gives an
> empty
> node list. The result of stonith_admin -l does include it in the
> device list !
> 
> result of "stonith_admin -l elasticsearch-05" run from
> elasticsearch-06 :
>  stonith-xen-peatbull
>  stonith-xen-eddu
> 2 devices found
> 
> stonith-xen-peatbull is a correct fencing device
> stonith-xen-eddu is a fencing device with an empty hostlist
> 
> running "my-xen0 gethosts" with the stonith-xen-eddu params by hand
> doesn't return any host, and it does exit with 0 (is that correct to
> return 0 with an empty host list ?)
>
> 
> logs :
> Dec 12 01:09:10 elasticsearch-06 stonith-ng[18181]:   notice:
> stonith_device_register: Added 'stonith-cluster-xen' to the device
> list (6 active devices)
> Dec 12 01:09:10 elasticsearch-06 attrd[18183]:   notice:
> attrd_trigger_update: Sending flush op to all hosts for:
> probe_complete (true)
> Dec 12 01:09:10 elasticsearch-06 attrd[18183]:   notice:
> attrd_perform_update: Sent update 5: probe_complete=true
> Dec 12 01:09:11 elasticsearch-06 stonith-ng[18181]:   notice:
> stonith_device_register: Added 'stonith-xen-eddu' to the device list
> (6 active devices)
> Dec 12 01:09:11 elasticsearch-06 stonith-ng[18181]:   notice:
> stonith_device_register: Added 'stonith-xen-peatbull' to the device
> list (6 active devices)
> Dec 12 01:09:12 elasticsearch-06 stonith: [18434]: info:
> external/my-xen0-ha device OK.
> Dec 12 01:09:12 elasticsearch-06 crmd[18185]:   notice:
> process_lrm_event: LRM operation stonith-cluster-xen_start_0
> (call=61,rc=0, cib-update=27, confirmed=true) ok
> Dec 12 01:09:14 elasticsearch-06 stonith: [18465]: info:
> external_run_cmd: '/usr/lib/stonith/plugins/external/my-xen0 status'
> output: elasticsearch-05
> Dec 12 01:09:14 elasticsearch-06 stonith: [18465]: info:
> external_run_cmd: '/usr/lib/stonith/plugins/external/my-xen0 status'
> output: elasticsearch-06
> Dec 12 01:09:15 elasticsearch-06 stonith: [18465]: info:
> external/my-xen0 device OK.
> Dec 12 01:09:15 elasticsearch-06 crmd[18185]:   notice:
> process_lrm_event: LRM operation stonith-xen-peatbull_start_0
> (call=68, rc=0, cib-update=28, confirmed=true) ok
> Dec 12 01:09:15 elasticsearch-06 stonith: [18458]: info:
> external/my-xen0 device OK.
> Dec 12 01:09:15 elasticsearch-06 crmd[18185]:   notice:
> process_lrm_event: LRM operation stonith-xen-eddu_start_0 (call=66,
> rc=0, cib-update=29, confirmed=true) ok
> Dec 12 01:12:34 elasticsearch-06 stonith-ng[18181]:   notice:
> dynamic_list_search_cb: Disabling port list queries for
> stonith-xen-kornog (1): (null)
> Dec 12 01:12:34 elasticsearch-06 stonith-ng[18181]:   notice:
> dynamic_list_search_cb: Disabling port list queries for
> stonith-xen-nikka (1): (null)
> Dec 12 01:12:34 elasticsearch-06 stonith-ng[18181]:   notice:
> dynamic_list_search_cb: Disabling port list queries for
> stonith-xen-yoichi (1): (null)
> Dec 12 01:12:34 elasticsearch-06 stonith: [19301]: CRIT:
> external_hostlist: 'my-xen0 gethosts' returned an empty hostlist
> Dec 12 01:12:34 elasticsearch-06 stonith: [19301]: ERROR: Could not
> list hosts for external/my-xen0.
> Dec 12 01:12:37 elasticsearch-06 stonith: [19332]: CRIT:
> external_hostlist: 'my-xen0 gethosts' returned an empty hostlist
> Dec 12 01:12:37 elasticsearch-06 stonith: [19332]: ERROR: Could not
> list hosts for external/my-xen0.
> Dec 12 01:12:37 elasticsearch-06 stonith-ng[18181]:   notice:
> dynamic_list_search_cb: Disabling port list queries for
> stonith-xen-eddu (1): failed:  255

We discover what hosts a agent can fence by running this command internally in stonith.

# agent -o list

>From there we expect a exit-code of 0 and the list of node to be in the output.
https://fedorahosted.org/cluster/wiki/FenceAgentAPI

Looking at your logs, stonith-xen-eddu is returning -1 (255) as the return code when we issue the 'list' action.  That means we don't try to get the dynamic list again, we assume the 'list' action isn't supported. From there we fall back to using the 'status' action to dynamically determine if agent can fence a particular host.  I'm guessing the 'status' action is returning true (return codes 0 or 2) for hosts you wouldn't expect the agent to be able to fence for some reason.

-- Vossel

> 
> David, I mentioned a node being wrongly fenced in the
> "stonith-timeout
> duration 0 is too low" bug, could it be related ?
> 
> 
> --
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started:
> http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 




More information about the Pacemaker mailing list