[Pacemaker] wrong device in stonith_admin -l

laurent+pacemaker at u-picardie.fr laurent+pacemaker at u-picardie.fr
Fri Dec 14 21:45:49 UTC 2012


Andrew Beekhof <andrew at beekhof.net> writes:

> On Wed, Dec 12, 2012 at 11:51 AM,  <laurent+pacemaker at u-picardie.fr> wrote:
>>
>> Hi,
>>
>> I've just observed something weird.
>> A node is running a stonith resource for which gethosts gives an empty
>> node list. The result of stonith_admin -l does include it in the
>> device list !
>>
>> result of "stonith_admin -l elasticsearch-05" run from
>> elasticsearch-06 :
>>  stonith-xen-peatbull
>>  stonith-xen-eddu
>> 2 devices found
>>
>> stonith-xen-peatbull is a correct fencing device
>> stonith-xen-eddu is a fencing device with an empty hostlist
>>
>> running "my-xen0 gethosts" with the stonith-xen-eddu params by hand
>> doesn't return any host, and it does exit with 0 (is that correct to
>> return 0 with an empty host list ?)
>>
>> logs :
>> Dec 12 01:09:10 elasticsearch-06 stonith-ng[18181]:   notice: stonith_device_register: Added 'stonith-cluster-xen' to the device list (6 active devices)
>> Dec 12 01:09:10 elasticsearch-06 attrd[18183]:   notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
>> Dec 12 01:09:10 elasticsearch-06 attrd[18183]:   notice: attrd_perform_update: Sent update 5: probe_complete=true
>> Dec 12 01:09:11 elasticsearch-06 stonith-ng[18181]:   notice: stonith_device_register: Added 'stonith-xen-eddu' to the device list (6 active devices)
>> Dec 12 01:09:11 elasticsearch-06 stonith-ng[18181]:   notice: stonith_device_register: Added 'stonith-xen-peatbull' to the device list (6 active devices)
>> Dec 12 01:09:12 elasticsearch-06 stonith: [18434]: info: external/my-xen0-ha device OK.
>> Dec 12 01:09:12 elasticsearch-06 crmd[18185]:   notice: process_lrm_event: LRM operation stonith-cluster-xen_start_0 (call=61,rc=0, cib-update=27, confirmed=true) ok
>> Dec 12 01:09:14 elasticsearch-06 stonith: [18465]: info: external_run_cmd: '/usr/lib/stonith/plugins/external/my-xen0 status' output: elasticsearch-05
>> Dec 12 01:09:14 elasticsearch-06 stonith: [18465]: info: external_run_cmd: '/usr/lib/stonith/plugins/external/my-xen0 status' output: elasticsearch-06
>> Dec 12 01:09:15 elasticsearch-06 stonith: [18465]: info: external/my-xen0 device OK.
>> Dec 12 01:09:15 elasticsearch-06 crmd[18185]:   notice: process_lrm_event: LRM operation stonith-xen-peatbull_start_0 (call=68, rc=0, cib-update=28, confirmed=true) ok
>> Dec 12 01:09:15 elasticsearch-06 stonith: [18458]: info: external/my-xen0 device OK.
>> Dec 12 01:09:15 elasticsearch-06 crmd[18185]:   notice: process_lrm_event: LRM operation stonith-xen-eddu_start_0 (call=66, rc=0, cib-update=29, confirmed=true) ok
>> Dec 12 01:12:34 elasticsearch-06 stonith-ng[18181]:   notice: dynamic_list_search_cb: Disabling port list queries for stonith-xen-kornog (1): (null)
>> Dec 12 01:12:34 elasticsearch-06 stonith-ng[18181]:   notice: dynamic_list_search_cb: Disabling port list queries for stonith-xen-nikka (1): (null)
>> Dec 12 01:12:34 elasticsearch-06 stonith-ng[18181]:   notice: dynamic_list_search_cb: Disabling port list queries for stonith-xen-yoichi (1): (null)
>> Dec 12 01:12:34 elasticsearch-06 stonith: [19301]: CRIT: external_hostlist: 'my-xen0 gethosts' returned an empty hostlist
>> Dec 12 01:12:34 elasticsearch-06 stonith: [19301]: ERROR: Could not list hosts for external/my-xen0.
>> Dec 12 01:12:37 elasticsearch-06 stonith: [19332]: CRIT: external_hostlist: 'my-xen0 gethosts' returned an empty hostlist
>> Dec 12 01:12:37 elasticsearch-06 stonith: [19332]: ERROR: Could not list hosts for external/my-xen0.
>> Dec 12 01:12:37 elasticsearch-06 stonith-ng[18181]:   notice: dynamic_list_search_cb: Disabling port list queries for stonith-xen-eddu (1): failed:  255
>>
>> David, I mentioned a node being wrongly fenced in the "stonith-timeout
>> duration 0 is too low" bug, could it be related ?

Hi,

> Doubtful, what does your config look like?

i've restarted from scratch with a simpler setup:
primitive dummy_01 ocf:heartbeat:Dummy \
        meta allow-migrate="true" \
        op monitor interval="180" timeout="20"
primitive stonith-xen-eddu stonith:external/my-xen0 \
        params
        hostlist="elasticsearch-01 elasticsearch-02 elasticsearch-03 elasticsearch-04" dom0="eddu"
clone clone-stonith-xen-eddu stonith-xen-eddu \
        meta clone-max="3" clone-node-max="1"
location clone-stonith-xen-eddu-location-01 clone-stonith-xen-eddu \
        rule $id="clone-stonith-xen-eddu-location-01-rule" inf:
        defined #uname
location dummy_01-location-01 dummy_01 \
        rule $id="dummy_01-location-01-rule" inf: defined #uname
property $id="cib-bootstrap-options" \
        dc-version="1.1.8-56429db" \
        cluster-infrastructure="corosync" \
        stonith-timeout="120" \
        symmetric-cluster="false" \
        no-quorum-policy="stop" \
        stonith-enabled="true"

there're 6 nodes: elasticsearch-01 ... 06
afaik pcmk_host_check defaults to "dynamic-list".

when the external stonith agent is called with "gethosts" it checks if
any of the guests are running on eddu  (the xen dom0/host)
In this case, there're none of them running on eddu, it then returns
an empty hostlist.
Looking at the logs there's a critical message concerning the empty
hostlist.
So I guess it's not valid to have a stonith primitive temporarily
having no hosts to fence.


It's just I would certainly not expect that device to appear in the
result of "stonith-admin -l nodename".
And it does ! :)

I've just reproduced it again starting a new cluster from scratch and
using the above config.
Let's say the stonith agent runs on nodes 02, 03 and 04.
The first time I run stonith-admin -l "elasticsearch-01" on node 02,
03 or 04 it returns "No devices found". From the second attempt it
does list "stonith-xen-eddu" as valid device.

That's a behavior I did observe with the "stonith-timeout duration 0
is too low" bug.
I wouldn't be surprised if it was related: in case of a timeout or in
case of an empty hostlist the stonith device is wrongly reported as
a valid fencing device instead of being blacklisted/disabled.

I hope it's a bit clearer now. If not i'll have to try to learn how to
write a test case for it. (that would definitely make it clearer !)
:-)


> IIRC, these agents want to be told which machines they can fence

I'd say that's true for the ipmi agent.
But a xen guest might be migrated from one host to another.


-- 
Laurent




More information about the Pacemaker mailing list