[Pacemaker] wrong device in stonith_admin -l
Andrew Beekhof
andrew at beekhof.net
Mon Dec 17 02:38:30 UTC 2012
On Sat, Dec 15, 2012 at 8:45 AM, <laurent+pacemaker at u-picardie.fr> wrote:
> Andrew Beekhof <andrew at beekhof.net> writes:
>
>> On Wed, Dec 12, 2012 at 11:51 AM, <laurent+pacemaker at u-picardie.fr> wrote:
>>>
>>> Hi,
>>>
>>> I've just observed something weird.
>>> A node is running a stonith resource for which gethosts gives an empty
>>> node list. The result of stonith_admin -l does include it in the
>>> device list !
>>>
>>> result of "stonith_admin -l elasticsearch-05" run from
>>> elasticsearch-06 :
>>> stonith-xen-peatbull
>>> stonith-xen-eddu
>>> 2 devices found
>>>
>>> stonith-xen-peatbull is a correct fencing device
>>> stonith-xen-eddu is a fencing device with an empty hostlist
>>>
>>> running "my-xen0 gethosts" with the stonith-xen-eddu params by hand
>>> doesn't return any host, and it does exit with 0 (is that correct to
>>> return 0 with an empty host list ?)
>>>
>>> logs :
>>> Dec 12 01:09:10 elasticsearch-06 stonith-ng[18181]: notice: stonith_device_register: Added 'stonith-cluster-xen' to the device list (6 active devices)
>>> Dec 12 01:09:10 elasticsearch-06 attrd[18183]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
>>> Dec 12 01:09:10 elasticsearch-06 attrd[18183]: notice: attrd_perform_update: Sent update 5: probe_complete=true
>>> Dec 12 01:09:11 elasticsearch-06 stonith-ng[18181]: notice: stonith_device_register: Added 'stonith-xen-eddu' to the device list (6 active devices)
>>> Dec 12 01:09:11 elasticsearch-06 stonith-ng[18181]: notice: stonith_device_register: Added 'stonith-xen-peatbull' to the device list (6 active devices)
>>> Dec 12 01:09:12 elasticsearch-06 stonith: [18434]: info: external/my-xen0-ha device OK.
>>> Dec 12 01:09:12 elasticsearch-06 crmd[18185]: notice: process_lrm_event: LRM operation stonith-cluster-xen_start_0 (call=61,rc=0, cib-update=27, confirmed=true) ok
>>> Dec 12 01:09:14 elasticsearch-06 stonith: [18465]: info: external_run_cmd: '/usr/lib/stonith/plugins/external/my-xen0 status' output: elasticsearch-05
>>> Dec 12 01:09:14 elasticsearch-06 stonith: [18465]: info: external_run_cmd: '/usr/lib/stonith/plugins/external/my-xen0 status' output: elasticsearch-06
>>> Dec 12 01:09:15 elasticsearch-06 stonith: [18465]: info: external/my-xen0 device OK.
>>> Dec 12 01:09:15 elasticsearch-06 crmd[18185]: notice: process_lrm_event: LRM operation stonith-xen-peatbull_start_0 (call=68, rc=0, cib-update=28, confirmed=true) ok
>>> Dec 12 01:09:15 elasticsearch-06 stonith: [18458]: info: external/my-xen0 device OK.
>>> Dec 12 01:09:15 elasticsearch-06 crmd[18185]: notice: process_lrm_event: LRM operation stonith-xen-eddu_start_0 (call=66, rc=0, cib-update=29, confirmed=true) ok
>>> Dec 12 01:12:34 elasticsearch-06 stonith-ng[18181]: notice: dynamic_list_search_cb: Disabling port list queries for stonith-xen-kornog (1): (null)
>>> Dec 12 01:12:34 elasticsearch-06 stonith-ng[18181]: notice: dynamic_list_search_cb: Disabling port list queries for stonith-xen-nikka (1): (null)
>>> Dec 12 01:12:34 elasticsearch-06 stonith-ng[18181]: notice: dynamic_list_search_cb: Disabling port list queries for stonith-xen-yoichi (1): (null)
>>> Dec 12 01:12:34 elasticsearch-06 stonith: [19301]: CRIT: external_hostlist: 'my-xen0 gethosts' returned an empty hostlist
>>> Dec 12 01:12:34 elasticsearch-06 stonith: [19301]: ERROR: Could not list hosts for external/my-xen0.
>>> Dec 12 01:12:37 elasticsearch-06 stonith: [19332]: CRIT: external_hostlist: 'my-xen0 gethosts' returned an empty hostlist
>>> Dec 12 01:12:37 elasticsearch-06 stonith: [19332]: ERROR: Could not list hosts for external/my-xen0.
>>> Dec 12 01:12:37 elasticsearch-06 stonith-ng[18181]: notice: dynamic_list_search_cb: Disabling port list queries for stonith-xen-eddu (1): failed: 255
>>>
>>> David, I mentioned a node being wrongly fenced in the "stonith-timeout
>>> duration 0 is too low" bug, could it be related ?
>
> Hi,
>
>> Doubtful, what does your config look like?
>
> i've restarted from scratch with a simpler setup:
> primitive dummy_01 ocf:heartbeat:Dummy \
> meta allow-migrate="true" \
> op monitor interval="180" timeout="20"
> primitive stonith-xen-eddu stonith:external/my-xen0 \
> params
> hostlist="elasticsearch-01 elasticsearch-02 elasticsearch-03 elasticsearch-04" dom0="eddu"
> clone clone-stonith-xen-eddu stonith-xen-eddu \
> meta clone-max="3" clone-node-max="1"
> location clone-stonith-xen-eddu-location-01 clone-stonith-xen-eddu \
> rule $id="clone-stonith-xen-eddu-location-01-rule" inf:
> defined #uname
> location dummy_01-location-01 dummy_01 \
> rule $id="dummy_01-location-01-rule" inf: defined #uname
> property $id="cib-bootstrap-options" \
> dc-version="1.1.8-56429db" \
> cluster-infrastructure="corosync" \
> stonith-timeout="120" \
> symmetric-cluster="false" \
> no-quorum-policy="stop" \
> stonith-enabled="true"
>
> there're 6 nodes: elasticsearch-01 ... 06
> afaik pcmk_host_check defaults to "dynamic-list".
>
> when the external stonith agent is called with "gethosts" it checks if
> any of the guests are running on eddu (the xen dom0/host)
> In this case, there're none of them running on eddu, it then returns
> an empty hostlist.
> Looking at the logs there's a critical message concerning the empty
> hostlist.
> So I guess it's not valid to have a stonith primitive temporarily
> having no hosts to fence.
Just to be clear, thats the cluster-glue stonith binary complaining.
Not pacemaker.
>
> It's just I would certainly not expect that device to appear in the
> result of "stonith-admin -l nodename".
> And it does ! :)
Might be time to create a bug and attach logs.
> I've just reproduced it again starting a new cluster from scratch and
> using the above config.
> Let's say the stonith agent runs on nodes 02, 03 and 04.
> The first time I run stonith-admin -l "elasticsearch-01" on node 02,
> 03 or 04 it returns "No devices found". From the second attempt it
> does list "stonith-xen-eddu" as valid device.
>
> That's a behavior I did observe with the "stonith-timeout duration 0
> is too low" bug.
> I wouldn't be surprised if it was related: in case of a timeout or in
> case of an empty hostlist the stonith device is wrongly reported as
> a valid fencing device instead of being blacklisted/disabled.
>
> I hope it's a bit clearer now. If not i'll have to try to learn how to
> write a test case for it. (that would definitely make it clearer !)
> :-)
>
>
>> IIRC, these agents want to be told which machines they can fence
>
> I'd say that's true for the ipmi agent.
> But a xen guest might be migrated from one host to another.
Agreed. But I believe thats how most of them are written.
More information about the Pacemaker
mailing list