[Pacemaker] a question on the `ping` RA

Thu May 29 11:19:53 UTC 2014

Hello,

we have setup a cluster of 10 nodes to serve a Lustre filesystem to a
computational cluster, with Pacemaker+Corosync to handle failover
between hosts.  Each host is connected to an ethernet network and an
Infiniband, and we set up a `ping` resource to ensure that storage
nodes can see compute nodes over the Infiniband network.  The
intention is to ensure that, if a storage node cannot communicate with
compute nodes over IB, it should hand over resources to another
storage node.

Here's the relevant section from `crm configure show`::

    primitive ping ocf:pacemaker:ping \
            params name=ping dampen=5s multiplier=10
host_list="lustre-mds1 ibr01c01b01n01 ...(24 hosts omitted)..." \
            op start timeout=120 interval=0 \
            op monitor timeout=60 interval=10 \
            op stop timeout=20 interval=0
    clone ping_clone ping \
            meta globally-unique=false clone-node-max=1
is-managed=true target-role=Started
    # Bind OST locations to hosts that can actually support them.
    location mdt-location mdt \
            [...]
            rule $id="mdt_only_if_ping_works" -INFINITY: not_defined
ping or ping number:lte 0

In our understanding of the `ping` RA, this would add a score from 0
to 520, depending on how many compute nodes a storage node can ping.

Since the resource stickiness is 2000, resources would only move if
the `ping` RA failed completely and the host was totally cut off from
the IB network.

However, we have had a case last night of resources moving back and
forth between two storage nodes; the only trace left in the logs is
that `ping` failed everywhere, and some trouble reports from Corosync
(which we cannot explain and could be the real cause)::

    May 28 00:29:19 lustre-mds1 ping(ping)[8147]: ERROR: Unexpected
result for 'ping -n -q -W 5 -c 3  iblustre-mds1' 2: ping: unknown host
iblustre-mds1
    May 28 00:29:22 lustre-mds1 corosync[23879]:   [TOTEM ]
Incrementing problem counter for seqid 11125389 i
    face 10.129.93.10 to [9 of 10]
    May 28 00:29:25 lustre-mds1 corosync[23879]:   [TOTEM ]
Incrementing problem counter for seqid 11126239 i
    face 10.129.93.10 to [10 of 10]
    May 28 00:29:25 lustre-mds1 corosync[23879]:   [TOTEM ] Marking
seqid 11126239 ringid 0 interface 10.129.
    93.10 FAULTY
    May 28 00:29:26 lustre-mds1 corosync[23879]:   [TOTEM ]
Automatically recovered ring 0
    May 28 00:29:27 lustre-mds1 lrmd[23906]:  warning:
child_timeout_callback: ping_monitor_10000 process (PID 8147) timed
out
    May 28 00:29:27 lustre-mds1 lrmd[23906]:  warning:
operation_finished: ping_monitor_10000:8147 - timed out after 60000ms
    May 28 00:29:27 lustre-mds1 crmd[23909]:    error:
process_lrm_event: Operation ping_monitor_10000: Timed Out
(node=lustre-mds1.ften.es.hpcn.uzh.ch, call=267, timeout=60000ms)
    May 28 00:29:27 lustre-mds1 corosync[23879]:   [TOTEM ]
Incrementing problem counter for seqid 11126319 iface 10.129.93.10 to
[1 of 10]
    May 28 00:29:27 lustre-mds1 crmd[23909]:  warning:
update_failcount: Updating failcount for ping on
lustre-mds1.ften.es.hpcn.uzh.ch after failed monitor: rc=1
(update=value++, time=1401229767)
    [...]
    May 28 00:30:03 lustre-mds1 crmd[23909]:  warning:
update_failcount: Updating failcount for ping on
lustre-oss1.ften.es.hpcn.uzh.ch after failed monitor: rc=1
(update=value++, time=1401229803)
    May 28 00:30:03 lustre-mds1 crmd[23909]:   notice: run_graph:
Transition 472 (Complete=7, Pending=0, Fired=0, Skipped=1,
Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-2770.bz2):
Stopped
    May 28 00:30:03 lustre-mds1 pengine[23908]:  warning:
unpack_rsc_op_failure: Processing failed op monitor for ping:0 on
lustre-oss4.ften.es.hpcn.uzh.ch: unknown error (1)
    May 28 00:30:03 lustre-mds1 pengine[23908]:  warning:
unpack_rsc_op_failure: Processing failed op monitor for ping:1 on
lustre-oss5.ften.es.hpcn.uzh.ch: unknown error (1)
    May 28 00:30:03 lustre-mds1 pengine[23908]:  warning:
unpack_rsc_op_failure: Processing failed op monitor for ping:2 on
lustre-oss6.ften.es.hpcn.uzh.ch: unknown error (1)
    May 28 00:30:03 lustre-mds1 pengine[23908]:  warning:
unpack_rsc_op_failure: Processing failed op monitor for ping:3 on
lustre-oss7.ften.es.hpcn.uzh.ch: unknown error (1)
    May 28 00:30:03 lustre-mds1 pengine[23908]:  warning:
unpack_rsc_op_failure: Processing failed op monitor for ping:4 on
lustre-oss8.ften.es.hpcn.uzh.ch: unknown error (1)
    May 28 00:30:03 lustre-mds1 pengine[23908]:  warning:
unpack_rsc_op_failure: Processing failed op monitor for ping:5 on
lustre-mds1.ften.es.hpcn.uzh.ch: unknown error (1)
    May 28 00:30:03 lustre-mds1 pengine[23908]:  warning:
unpack_rsc_op_failure: Processing failed op monitor for ping:6 on
lustre-mds2.ften.es.hpcn.uzh.ch: unknown error (1)
    May 28 00:30:03 lustre-mds1 pengine[23908]:  warning:
unpack_rsc_op_failure: Processing failed op monitor for ping:7 on
lustre-oss1.ften.es.hpcn.uzh.ch: unknown error (1)
    May 28 00:30:03 lustre-mds1 pengine[23908]:  warning:
unpack_rsc_op_failure: Processing failed op monitor for ping:8 on
lustre-oss2.ften.es.hpcn.uzh.ch: unknown error (1)
    May 28 00:30:03 lustre-mds1 pengine[23908]:  warning:
unpack_rsc_op_failure: Processing failed op monitor for ping:9 on
lustre-oss3.ften.es.hpcn.uzh.ch: unknown error (1)
    May 28 00:30:03 lustre-mds1 pengine[23908]:   notice: LogActions:
Restart mdt#011(Started lustre-mds1.ften.es.hpcn.uzh.ch)
    May 28 00:30:03 lustre-mds1 pengine[23908]:   notice: LogActions:
Move    mgt#011(Started lustre-mds2.ften.es.hpcn.uzh.ch ->
lustre-mds1.ften.es.hpcn.uzh.ch)
    May 28 00:30:03 lustre-mds1 pengine[23908]:   notice: LogActions:
Restart ost00#011(Started lustre-oss1.ften.es.hpcn.uzh.ch)
    May 28 00:30:03 lustre-mds1 pengine[23908]:   notice: LogActions:
Restart ost01#011(Started lustre-oss3.ften.es.hpcn.uzh.ch)
    [...]

So, questions:

- is this the way one is supposed to use the `ping` RA, i.e., to
  compute a score based on the number of reachable test nodes?

- or rather does the `ping` RA trigger failure events when even one of
  the nodes cannot be pinged?

- could the ping failure have triggered the resource restart above?

- any hints how to further debug the issue?

Thank you for any help!

Kind regards,
Riccardo

--
Riccardo Murri
http://www.gc3.uzh.ch/people/rm

Grid Computing Competence Centre
University of Zurich
Winterthurerstrasse 190, CH-8057 Zürich (Switzerland)
Tel: +41 44 635 4222
Fax: +41 44 635 6888