[Pacemaker] monitoring action fails

Wed Nov 19 14:47:13 UTC 2008

Actually, there was (also?) a bug here causing re-probe loops.
Fix in:
   http://hg.clusterlabs.org/pacemaker/stable-1.0/rev/3df83ce5c974

On Wed, Nov 19, 2008 at 14:25, Andrew Beekhof <beekhof at gmail.com> wrote:
> My suspicion here is that the RA is messing up the monitoring action.
> I'd suggest trying with just one of the drbd clones and see if that works.
>
> On Wed, Nov 12, 2008 at 13:19, Raoul Bhatia [IPAX] <r.bhatia at ipax.at> wrote:
>> hi,
>>
>> i have a cluster with several resources.
>>
>> i issued crm_resource -P and now have got the cluster in some strange
>> state, which it cannot resolve by itself:
>>
>>> Node: wc01 (31de4ab3-2d05-476e-8f9a-627ad6cd94ca): standby
>>> Node: wc02 (f36760d8-d84a-46b2-b452-4c8cac8b3396): standby
>> ...
>>> Master/Slave Set: ms_drbd_www
>>>     drbd_www:0  (ocf::heartbeat:drbd) Master [  wc01    wc02 ]
>>>     drbd_www:1  (ocf::heartbeat:drbd) Master [  wc01    wc02 ]
>> ...
>>> Master/Slave Set: ms_drbd_mysql
>>>     drbd_mysql:0        (ocf::heartbeat:drbd) Master [  wc01    wc02 ]
>>>     drbd_mysql:1        (ocf::heartbeat:drbd) Master [  wc01    wc02 ]
>>
>> failed actions:
>>> Failed actions:
>>>     drbd_www:1_monitor_0 (node=wc02, call=13666, rc=0): complete
>>>     drbd_www:0_monitor_0 (node=wc02, call=13665, rc=0): complete
>>>     drbd_mysql:1_monitor_0 (node=wc02, call=13672, rc=0): complete
>>>     drbd_mysql:0_monitor_0 (node=wc02, call=13671, rc=0): complete
>>
>> those monitoring failures repeat continouesly. in the logfiles i find:
>> ...
>>> crmd[14105]: 2008/11/12_13:14:19 WARN: status_from_rc: Action 16 (drbd_www:0_monitor_0) on wc02 failed (target: 8 vs. rc: 0): Error
>>> crmd[14105]: 2008/11/12_13:14:19 info: abort_transition_graph: __FUNCTION__:385 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=drbd_www:0_monitor_0, magic=0:0;16:670:8:d3f15030-d3f0-421d-a477-ce19a2cae321) : Event failed
>>> crmd[14105]: 2008/11/12_13:14:19 info: update_abort_priority: Abort priority upgraded from 0 to 1
>>> crmd[14105]: 2008/11/12_13:14:19 info: update_abort_priority: Abort action done superceeded by restart
>>> crmd[14105]: 2008/11/12_13:14:19 info: match_graph_event: Action drbd_www:0_monitor_0 (16) confirmed on wc02 (rc=4)
>>> crmd[14105]: 2008/11/12_13:14:19 WARN: status_from_rc: Action 17 (drbd_www:1_monitor_0) on wc02 failed (target: 8 vs. rc: 0): Error
>>> crmd[14105]: 2008/11/12_13:14:19 info: abort_transition_graph: __FUNCTION__:385 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=drbd_www:1_monitor_0, magic=0:0;17:670:8:d3f15030-d3f0-421d-a477-ce19a2cae321) : Event failed
>>> crmd[14105]: 2008/11/12_13:14:19 info: match_graph_event: Action drbd_www:1_monitor_0 (17) confirmed on wc02 (rc=4)
>> ...
>>
>> i put some debug information into the drbd ocf ra:
>>> #!/bin/sh
>>> echo "----" >> /tmp/lalala
>>
>> but /tmp/lalala stays emtpy. if i manually call the drbd ra with
>> all parameters i get the expected rc 8.
>>
>> hb_report http://ip52.ipax.at/~raoul/cluster/no_monitor_action.tar.gz
>> (its kinda big as a lot of actions failed)
>>
>> cheers,
>> raoul
>>
>> ps: i allready tried to revoke the crm_standby, but this does not
>> resolve the error messages and does not call the drbd ocf ra.
>> --
>> ____________________________________________________________________
>> DI (FH) Raoul Bhatia M.Sc.          email.          r.bhatia at ipax.at
>> Technischer Leiter
>>
>> IPAX - Aloy Bhatia Hava OEG         web.          http://www.ipax.at
>> Barawitzkagasse 10/2/2/11           email.            office at ipax.at
>> 1190 Wien                           tel.               +43 1 3670030
>> FN 277995t HG Wien                  fax.            +43 1 3670030 15
>> ____________________________________________________________________
>>
>> _______________________________________________
>> Pacemaker mailing list
>> Pacemaker at clusterlabs.org
>> http://list.clusterlabs.org/mailman/listinfo/pacemaker
>>
>