[ClusterLabs] [OCF] Pacemaker reports a multi-state clone resource instance as running while it is not in fact
Ken Gaillot
kgaillot at redhat.com
Mon Jan 4 16:36:03 CET 2016
On 01/04/2016 09:25 AM, Bogdan Dobrelya wrote:
> On 04.01.2016 15:50, Bogdan Dobrelya wrote:
>> So far so bad.
>> I made a dummy OCF script [0] to simulate an example
>> promote/demote/notify failure mode for a multistate clone resource which
>> is very similar to the one I reported originally. And the test to
>> reproduce my case with the dummy is:
>> - install dummy resource ocf ra and create the dummy resource as README
>> [0] says
>> - just watch the a) OCF logs from the dummy and b) outputs for the
>> reoccurring commands:
>>
>> # while true; do date; ls /var/lib/heartbeat/trace_ra/dummy/ | tail -1;
>> sleep 20; done&
>> # crm_resource --resource p_dummy --list-operations
>>
>> At some point I noticed:
>> - there are no more "OK" messages logged from the monitor actions,
>> although according to the trace_ra dumps' timestamps, all monitors are
>> still being invoked!
>>
>> - at some point I noticed very strange results reported by the:
>> # crm_resource --resource p_dummy --list-operations
>> p_dummy (ocf::dummy:dummy): FAILED : p_dummy_monitor_103000
>> (node=node-1.test.domain.local, call=579, rc=1, last-rc-change=Mon Jan
>> 4 14:33:07 2016, exec=62107ms): Timed Out
>> or
>> p_dummy (ocf::dummy:dummy): Started : p_dummy_monitor_103000
>> (node=node-3.test.domain.local, call=-1, rc=1, last-rc-change=Mon Jan 4
>> 14:43:58 2016, exec=0ms): Timed Out
>>
>> - according to the trace_ra dumps reoccurring monitors are being invoked
>> by the intervals *much longer* than configured. For example, a 7 minutes
>> of "monitoring silence":
>> Mon Jan 4 14:47:46 UTC 2016
>> p_dummy.monitor.2016-01-04.14:40:52
>> Mon Jan 4 14:48:06 UTC 2016
>> p_dummy.monitor.2016-01-04.14:47:58
>>
>> Given that said, it is very likely there is some bug exist for
>> monitoring multi-state clones in pacemaker!
>>
>> [0] https://github.com/bogdando/dummy-ocf-ra
>>
>
> Also note, that lrmd spawns *many* monitors like:
> root 6495 0.0 0.0 70268 1456 ? Ss 2015 4:56 \_
> /usr/lib/pacemaker/lrmd
> root 31815 0.0 0.0 4440 780 ? S 15:08 0:00 | \_
> /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> root 31908 0.0 0.0 4440 388 ? S 15:08 0:00 |
> \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> root 31910 0.0 0.0 4440 384 ? S 15:08 0:00 |
> \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> root 31915 0.0 0.0 4440 392 ? S 15:08 0:00 |
> \_ /bin/sh /usr/lib/ocf/resource.d/dummy/dummy monitor
> ...
At first glance, that looks like your monitor action is calling itself
recursively, but I don't see how in your code.
> At some point, there was 1111 already. Then I unmanaged the p_dummy but
> it grew up to the 2403 after that. The number of running monitors may
> grow or decrease as well.
> Also, the /var/lib/heartbeat/trace_ra/dummy/ still have been populated
> by new p_dummy.monitor* files with recent timestamps. Why?..
>
> If I pkill -9 all dummy monitors, lrmd spawns another ~2000 almost
> instantly :) Unless the node became unresponsive at some point. And
> after restarted by power off&on:
> # crm_resource --resource p_dummy --list-operations
> p_dummy (ocf::dummy:dummy): Started (unmanaged) :
> p_dummy_monitor_30000 (node=node-1.test.domain.local, call=679, rc=1,
> last-rc-change=Mon Jan 4 15:04:25 2016, exec=66747ms): Timed Out
> or
> p_dummy (ocf::dummy:dummy): Stopped (unmanaged) :
> p_dummy_monitor_103000 (node=node-3.test.domain.local, call=142, rc=1,
> last-rc-change=Mon Jan 4 15:14:59 2016, exec=65237ms): Timed Out
>
> And then lrmd repeats all of the fun again.
>
>
More information about the Users
mailing list