[Pacemaker] [SOLVED] Resource-Monitoring with an "On Fail"-Action

Fri Mar 19 05:08:44 UTC 2010

Hi

Thanks a lot for your help.

So now it's Novell's turn.....:-)

Regards,
Tom

2010/3/18 Dejan Muhamedagic <dejanmm at fastmail.fm>:
> Hi,
>
> On Thu, Mar 18, 2010 at 02:15:07PM +0100, Tom Tux wrote:
>> Hi Dejan
>>
>> hb_report -V says:
>> cluster-glue: 1.0.2 (b75bd738dc09263a578accc69342de2cb2eb8db6)
>
> Yes, unfortunately that one is buggy.
>
>> I've opened a case by Novell. They will fix this problem with updating
>> to the newest cluster-glue-release.
>>
>> Could it be, that I have another configuration-issue in my
>> cluster-config? I think with the following setting, the resource
>> should be monitored:
>>
>> ...
>> ...
>> primitive MySQL_MonitorAgent_Resource lsb:mysql-monitor-agent \
>>         meta migration-threshold="3" \
>>         op monitor interval="10s" timeout="20s" on-fail="restart"
>> op_defaults $id="op_defaults-options" \
>>         on-fail="restart" \
>>         enabled="true"
>> property $id="cib-bootstrap-options" \
>>         expected-quorum-votes="2" \
>>         dc-version="1.0.6-c48e3360eb18c53fd68bb7e7dbe39279ccbc0354" \
>>         cluster-infrastructure="openais" \
>>         stonith-enabled="true" \
>>         no-quorum-policy="ignore" \
>>         stonith-action="reboot" \
>>         last-lrm-refresh="1268838090"
>> ...
>> ...
>>
>>
>> And when I look the last-run-time with "crm_mon -fort1", then it results me:
>>    MySQL_Server_Resource: migration-threshold=3
>>     + (32) stop: last-rc-change='Wed Mar 17 10:49:55 2010'
>> last-run='Wed Mar 17 10:49:55 2010' exec-time=5060ms queue-time=0ms
>> rc=0 (ok)
>>     + (40) start: last-rc-change='Wed Mar 17 11:09:06 2010'
>> last-run='Wed Mar 17 11:09:06 2010' exec-time=4080ms queue-time=0ms
>> rc=0 (ok)
>>     + (41) monitor: interval=20000ms last-rc-change='Wed Mar 17
>> 11:09:10 2010' last-run='Wed Mar 17 11:09:10 2010' exec-time=20ms
>> queue-time=0ms rc=0 (ok)
>>
>> And the results above was yesterday....
>
> The configuration looks fine to me.
>
> Cheers,
>
> Dejan
>
>> Thanks for your help.
>> Kind regards,
>> Tom
>>
>>
>>
>> 2010/3/18 Dejan Muhamedagic <dejanmm at fastmail.fm>:
>> > Hi,
>> >
>> > On Wed, Mar 17, 2010 at 12:38:47PM +0100, Tom Tux wrote:
>> >> Hi Dejan
>> >>
>> >> Thanks for your answer.
>> >>
>> >> I'm using this cluster with the packages from the HAE
>> >> (HighAvailability-Extension)-Repository from SLES11. Therefore, is it
>> >> possible, to upgrade the cluster-glue from source?
>> >
>> > Yes, though I don't think that any SLE11 version has this bug.
>> > When was your version released? What does hb_report -V say?
>> >
>> >> I think, the better
>> >> way is to wait for updates in the hae-repository from novell. Or do
>> >> you have experience, upgrading the cluster-glue from source (even if
>> >> it is installed with zypper/rpm)?
>> >>
>> >> Do you know, when the HAE-Repository will be upgraded?
>> >
>> > Can't say. Best would be if you talk to Novell about the issue.
>> >
>> > Cheers,
>> >
>> > Dejan
>> >
>> >> Thanks a lot.
>> >> Tom
>> >>
>> >>
>> >> 2010/3/17 Dejan Muhamedagic <dejanmm at fastmail.fm>:
>> >> > Hi,
>> >> >
>> >> > On Wed, Mar 17, 2010 at 10:57:16AM +0100, Tom Tux wrote:
>> >> >> Hi Dominik
>> >> >>
>> >> >> The problem is, that the cluster does not do the monitor-action every
>> >> >> 20s. The last time, when he did the action was at 09:21. And now we
>> >> >> have 10:37:
>> >> >
>> >> > There was a serious bug in some cluster-glue packages. What
>> >> > you're experiencing sounds like that. I can't say which
>> >> > packages (probably sth like 1.0.1, they were never released). At
>> >> > any rate, I'd suggest upgrading to cluster-glue 1.0.3.
>> >> >
>> >> > Thanks,
>> >> >
>> >> > Dejan
>> >> >
>> >> >>  MySQL_MonitorAgent_Resource: migration-threshold=3
>> >> >>     + (479) stop: last-rc-change='Wed Mar 17 09:21:28 2010'
>> >> >> last-run='Wed Mar 17 09:21:28 2010' exec-time=3010ms queue-time=0ms
>> >> >> rc=0 (ok)
>> >> >>     + (480) start: last-rc-change='Wed Mar 17 09:21:31 2010'
>> >> >> last-run='Wed Mar 17 09:21:31 2010' exec-time=3010ms queue-time=0ms
>> >> >> rc=0 (ok)
>> >> >>     + (481) monitor: interval=10000ms last-rc-change='Wed Mar 17
>> >> >> 09:21:34 2010' last-run='Wed Mar 17 09:21:34 2010' exec-time=20ms
>> >> >> queue-time=0ms rc=0 (ok)
>> >> >>
>> >> >> If I restart the whole cluster, then the new returncode (exit99 or
>> >> >> exit4) will be saw by the cluster-monitor.
>> >> >>
>> >> >>
>> >> >> 2010/3/17 Dominik Klein <dk at in-telegence.net>:
>> >> >> > Hi Tom
>> >> >> >
>> >> >> > have a look at the logs and see whether the monitor op really returns
>> >> >> > 99. (grep for the resource-id). If so, I'm not sure what the cluster
>> >> >> > does with rc=99. As far as I know, rc=4 would be status=failed (unknown
>> >> >> > actually).
>> >> >> >
>> >> >> > Regards
>> >> >> > Dominik
>> >> >> >
>> >> >> > Tom Tux wrote:
>> >> >> >> Thanks for your hint.
>> >> >> >>
>> >> >> >> I've configured an lsb-resource like this (with migration-threshold):
>> >> >> >>
>> >> >> >> primitive MySQL_MonitorAgent_Resource lsb:mysql-monitor-agent \
>> >> >> >>         meta target-role="Started" migration-threshold="3" \
>> >> >> >>         op monitor interval="10s" timeout="20s" on-fail="restart"
>> >> >> >>
>> >> >> >> I have now modified the init-script "/etc/init.d/mysql-monitor-agent",
>> >> >> >> to exit with a returncode not equal "0" (example exit 99), when the
>> >> >> >> monitor-operation is querying the status. But the cluster does not
>> >> >> >> recognise a failed monitor-action. Why this behaviour? For the
>> >> >> >> cluster, everything seems ok.
>> >> >> >>
>> >> >> >> node1:/ # showcores.sh MySQL_MonitorAgent_Resource
>> >> >> >> Resource                             Score     Node     Stickiness
>> >> >> >> #Fail    Migration-Threshold
>> >> >> >> MySQL_MonitorAgent_Resource          -1000000  node1 100        0        3
>> >> >> >> MySQL_MonitorAgent_Resource          100       node2 100        0        3
>> >> >> >>
>> >> >> >> I also saw, that the "last-run"-entry (crm_mon -fort1) for this
>> >> >> >> resource is not up-to-date. For me it seems, that the monitor-action
>> >> >> >> does not occurs every 10 seconds. Why? Any hints for this behaviour?
>> >> >> >>
>> >> >> >> Thanks a lot.
>> >> >> >> Tom
>> >> >> >>
>> >> >> >>
>> >> >> >> 2010/3/16 Dominik Klein <dk at in-telegence.net>:
>> >> >> >>> Tom Tux wrote:
>> >> >> >>>> Hi
>> >> >> >>>>
>> >> >> >>>> I've have a question about the resource-monitoring:
>> >> >> >>>> I'm monitoring an ip-resource every 20 seconds. I have configured the
>> >> >> >>>> "On Fail"-action with "restart". This works fine. If the
>> >> >> >>>> "monitor"-operation fails, then the resource will be restartet.
>> >> >> >>>>
>> >> >> >>>> But how can I define this resource, to migrate to the other node, if
>> >> >> >>>> the resource still fails after 10 restarts? Is this possible? How will
>> >> >> >>>> the "failcount" interact with this scenario?
>> >> >> >>>>
>> >> >> >>>> In the documentation I read, that the resource-"fail_count" will
>> >> >> >>>> encrease every time, when the resource restarts. But I can't see this
>> >> >> >>>> fail_count.
>> >> >> >>> Look at the meta attribute "migration-threshold".
>> >> >> >>>
>> >> >> >>> Regards
>> >> >> >>> Dominik
>> >> >> >
>> >> >> >
>> >> >> > _______________________________________________
>> >> >> > Pacemaker mailing list
>> >> >> > Pacemaker at oss.clusterlabs.org
>> >> >> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >> >> >
>> >> >>
>> >> >> _______________________________________________
>> >> >> Pacemaker mailing list
>> >> >> Pacemaker at oss.clusterlabs.org
>> >> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >> >
>> >> > _______________________________________________
>> >> > Pacemaker mailing list
>> >> > Pacemaker at oss.clusterlabs.org
>> >> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >> >
>> >>
>> >> _______________________________________________
>> >> Pacemaker mailing list
>> >> Pacemaker at oss.clusterlabs.org
>> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >
>> > _______________________________________________
>> > Pacemaker mailing list
>> > Pacemaker at oss.clusterlabs.org
>> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> >
>>
>> _______________________________________________
>> Pacemaker mailing list
>> Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>