[Pacemaker] Problems with jboss on pacemaker
Benjamin Knoth
knoth at mpdl.mpg.de
Thu May 5 10:26:57 UTC 2011
Hi again,
i copied the jboss ocf and modified the variables, that the script use
my variables ifi start it. Now if i start the ocf script i get the
following everytime.
./jboss-test start
jboss-test[6165]: DEBUG: [jboss] Enter jboss start
jboss-test[6165]: DEBUG: start_jboss[jboss]: retry monitor_jboss
jboss-test[6165]: DEBUG: start_jboss[jboss]: retry monitor_jboss
jboss-test[6165]: DEBUG: start_jboss[jboss]: retry monitor_jboss
Something is wrong.
Cheers
Benjamin
Am 05.05.2011 12:03, schrieb Benjamin Knoth:
> Hi,
>
> Am 05.05.2011 11:46, schrieb Dejan Muhamedagic:
>> On Wed, May 04, 2011 at 03:44:02PM +0200, Benjamin Knoth wrote:
>>>
>>>
>>> Am 04.05.2011 13:18, schrieb Benjamin Knoth:
>>>> Hi,
>>>>
>>>> Am 04.05.2011 12:18, schrieb Dejan Muhamedagic:
>>>>> Hi,
>>>>>
>>>>> On Wed, May 04, 2011 at 10:37:40AM +0200, Benjamin Knoth wrote:
>>>>>
>>>>>
>>>>> Am 04.05.2011 09:42, schrieb Florian Haas:
>>>>>>>> On 05/04/2011 09:31 AM, Benjamin Knoth wrote:
>>>>>>>>> Hi Florian,
>>>>>>>>> i test it with ocf, but i couldn't run.
>>>>>>>>
>>>>>>>> Well that's really helpful information. Logs? Error messages? Anything?
>>>>>
>>>>> Logs
>>>>>
>>>>> May 4 09:55:10 vm36 lrmd: [19214]: WARN: p_jboss_ocf:start process (PID
>>>>> 27702) timed out (try 1). Killing with signal SIGTERM (15).
>>>>>
>>>>>> You need to set/increase the timeout for the start operation to
>>>>>> match the maximum expected start time. Take a look at "crm ra
>>>>>> info jboss" for minimum values.
>>>>>
>>>>> May 4 09:55:10 vm36 attrd: [19215]: info: find_hash_entry: Creating
>>>>> hash entry for fail-count-p_jboss_ocf
>>>>> May 4 09:55:10 vm36 lrmd: [19214]: WARN: operation start[342] on
>>>>> ocf::jboss::p_jboss_ocf for client 19217, its parameters:
>>>>> CRM_meta_name=[start] crm_feature_set=[3.0.1]
>>>>> java_home=[/usr/lib64/jvm/java] CRM_meta_timeout=[240000] jboss_sto
>>>>> p_timeout=[30] jboss_home=[/usr/share/jboss] jboss_pstring=[java
>>>>> -Dprogram.name=run.sh] : pid [27702] timed out
>>>>> May 4 09:55:10 vm36 attrd: [19215]: info: attrd_trigger_update: Sending
>>>>> flush op to all hosts for: fail-count-p_jboss_ocf (INFINITY)
>>>>> May 4 09:55:10 vm36 crmd: [19217]: WARN: status_from_rc: Action 64
>>>>> (p_jboss_ocf_start_0) on vm36 failed (target: 0 vs. rc: -2): Error
>>>>> May 4 09:55:10 vm36 lrmd: [19214]: info: rsc:p_jboss_ocf:346: stop
>>>>> May 4 09:55:10 vm36 attrd: [19215]: info: attrd_perform_update: Sent
>>>>> update 2294: fail-count-p_jboss_ocf=INFINITY
>>>>> May 4 09:55:10 vm36 pengine: [19216]: notice: unpack_rsc_op: Hard error
>>>>> - p_jboss_lsb_monitor_0 failed with rc=5: Preventing p_jboss_lsb from
>>>>> re-starting on vm36
>>>>> May 4 09:55:10 vm36 crmd: [19217]: WARN: update_failcount: Updating
>>>>> failcount for p_jboss_ocf on vm36 after failed start: rc=-2
>>>>> (update=INFINITY, time=1304495710)
>>>>> May 4 09:55:10 vm36 attrd: [19215]: info: find_hash_entry: Creating
>>>>> hash entry for last-failure-p_jboss_ocf
>>>>> May 4 09:55:10 vm36 pengine: [19216]: notice: unpack_rsc_op: Operation
>>>>> p_jboss_cs_monitor_0 found resource p_jboss_cs active on vm36
>>>>> May 4 09:55:10 vm36 crmd: [19217]: info: abort_transition_graph:
>>>>> match_graph_event:272 - Triggered transition abort (complete=0,
>>>>> tag=lrm_rsc_op, id=p_jboss_ocf_start_0,
>>>>> magic=2:-2;64:1375:0:fc16910d-2fe9-4daa-834a-348a4c7645ef, cib=0.53
>>>>> 5.2) : Event failed
>>>>> May 4 09:55:10 vm36 attrd: [19215]: info: attrd_trigger_update: Sending
>>>>> flush op to all hosts for: last-failure-p_jboss_ocf (1304495710)
>>>>> May 4 09:55:10 vm36 pengine: [19216]: notice: unpack_rsc_op: Hard error
>>>>> - p_jboss_init_monitor_0 failed with rc=5: Preventing p_jboss_init from
>>>>> re-starting on vm36
>>>>> May 4 09:55:10 vm36 crmd: [19217]: info: match_graph_event: Action
>>>>> p_jboss_ocf_start_0 (64) confirmed on vm36 (rc=4)
>>>>> May 4 09:55:10 vm36 attrd: [19215]: info: attrd_perform_update: Sent
>>>>> update 2297: last-failure-p_jboss_ocf=1304495710
>>>>> May 4 09:55:10 vm36 pengine: [19216]: WARN: unpack_rsc_op: Processing
>>>>> failed op p_jboss_ocf_start_0 on vm36: unknown exec error (-2)
>>>>> May 4 09:55:10 vm36 crmd: [19217]: info: te_rsc_command: Initiating
>>>>> action 1: stop p_jboss_ocf_stop_0 on vm36 (local)
>>>>> May 4 09:55:10 vm36 pengine: [19216]: notice: unpack_rsc_op: Operation
>>>>> p_jboss_ocf_monitor_0 found resource p_jboss_ocf active on vm37
>>>>> May 4 09:55:10 vm36 crmd: [19217]: info: do_lrm_rsc_op: Performing
>>>>> key=1:1376:0:fc16910d-2fe9-4daa-834a-348a4c7645ef op=p_jboss_ocf_stop_0 )
>>>>> May 4 09:55:10 vm36 pengine: [19216]: notice: native_print: p_jboss_ocf
>>>>> (ocf::heartbeat:jboss): Stopped
>>>>> May 4 09:55:10 vm36 pengine: [19216]: info: get_failcount: p_jboss_ocf
>>>>> has failed INFINITY times on vm36
>>>>> May 4 09:55:10 vm36 pengine: [19216]: WARN: common_apply_stickiness:
>>>>> Forcing p_jboss_ocf away from vm36 after 1000000 failures (max=1000000)
>>>>> May 4 09:59:10 vm36 pengine: [19216]: info: unpack_config: Node scores:
>>>>> 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
>>>>> May 4 09:59:10 vm36 crmd: [19217]: WARN: status_from_rc: Action 50
>>>>> (p_jboss_ocf_start_0) on vm37 failed (target: 0 vs. rc: -2): Error
>>>>> May 4 09:59:10 vm36 pengine: [19216]: info: determine_online_status:
>>>>> Node vm36 is online
>>>>> May 4 09:59:10 vm36 crmd: [19217]: WARN: update_failcount: Updating
>>>>> failcount for p_jboss_ocf on vm37 after failed start: rc=-2
>>>>> (update=INFINITY, time=1304495950)
>>>>> May 4 09:59:10 vm36 pengine: [19216]: notice: unpack_rsc_op: Hard error
>>>>> - p_jboss_lsb_monitor_0 failed with rc=5: Preventing p_jboss_lsb from
>>>>> re-starting on vm36
>>>>> May 4 09:59:10 vm36 crmd: [19217]: info: abort_transition_graph:
>>>>> match_graph_event:272 - Triggered transition abort (complete=0,
>>>>> tag=lrm_rsc_op, id=p_jboss_ocf_start_0,
>>>>> magic=2:-2;50:1377:0:fc16910d-2fe9-4daa-834a-348a4c7645ef, cib=0.53
>>>>> 5.12) : Event failed
>>>>> May 4 09:59:10 vm36 pengine: [19216]: notice: unpack_rsc_op: Operation
>>>>> p_jboss_cs_monitor_0 found resource p_jboss_cs active on vm36
>>>>> May 4 09:59:10 vm36 crmd: [19217]: info: match_graph_event: Action
>>>>> p_jboss_ocf_start_0 (50) confirmed on vm37 (rc=4)
>>>>> May 4 09:59:10 vm36 pengine: [19216]: notice: native_print: p_jboss_ocf
>>>>> (ocf::heartbeat:jboss): Stopped
>>>>> May 4 09:59:10 vm36 pengine: [19216]: info: get_failcount: p_jboss_ocf
>>>>> has failed INFINITY times on vm37
>>>>> May 4 09:59:10 vm36 pengine: [19216]: WARN: common_apply_stickiness:
>>>>> Forcing p_jboss_ocf away from vm37 after 1000000 failures (max=1000000)
>>>>> May 4 09:59:10 vm36 pengine: [19216]: info: get_failcount: p_jboss_ocf
>>>>> has failed INFINITY times on vm36
>>>>> May 4 09:59:10 vm36 pengine: [19216]: info: native_color: Resource
>>>>> p_jboss_ocf cannot run anywhere
>>>>> May 4 09:59:10 vm36 pengine: [19216]: notice: LogActions: Leave
>>>>> resource p_jboss_ocf (Stopped)
>>>>> May 4 09:59:31 vm36 pengine: [19216]: notice: native_print: p_jboss_ocf
>>>>> (ocf::heartbeat:jboss): Stopped
>>>>> ....
>>>>>
>>>>> Now i don't know how can i reset the resource p_jboss_ocf to test it again.
>>>>>
>>>>>> crm resource cleanup p_jboss_ocf
>>>>
>>>> That's the now way, but if i start this command on shell or crm shell in
>>>> both i get Cleaning up p_jboss_ocf on vm37
>>>> Cleaning up p_jboss_ocf on vm36
>>>>
>>>> But if i look on the monitoring with crm_mon -1 i getevery time
>>>>
>>>> Failed actions:
>>>> p_jboss_ocf_start_0 (node=vm36, call=-1, rc=1, status=Timed Out):
>>>> unknown error
>>>> p_jboss_monitor_0 (node=vm37, call=205, rc=5, status=complete): not
>>>> installed
>>>> p_jboss_ocf_start_0 (node=vm37, call=281, rc=-2, status=Timed Out):
>>>> unknown exec error
>>>>
>>>> p_jboss was deleted in the config yesterday.
>>>
>>> For demonstration:
>>>
>>> 15:34:22 ~ # crm_mon -1
>>>
>>> Failed actions:
>>> p_jboss_ocf_start_0 (node=vm36, call=376, rc=-2, status=Timed Out):
>>> unknown exec error
>>> p_jboss_monitor_0 (node=vm37, call=205, rc=5, status=complete): not
>>> installed
>>> p_jboss_ocf_start_0 (node=vm37, call=283, rc=-2, status=Timed Out):
>>> unknown exec error
>>>
>>> 15:35:02 ~ # crm resource cleanup p_jboss_ocf
>>> INFO: no curses support: you won't see colors
>>> Cleaning up p_jboss_ocf on vm37
>>> Cleaning up p_jboss_ocf on vm36
>>>
>>> 15:39:12 ~ # crm resource cleanup p_jboss
>>> INFO: no curses support: you won't see colors
>>> Cleaning up p_jboss on vm37
>>> Cleaning up p_jboss on vm36
>>>
>>> 15:39:19 ~ # crm_mon -1
>>>
>>> Failed actions:
>>> p_jboss_ocf_start_0 (node=vm36, call=376, rc=-2, status=Timed Out):
>>> unknown exec error
>>> p_jboss_monitor_0 (node=vm37, call=205, rc=5, status=complete): not
>>> installed
>>> p_jboss_ocf_start_0 (node=vm37, call=283, rc=-2, status=Timed Out):
>>> unknown exec error
>
> Strange, after i edit the config all other Failed actions are deleted
> only this Failed actions will be displayed.
>
> Failed actions:
> p_jboss_ocf_start_0 (node=vm36, call=380, rc=-2, status=Timed Out):
> unknown exec error
> p_jboss_ocf_start_0 (node=vm37, call=287, rc=-2, status=Timed Out):
> unknown exec error
>
>>
>> Strange, perhaps you ran into a bug here. You can open a bugzilla
>> with hb_report.
>>
>> Anyway, you should fix the timeout issue.
>
> I know but what sould i do to resolve this issue.
>
> my config entry for jboss is:
>
> primitive p_jboss_ocf ocf:heartbeat:jboss \
> params java_home="/usr/lib64/jvm/java"
> jboss_home="/usr/share/jboss" jboss_pstring="java -Dprogram.name=run.sh"
> jboss_stop_timeout="30" \
> op start interval="0" timeout="240s" \
> op stop interval="0" timeout="240s" \
> op monitor interval="20s"
>
> In worst case jboss needs max 120s and that's really the worst.
>
> Cheers,
> Benjamin
>
>>
>> Thanks,
>>
>> Dejan
>>
>>
>>>>>
>>>>> And after some tests i have some not more existing resouces in the
>>>>> Failed actions list. How can i delete them?
>>>>>
>>>>>> The same way.
>>>>>
>>>>>> Thanks,
>>>>>
>>>>>> Dejan
>>>>>
>>>
>>> Thx
>>>
>>> Benjamin
>>>>>
>>>>>
>>>>>
>>>>>>>>
>>>>>>>> Florian
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>
>>>>
>>>> _______________________________________________
>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>>
>>>> Project Home: http://www.clusterlabs.org
>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>>
>>> _______________________________________________
>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>>
>>> Project Home: http://www.clusterlabs.org
>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>
--
Benjamin Knoth
Max Planck Digital Library (MPDL)
Systemadministration
Amalienstrasse 33
80799 Munich, Germany
http://www.mpdl.mpg.de
Mail: knoth at mpdl.mpg.de
Phone: +49 89 38602 202
Fax: +49-89-38602-280
More information about the Pacemaker
mailing list