[Pacemaker] Problems with jboss on pacemaker

Fri May 6 07:32:47 UTC 2011

On Thu, May 05, 2011 at 06:39:09PM +0200, Benjamin Knoth wrote:
> Hi
> 
> Am 05.05.2011 16:35, schrieb Dejan Muhamedagic:
> > On Thu, May 05, 2011 at 12:26:57PM +0200, Benjamin Knoth wrote:
> >> Hi again,
> >>
> >> i copied the jboss ocf and modified the variables, that the script use
> >> my variables ifi start it. Now if i start the ocf script i get the
> >> following everytime.
> >>
> >> ./jboss-test start
> >> jboss-test[6165]: DEBUG: [jboss] Enter jboss start
> >> jboss-test[6165]: DEBUG: start_jboss[jboss]: retry monitor_jboss
> >> jboss-test[6165]: DEBUG: start_jboss[jboss]: retry monitor_jboss
> >> jboss-test[6165]: DEBUG: start_jboss[jboss]: retry monitor_jboss
> >>
> >> Something is wrong.
> > 
> > Typically, the start operation includes a monitor at the end to
> > make sure that the resource really started. In this case it
> > looks like the monitor repeatedly fails. You should check the
> > monitor operation. Take a look at the output of "crm ra info
> > jboss" for parameters which have effect on monitoring. BTW, you
> > can test your resource without cluster using ocf-tester.
> 
> I don't find the ocf tester or i don't know how to use them.
> The log of jboss says that jboss will started, but it can't deploy some
> packages, with the ocf script. The most important is:
> 
> 18:18:16,654 ERROR [MainDeployer] Could not start deployment:
> file:/data/jboss-4.2.2.GA/server/default/tmp/deploy/tmp8457743723406154025escidoc-core.ear-contents/escidoc-core.war
> org.jboss.deployment.DeploymentException: URL
> file:/data/jboss-4.2.2.GA/server/default/tmp/deploy/tmp8457743723406154025escidoc-core.ear-contents/escidoc-core-exp.war/
> deployment failed
> 
> --- Incompletely deployed packages ---
> org.jboss.deployment.DeploymentInfo at 844a3a10 {
> url=file:/data/jboss-4.2.2.GA/server/default/deploy/escidoc-core.ear }
>   deployer: org.jboss.deployment.EARDeployer at 40f940f9
>   status: Deployment FAILED reason: URL
> file:/data/jboss-4.2.2.GA/server/default/tmp/deploy/tmp8457743723406154025escidoc-core.ear-contents/escidoc-core-exp.war/
> deployment failed
>   state: FAILED
>   watch: file:/data/jboss-4.2.2.GA/server/default/deploy/escidoc-core.ear
>   altDD: null
>   lastDeployed: 1304612289701
>   lastModified: 1304612278000
>   mbeans:
> 
> After 4 minutes Jboss will shutdown from pacemaker.
> 
> If i run the init-script normal it runs fine and all important packages
> will deploy.
> 
> I checked the differnce between processes on start bei init-script and
> ocf-script from pacemaker
> 
> pacemaker
> 
> root     20074  0.0  0.0  12840  1792 ?        S    17:56   0:00 /bin/sh
> /usr/lib/ocf/resource.d//heartbeat/jboss start
> root     20079  0.0  0.0  48336  1368 ?        S    17:56   0:00 su -
> jboss -s /bin/bash -c export JAVA_HOME=/usr/lib64/jvm/java;\n?
>                   export JBOSS_HOME=/usr/share/jboss;\n?
>             /usr/share/jboss/bin/run.sh -c default
> -Djboss.bind.address=0.0.0.0
> 
> init-script
> 
> root     20079  0.0  0.0  48336  1368 ?        S    17:56   0:00 su
> jboss -s /bin/bash -c /usr/share/jboss/bin/run.sh -c default
> -Djboss.bind.address=0.0.0.0

No idea. Perhaps somebody using jboss here can take a look. Or
you could experiment a bit to find out which part makes the
difference. Apart from the two exported vars, the rest of the
command line is the same. In addition the OCF RA does 'su -'.

Thanks,

Dejan

> Cheers
> 
> Benjamin
> 
> > 
> > Thanks,
> > 
> > Dejan
> > 
> >> Cheers
> >> Benjamin
> >>
> >> Am 05.05.2011 12:03, schrieb Benjamin Knoth:
> >>> Hi,
> >>>
> >>> Am 05.05.2011 11:46, schrieb Dejan Muhamedagic:
> >>>> On Wed, May 04, 2011 at 03:44:02PM +0200, Benjamin Knoth wrote:
> >>>>>
> >>>>>
> >>>>> Am 04.05.2011 13:18, schrieb Benjamin Knoth:
> >>>>>> Hi,
> >>>>>>
> >>>>>> Am 04.05.2011 12:18, schrieb Dejan Muhamedagic:
> >>>>>>> Hi,
> >>>>>>>
> >>>>>>> On Wed, May 04, 2011 at 10:37:40AM +0200, Benjamin Knoth wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>> Am 04.05.2011 09:42, schrieb Florian Haas:
> >>>>>>>>>> On 05/04/2011 09:31 AM, Benjamin Knoth wrote:
> >>>>>>>>>>> Hi Florian,
> >>>>>>>>>>> i test  it with ocf, but i couldn't run.
> >>>>>>>>>>
> >>>>>>>>>> Well that's really helpful information. Logs? Error messages? Anything?
> >>>>>>>
> >>>>>>> Logs
> >>>>>>>
> >>>>>>> May  4 09:55:10 vm36 lrmd: [19214]: WARN: p_jboss_ocf:start process (PID
> >>>>>>> 27702) timed out (try 1).  Killing with signal SIGTERM (15).
> >>>>>>>
> >>>>>>>> You need to set/increase the timeout for the start operation to
> >>>>>>>> match the maximum expected start time. Take a look at "crm ra
> >>>>>>>> info jboss" for minimum values.
> >>>>>>>
> >>>>>>> May  4 09:55:10 vm36 attrd: [19215]: info: find_hash_entry: Creating
> >>>>>>> hash entry for fail-count-p_jboss_ocf
> >>>>>>> May  4 09:55:10 vm36 lrmd: [19214]: WARN: operation start[342] on
> >>>>>>> ocf::jboss::p_jboss_ocf for client 19217, its parameters:
> >>>>>>> CRM_meta_name=[start] crm_feature_set=[3.0.1]
> >>>>>>> java_home=[/usr/lib64/jvm/java] CRM_meta_timeout=[240000] jboss_sto
> >>>>>>> p_timeout=[30] jboss_home=[/usr/share/jboss] jboss_pstring=[java
> >>>>>>> -Dprogram.name=run.sh] : pid [27702] timed out
> >>>>>>> May  4 09:55:10 vm36 attrd: [19215]: info: attrd_trigger_update: Sending
> >>>>>>> flush op to all hosts for: fail-count-p_jboss_ocf (INFINITY)
> >>>>>>> May  4 09:55:10 vm36 crmd: [19217]: WARN: status_from_rc: Action 64
> >>>>>>> (p_jboss_ocf_start_0) on vm36 failed (target: 0 vs. rc: -2): Error
> >>>>>>> May  4 09:55:10 vm36 lrmd: [19214]: info: rsc:p_jboss_ocf:346: stop
> >>>>>>> May  4 09:55:10 vm36 attrd: [19215]: info: attrd_perform_update: Sent
> >>>>>>> update 2294: fail-count-p_jboss_ocf=INFINITY
> >>>>>>> May  4 09:55:10 vm36 pengine: [19216]: notice: unpack_rsc_op: Hard error
> >>>>>>> - p_jboss_lsb_monitor_0 failed with rc=5: Preventing p_jboss_lsb from
> >>>>>>> re-starting on vm36
> >>>>>>> May  4 09:55:10 vm36 crmd: [19217]: WARN: update_failcount: Updating
> >>>>>>> failcount for p_jboss_ocf on vm36 after failed start: rc=-2
> >>>>>>> (update=INFINITY, time=1304495710)
> >>>>>>> May  4 09:55:10 vm36 attrd: [19215]: info: find_hash_entry: Creating
> >>>>>>> hash entry for last-failure-p_jboss_ocf
> >>>>>>> May  4 09:55:10 vm36 pengine: [19216]: notice: unpack_rsc_op: Operation
> >>>>>>> p_jboss_cs_monitor_0 found resource p_jboss_cs active on vm36
> >>>>>>> May  4 09:55:10 vm36 crmd: [19217]: info: abort_transition_graph:
> >>>>>>> match_graph_event:272 - Triggered transition abort (complete=0,
> >>>>>>> tag=lrm_rsc_op, id=p_jboss_ocf_start_0,
> >>>>>>> magic=2:-2;64:1375:0:fc16910d-2fe9-4daa-834a-348a4c7645ef, cib=0.53
> >>>>>>> 5.2) : Event failed
> >>>>>>> May  4 09:55:10 vm36 attrd: [19215]: info: attrd_trigger_update: Sending
> >>>>>>> flush op to all hosts for: last-failure-p_jboss_ocf (1304495710)
> >>>>>>> May  4 09:55:10 vm36 pengine: [19216]: notice: unpack_rsc_op: Hard error
> >>>>>>> - p_jboss_init_monitor_0 failed with rc=5: Preventing p_jboss_init from
> >>>>>>> re-starting on vm36
> >>>>>>> May  4 09:55:10 vm36 crmd: [19217]: info: match_graph_event: Action
> >>>>>>> p_jboss_ocf_start_0 (64) confirmed on vm36 (rc=4)
> >>>>>>> May  4 09:55:10 vm36 attrd: [19215]: info: attrd_perform_update: Sent
> >>>>>>> update 2297: last-failure-p_jboss_ocf=1304495710
> >>>>>>> May  4 09:55:10 vm36 pengine: [19216]: WARN: unpack_rsc_op: Processing
> >>>>>>> failed op p_jboss_ocf_start_0 on vm36: unknown exec error (-2)
> >>>>>>> May  4 09:55:10 vm36 crmd: [19217]: info: te_rsc_command: Initiating
> >>>>>>> action 1: stop p_jboss_ocf_stop_0 on vm36 (local)
> >>>>>>> May  4 09:55:10 vm36 pengine: [19216]: notice: unpack_rsc_op: Operation
> >>>>>>> p_jboss_ocf_monitor_0 found resource p_jboss_ocf active on vm37
> >>>>>>> May  4 09:55:10 vm36 crmd: [19217]: info: do_lrm_rsc_op: Performing
> >>>>>>> key=1:1376:0:fc16910d-2fe9-4daa-834a-348a4c7645ef op=p_jboss_ocf_stop_0 )
> >>>>>>> May  4 09:55:10 vm36 pengine: [19216]: notice: native_print: p_jboss_ocf
> >>>>>>>        (ocf::heartbeat:jboss): Stopped
> >>>>>>> May  4 09:55:10 vm36 pengine: [19216]: info: get_failcount: p_jboss_ocf
> >>>>>>> has failed INFINITY times on vm36
> >>>>>>> May  4 09:55:10 vm36 pengine: [19216]: WARN: common_apply_stickiness:
> >>>>>>> Forcing p_jboss_ocf away from vm36 after 1000000 failures (max=1000000)
> >>>>>>> May  4 09:59:10 vm36 pengine: [19216]: info: unpack_config: Node scores:
> >>>>>>> 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
> >>>>>>> May  4 09:59:10 vm36 crmd: [19217]: WARN: status_from_rc: Action 50
> >>>>>>> (p_jboss_ocf_start_0) on vm37 failed (target: 0 vs. rc: -2): Error
> >>>>>>> May  4 09:59:10 vm36 pengine: [19216]: info: determine_online_status:
> >>>>>>> Node vm36 is online
> >>>>>>> May  4 09:59:10 vm36 crmd: [19217]: WARN: update_failcount: Updating
> >>>>>>> failcount for p_jboss_ocf on vm37 after failed start: rc=-2
> >>>>>>> (update=INFINITY, time=1304495950)
> >>>>>>> May  4 09:59:10 vm36 pengine: [19216]: notice: unpack_rsc_op: Hard error
> >>>>>>> - p_jboss_lsb_monitor_0 failed with rc=5: Preventing p_jboss_lsb from
> >>>>>>> re-starting on vm36
> >>>>>>> May  4 09:59:10 vm36 crmd: [19217]: info: abort_transition_graph:
> >>>>>>> match_graph_event:272 - Triggered transition abort (complete=0,
> >>>>>>> tag=lrm_rsc_op, id=p_jboss_ocf_start_0,
> >>>>>>> magic=2:-2;50:1377:0:fc16910d-2fe9-4daa-834a-348a4c7645ef, cib=0.53
> >>>>>>> 5.12) : Event failed
> >>>>>>> May  4 09:59:10 vm36 pengine: [19216]: notice: unpack_rsc_op: Operation
> >>>>>>> p_jboss_cs_monitor_0 found resource p_jboss_cs active on vm36
> >>>>>>> May  4 09:59:10 vm36 crmd: [19217]: info: match_graph_event: Action
> >>>>>>> p_jboss_ocf_start_0 (50) confirmed on vm37 (rc=4)
> >>>>>>> May  4 09:59:10 vm36 pengine: [19216]: notice: native_print: p_jboss_ocf
> >>>>>>>        (ocf::heartbeat:jboss): Stopped
> >>>>>>> May  4 09:59:10 vm36 pengine: [19216]: info: get_failcount: p_jboss_ocf
> >>>>>>> has failed INFINITY times on vm37
> >>>>>>> May  4 09:59:10 vm36 pengine: [19216]: WARN: common_apply_stickiness:
> >>>>>>> Forcing p_jboss_ocf away from vm37 after 1000000 failures (max=1000000)
> >>>>>>> May  4 09:59:10 vm36 pengine: [19216]: info: get_failcount: p_jboss_ocf
> >>>>>>> has failed INFINITY times on vm36
> >>>>>>> May  4 09:59:10 vm36 pengine: [19216]: info: native_color: Resource
> >>>>>>> p_jboss_ocf cannot run anywhere
> >>>>>>> May  4 09:59:10 vm36 pengine: [19216]: notice: LogActions: Leave
> >>>>>>> resource p_jboss_ocf   (Stopped)
> >>>>>>> May  4 09:59:31 vm36 pengine: [19216]: notice: native_print: p_jboss_ocf
> >>>>>>>        (ocf::heartbeat:jboss): Stopped
> >>>>>>> ....
> >>>>>>>
> >>>>>>> Now i don't know how can i reset the resource p_jboss_ocf to test it again.
> >>>>>>>
> >>>>>>>> crm resource cleanup p_jboss_ocf
> >>>>>>
> >>>>>> That's the now way, but if i start this command on shell or crm shell in
> >>>>>> both i get Cleaning up p_jboss_ocf on vm37
> >>>>>> Cleaning up p_jboss_ocf on vm36
> >>>>>>
> >>>>>> But if i look on the monitoring with crm_mon -1 i getevery time
> >>>>>>
> >>>>>> Failed actions:
> >>>>>> p_jboss_ocf_start_0 (node=vm36, call=-1, rc=1, status=Timed Out):
> >>>>>> unknown error
> >>>>>>     p_jboss_monitor_0 (node=vm37, call=205, rc=5, status=complete): not
> >>>>>> installed
> >>>>>>     p_jboss_ocf_start_0 (node=vm37, call=281, rc=-2, status=Timed Out):
> >>>>>> unknown exec error
> >>>>>>
> >>>>>> p_jboss was deleted in the config yesterday.
> >>>>>
> >>>>> For demonstration:
> >>>>>
> >>>>> 15:34:22 ~ # crm_mon -1
> >>>>>
> >>>>> Failed actions:
> >>>>>     p_jboss_ocf_start_0 (node=vm36, call=376, rc=-2, status=Timed Out):
> >>>>> unknown exec error
> >>>>>     p_jboss_monitor_0 (node=vm37, call=205, rc=5, status=complete): not
> >>>>> installed
> >>>>>     p_jboss_ocf_start_0 (node=vm37, call=283, rc=-2, status=Timed Out):
> >>>>> unknown exec error
> >>>>>
> >>>>> 15:35:02 ~ # crm resource cleanup p_jboss_ocf
> >>>>> INFO: no curses support: you won't see colors
> >>>>> Cleaning up p_jboss_ocf on vm37
> >>>>> Cleaning up p_jboss_ocf on vm36
> >>>>>
> >>>>> 15:39:12 ~ # crm resource cleanup p_jboss
> >>>>> INFO: no curses support: you won't see colors
> >>>>> Cleaning up p_jboss on vm37
> >>>>> Cleaning up p_jboss on vm36
> >>>>>
> >>>>> 15:39:19 ~ # crm_mon -1
> >>>>>
> >>>>> Failed actions:
> >>>>>     p_jboss_ocf_start_0 (node=vm36, call=376, rc=-2, status=Timed Out):
> >>>>> unknown exec error
> >>>>>     p_jboss_monitor_0 (node=vm37, call=205, rc=5, status=complete): not
> >>>>> installed
> >>>>>     p_jboss_ocf_start_0 (node=vm37, call=283, rc=-2, status=Timed Out):
> >>>>> unknown exec error
> >>>
> >>> Strange, after i edit the config all other Failed actions are deleted
> >>> only this Failed actions will be displayed.
> >>>
> >>> Failed actions:
> >>>     p_jboss_ocf_start_0 (node=vm36, call=380, rc=-2, status=Timed Out):
> >>> unknown exec error
> >>>     p_jboss_ocf_start_0 (node=vm37, call=287, rc=-2, status=Timed Out):
> >>> unknown exec error
> >>>
> >>>>
> >>>> Strange, perhaps you ran into a bug here. You can open a bugzilla
> >>>> with hb_report.
> >>>>
> >>>> Anyway, you should fix the timeout issue.
> >>>
> >>> I know but what sould i do to resolve this issue.
> >>>
> >>> my config entry for jboss is:
> >>>
> >>> primitive p_jboss_ocf ocf:heartbeat:jboss \
> >>>         params java_home="/usr/lib64/jvm/java"
> >>> jboss_home="/usr/share/jboss" jboss_pstring="java -Dprogram.name=run.sh"
> >>> jboss_stop_timeout="30" \
> >>>         op start interval="0" timeout="240s" \
> >>>         op stop interval="0" timeout="240s" \
> >>>         op monitor interval="20s"
> >>>
> >>> In worst case jboss needs max 120s and that's really the worst.
> >>>
> >>> Cheers,
> >>> Benjamin
> >>>
> >>>>
> >>>> Thanks,
> >>>>
> >>>> Dejan
> >>>>
> >>>>
> >>>>>>>
> >>>>>>> And after some tests i have some not  more existing resouces in the
> >>>>>>> Failed actions list. How can i delete them?
> >>>>>>>
> >>>>>>>> The same way.
> >>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>
> >>>>>>>> Dejan
> >>>>>>>
> >>>>>
> >>>>> Thx
> >>>>>
> >>>>> Benjamin
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> Florian
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> _______________________________________________
> >>>>>>
> >>>>>>
> >>>>>> _______________________________________________
> >>>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>>>>
> >>>>>> Project Home: http://www.clusterlabs.org
> >>>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>>>>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> >>>>>
> >>>>> _______________________________________________
> >>>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>>>
> >>>>> Project Home: http://www.clusterlabs.org
> >>>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>>>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> >>>>
> >>>> _______________________________________________
> >>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>>
> >>>> Project Home: http://www.clusterlabs.org
> >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> >>>
> >>
> >> -- 
> >> Benjamin Knoth
> >> Max Planck Digital Library (MPDL)
> >> Systemadministration
> >> Amalienstrasse 33
> >> 80799 Munich, Germany
> >> http://www.mpdl.mpg.de
> >>
> >> Mail: knoth at mpdl.mpg.de
> >> Phone:  +49 89 38602 202
> >> Fax:    +49-89-38602-280
> >>
> >> _______________________________________________
> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>
> >> Project Home: http://www.clusterlabs.org
> >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> > 
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> > 
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> 
> -- 
> Benjamin Knoth
> Max Planck Digital Library (MPDL)
> Systemadministration
> Amalienstrasse 33
> 80799 Munich, Germany
> http://www.mpdl.mpg.de
> 
> Mail: knoth at mpdl.mpg.de
> Phone:  +49 89 38602 202
> Fax:    +49-89-38602-280
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker