[Pacemaker] Problems with jboss on pacemaker

Thu May 5 14:35:20 UTC 2011

On Thu, May 05, 2011 at 12:26:57PM +0200, Benjamin Knoth wrote:
> Hi again,
> 
> i copied the jboss ocf and modified the variables, that the script use
> my variables ifi start it. Now if i start the ocf script i get the
> following everytime.
> 
> ./jboss-test start
> jboss-test[6165]: DEBUG: [jboss] Enter jboss start
> jboss-test[6165]: DEBUG: start_jboss[jboss]: retry monitor_jboss
> jboss-test[6165]: DEBUG: start_jboss[jboss]: retry monitor_jboss
> jboss-test[6165]: DEBUG: start_jboss[jboss]: retry monitor_jboss
> 
> Something is wrong.

Typically, the start operation includes a monitor at the end to
make sure that the resource really started. In this case it
looks like the monitor repeatedly fails. You should check the
monitor operation. Take a look at the output of "crm ra info
jboss" for parameters which have effect on monitoring. BTW, you
can test your resource without cluster using ocf-tester.

Thanks,

Dejan

> Cheers
> Benjamin
> 
> Am 05.05.2011 12:03, schrieb Benjamin Knoth:
> > Hi,
> > 
> > Am 05.05.2011 11:46, schrieb Dejan Muhamedagic:
> >> On Wed, May 04, 2011 at 03:44:02PM +0200, Benjamin Knoth wrote:
> >>>
> >>>
> >>> Am 04.05.2011 13:18, schrieb Benjamin Knoth:
> >>>> Hi,
> >>>>
> >>>> Am 04.05.2011 12:18, schrieb Dejan Muhamedagic:
> >>>>> Hi,
> >>>>>
> >>>>> On Wed, May 04, 2011 at 10:37:40AM +0200, Benjamin Knoth wrote:
> >>>>>
> >>>>>
> >>>>> Am 04.05.2011 09:42, schrieb Florian Haas:
> >>>>>>>> On 05/04/2011 09:31 AM, Benjamin Knoth wrote:
> >>>>>>>>> Hi Florian,
> >>>>>>>>> i test  it with ocf, but i couldn't run.
> >>>>>>>>
> >>>>>>>> Well that's really helpful information. Logs? Error messages? Anything?
> >>>>>
> >>>>> Logs
> >>>>>
> >>>>> May  4 09:55:10 vm36 lrmd: [19214]: WARN: p_jboss_ocf:start process (PID
> >>>>> 27702) timed out (try 1).  Killing with signal SIGTERM (15).
> >>>>>
> >>>>>> You need to set/increase the timeout for the start operation to
> >>>>>> match the maximum expected start time. Take a look at "crm ra
> >>>>>> info jboss" for minimum values.
> >>>>>
> >>>>> May  4 09:55:10 vm36 attrd: [19215]: info: find_hash_entry: Creating
> >>>>> hash entry for fail-count-p_jboss_ocf
> >>>>> May  4 09:55:10 vm36 lrmd: [19214]: WARN: operation start[342] on
> >>>>> ocf::jboss::p_jboss_ocf for client 19217, its parameters:
> >>>>> CRM_meta_name=[start] crm_feature_set=[3.0.1]
> >>>>> java_home=[/usr/lib64/jvm/java] CRM_meta_timeout=[240000] jboss_sto
> >>>>> p_timeout=[30] jboss_home=[/usr/share/jboss] jboss_pstring=[java
> >>>>> -Dprogram.name=run.sh] : pid [27702] timed out
> >>>>> May  4 09:55:10 vm36 attrd: [19215]: info: attrd_trigger_update: Sending
> >>>>> flush op to all hosts for: fail-count-p_jboss_ocf (INFINITY)
> >>>>> May  4 09:55:10 vm36 crmd: [19217]: WARN: status_from_rc: Action 64
> >>>>> (p_jboss_ocf_start_0) on vm36 failed (target: 0 vs. rc: -2): Error
> >>>>> May  4 09:55:10 vm36 lrmd: [19214]: info: rsc:p_jboss_ocf:346: stop
> >>>>> May  4 09:55:10 vm36 attrd: [19215]: info: attrd_perform_update: Sent
> >>>>> update 2294: fail-count-p_jboss_ocf=INFINITY
> >>>>> May  4 09:55:10 vm36 pengine: [19216]: notice: unpack_rsc_op: Hard error
> >>>>> - p_jboss_lsb_monitor_0 failed with rc=5: Preventing p_jboss_lsb from
> >>>>> re-starting on vm36
> >>>>> May  4 09:55:10 vm36 crmd: [19217]: WARN: update_failcount: Updating
> >>>>> failcount for p_jboss_ocf on vm36 after failed start: rc=-2
> >>>>> (update=INFINITY, time=1304495710)
> >>>>> May  4 09:55:10 vm36 attrd: [19215]: info: find_hash_entry: Creating
> >>>>> hash entry for last-failure-p_jboss_ocf
> >>>>> May  4 09:55:10 vm36 pengine: [19216]: notice: unpack_rsc_op: Operation
> >>>>> p_jboss_cs_monitor_0 found resource p_jboss_cs active on vm36
> >>>>> May  4 09:55:10 vm36 crmd: [19217]: info: abort_transition_graph:
> >>>>> match_graph_event:272 - Triggered transition abort (complete=0,
> >>>>> tag=lrm_rsc_op, id=p_jboss_ocf_start_0,
> >>>>> magic=2:-2;64:1375:0:fc16910d-2fe9-4daa-834a-348a4c7645ef, cib=0.53
> >>>>> 5.2) : Event failed
> >>>>> May  4 09:55:10 vm36 attrd: [19215]: info: attrd_trigger_update: Sending
> >>>>> flush op to all hosts for: last-failure-p_jboss_ocf (1304495710)
> >>>>> May  4 09:55:10 vm36 pengine: [19216]: notice: unpack_rsc_op: Hard error
> >>>>> - p_jboss_init_monitor_0 failed with rc=5: Preventing p_jboss_init from
> >>>>> re-starting on vm36
> >>>>> May  4 09:55:10 vm36 crmd: [19217]: info: match_graph_event: Action
> >>>>> p_jboss_ocf_start_0 (64) confirmed on vm36 (rc=4)
> >>>>> May  4 09:55:10 vm36 attrd: [19215]: info: attrd_perform_update: Sent
> >>>>> update 2297: last-failure-p_jboss_ocf=1304495710
> >>>>> May  4 09:55:10 vm36 pengine: [19216]: WARN: unpack_rsc_op: Processing
> >>>>> failed op p_jboss_ocf_start_0 on vm36: unknown exec error (-2)
> >>>>> May  4 09:55:10 vm36 crmd: [19217]: info: te_rsc_command: Initiating
> >>>>> action 1: stop p_jboss_ocf_stop_0 on vm36 (local)
> >>>>> May  4 09:55:10 vm36 pengine: [19216]: notice: unpack_rsc_op: Operation
> >>>>> p_jboss_ocf_monitor_0 found resource p_jboss_ocf active on vm37
> >>>>> May  4 09:55:10 vm36 crmd: [19217]: info: do_lrm_rsc_op: Performing
> >>>>> key=1:1376:0:fc16910d-2fe9-4daa-834a-348a4c7645ef op=p_jboss_ocf_stop_0 )
> >>>>> May  4 09:55:10 vm36 pengine: [19216]: notice: native_print: p_jboss_ocf
> >>>>>        (ocf::heartbeat:jboss): Stopped
> >>>>> May  4 09:55:10 vm36 pengine: [19216]: info: get_failcount: p_jboss_ocf
> >>>>> has failed INFINITY times on vm36
> >>>>> May  4 09:55:10 vm36 pengine: [19216]: WARN: common_apply_stickiness:
> >>>>> Forcing p_jboss_ocf away from vm36 after 1000000 failures (max=1000000)
> >>>>> May  4 09:59:10 vm36 pengine: [19216]: info: unpack_config: Node scores:
> >>>>> 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
> >>>>> May  4 09:59:10 vm36 crmd: [19217]: WARN: status_from_rc: Action 50
> >>>>> (p_jboss_ocf_start_0) on vm37 failed (target: 0 vs. rc: -2): Error
> >>>>> May  4 09:59:10 vm36 pengine: [19216]: info: determine_online_status:
> >>>>> Node vm36 is online
> >>>>> May  4 09:59:10 vm36 crmd: [19217]: WARN: update_failcount: Updating
> >>>>> failcount for p_jboss_ocf on vm37 after failed start: rc=-2
> >>>>> (update=INFINITY, time=1304495950)
> >>>>> May  4 09:59:10 vm36 pengine: [19216]: notice: unpack_rsc_op: Hard error
> >>>>> - p_jboss_lsb_monitor_0 failed with rc=5: Preventing p_jboss_lsb from
> >>>>> re-starting on vm36
> >>>>> May  4 09:59:10 vm36 crmd: [19217]: info: abort_transition_graph:
> >>>>> match_graph_event:272 - Triggered transition abort (complete=0,
> >>>>> tag=lrm_rsc_op, id=p_jboss_ocf_start_0,
> >>>>> magic=2:-2;50:1377:0:fc16910d-2fe9-4daa-834a-348a4c7645ef, cib=0.53
> >>>>> 5.12) : Event failed
> >>>>> May  4 09:59:10 vm36 pengine: [19216]: notice: unpack_rsc_op: Operation
> >>>>> p_jboss_cs_monitor_0 found resource p_jboss_cs active on vm36
> >>>>> May  4 09:59:10 vm36 crmd: [19217]: info: match_graph_event: Action
> >>>>> p_jboss_ocf_start_0 (50) confirmed on vm37 (rc=4)
> >>>>> May  4 09:59:10 vm36 pengine: [19216]: notice: native_print: p_jboss_ocf
> >>>>>        (ocf::heartbeat:jboss): Stopped
> >>>>> May  4 09:59:10 vm36 pengine: [19216]: info: get_failcount: p_jboss_ocf
> >>>>> has failed INFINITY times on vm37
> >>>>> May  4 09:59:10 vm36 pengine: [19216]: WARN: common_apply_stickiness:
> >>>>> Forcing p_jboss_ocf away from vm37 after 1000000 failures (max=1000000)
> >>>>> May  4 09:59:10 vm36 pengine: [19216]: info: get_failcount: p_jboss_ocf
> >>>>> has failed INFINITY times on vm36
> >>>>> May  4 09:59:10 vm36 pengine: [19216]: info: native_color: Resource
> >>>>> p_jboss_ocf cannot run anywhere
> >>>>> May  4 09:59:10 vm36 pengine: [19216]: notice: LogActions: Leave
> >>>>> resource p_jboss_ocf   (Stopped)
> >>>>> May  4 09:59:31 vm36 pengine: [19216]: notice: native_print: p_jboss_ocf
> >>>>>        (ocf::heartbeat:jboss): Stopped
> >>>>> ....
> >>>>>
> >>>>> Now i don't know how can i reset the resource p_jboss_ocf to test it again.
> >>>>>
> >>>>>> crm resource cleanup p_jboss_ocf
> >>>>
> >>>> That's the now way, but if i start this command on shell or crm shell in
> >>>> both i get Cleaning up p_jboss_ocf on vm37
> >>>> Cleaning up p_jboss_ocf on vm36
> >>>>
> >>>> But if i look on the monitoring with crm_mon -1 i getevery time
> >>>>
> >>>> Failed actions:
> >>>> p_jboss_ocf_start_0 (node=vm36, call=-1, rc=1, status=Timed Out):
> >>>> unknown error
> >>>>     p_jboss_monitor_0 (node=vm37, call=205, rc=5, status=complete): not
> >>>> installed
> >>>>     p_jboss_ocf_start_0 (node=vm37, call=281, rc=-2, status=Timed Out):
> >>>> unknown exec error
> >>>>
> >>>> p_jboss was deleted in the config yesterday.
> >>>
> >>> For demonstration:
> >>>
> >>> 15:34:22 ~ # crm_mon -1
> >>>
> >>> Failed actions:
> >>>     p_jboss_ocf_start_0 (node=vm36, call=376, rc=-2, status=Timed Out):
> >>> unknown exec error
> >>>     p_jboss_monitor_0 (node=vm37, call=205, rc=5, status=complete): not
> >>> installed
> >>>     p_jboss_ocf_start_0 (node=vm37, call=283, rc=-2, status=Timed Out):
> >>> unknown exec error
> >>>
> >>> 15:35:02 ~ # crm resource cleanup p_jboss_ocf
> >>> INFO: no curses support: you won't see colors
> >>> Cleaning up p_jboss_ocf on vm37
> >>> Cleaning up p_jboss_ocf on vm36
> >>>
> >>> 15:39:12 ~ # crm resource cleanup p_jboss
> >>> INFO: no curses support: you won't see colors
> >>> Cleaning up p_jboss on vm37
> >>> Cleaning up p_jboss on vm36
> >>>
> >>> 15:39:19 ~ # crm_mon -1
> >>>
> >>> Failed actions:
> >>>     p_jboss_ocf_start_0 (node=vm36, call=376, rc=-2, status=Timed Out):
> >>> unknown exec error
> >>>     p_jboss_monitor_0 (node=vm37, call=205, rc=5, status=complete): not
> >>> installed
> >>>     p_jboss_ocf_start_0 (node=vm37, call=283, rc=-2, status=Timed Out):
> >>> unknown exec error
> > 
> > Strange, after i edit the config all other Failed actions are deleted
> > only this Failed actions will be displayed.
> > 
> > Failed actions:
> >     p_jboss_ocf_start_0 (node=vm36, call=380, rc=-2, status=Timed Out):
> > unknown exec error
> >     p_jboss_ocf_start_0 (node=vm37, call=287, rc=-2, status=Timed Out):
> > unknown exec error
> > 
> >>
> >> Strange, perhaps you ran into a bug here. You can open a bugzilla
> >> with hb_report.
> >>
> >> Anyway, you should fix the timeout issue.
> > 
> > I know but what sould i do to resolve this issue.
> > 
> > my config entry for jboss is:
> > 
> > primitive p_jboss_ocf ocf:heartbeat:jboss \
> >         params java_home="/usr/lib64/jvm/java"
> > jboss_home="/usr/share/jboss" jboss_pstring="java -Dprogram.name=run.sh"
> > jboss_stop_timeout="30" \
> >         op start interval="0" timeout="240s" \
> >         op stop interval="0" timeout="240s" \
> >         op monitor interval="20s"
> > 
> > In worst case jboss needs max 120s and that's really the worst.
> > 
> > Cheers,
> > Benjamin
> > 
> >>
> >> Thanks,
> >>
> >> Dejan
> >>
> >>
> >>>>>
> >>>>> And after some tests i have some not  more existing resouces in the
> >>>>> Failed actions list. How can i delete them?
> >>>>>
> >>>>>> The same way.
> >>>>>
> >>>>>> Thanks,
> >>>>>
> >>>>>> Dejan
> >>>>>
> >>>
> >>> Thx
> >>>
> >>> Benjamin
> >>>>>
> >>>>>
> >>>>>
> >>>>>>>>
> >>>>>>>> Florian
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> _______________________________________________
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>>
> >>>> Project Home: http://www.clusterlabs.org
> >>>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> >>>
> >>> _______________________________________________
> >>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>>
> >>> Project Home: http://www.clusterlabs.org
> >>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> >>
> >> _______________________________________________
> >> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>
> >> Project Home: http://www.clusterlabs.org
> >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> > 
> 
> -- 
> Benjamin Knoth
> Max Planck Digital Library (MPDL)
> Systemadministration
> Amalienstrasse 33
> 80799 Munich, Germany
> http://www.mpdl.mpg.de
> 
> Mail: knoth at mpdl.mpg.de
> Phone:  +49 89 38602 202
> Fax:    +49-89-38602-280
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker