[Pacemaker] long time to start

Wed Apr 21 11:07:40 EDT 2010

>> Hi,

>>   I have a resource that sometimes can take 10 minutes to start after

>>   a failure due to log records that need to be sync'd. (my own OCF)

>>

>>   I noticed while the start action was being performed, if other

>>

>>   resources in my cluster report a "not running", no restart will be

>>   attempted until my long running started resource returns.

>>

>>   Meanwhile, the crm_mon  reports the resources as "started"

>>   eventhough they are not running, and may not be for many minutes.

>>   Is the lrm process single threaded?

>You are saying that while your RA starts (with a long start timeout),

>and the start action is not yet complete,

>other _independend_ resources are not yet started,

>but crm_mon thinks they are running already,

>even though "something" (what?) reports "not running" for those?

Yes, I am saying that if a resource (R1) is taking a long time to start and another resource (R2) monitor action returns a not running, it will not be restarted until the first stuck resource returns or in my case times out.  Since the stop action has not been run on R2, crm_mon still says "Started"

>I think you lost me ;)

>please show a "crm configure show"

primitive dummy-1 ocf:heartbeat:Dummy \

        op monitor interval="30s" \

        op start interval="0" timeout="90s" migration-threshold="0"

primitive dummy-main ocf:heartbeat:Dummy \

        op monitor interval="30s" \

        op start interval="0" timeout="30s" \

        meta migration-threshold="0" target-role="Started"

primitive dummy-sleep ocf:heartbeat:DummySleep \

        op monitor interval="60s" \

        op start interval="0" timeout="2m" \

        meta migration-threshold="0" target-role="Started"

colocation d inf: dummy-sleep dummy-main

colocation d1 inf: dummy-1 dummy-main

property $id="cib-bootstrap-options" \

        dc-version="1.0.6-cebe2b6ff49b36b29a3bd7ada1c4701c7470febe" \

        cluster-infrastructure="Heartbeat" \

        stonith-enabled="false" \

        last-lrm-refresh="1271853339

>Can you reproduce this easily?

Not easily, but I finally have the correct combination.  In my case I have dependent resources, but I was able to reproduce part of the issue using the Dummy resource.

>Can you reproduce this with just a few "Dummy" resources?

I added an ocf_log to the monitor action so I could tail the messages file to see what was happening.  I created another resource "DummySleep" where I inserted a sleep as follows:

dummy_start() {

    ocf_log info "OCF_RESKEY_state is ${OCF_RESKEY_state}"

    dummy_monitor

    ret=$?

    ocf_log info "dummy start sleep..."

     sleep 3000

     return $OCF_ERR_GENERIC

    ocf_log info "dummy start sleep return..."

    if [ $ret =  $OCF_SUCCESS ]; then

        return $OCF_SUCCESS

    fi

    touch ${OCF_RESKEY_state}

I ran the test as follows:

1)  Commented out the sleep and return to get the DummySleep resource going with the others

2)  Replaced the DummySleep OCF with sleep turned on

3)  Ran crm resource stop dummy-sleep

4)  Crm resource start dummy-sleep to cause it to sleep

5)  "rm" the state file for dummy-main to cause the failure

You'll notice that the monitor for dummy-main keeps going even though it's not running until the dummy-sleep resource is woken up.

Apr 21 10:06:56 qpr1 lrmd: [30826]: info: RA output: (dummy-sleep:start:stderr) 2010/04/21_10:06:56 INFO: dummy start sleep...

Apr 21 10:07:01 qpr1 lrmd: [30826]: info: RA output: (dummy-main:monitor:stderr) 2010/04/21_10:07:01 INFO: dummy monitor

Apr 21 10:07:26 qpr1 lrmd: [30826]: info: RA output: (dummy-1:monitor:stderr) 2010/04/21_10:07:26 INFO: dummy monitor

Apr 21 10:07:31 qpr1 lrmd: [30826]: info: RA output: (dummy-main:monitor:stderr) 2010/04/21_10:07:31 INFO: dummy monitor

Apr 21 10:07:31 qpr1 lrmd: [30826]: info: RA output: (dummy-main:monitor:stderr) 2010/04/21_10:07:31 INFO: Not Running

Apr 21 10:07:31 qpr1 crmd: [30829]: info: process_lrm_event: LRM operation dummy-main_monitor_30000 (call=135, rc=7, cib-update=204, confirmed=false) not running

Apr 21 10:07:32 qpr1 attrd: [30828]: info: attrd_ha_callback: Update relayed from qpr2

Apr 21 10:07:32 qpr1 attrd: [30828]: info: attrd_local_callback: Expanded fail-count-dummy-main=value++ to 7

Apr 21 10:07:32 qpr1 attrd: [30828]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-dummy-main (7)

Apr 21 10:07:32 qpr1 attrd: [30828]: info: attrd_perform_update: Sent update 107: fail-count-dummy-main=7

Apr 21 10:07:32 qpr1 attrd: [30828]: info: attrd_ha_callback: Update relayed from qpr2

Apr 21 10:07:32 qpr1 attrd: [30828]: info: attrd_trigger_update: Sending flush op to all hosts for: last-failure-dummy-main (1271858866)

Apr 21 10:07:32 qpr1 attrd: [30828]: info: attrd_perform_update: Sent update 109: last-failure-dummy-main=1271858866

Apr 21 10:07:56 qpr1 lrmd: [30826]: info: RA output: (dummy-1:monitor:stderr) 2010/04/21_10:07:56 INFO: dummy monitor

Apr 21 10:08:01 qpr1 lrmd: [30826]: info: RA output: (dummy-main:monitor:stderr) 2010/04/21_10:08:01 INFO: dummy monitor

Apr 21 10:08:01 qpr1 lrmd: [30826]: info: RA output: (dummy-main:monitor:stderr) 2010/04/21_10:08:01 INFO: Not Running

Apr 21 10:08:26 qpr1 lrmd: [30826]: info: RA output: (dummy-1:monitor:stderr) 2010/04/21_10:08:26 INFO: dummy monitor

Apr 21 10:08:31 qpr1 lrmd: [30826]: info: RA output: (dummy-main:monitor:stderr) 2010/04/21_10:08:31 INFO: dummy monitor

Apr 21 10:08:01 qpr1 lrmd: [30826]: info: RA output: (dummy-main:monitor:stderr) 2010/04/21_10:08:01 INFO: Not Running

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20100421/c90d9105/attachment-0001.html>