[Pacemaker] [Problem] lrmd detects monitor time-out by revision of the system time.
renayama19661014 at ybb.ne.jp
renayama19661014 at ybb.ne.jp
Fri Sep 5 04:22:18 UTC 2014
Hi All,
We confirmed that lrmd caused the time-out of the monitor when the time of the system was revised.
When a system considers revision of the time when I used ntpd, it is a problem very much.
We can confirm this problem in the next procedure.
Step1) Start Pacemaker in a single node.
[root at snmp1 ~]# start pacemaker.combined
pacemaker.combined start/running, process 11382
Step2) Send simple crm.
--------trac2915-3.crm------------
primitive prmDummyA ocf:pacemaker:Dummy1 \
op start interval="0s" timeout="60s" on-fail="restart" \
op monitor interval="10s" timeout="30s" on-fail="restart" \
op stop interval="0s" timeout="60s" on-fail="block"
group grpA prmDummyA
location rsc_location-grpA-1 grpA \
rule $id="rsc_location-grpA-1-rule" 200: #uname eq snmp1 \
rule $id="rsc_location-grpA-1-rule-0" 100: #uname eq snmp2
property $id="cib-bootstrap-options" \
no-quorum-policy="ignore" \
stonith-enabled="false" \
crmd-transition-delay="2s"
rsc_defaults $id="rsc-options" \
resource-stickiness="INFINITY" \
migration-threshold="1"
----------------------------------
[root at snmp1 ~]# crm configure load update trac2915-3.crm
WARNING: rsc_location-grpA-1: referenced node snmp2 does not exist
[root at snmp1 ~]# crm_mon -1 -Af
Last updated: Fri Sep 5 13:09:45 2014
Last change: Fri Sep 5 13:09:13 2014
Stack: corosync
Current DC: snmp1 (3232238180) - partition WITHOUT quorum
Version: 1.1.12-561c4cf
1 Nodes configured
1 Resources configured
Online: [ snmp1 ]
Resource Group: grpA
prmDummyA (ocf::pacemaker:Dummy1): Started snmp1
Node Attributes:
* Node snmp1:
Migration summary:
* Node snmp1:
Step3) After the monitor of the resource just began, we push forward time than the timeout(timeout="30s") of the monitor.
[root at snmp1 ~]# date -s +40sec
Fri Sep 5 13:11:04 JST 2014
Step4) The time-out of the monitor occurs.
[root at snmp1 ~]# crm_mon -1 -Af
Last updated: Fri Sep 5 13:11:24 2014
Last change: Fri Sep 5 13:09:13 2014
Stack: corosync
Current DC: snmp1 (3232238180) - partition WITHOUT quorum
Version: 1.1.12-561c4cf
1 Nodes configured
1 Resources configured
Online: [ snmp1 ]
Node Attributes:
* Node snmp1:
Migration summary:
* Node snmp1:
prmDummyA: migration-threshold=1 fail-count=1 last-failure='Fri Sep 5 13:11:04 2014'
Failed actions:
prmDummyA_monitor_10000 on snmp1 'unknown error' (1): call=7, status=Timed Out, last-rc-change='Fri Sep 5 13:11:04 2014', queued=0ms, exec=0ms
I confirmed some problems, but seem to be caused by the fact that an event occurs somehow or other in g_main_loop of lrmd in the period when it is shorter than a monitor.
This problem does not seem to happen somehow or other in lrmd of PM1.0.
Best Regards,
Hideo Yamauchi.
More information about the Pacemaker
mailing list