[Pacemaker] why pacemaker does not control the resources

Tue Nov 12 18:18:23 EST 2013

On 12 Nov 2013, at 4:42 pm, Andrey Groshev <greenx at yandex.ru> wrote:

> 
> 
> 11.11.2013, 03:44, "Andrew Beekhof" <andrew at beekhof.net>:
>> On 8 Nov 2013, at 7:49 am, Andrey Groshev <greenx at yandex.ru> wrote:
>> 
>>>  Hi, PPL!
>>>  I need help. I do not understand... Why has stopped working.
>>>  This configuration work on other cluster, but on corosync1.
>>> 
>>>  So... cluster postgres with master/slave.
>>>  Classic config as in wiki.
>>>  I build cluster, start, he is working.
>>>  Next I kill postgres on Master with 6 signal, as if "disk space left"
>>> 
>>>  # pkill -6 postgres
>>>  # ps axuww|grep postgres
>>>  root      9032  0.0  0.1 103236   860 pts/0    S+   00:37   0:00 grep postgres
>>> 
>>>  PostgreSQL die, But crm_mon shows that the master is still running.
>>> 
>>>  Last updated: Fri Nov  8 00:42:08 2013
>>>  Last change: Fri Nov  8 00:37:05 2013 via crm_attribute on dev-cluster2-node4
>>>  Stack: corosync
>>>  Current DC: dev-cluster2-node4 (172793107) - partition with quorum
>>>  Version: 1.1.10-1.el6-368c726
>>>  3 Nodes configured
>>>  7 Resources configured
>>> 
>>>  Node dev-cluster2-node2 (172793105): online
>>>         pingCheck       (ocf::pacemaker:ping):  Started
>>>         pgsql   (ocf::heartbeat:pgsql): Started
>>>  Node dev-cluster2-node3 (172793106): online
>>>         pingCheck       (ocf::pacemaker:ping):  Started
>>>         pgsql   (ocf::heartbeat:pgsql): Started
>>>  Node dev-cluster2-node4 (172793107): online
>>>         pgsql   (ocf::heartbeat:pgsql): Master
>>>         pingCheck       (ocf::pacemaker:ping):  Started
>>>         VirtualIP       (ocf::heartbeat:IPaddr2):       Started
>>> 
>>>  Node Attributes:
>>>  * Node dev-cluster2-node2:
>>>     + default_ping_set                  : 100
>>>     + master-pgsql                      : -INFINITY
>>>     + pgsql-data-status                 : STREAMING|ASYNC
>>>     + pgsql-status                      : HS:async
>>>  * Node dev-cluster2-node3:
>>>     + default_ping_set                  : 100
>>>     + master-pgsql                      : -INFINITY
>>>     + pgsql-data-status                 : STREAMING|ASYNC
>>>     + pgsql-status                      : HS:async
>>>  * Node dev-cluster2-node4:
>>>     + default_ping_set                  : 100
>>>     + master-pgsql                      : 1000
>>>     + pgsql-data-status                 : LATEST
>>>     + pgsql-master-baseline             : 0000000002000078
>>>     + pgsql-status                      : PRI
>>> 
>>>  Migration summary:
>>>  * Node dev-cluster2-node4:
>>>  * Node dev-cluster2-node2:
>>>  * Node dev-cluster2-node3:
>>> 
>>>  Tickets:
>>> 
>>>  CONFIG:
>>>  node $id="172793105" dev-cluster2-node2. \
>>>         attributes pgsql-data-status="STREAMING|ASYNC" standby="false"
>>>  node $id="172793106" dev-cluster2-node3. \
>>>         attributes pgsql-data-status="STREAMING|ASYNC" standby="false"
>>>  node $id="172793107" dev-cluster2-node4. \
>>>         attributes pgsql-data-status="LATEST"
>>>  primitive VirtualIP ocf:heartbeat:IPaddr2 \
>>>         params ip="10.76.157.194" \
>>>         op start interval="0" timeout="60s" on-fail="stop" \
>>>         op monitor interval="10s" timeout="60s" on-fail="restart" \
>>>         op stop interval="0" timeout="60s" on-fail="block"
>>>  primitive pgsql ocf:heartbeat:pgsql \
>>>         params pgctl="/usr/pgsql-9.1/bin/pg_ctl" psql="/usr/pgsql-9.1/bin/psql" pgdata="/var/lib/pgsql/9.1/data" tmpdir="/tmp/pg" start_opt="-p 5432" logfile="/var/lib/pgsql/9.1//pgstartup.log" rep_mode="async" node_list=" dev-cluster2-node2. dev-cluster2-node3. dev-cluster2-node4. " restore_command="gzip -cd /var/backup/pitr/dev-cluster2-master#5432/xlog/%f.gz > %p" primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" master_ip="10.76.157.194" \
>>>         op start interval="0" timeout="60s" on-fail="restart" \
>>>         op monitor interval="5s" timeout="61s" on-fail="restart" \
>>>         op monitor interval="1s" role="Master" timeout="62s" on-fail="restart" \
>>>         op promote interval="0" timeout="63s" on-fail="restart" \
>>>         op demote interval="0" timeout="64s" on-fail="stop" \
>>>         op stop interval="0" timeout="65s" on-fail="block" \
>>>         op notify interval="0" timeout="66s"
>>>  primitive pingCheck ocf:pacemaker:ping \
>>>         params name="default_ping_set" host_list="10.76.156.1" multiplier="100" \
>>>         op start interval="0" timeout="60s" on-fail="restart" \
>>>         op monitor interval="10s" timeout="60s" on-fail="restart" \
>>>         op stop interval="0" timeout="60s" on-fail="ignore"
>>>  ms msPostgresql pgsql \
>>>         meta master-max="1" master-node-max="1" clone-node-max="1" notify="true" target-role="Master" clone-max="3"
>>>  clone clnPingCheck pingCheck \
>>>         meta clone-max="3"
>>>  location l0_DontRunPgIfNotPingGW msPostgresql \
>>>         rule $id="l0_DontRunPgIfNotPingGW-rule" -inf: not_defined default_ping_set or default_ping_set lt 100
>>>  colocation r0_StartPgIfPingGW inf: msPostgresql clnPingCheck
>>>  colocation r1_MastersGroup inf: VirtualIP msPostgresql:Master
>>>  order rsc_order-1 0: clnPingCheck msPostgresql
>>>  order rsc_order-2 0: msPostgresql:promote VirtualIP:start symmetrical=false
>>>  order rsc_order-3 0: msPostgresql:demote VirtualIP:stop symmetrical=false
>>>  property $id="cib-bootstrap-options" \
>>>         dc-version="1.1.10-1.el6-368c726" \
>>>         cluster-infrastructure="corosync" \
>>>         stonith-enabled="false" \
>>>         no-quorum-policy="stop"
>>>  rsc_defaults $id="rsc-options" \
>>>         resource-stickiness="INFINITY" \
>>>         migration-threshold="1"
>>> 
>>>  Tell me where to look - why does pacemaker not work?
>> 
>> You might want to follow some of the steps at:
>> 
>>    http://blog.clusterlabs.org/blog/2013/debugging-pacemaker/
>> 
>> under the heading "Resource-level failures".
> 
> Yes. Thank you. 
> I've seen this article and now I study it in more detail.
> A lot of information in the logs, so it is difficult to determine where the error is, and where the consequence of error.
> Now I'm trying to figure it out.
> 
> BUT...
> While I can say with certainty that the RA with monitor in the MS(pgsql) is called ONLY on the node on which the last was launched PACEMAKER.

It looks like you're hitting https://github.com/beekhof/pacemaker/commit/58962338
Since you appear to be on rhel6 (or a clone of rhel6), can I suggest you use the 1.1.10 packages that come with 6.4?
They include the above patch.

Also, just to be sure. Are you expecting monitor operations to detect when you started a resource manually?
If so, you'll need a monitor operation with role=Stopped. We don't do that by default.

>> 
>> 'crm_mon -o' might be a good source of information too.
> Therefore, I see that my resources allegedly functioning normally.
> 
> # crm_mon -o1 
> Last updated: Tue Nov 12 09:27:16 2013
> Last change: Tue Nov 12 00:08:35 2013 via crm_attribute on dev-cluster2-node2
> Stack: corosync
> Current DC: dev-cluster2-node2 (172793105) - partition with quorum
> Version: 1.1.10-1.el6-368c726
> 3 Nodes configured
> 337 Resources configured
> 
> 
> Online: [ dev-cluster2-node2 dev-cluster2-node3 dev-cluster2-node4 ]
> 
> Clone Set: clonePing [pingCheck]
>     Started: [ dev-cluster2-node2 dev-cluster2-node3 dev-cluster2-node4 ]
> Master/Slave Set: msPgsql [pgsql]
>     Masters: [ dev-cluster2-node2 ]
>     Slaves: [ dev-cluster2-node3 dev-cluster2-node4 ]
> VirtualIP      (ocf::heartbeat:IPaddr2):       Started dev-cluster2-node2
> 
> Operations:
> * Node dev-cluster2-node2:
>   pingCheck: migration-threshold=1
>    + (20) start: rc=0 (ok)
>    + (23) monitor: interval=10000ms rc=0 (ok)
>   pgsql: migration-threshold=1
>    + (41) promote: rc=0 (ok)
>    + (87) monitor: interval=1000ms rc=8 (master)
>   VirtualIP: migration-threshold=1
>    + (49) start: rc=0 (ok)
>    + (52) monitor: interval=10000ms rc=0 (ok)
> * Node dev-cluster2-node3:
>   pingCheck: migration-threshold=1
>    + (20) start: rc=0 (ok)
>    + (23) monitor: interval=10000ms rc=0 (ok)
>   pgsql: migration-threshold=1
>    + (26) start: rc=0 (ok)
>    + (32) monitor: interval=10000ms rc=0 (ok)
> * Node dev-cluster2-node4:
>   pingCheck: migration-threshold=1
>    + (20) start: rc=0 (ok)
>    + (23) monitor: interval=10000ms rc=0 (ok)
>   pgsql: migration-threshold=1
>    + (26) start: rc=0 (ok)
>    + (32) monitor: interval=10000ms rc=0 (ok)
> 
> In reality now killed (signal 4|6) the PG master and the penultimate slave PG.
> IMHO, even if I have something configured incorrectly, the inability to monitor the resource must cause a fatal error.
> Or is there a reason not to do so?
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 841 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20131113/3cc50000/attachment-0003.sig>