[Pacemaker] fail-count is not updated

Tue Apr 3 20:45:14 UTC 2012

----- Original Message -----
> From: "Kazunori INOUE" <inouekazu at intellilink.co.jp>
> To: "pacemaker at oss" <pacemaker at oss.clusterlabs.org>
> Cc: tanakakza at intellilink.co.jp
> Sent: Monday, April 2, 2012 12:40:20 AM
> Subject: [Pacemaker]  fail-count is not updated
> 
> Hi, Andrew
> 
> When combined with pacemaker-1.1.7 and corosync-1.99.9,
> fail-count is not updated at the time of monitor failure.
> 
> I am using the newest devel.
> - pacemaker : 7172b7323bb72c51999ce11c6fa5d3ff0a0a4b4f
> - corosync  : 4b2cfc3f6beabe517b28ea31c5340bf3b0a6b455
> - glue      : 041b464f74c8
> - libqb     : 7b13d09afbb684f9ee59def23b155b38a21987df
> 
> # crm_mon -f1
> ============
> Last updated: Mon Apr  2 14:03:03 2012
> Last change: Mon Apr  2 14:02:33 2012 via cibadmin on vm1
> Stack: corosync
> Current DC: vm1 (224766144) - partition with quorum
> Version: 1.1.7-7172b73
> 2 Nodes configured, unknown expected votes
> 1 Resources configured.
> ============
> 
> Online: [ vm1 vm2 ]
> 
>  prmDummy1      (ocf::pacemaker:Dummy): Started vm1
> 
> Migration summary:
> * Node vm1:
> * Node vm2:
> 
> Failed actions:
>     prmDummy1_monitor_10000 (node=vm1, call=4, rc=7,
>     status=complete): not running
> #
> 
> I think this is because corosync's nodeID and hostname are
> intermingled in the
> value which identifies a cluster node.
> I added the debugging code. (l.769)
> 
> # vi pacemaker/tools/attrd.c
> <snip>
> 752 void
> 753 attrd_local_callback(xmlNode * msg)
> 754 {
> <snip>
> 768
> 769 crm_info("DEBUG:
> [%s,%s,%s,%s,%s],[%s]\n",from,op,attr,value,host,attrd_uname);
> 770     if (host != NULL && safe_str_neq(host, attrd_uname)) {
> 771         send_cluster_message(host, crm_msg_attrd, msg, FALSE);
> 772         return;
> 773     }
> 774
> 775     crm_debug("%s message from %s: %s=%s", op, from, attr,
> crm_str(value));
> 
> [root at vm1 ~]# grep DEBUG /var/log/ha-debug
> <snip>
> Apr  2 14:02:34 vm1 Dummy(prmDummy1)[21140]: DEBUG: prmDummy1 monitor
> : 7
> Apr  2 14:02:34 vm1 attrd[21077]:     info: attrd_local_callback:
> DEBUG: [crmd,update,probe_complete,true,(null)],[vm1]
> Apr  2 14:02:34 vm1 Dummy(prmDummy1)[21151]: DEBUG: prmDummy1 start :
> 0
> Apr  2 14:02:34 vm1 Dummy(prmDummy1)[21159]: DEBUG: prmDummy1 monitor
> : 0
> Apr  2 14:02:44 vm1 Dummy(prmDummy1)[21166]: DEBUG: prmDummy1 monitor
> : 0
> Apr  2 14:02:54 vm1 Dummy(prmDummy1)[21175]: DEBUG: prmDummy1 monitor
> : 7
> Apr  2 14:02:54 vm1 attrd[21077]:     info: attrd_local_callback:
> DEBUG: [crmd,update,fail-count-prmDummy1,value++,224766144],[vm1]
> Apr  2 14:02:54 vm1 attrd[21077]:     info: attrd_local_callback:
> DEBUG:
> [crmd,update,last-failure-prmDummy1,1333342974,224766144],[vm1]
> Apr  2 14:02:54 vm1 Dummy(prmDummy1)[21182]: DEBUG: prmDummy1 stop :
> 0
> Apr  2 14:02:54 vm1 Dummy(prmDummy1)[21189]: DEBUG: prmDummy1 start :
> 0
> Apr  2 14:02:54 vm1 Dummy(prmDummy1)[21201]: DEBUG: prmDummy1 monitor
> : 0
> 
> Corosync's nodeID was stored in variable 'host', and hostname was
> stored in
> variable 'attrd_uname'..
> 
> [root at vm1 ~]# corosync-cfgtool -s | grep node
> Local node ID 224766144
> [root at vm1 ~]#
> 
> Regards,
> Kazunori INOUE

Yep, I am seeing this as well.  I created a bug report for the issue here, http://bugs.clusterlabs.org/show_bug.cgi?id=5053.  I'll take a shot at fixing this soon.

-- Vossel