[Pacemaker] WARN: msg_to_op(1324): failed to get the value of field lrm_opstatus from a ha_msg

Fri Mar 25 23:32:42 UTC 2011

A few more thoughts that occurred after I hit <return>

1.  This problem sees to only occur when "/etc/init.d/heartbeat start" is 
executed on two nodes at the same time.  If I only do one at a time it does not 
seem to occur.  (this may be related to the creation of master/slave resources 
in /etc/ha.d/resource.d/startstop when heartbeat starts)
2.  This problem seemed to occur most frequently when I went from 4 master/slave 
resources to 6 master/slave resources.

Thanks,

Bob

----- Original Message ----
From: Bob Schatz <bschatz at yahoo.com>
To: The Pacemaker cluster resource manager <pacemaker at oss.clusterlabs.org>
Sent: Fri, March 25, 2011 4:22:39 PM
Subject: Re: [Pacemaker] WARN: msg_to_op(1324): failed to get the value of field 
lrm_opstatus from a ha_msg

After reading more threads, I noticed that I needed to include the PE outputs.

Therefore, I have rerun the tests and included the PE outputs, the configuration 

file and the logs for both nodes.

The test was rerun with max-children of 20.

Thanks,

Bob

----- Original Message ----
From: Bob Schatz <bschatz at yahoo.com>
To: pacemaker at oss.clusterlabs.org
Sent: Thu, March 24, 2011 7:35:54 PM
Subject: [Pacemaker] WARN: msg_to_op(1324): failed to get the value of field 
lrm_opstatus from a ha_msg

I am getting these messages in the log:

   2011-03-24 18:53:12| warning |crmd: [27913]: WARN: msg_to_op(1324): failed to 

get the value of field lrm_opstatus from  a ha_msg
   2011-03-24 18:53:12| info |crmd: [27913]: info: msg_to_op: Message follows:
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG: Dumping message with 16 
fields
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[0] : [lrm_t=op]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[1] : 
[lrm_rid=SSJ0000E02A2:0]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[2] : [lrm_op=start]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[3] : [lrm_timeout=300000]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[4] : [lrm_interval=0]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[5] : [lrm_delay=0]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[6] : [lrm_copyparams=1]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[7] : [lrm_t_run=0]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[8] : [lrm_t_rcchange=0]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[9] : [lrm_exec_time=0]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[10] : [lrm_queue_time=0]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[11] : [lrm_targetrc=-1]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[12] : [lrm_app=crmd]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[13] : 
[lrm_userdata=91:3:0:dc9ad1c7-1d74-4418-a002-34426b34b576]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[14] : 
[(2)lrm_param=0x64c230(938 1098)]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG: Dumping message with 27 
fields
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[0] : [CRM_meta_clone=0]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[1] : 
[CRM_meta_notify_slave_resource= ]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[2] : 
[CRM_meta_notify_active_resource= ]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[3] : 
[CRM_meta_notify_demote_uname= ]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[4] : 
[CRM_meta_notify_inactive_resource=SSJ0000E02A2:0 SSJ0000E02A2:1 ]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[5] : 
[ssconf=/var/omneon/config/config.J0000E02A2]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[6] : 
[CRM_meta_master_node_max=1]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[7] : 
[CRM_meta_notify_stop_resource= ]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[8] : 
[CRM_meta_notify_master_resource= ]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[9] : 
[CRM_meta_clone_node_max=1]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[10] : 
[CRM_meta_clone_max=2]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[11] : 
[CRM_meta_notify=true]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[12] : 
[CRM_meta_notify_start_resource=SSJ0000E02A2:0 SSJ0000E02A2:1 ]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[13] : 
[CRM_meta_notify_stop_uname= ]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[14] : 
[crm_feature_set=3.0.1]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[15] : 
[CRM_meta_notify_master_uname= ]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[16] : 
[CRM_meta_master_max=1]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[17] : 
[CRM_meta_globally_unique=false]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[18] : 
[CRM_meta_notify_promote_resource=SSJ0000E02A2:0 ]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[19] : 
[CRM_meta_notify_promote_uname=mgraid-s0000e02a1-0 ]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[20] : 
[CRM_meta_notify_active_uname= ]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[21] : 
[CRM_meta_notify_start_uname=mgraid-s0000e02a1-0 mgraid-s0000e02a1-1 ]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[22] : 
[CRM_meta_notify_slave_uname= ]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[23] : 
[CRM_meta_name=start]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[24] : 
[ss_resource=SSJ0000E02A2]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[25] : 
[CRM_meta_notify_demote_resource= ]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[26] : 
[CRM_meta_timeout=300000]
   2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[15] : [lrm_callid=15]

This results in the resources being stopped even though I can see from the 
logging that the agent START function returned $OCF_SUCCESS.  (The agent start 
function prints "ss_start()  START" and "ss_start() END" in the logging).

The START function can take anywhere from 30 - 60 seconds to complete due to our 

application.

I am running with 1.0.9 Pacemaker and heartbeat 3.0.3.

I have attached the configuration as a file to this email since I thought it 
would make the email unreadable.  (Summary is 6 master/slave resources).

I have also attached logs .   The above messages are from the file n0-short.txt 
but also occur in n1-short.txt.  

I thought that maybe I was running into a problem with the number of threads 
that lrmd had configured.   I increased in to 40 and proved that it was in 
affect with:

   # /sbin/lrmadmin -g max-children
   max-children: 40

This problem is reproducible every time.

Thanks in advance,

Bob