[Pacemaker] WARN: msg_to_op(1324): failed to get the value of field lrm_opstatus from a ha_msg
Bob Schatz
bschatz at yahoo.com
Fri Mar 25 23:32:42 UTC 2011
A few more thoughts that occurred after I hit <return>
1. This problem sees to only occur when "/etc/init.d/heartbeat start" is
executed on two nodes at the same time. If I only do one at a time it does not
seem to occur. (this may be related to the creation of master/slave resources
in /etc/ha.d/resource.d/startstop when heartbeat starts)
2. This problem seemed to occur most frequently when I went from 4 master/slave
resources to 6 master/slave resources.
Thanks,
Bob
----- Original Message ----
From: Bob Schatz <bschatz at yahoo.com>
To: The Pacemaker cluster resource manager <pacemaker at oss.clusterlabs.org>
Sent: Fri, March 25, 2011 4:22:39 PM
Subject: Re: [Pacemaker] WARN: msg_to_op(1324): failed to get the value of field
lrm_opstatus from a ha_msg
After reading more threads, I noticed that I needed to include the PE outputs.
Therefore, I have rerun the tests and included the PE outputs, the configuration
file and the logs for both nodes.
The test was rerun with max-children of 20.
Thanks,
Bob
----- Original Message ----
From: Bob Schatz <bschatz at yahoo.com>
To: pacemaker at oss.clusterlabs.org
Sent: Thu, March 24, 2011 7:35:54 PM
Subject: [Pacemaker] WARN: msg_to_op(1324): failed to get the value of field
lrm_opstatus from a ha_msg
I am getting these messages in the log:
2011-03-24 18:53:12| warning |crmd: [27913]: WARN: msg_to_op(1324): failed to
get the value of field lrm_opstatus from a ha_msg
2011-03-24 18:53:12| info |crmd: [27913]: info: msg_to_op: Message follows:
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG: Dumping message with 16
fields
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[0] : [lrm_t=op]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[1] :
[lrm_rid=SSJ0000E02A2:0]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[2] : [lrm_op=start]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[3] : [lrm_timeout=300000]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[4] : [lrm_interval=0]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[5] : [lrm_delay=0]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[6] : [lrm_copyparams=1]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[7] : [lrm_t_run=0]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[8] : [lrm_t_rcchange=0]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[9] : [lrm_exec_time=0]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[10] : [lrm_queue_time=0]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[11] : [lrm_targetrc=-1]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[12] : [lrm_app=crmd]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[13] :
[lrm_userdata=91:3:0:dc9ad1c7-1d74-4418-a002-34426b34b576]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[14] :
[(2)lrm_param=0x64c230(938 1098)]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG: Dumping message with 27
fields
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[0] : [CRM_meta_clone=0]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[1] :
[CRM_meta_notify_slave_resource= ]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[2] :
[CRM_meta_notify_active_resource= ]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[3] :
[CRM_meta_notify_demote_uname= ]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[4] :
[CRM_meta_notify_inactive_resource=SSJ0000E02A2:0 SSJ0000E02A2:1 ]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[5] :
[ssconf=/var/omneon/config/config.J0000E02A2]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[6] :
[CRM_meta_master_node_max=1]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[7] :
[CRM_meta_notify_stop_resource= ]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[8] :
[CRM_meta_notify_master_resource= ]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[9] :
[CRM_meta_clone_node_max=1]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[10] :
[CRM_meta_clone_max=2]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[11] :
[CRM_meta_notify=true]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[12] :
[CRM_meta_notify_start_resource=SSJ0000E02A2:0 SSJ0000E02A2:1 ]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[13] :
[CRM_meta_notify_stop_uname= ]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[14] :
[crm_feature_set=3.0.1]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[15] :
[CRM_meta_notify_master_uname= ]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[16] :
[CRM_meta_master_max=1]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[17] :
[CRM_meta_globally_unique=false]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[18] :
[CRM_meta_notify_promote_resource=SSJ0000E02A2:0 ]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[19] :
[CRM_meta_notify_promote_uname=mgraid-s0000e02a1-0 ]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[20] :
[CRM_meta_notify_active_uname= ]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[21] :
[CRM_meta_notify_start_uname=mgraid-s0000e02a1-0 mgraid-s0000e02a1-1 ]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[22] :
[CRM_meta_notify_slave_uname= ]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[23] :
[CRM_meta_name=start]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[24] :
[ss_resource=SSJ0000E02A2]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[25] :
[CRM_meta_notify_demote_resource= ]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[26] :
[CRM_meta_timeout=300000]
2011-03-24 18:53:12| info |crmd: [27913]: info: MSG[15] : [lrm_callid=15]
This results in the resources being stopped even though I can see from the
logging that the agent START function returned $OCF_SUCCESS. (The agent start
function prints "ss_start() START" and "ss_start() END" in the logging).
The START function can take anywhere from 30 - 60 seconds to complete due to our
application.
I am running with 1.0.9 Pacemaker and heartbeat 3.0.3.
I have attached the configuration as a file to this email since I thought it
would make the email unreadable. (Summary is 6 master/slave resources).
I have also attached logs . The above messages are from the file n0-short.txt
but also occur in n1-short.txt.
I thought that maybe I was running into a problem with the number of threads
that lrmd had configured. I increased in to 40 and proved that it was in
affect with:
# /sbin/lrmadmin -g max-children
max-children: 40
This problem is reproducible every time.
Thanks in advance,
Bob
More information about the Pacemaker
mailing list