[Pacemaker] standby attribute and same resources running at the same time

Wed Mar 6 07:17:13 EST 2013

Am 06.03.2013 um 05:14 schrieb Andrew Beekhof <andrew at beekhof.net>:
> On Tue, Mar 5, 2013 at 4:20 AM, Leon Fauster <leonfauster at googlemail.com> wrote:
>> 
>> So far all good. I am doing some stress test now and noticed that rebooting
>> one node (n2), that node (n2) will be marked as standby in the cib (shown on the
>> other node (n1)).
>> 
>> After rebooting the node (n2) crm_mon on that node shows that the other node (n1)
>> is offline and begins to start the ressources. While the other node (n1) that wasn't
>> rebooted still shows n2 as standby. At that point both nodes are runnnig the "same"
>> resources. After a couple of minutes that situation is noticed and both nodes
>> renegotiate the current state. Then one node take over the responsibility to provide
>> the resources. On both nodes the previously rebooted node is still listed as standby.
>> 
>> 
>>  cat /var/log/messages |grep error
>>  Mar  4 17:32:33 cn1 pengine[1378]:    error: native_create_actions: Resource resIP (ocf::IPaddr2) is active on 2 nodes attempting recovery
>>  Mar  4 17:32:33 cn1 pengine[1378]:    error: native_create_actions: Resource resApache (ocf::apache) is active on 2 nodes attempting recovery
>>  Mar  4 17:32:33 cn1 pengine[1378]:    error: process_pe_message: Calculated Transition 1: /var/lib/pacemaker/pengine/pe-error-6.bz2
>>  Mar  4 17:32:48 cn1 crmd[1379]:   notice: run_graph: Transition 1 (Complete=9, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-error-6.bz2): Complete
>> 
>> 
>>  crm_mon -1
>>  Last updated: Mon Mar  4 17:49:08 2013
>>  Last change: Mon Mar  4 10:22:53 2013 via crm_resource on cn1.localdomain
>>  Stack: cman
>>  Current DC: cn1.localdomain - partition with quorum
>>  Version: 1.1.8-7.el6-394e906
>>  2 Nodes configured, 2 expected votes
>>  2 Resources configured.
>> 
>>  Node cn2.localdomain: standby
>>  Online: [ cn1.localdomain ]
>> 
>>  resIP (ocf::heartbeat:IPaddr2):       Started cn1.localdomain
>>  resApache     (ocf::heartbeat:apache):        Started cn1.localdomain
>> 
>> 
>> i checked the init scripts and found that the standby "behavior" comes
>> from a function that is called on "service pacemaker stop" (added in rhel6.4).
>> 
>> cman_pre_stop()
>> {
>>    cname=`crm_node --name`
>>    crm_attribute -N $cname -n standby -v true -l reboot
>>    echo -n "Waiting for shutdown of managed resources"
>> ...
> 
> That will only last until the node comes back (the cluster will remove
> it automatically), the core problem is that it appears not to have.
> Can you file a bug and attach a crm_report for the period covered by
> the restart?

I used the redhat's bugzilla:

https://bugzilla.redhat.com/show_bug.cgi?id=918502

as you are also the maintainer of the corresponding rpm. 

--
Thanks
Leon