[Pacemaker] Issue with an isolated node overriding CIB after rejoining main cluster

Andrew Beekhof andrew at beekhof.net
Mon Jul 15 19:56:40 EDT 2013


On 16/07/2013, at 2:05 AM, "Howley, Tom" <tom.howley at hp.com> wrote:

> Hi Andrew,
> 
> Thanks for the reply. I have a couple of more questions below. I have seem to have two main problems: isolated node updating CIB; corosync behaviour to ifdown.
> 
>> Why isn't your normal fencing device working?
> My normal fencing is working and was in place for nearly all of my testing. I just tried the "suicide" option to see if it would prevent the isolated node from carrying out any  CIB updates.
> 
> 
>> epoch is bumped after an election and a configuration change but NOT a status change. 
>> so it shouldn't be making it to 102
> My log below shows that the cib-bootstrap-options property is being updated. Is this not a configuration change?

Yes, but who changed it?
I wouldn't expect that to happen automatically.

> 
> 
>>> 1.       My initial feeling was that the isolated node, Alice,  (which has no quorum) should not be updating a CIB that could potentially override the sane part of the cluster. Is that a fair comment?
> 
>> Not as currently designed.  Although there may be some improvements we can make in that area.
> Would you consider this a bug, or is there a case where this behaviour is desired?

Its probably a bug in the sense that we can do better.
The fix will have to wait for 1.1.11 though.  Its a simple change, but it needs a lot of testing to make sure any side-effects are accounted for. 

> 
> 
> In the meantime, I have a run script over the weekend that brings down the network on the current drbd master, randomly using one of two options: ifdown ethX; or add iptables rule to block all incoming and outgoing packages. All of the roughly 350 block ports scenarios were successfully recovered (i.e. no split-brain), whereas 130 out of 350 ifdown scenarios resulted in split-brain (the script automatically repaired split-brain between test interations). (Note that in order to aggravate the problem, these tests are based on using stonith with an artificial delay before reset, and ensuring that crm-fence-peer timeout is still greater than this delay -- I also intend to redo test with normal conditions.)
> 
> Is this a known/expected issue, which effectively means I shouldn't test using "ifdown ethX"?

The general consensus over the years is that ifdown is not considered a valid test - even at the corosync level without pacemaker involved.

> If so, is there some configuration I can apply to change behaviour to ifdown? My major fear is that some network failure could trigger the code path that leads to the isolated node updating CIB, etc. 
> 
> 
> Thanks again,
> 
> Tom
> 
> -----Original Message-----
> From: Andrew Beekhof [mailto:andrew at beekhof.net] 
> Sent: 15 July 2013 01:52
> To: The Pacemaker cluster resource manager
> Subject: Re: [Pacemaker] Issue with an isolated node overriding CIB after rejoining main cluster
> 
> 
> On 12/07/2013, at 10:49 PM, "Howley, Tom" <tom.howley at hp.com> wrote:
> 
>> Hi,
>> 
>> pacemaker:1.1.6-2ubuntu3,
> 
> ouch
> 
>> corosync:1.4.2-2, drbd8-utils 2:8.3.11-0ubuntu1
>> 
>> I have a three node setup, with two nodes running DRBD, resource-level fencing enabled ('resource-and-stonith') and obviously stonith configured for each node. In my current test case, I bring down network interface on the DRBD primary/master node (using ifdown eth0, for example), which sometimes leads to split-brain when the isolated node rejoins the cluster - the serious problem is that upon rejoining, the isolated node is promoted to DRBD primary (despite the original fencing constraint) , which opens us up to data-loss for updates that occurred while that node was down.
>> 
>> The exact problem scenario is as follows:
>> -          Alice: DRBD Primary/Master, Bob: Secondary/Slave, Jim: Quorum node, Epoch=100
>> -          ifdown eth0 on Alice
>> -          Alice detects loss of network if, sets itself up as DC, carries out some CIB updates (see log snippet below) that raises the epoch level, say Epoch=102
> 
> epoch is bumped after an election and a configuration change but NOT a status change.
> so it shouldn't be making it to 102
> 
>> -          Alice is shot via stonith.
>> -          Bob adds fencing rule to CIB to prevent promotion of DRBD on any other node, Epoch=101
>> -          When Alice comes back and rejoins the cluster, the DC decides to sync to Alice CIB, thereby removing the fencing rule prematurely (i.e. before the drbd devices have resynched).
>> -          In some cases: Alice is promoted to Primary/Master and fences resource to prevent promotion on any other node.
>> -          We now have split-brain and potential loss of data.
>> 
>> So some questions on the above:
>> 1.       My initial feeling was that the isolated node, Alice,  (which has no quorum) should not be updating a CIB that could potentially override the sane part of the cluster. Is that a fair comment?
> 
> Not as currently designed.  Although there may be some improvements we can make in that area.
> 
>> 2.       Is this issue just particular to my use of 'ifdown ethX' to disable the network? This is hinted at here: https://github.com/corosync/corosync/wiki/Corosync-and-ifdown-on-active-network-interface Has this issue been addressed, or will it be in the future?
>> 3.        If 'ifdown ethX is not valid', what is the best alternative that mimics what might happen in real world? I have tried blocking connections using iptables rules, dropping all incoming and outoing packets; initial testing appears to show different corosync behaviour that would hopefully not lead to my problem scenario, but I'm still in the process of confirming. I have also carried out some cable pulls and not run into issues yet, but this problem can be intermittent, so really needs an automated way to test many times.
>> 4.       The log snippet below from the isolated node shows that it updates the CIB twice sometime after detecting loss of network interface. Why does this happen? I believe that ultimately it is these CIB updates that increment the epoch, which leads to this CIB overriding the cluster later.
>> 
>> I have also tried a no-quorum-policy of 'suicide' in an attempt to prevent CIB updates by the Alice, but it didn't make a different.
> 
> Why isn't your normal fencing device working?
> 
>> Note that to facilitate log collection and analysis, I have added a delay to the stonith reset operation, but I have also set the timeout on the crm-fence-peer script to ensure that it is greater than this 'deadtime'.
>> 
>> Any advice on this would be greatly appreciated.
>> 
>> Thanks,
>> 
>> Tom
>> 
>> Log snippet showing isolated node updating the CIB, which results in epoch being incremented two times:
>> 
>> Jul 10 13:42:54 stratus18 corosync[1268]:   [TOTEM ] A processor failed, forming new configuration.
>> Jul 10 13:42:54 stratus18 corosync[1268]:   [TOTEM ] The network interface is down.
>> Jul 10 13:42:54 stratus18 crm-fence-peer.sh[20758]: TOMTEST-DEBUG: modified version
>> Jul 10 13:42:54 stratus18 crm-fence-peer.sh[20758]: invoked for tomtest
>> Jul 10 13:42:54 stratus18 crm-fence-peer.sh[20761]: TOMTEST-DEBUG: modified version
>> Jul 10 13:42:54 stratus18 crm-fence-peer.sh[20761]: invoked for tomtest
>> Jul 10 13:42:55 stratus18 stonith-ng: [1276]: info: stonith_command: Processed st_execute from lrmd: rc=-1
>> Jul 10 13:42:55 stratus18 external/ipmi[20806]: [20816]: ERROR: error executing ipmitool: Connect failed: Network is unreachable#015 Unable to get Chassis Power Status#015
>> Jul 10 13:42:55 stratus18 crm-fence-peer.sh[20758]: Call cib_query failed (-41): Remote node did not respond
>> Jul 10 13:42:55 stratus18 crm-fence-peer.sh[20761]: Call cib_query failed (-41): Remote node did not respond
>> Jul 10 13:42:55 stratus18 ntpd[1062]: Deleting interface #7 eth0, 192.168.185.150#123, interface stats: received=0, sent=0, dropped=0, active_time=912 secs
>> Jul 10 13:42:55 stratus18 ntpd[1062]: Deleting interface #4 eth0, fe80::7ae7:d1ff:fe22:5270#123, interface stats: received=0, sent=0, dropped=0, active_time=6080 secs
>> Jul 10 13:42:55 stratus18 ntpd[1062]: Deleting interface #3 eth0, 192.168.185.118#123, interface stats: received=52, sent=53, dropped=0, active_time=6080 secs
>> Jul 10 13:42:55 stratus18 ntpd[1062]: 192.168.8.97 interface 192.168.185.118 -> (none)
>> Jul 10 13:42:55 stratus18 ntpd[1062]: peers refreshed
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] notice: pcmk_peer_update: Transitional membership event on ring 2728: memb=1, new=0, lost=2
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: pcmk_peer_update: memb: .unknown. 16777343
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: pcmk_peer_update: lost: stratus18 1991878848
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: pcmk_peer_update: lost: stratus20 2025433280
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] notice: pcmk_peer_update: Stable membership event on ring 2728: memb=1, new=0, lost=0
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: update_member: Creating entry for node 16777343 born on 2728
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: update_member: Node 16777343/unknown is now: member
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: pcmk_peer_update: MEMB: .pending. 16777343
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] ERROR: pcmk_peer_update: Something strange happened: 1
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: ais_mark_unseen_peer_dead: Node stratus17 was not seen in the previous transition
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: update_member: Node 1975101632/stratus17 is now: lost
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: ais_mark_unseen_peer_dead: Node stratus18 was not seen in the previous transition
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: update_member: Node 1991878848/stratus18 is now: lost
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: ais_mark_unseen_peer_dead: Node stratus20 was not seen in the previous transition
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: update_member: Node 2025433280/stratus20 is now: lost
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] WARN: pcmk_update_nodeid: Detected local node id change: 1991878848 -> 16777343
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: destroy_ais_node: Destroying entry for node 1991878848
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] notice: ais_remove_peer: Removed dead peer 1991878848 from the membership list
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: ais_remove_peer: Sending removal of 1991878848 to 2 children
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: update_member: 0x13d9520 Node 16777343 now known as stratus18 (was: (null))
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: update_member: Node stratus18 now has 1 quorum votes (was 0)
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: update_member: Node stratus18 now has process list: 00000000000000000000000000111312 (1118994)
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: send_member_notification: Sending membership update 2728 to 2 children
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: update_member: 0x13d9520 Node 16777343 ((null)) born on: 2708
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [TOTEM ] A processor joined or left the membership and a new membership was formed.
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: crm_get_peer: Node stratus18 now has id: 16777343
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: ais_dispatch_message: Membership 2728: quorum retained
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: ais_dispatch_message: Removing peer 1991878848/1991878848
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: reap_crm_member: Peer 1991878848 is unknown
>> Jul 10 13:42:55 stratus18 cib: [1277]: notice: ais_dispatch_message: Membership 2728: quorum lost
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: crm_update_peer: Node stratus17: id=1975101632 state=lost (new) addr=r(0) ip(192.168.185.117)  votes=1 born=2724 seen=2724 proc=00000000000000000000000000111312
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: crm_update_peer: Node stratus20: id=2025433280 state=lost (new) addr=r(0) ip(192.168.185.120)  votes=1 born=4 seen=2724 proc=00000000000000000000000000111312
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: crm_get_peer: Node stratus18 now has id: 1991878848
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [CPG   ] chosen downlist: sender r(0) ip(127.0.0.1) ; members(old:3 left:3)
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [MAIN  ] Completed service synchronization, ready to provide service.
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: crm_get_peer: Node stratus18 now has id: 16777343
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: ais_dispatch_message: Membership 2728: quorum retained
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: ais_dispatch_message: Removing peer 1991878848/1991878848
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: reap_crm_member: Peer 1991878848 is unknown
>> Jul 10 13:42:55 stratus18 crmd: [1281]: notice: ais_dispatch_message: Membership 2728: quorum lost
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: ais_status_callback: status: stratus17 is now lost (was member)
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: crm_update_peer: Node stratus17: id=1975101632 state=lost (new) addr=r(0) ip(192.168.185.117)  votes=1 born=2724 seen=2724 proc=00000000000000000000000000111312
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: ais_status_callback: status: stratus20 is now lost (was member)
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: crm_update_peer: Node stratus20: id=2025433280 state=lost (new) addr=r(0) ip(192.168.185.120)  votes=1 born=4 seen=2724 proc=00000000000000000000000000111312
>> Jul 10 13:42:55 stratus18 crmd: [1281]: WARN: check_dead_member: Our DC node (stratus20) left the cluster
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_state_transition: State transition S_NOT_DC -> S_ELECTION [ input=I_ELECTION cause=C_FSA_INTERNAL origin=check_dead_member ]
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: update_dc: Unset DC stratus20
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_state_transition: State transition S_ELECTION -> S_INTEGRATION [ input=I_ELECTION_DC cause=C_FSA_INTERNAL origin=do_election_check ]
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_te_control: Registering TE UUID: 6e335eff-5e48-4fc1-9003-0537ae948dfd
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: set_graph_functions: Setting custom graph functions
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: unpack_graph: Unpacked transition -1: 0 actions in 0 synapses
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_dc_takeover: Taking over DC status for this partition
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_readwrite: We are now in R/W mode
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation complete: op cib_master for section 'all' (origin=local/crmd/57, version=0.76.46): ok (rc=0)
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation complete: op cib_modify for section cib (origin=local/crmd/58, version=0.76.47): ok (rc=0)
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: crm_get_peer: Node stratus18 now has id: 16777343
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/60, version=0.76.48): ok (rc=0)
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: join_make_offer: Making join offers based on membership 2728
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_dc_join_offer_all: join-1: Waiting on 1 outstanding join acks
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: ais_dispatch_message: Membership 2728: quorum still lost
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/62, version=0.76.49): ok (rc=0)
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: crmd_ais_dispatch: Setting expected votes to 2
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: update_dc: Set DC to stratus18 (3.0.5)
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: config_query_callback: Shutdown escalation occurs after: 1200000ms
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: config_query_callback: Checking for expired actions every 900000ms
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: config_query_callback: Sending expected-votes=3 to corosync
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: ais_dispatch_message: Membership 2728: quorum still lost
>> Jul 10 13:42:55 stratus18 corosync[1268]:   [pcmk  ] info: update_expected_votes: Expected quorum votes 2 -> 3
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: - <cib admin_epoch="0" epoch="76" num_updates="49" >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -   <configuration >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -     <crm_config >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -       <cluster_property_set id="cib-bootstrap-options" >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -         <nvpair value="3" id="cib-bootstrap-options-expected-quorum-votes" />
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -       </cluster_property_set>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -     </crm_config>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -   </configuration>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: - </cib>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: + <cib admin_epoch="0" cib-last-written="Wed Jul 10 13:25:58 2013" crm_feature_set="3.0.5" epoch="77" have-quorum="1" num_updates="1" update-client="crmd" update-origin="stratus17" validate-with="pacemaker-1.2" dc-uuid="stratus20" >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +   <configuration >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +     <crm_config >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +       <cluster_property_set id="cib-bootstrap-options" >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +         <nvpair id="cib-bootstrap-options-expected-quorum-votes" name="expected-quorum-votes" value="2" />
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +       </cluster_property_set>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +     </crm_config>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +   </configuration>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: + </cib>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/65, version=0.77.1): ok (rc=0)
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: crmd_ais_dispatch: Setting expected votes to 3
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_state_transition: State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ]
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_state_transition: All 1 cluster nodes responded to the join offer.
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_dc_join_finalize: join-1: Syncing the CIB from stratus18 to the rest of the cluster
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: - <cib admin_epoch="0" epoch="77" num_updates="1" >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -   <configuration >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -     <crm_config >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -       <cluster_property_set id="cib-bootstrap-options" >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -         <nvpair value="2" id="cib-bootstrap-options-expected-quorum-votes" />
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -       </cluster_property_set>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -     </crm_config>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: -   </configuration>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: - </cib>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: + <cib admin_epoch="0" cib-last-written="Wed Jul 10 13:42:55 2013" crm_feature_set="3.0.5" epoch="78" have-quorum="1" num_updates="1" update-client="crmd" update-origin="stratus18" validate-with="pacemaker-1.2" dc-uuid="stratus20" >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +   <configuration >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +     <crm_config >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +       <cluster_property_set id="cib-bootstrap-options" >
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +         <nvpair id="cib-bootstrap-options-expected-quorum-votes" name="expected-quorum-votes" value="3" />
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +       </cluster_property_set>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +     </crm_config>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: +   </configuration>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib:diff: + </cib>
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation complete: op cib_modify for section crm_config (origin=local/crmd/68, version=0.78.1): ok (rc=0)
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation complete: op cib_sync for section 'all' (origin=local/crmd/69, version=0.78.1): ok (rc=0)
>> Jul 10 13:42:55 stratus18 lrmd: [1278]: info: stonith_api_device_metadata: looking up external/ipmi/heartbeat metadata
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/70, version=0.78.2): ok (rc=0)
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_dc_join_ack: join-1: Updating node state to member for stratus18
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation complete: op cib_delete for section //node_state[@uname='stratus18']/lrm (origin=local/crmd/71, version=0.78.3): ok (rc=0)
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: erase_xpath_callback: Deletion of "//node_state[@uname='stratus18']/lrm": ok (rc=0)
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_state_transition: State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED cause=C_FSA_INTERNAL origin=check_join_state ]
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_state_transition: All 1 cluster nodes are eligible to run resources.
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_dc_join_final: Ensuring DC, quorum and node attributes are up-to-date
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: crm_update_quorum: Updating quorum status to false (call=75)
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: abort_transition_graph: do_te_invoke:167 - Triggered transition abort (complete=1) : Peer Cancelled
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_pe_invoke: Query 76: Requesting the current CIB: S_POLICY_ENGINE
>> Jul 10 13:42:55 stratus18 attrd: [1279]: notice: attrd_local_callback: Sending full refresh (origin=crmd)
>> Jul 10 13:42:55 stratus18 attrd: [1279]: notice: attrd_trigger_update: Sending flush op to all hosts for: probe_complete (true)
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation complete: op cib_modify for section nodes (origin=local/crmd/73, version=0.78.5): ok (rc=0)
>> Jul 10 13:42:55 stratus18 crmd: [1281]: WARN: match_down_event: No match for shutdown action on stratus17
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: te_update_diff: Stonith/shutdown of stratus17 not matched
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: abort_transition_graph: te_update_diff:215 - Triggered transition abort (complete=1, tag=node_state, id=stratus17, magic=NA, cib=0.78.6) : Node failure
>> Jul 10 13:42:55 stratus18 crmd: [1281]: WARN: match_down_event: No match for shutdown action on stratus20
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: te_update_diff: Stonith/shutdown of stratus20 not matched
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: abort_transition_graph: te_update_diff:215 - Triggered transition abort (complete=1, tag=node_state, id=stratus20, magic=NA, cib=0.78.6) : Node failure
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_pe_invoke: Query 77: Requesting the current CIB: S_POLICY_ENGINE
>> Jul 10 13:42:55 stratus18 crmd: [1281]: info: do_pe_invoke: Query 78: Requesting the current CIB: S_POLICY_ENGINE
>> Jul 10 13:42:55 stratus18 cib: [1277]: info: cib_process_request: Operation complete: op cib_modify for section cib (origin=local/crmd/75, version=0.78.7): ok (rc=0)
>> Jul 10 13:42:56 stratus18 crmd: [1281]: info: do_pe_invoke_callback: Invoking the PE: query=78, ref=pe_calc-dc-1373460176-49, seq=2728, quorate=0
>> Jul 10 13:42:56 stratus18 attrd: [1279]: notice: attrd_trigger_update: Sending flush op to all hosts for: master-drbd_tomtest:0 (10000)
>> Jul 10 13:42:56 stratus18 pengine: [1280]: WARN: cluster_status: We do not have quorum - fencing and resource management disabled
>> Jul 10 13:42:56 stratus18 pengine: [1280]: WARN: pe_fence_node: Node stratus17 will be fenced because it is un-expectedly down
>> Jul 10 13:42:56 stratus18 pengine: [1280]: WARN: determine_online_status: Node stratus17 is unclean
>> Jul 10 13:42:56 stratus18 pengine: [1280]: WARN: pe_fence_node: Node stratus20 will be fenced because it is un-expectedly down
>> Jul 10 13:42:56 stratus18 pengine: [1280]: WARN: determine_online_status: Node stratus20 is unclean
>> Jul 10 13:42:56 stratus18 pengine: [1280]: notice: unpack_rsc_op: Hard error - drbd_tomtest:0_last_failure_0 failed with rc=5: Preventing ms_drbd_tomtest from re-starting on stratus20
>> Jul 10 13:42:56 stratus18 pengine: [1280]: notice: unpack_rsc_op: Hard error - tomtest_mysql_SERVICE_last_failure_0 failed with rc=5: Preventing tomtest_mysql_SERVICE from re-starting on stratus20
>> 
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>> 
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://bugs.clusterlabs.org
> 
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org





More information about the Pacemaker mailing list