[Pacemaker] problem configuring DRBD resource in "Floating Peers" mode

Wed May 27 08:33:11 UTC 2009

dДимитър Бойн wrote:
> Hi,
> 
> My ultimate goal is to run a bunch of servers/nodes that shall be able
> to handle a bunch of floating drbd peers.

the last time i tried to use floating peers, it did not work out
properly - especially when using outdating/.. mechanisms.

i recall a discussion last year [1] but cannot remember the
conclusion.

i saw your colocation request. did you check/resolve that?

i'll have a look anyways ...

> c001mlb_node01a:root > cat /etc/drbd.conf
> 
> # /usr/share/doc/drbd82/drbd.conf
looks fine.

> c001mlb_node01a:root > cibadmin -Q
*snip*
> 
>       <master id="ms-drbd0">
> 
>         <meta_attributes id="ma-ms-drbd0">
> 
>           <nvpair id="ma-ms-drbd0-1" name="clone_max" value="2"/>
all "_" have been replaced by "-" -> clone_max is now clone-max. please
recheck your configuration to comply with pacemaker-1.0 and above.

imho, that is the reason for having 26 clone instances of each drbd
device: drbd0:0 to drbd0:25.

moreover, and withouth looking further down, i would say that this might
fix some of your issues.

> 
>           <nvpair id="ma-ms-drbd0-2" name="clone_node_max" value="1"/>
> 
>           <nvpair id="ma-ms-drbd0-3" name="master_max" value="1"/>
> 
>           <nvpair id="ma-ms-drbd0-4" name="master_node_max" value="1"/>
> 
>           <nvpair id="ma-ms-drbd0-5" name="notify" value="yes"/>
> 
>           <nvpair id="ma-ms-drbd0-6" name="globally_unique" value="true"/>
> 
>           <nvpair id="ma-ms-drbd0-7" name="target_role" value="started"/>
> 
>         </meta_attributes>
> 
>         <primitive class="ocf" provider="heartbeat" type="drbd" id="drbd0">
> 
>           <instance_attributes id="ia-drbd0">
> 
>             <nvpair id="ia-drbd0-1" name="drbd_resource" value="drbd0"/>
> 
>             <nvpair id="ia-drbd0-2" name="clone_overrides_hostname"
> value="yes"/>
> 
>           </instance_attributes>
> 

>     <constraints>
> 
>       <rsc_location id="location-ip-c001drbd01a" rsc="ip-c001drbd01a">
> 
>         <rule id="ip-c001drbd01a-rule" score="-INFINITY">
> 
>           <expression id="exp-ip-c001drbd01a-rule" value="b"
> attribute="site" operation="eq"/>
> 
>         </rule>
> 
>       </rsc_location>
> 
>       <rsc_location id="location-ip-c001drbd01b" rsc="ip-c001drbd01b">
> 
>         <rule id="ip-c001drbd01b-rule" score="-INFINITY">
> 
>           <expression id="exp-ip-c001drbd01b-rule" value="a"
> attribute="site" operation="eq"/>
> 
>         </rule>
> 
>       </rsc_location>
> 
>       <rsc_location id="drbd0-master-1" rsc="ms-drbd0">
> 
>         <rule id="drbd0-master-on-c001mlb_node01a" role="master"
> score="100">
> 
>           <expression id="expression-1" attribute="#uname"
> operation="eq" value="c001mlb_node01a"/>
> 
>         </rule>
> 
>       </rsc_location>
> 
>       <rsc_order id="order-drbd0-after-ip-c001drbd01a"
> first="ip-c001drbd01a" then="ms-drbd0" score="1"/>
> 
>       <rsc_order id="order-drbd0-after-ip-c001drbd01b"
> first="ip-c001drbd01b" then="ms-drbd0" score="1"/>
> 
>       <rsc_colocation rsc="ip-c001drbd01a" score="INFINITY"
> id="colocate-drbd0-ip-c001drbd01a" with-rsc="ms-drbd0"/>
> 
>       <rsc_colocation rsc="ip-c001drbd01b" score="INFINITY"
> id="colocate-drbd0-ip-c001drbd01b" with-rsc="ms-drbd0"/>
> 
>     </constraints>
> 
>   </configuration>
*snip*

> ip-c001drbd01a  (ocf::heartbeat:IPaddr2):       Started c001mlb_node01a
> 
> Master/Slave Set: ms-drbd0
> 
>         Stopped: [ drbd0:0 drbd0:1 drbd0:2 drbd0:3 drbd0:4 drbd0:5
> drbd0:6 drbd0:7 drbd0:8 drbd0:9 drbd0:10 drbd0:11 drbd0:12 drbd0:13
> drbd0:14 drbd0:15 drbd0:1
> 6 drbd0:17 drbd0:18 drbd0:19 drbd0:20 drbd0:21 drbd0:22 drbd0:23
> drbd0:24 drbd0:25 ]
too many clones, see above.

->
> May 26 19:14:48 c001mlb_node01a cib: [31521]: info: retrieveCib: Reading
> cluster configuration from: /var/lib/heartbeat/crm/cib.2Ij02L (digest:
> /var/lib/heartbeat/crm/cib.G64hBN)
> 
> May 26 19:14:48 c001mlb_node01a pengine: [25734]: notice: print_list:  
> Stopped: [ drbd0:0 drbd0:1 drbd0:2 drbd0:3 drbd0:4 drbd0:5 drbd0:6
> drbd0:7 drbd0:8 drbd0:9 drbd0:10 drbd0:11 drbd0:12 drbd0:13 drbd0:14
> drbd0:15 drbd0:16 drbd0:17 drbd0:18 drbd0:19 drbd0:20 drbd0:21 drbd0:22
> drbd0:23 drbd0:24 drbd0:25 ]
all resources stopped. seems ok.

*snip*

> May 26 19:14:48 c001mlb_node01a pengine: [25734]: info: master_color:
> Promoting drbd0:0 (Stopped c001mlb_node01a)
> 
> May 26 19:14:48 c001mlb_node01a pengine: [25734]: info: master_color:
> ms-drbd0: Promoted 1 instances of a possible 1 to master
> 
> May 26 19:14:48 c001mlb_node01a pengine: [25734]: info: master_color:
> ms-drbd0: Promoted 1 instances of a possible 1 to master

2 instances have been promoted. that is correct.

> May 26 19:14:48 c001mlb_node01a crmd: [25735]: info:
> do_state_transition: State transition S_POLICY_ENGINE ->
> S_TRANSITION_ENGINE [ input=I_PE_SUCCESS cause=C_IPC_MESSAGE
> origin=handle_response ]
> 
> May 26 19:14:48 c001mlb_node01a crmd: [25735]: ERROR: do_fsa_action:
> Action A_DC_TIMER_STOP took 812418818s to complete
> 
> May 26 19:14:48 c001mlb_node01a crmd: [25735]: ERROR: do_fsa_action:
> Action A_INTEGRATE_TIMER_STOP took 812418818s to complete
> 
> May 26 19:14:48 c001mlb_node01a pengine: [25734]: WARN:
> process_pe_message: Transition 7: WARNINGs found during PE processing.
> PEngine Input stored in: /var/lib/pengine/pe-warn-771.bz2
>
> May 26 19:14:48 c001mlb_node01a crmd: [25735]: ERROR: do_fsa_action:
> Action A_FINALIZE_TIMER_STOP took 812418818s to complete
> 
> May 26 19:14:48 c001mlb_node01a pengine: [25734]: info:
> process_pe_message: Configuration WARNINGs found during PE processing. 
> Please run "crm_verify -L" to identify issues.

warnings found - please see this pe-warn file and/or run crm_verify -L

> May 26 19:14:48 c001mlb_node01a crmd: [25735]: info: unpack_graph:
> Unpacked transition 7: 17 actions in 17 synapses
> 
> May 26 19:14:48 c001mlb_node01a crmd: [25735]: info: do_te_invoke:
> Processing graph 7 (ref=pe_calc-dc-1243365288-55) derived from
> /var/lib/pengine/pe-warn-771.bz2
> 
> May 26 19:14:48 c001mlb_node01a crmd: [25735]: ERROR: do_fsa_action:
> Action A_TE_INVOKE took 812418817s to complete
> 
> May 26 19:14:48 c001mlb_node01a crmd: [25735]: info: te_rsc_command:
> Initiating action 5: start ip-c001drbd01a_start_0 on c001mlb_node01a (local)
> 
> May 26 19:14:48 c001mlb_node01a crmd: [25735]: info: do_lrm_rsc_op:
> Performing key=5:7:0:c58a32ec-ae57-4bc8-8a1e-5d7069c2f2bd
> op=ip-c001drbd01a_start_0 )

drbd seems to be up, starting ip below

> May 26 19:14:48 c001mlb_node01a lrmd: [25732]: info: rsc:ip-c001drbd01a:
> start
> 
> May 26 19:14:48 c001mlb_node01a crmd: [25735]: info: te_pseudo_action:
> Pseudo action 14 fired and confirmed
> 
> May 26 19:14:48 c001mlb_node01a crmd: [25735]: info: te_pseudo_action:
> Pseudo action 15 fired and confirmed
> 
> May 26 19:14:48 c001mlb_node01a crmd: [25735]: info: te_pseudo_action:
> Pseudo action 16 fired and confirmed
> 
> May 26 19:14:48 c001mlb_node01a crmd: [25735]: info: te_pseudo_action:
> Pseudo action 25 fired and confirmed
> 
> May 26 19:14:48 c001mlb_node01a crmd: [25735]: info: te_pseudo_action:
> Pseudo action 28 fired and confirmed
> 
> May 26 19:14:48 c001mlb_node01a crmd: [25735]: info: te_rsc_command:
> Initiating action 141: notify drbd0:0_post_notify_start_0 on
> c001mlb_node01a (local)
> 
> May 26 19:14:48 c001mlb_node01a crmd: [25735]: info: do_lrm_rsc_op:
> Performing key=141:7:0:c58a32ec-ae57-4bc8-8a1e-5d7069c2f2bd
> op=drbd0:0_notify_0 )
> 
> May 26 19:14:48 c001mlb_node01a lrmd: [25732]: info: rsc:drbd0:0: notify
> 
> May 26 19:14:48 c001mlb_node01a crmd: [25735]: info: te_rsc_command:
> Initiating action 143: notify drbd0:0_post_notify_promote_0 on
> c001mlb_node01a (local)
> 
> May 26 19:14:48 c001mlb_node01a lrmd: [25732]: info: RA output:
> (ip-c001drbd01a:start:stderr) eth0:0: warning: name may be invalid
>
> May 26 19:14:48 c001mlb_node01a crmd: [25735]: info: do_lrm_rsc_op:
> Performing key=143:7:0:c58a32ec-ae57-4bc8-8a1e-5d7069c2f2bd
> op=drbd0:0_notify_0 )
> 
> May 26 19:14:48 c001mlb_node01a lrmd: [25732]: info: RA output:
> (drbd0:0:notify:stderr) 2009/05/26_19:14:48 INFO: drbd0: Using hostname
> node_0
> 
> May 26 19:14:48 c001mlb_node01a lrmd: [25732]: info: RA output:
> (ip-c001drbd01a:start:stderr) 2009/05/26_19:14:48 INFO: ip -f inet addr
> add 192.168.80.213/32 brd 192.168.80.213 dev eth0 label eth0:0
> 
> May 26 19:14:48 c001mlb_node01a lrmd: [25732]: info: RA output:
> (ip-c001drbd01a:start:stderr) 2009/05/26_19:14:48 INFO: ip link set eth0
> up 2009/05/26_19:14:48 INFO: /usr/lib64/heartbeat/send_arp -i 200 -r 5
> -p /var/run/heartbeat/rsctmp/send_arp/send_arp-192.168.80.213 eth0
> 192.168.80.213 auto not_used not_used

error starting ip.

as far as i can see, this is now repeating for different nodes and as
pacemaker tries to recover.

please fix the above suggestions and reply with new findings.

i usually log to syslog so that i also see the drbd/daemon/... messages
to aid with debugging.

moreover, please attach the files from now on and do not c/p them into
one big email. it is a lot of work to read your email...

cheers,
raoul
[1] http://www.gossamer-threads.com/lists/linuxha/dev/50929
[2] http://www.clusterlabs.org/wiki/DRBD_HowTo_1.0
-- 
____________________________________________________________________
DI (FH) Raoul Bhatia M.Sc.          email.          r.bhatia at ipax.at
Technischer Leiter

IPAX - Aloy Bhatia Hava OEG         web.          http://www.ipax.at
Barawitzkagasse 10/2/2/11           email.            office at ipax.at
1190 Wien                           tel.               +43 1 3670030
FN 277995t HG Wien                  fax.            +43 1 3670030 15
____________________________________________________________________