[Pacemaker] Problems shutting down server/services

Thu Jun 17 02:45:23 EDT 2010

Hello,

On 2010-06-16 20:09, Erich Weiler wrote:
> Hi All,
>
> I'm having a weird issue that I was hoping someone could shed some light
> on...
>
> I've got Pacemaker 1.0.8 and Corosync 1.2.2 on a two node cluster. I'm
> managing some VMs via VirtualDomain, with DRBD under that. It mostly
> works, but I'm having a freal hard time shutting down services, or even
> shutting down one of the nodes. As I understand it, if I issue the
> shutdown command to one node, pacemaker should try and stop services on
> that node and move them to the other node (if configured), then power down.
>
> Normally crm_mon shows this:
>
> ============
> Last updated: Wed Jun 16 10:23:14 2010
> Stack: openais
> Current DC: hgvmcore2 - partition with quorum
> Version: 1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7
> 2 Nodes configured, 2 expected votes
> 4 Resources configured.
> ============
>
> Online: [ hgvmcore1 hgvmcore2 ]
>
> Master/Slave Set: ms-drbd-caweb
> Masters: [ hgvmcore1 ]
> Slaves: [ hgvmcore2 ]
> caweb-vd (ocf::heartbeat:VirtualDomain): Started hgvmcore1
> Master/Slave Set: ms-drbd-genome-nt
> Masters: [ hgvmcore2 ]
> Slaves: [ hgvmcore1 ]
> genome-nt-vd (ocf::heartbeat:VirtualDomain): Started hgvmcore2
>
> Looks good. When I try to reboot hgvmcore1, it does not power off, and I
> see this in the logs:
>
> Jun 16 10:30:16 hgvmcore1 shutdown[20339]: shutting down for system reboot
> Jun 16 10:30:16 hgvmcore1 init: Switching to runlevel: 6
> Jun 16 10:30:17 hgvmcore1 crm_attribute: [20398]: info: Invoked:
> crm_attribute -N hgvmcore1 -n master-drbd-genome-nt:1 -l reboot -v 100
> Jun 16 10:30:17 hgvmcore1 smartd[16139]: smartd received signal 15:
> Terminated
> Jun 16 10:30:17 hgvmcore1 smartd[16139]: smartd is exiting (exit status 0)
> Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Got SIGTERM, quitting.
> Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast
> group on interface vnet1.IPv6 with address fe80::7817:6eff:fe1c:506f.
> Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast
> group on interface vnet0.IPv6 with address fe80::d465:2ff:fefa:73ff.
> Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast
> group on interface virbr0.IPv6 with address fe80::200:ff:fe00:0.
> Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast
> group on interface virbr0.IPv4 with address 192.168.122.1.
> Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast
> group on interface br3.IPv6 with address fe80::225:64ff:fefc:1e3b.
> Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast
> group on interface br3.IPv4 with address 128.114.48.135.
> Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast
> group on interface br2.IPv6 with address fe80::225:64ff:fefc:1e39.
> Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast
> group on interface br2.IPv4 with address 10.1.1.89.
> Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast
> group on interface eth3.IPv6 with address fe80::225:64ff:fefc:1e3b.
> Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast
> group on interface eth2.IPv6 with address fe80::225:64ff:fefc:1e39.
> Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast
> group on interface eth1.IPv6 with address fe80::210:18ff:fe67:db0a.
> Jun 16 10:30:18 hgvmcore1 avahi-daemon[16031]: Leaving mDNS multicast
> group on interface eth1.IPv4 with address 172.16.0.1.
> Jun 16 10:30:18 hgvmcore1 libvirtd: 10:30:18.192: warning : Shutting
> down on signal 15

this is a problem ... without libvirtd running all management attempts 
via "virsh" will fail (including the VirtualDomain RA) ... why is 
libvirtd stopping before corosync?

there seems to be problem with the shutdown sequence of your system 
services ... of course you can also include libvirtd in your CRM config 
and disable the init script

Regards,
Andreas

> Jun 16 10:30:18 hgvmcore1 corosync[15541]: [SERV ] Unloading all
> Corosync service engines.
> Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: crm_signal_dispatch:
> Invoking handler for signal 15: Terminated
> Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: crm_shutdown: Requesting
> shutdown
> Jun 16 10:30:18 hgvmcore1 corosync[15541]: [pcmk ] notice:
> pcmk_shutdown: Shuting down Pacemaker
> Jun 16 10:30:18 hgvmcore1 corosync[15541]: [pcmk ] notice: stop_child:
> Sent -15 to crmd: [15563]
> Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: do_shutdown_req: Sending
> shutdown request to DC: hgvmcore2
> Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_ais_dispatch:
> Update relayed from hgvmcore2
> Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_trigger_update:
> Sending flush op to all hosts for: shutdown (1276709418)
> Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_perform_update:
> Sent update 48: shutdown=1276709418
> Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: cancel_op: operation
> monitor[21] on ocf::VirtualDomain::caweb-vd for client 15563, its
> parameters: CRM_meta_depth=[0] CRM_meta_interval=[10000] depth=[0]
> hypervisor=[qemu:///system] CRM_meta_timeout=[30000]
> crm_feature_set=[3.0.1] CRM_meta_name=[monitor]
> config=[/etc/libvirt/qemu/caweb.xml] cancelled
> Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: do_lrm_rsc_op: Performing
> key=39:448:0:13ca092e-78e7-4345-8728-7f8e8c2ddb29 op=caweb-vd_stop_0 )
> Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: rsc:caweb-vd:22: stop
> Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: do_lrm_rsc_op: Performing
> key=88:448:0:13ca092e-78e7-4345-8728-7f8e8c2ddb29
> op=drbd-caweb:0_notify_0 )
> Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: rsc:drbd-caweb:0:23: notify
> Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: do_lrm_rsc_op: Performing
> key=96:448:0:13ca092e-78e7-4345-8728-7f8e8c2ddb29
> op=drbd-genome-nt:1_notify_0 )
> Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: rsc:drbd-genome-nt:1:24:
> notify
> Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: process_lrm_event: LRM
> operation caweb-vd_monitor_10000 (call=21, status=1, cib-update=0,
> confirmed=true) Cancelled
> Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: RA output:
> (caweb-vd:stop:stderr) error: unable to connect to
> '/var/run/libvirt/libvirt-sock': Connection refused error: failed to
> connect to the hypervisor
> Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: RA output:
> (caweb-vd:stop:stderr) error:
> Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: RA output:
> (caweb-vd:stop:stderr) unable to connect to
> '/var/run/libvirt/libvirt-sock': Connection refused error: failed to
> connect to the hypervisor
> Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: process_lrm_event: LRM
> operation drbd-caweb:0_notify_0 (call=23, rc=0, cib-update=32,
> confirmed=true) ok
> Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: process_lrm_event: LRM
> operation drbd-genome-nt:1_notify_0 (call=24, rc=0, cib-update=33,
> confirmed=true) ok
> Jun 16 10:30:18 hgvmcore1 VirtualDomain[20485]: ERROR: Virtual domain
> caweb has no state during stop operation, bailing out.
> Jun 16 10:30:18 hgvmcore1 VirtualDomain[20485]: INFO: Issuing forced
> shutdown (destroy) request for domain caweb.
> Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: RA output:
> (caweb-vd:stop:stderr) error: unable to connect to
> '/var/run/libvirt/libvirt-sock': Connection refused error:
> Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: RA output:
> (caweb-vd:stop:stderr) failed to connect to the hypervisor
> Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: process_lrm_event: LRM
> operation caweb-vd_stop_0 (call=22, rc=1, cib-update=34, confirmed=true)
> unknown error
> Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: cancel_op: operation
> monitor[11] on ocf::drbd::drbd-genome-nt:1 for client 15563, its
> parameters: CRM_meta_interval=[15000] CRM_meta_notify_stop_resource=[ ]
> CRM_meta_notify_active_resource=[ ] CRM_meta_notify_slave_resource=[ ]
> CRM_meta_notify_start_resource=[drbd-genome-nt:1 ]
> CRM_meta_notify_active_uname=[ ] CRM_meta_notify_promote_resource=[ ]
> CRM_meta_notify_stop_uname=[ ] drbd_resource=[genome-nt]
> CRM_meta_notify_master_uname=[hgvmcore2 ] CRM_meta_master_max=[1]
> CRM_meta_notify_demote_uname=[ ] CRM_meta_notify_ cancelled
> Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: do_lrm_rsc_op: Performing
> key=46:448:0:13ca092e-78e7-4345-8728-7f8e8c2ddb29
> op=drbd-genome-nt:1_stop_0 )
> Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: rsc:drbd-genome-nt:1:25:
> stop
> Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: process_lrm_event: LRM
> operation drbd-genome-nt:1_monitor_15000 (call=11, status=1,
> cib-update=0, confirmed=true) Cancelled
> Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_ais_dispatch:
> Update relayed from hgvmcore2
> Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_trigger_update:
> Sending flush op to all hosts for: fail-count-caweb-vd (INFINITY)
> Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: peer( Primary -> Unknown
> ) conn( Connected -> Disconnecting ) pdsk( UpToDate -> DUnknown )
> Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: meta connection shut down
> by peer.
> Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: asender terminated
> Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: Terminating asender thread
> Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: Connection closed
> Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: conn( Disconnecting ->
> StandAlone )
> Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: receiver terminated
> Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: Terminating receiver thread
> Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: disk( UpToDate -> Diskless )
> Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: drbd_bm_resize called
> with capacity == 0
> Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: worker terminated
> Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: Terminating worker thread
> Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: RA output:
> (drbd-genome-nt:1:stop:stdout)
> Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: State change failed: Disk
> state is lower than outdated
> Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: state = { cs:StandAlone
> ro:Secondary/Unknown ds:Diskless/DUnknown r--- }
> Jun 16 10:30:18 hgvmcore1 kernel: block drbd1: wanted = { cs:StandAlone
> ro:Secondary/Unknown ds:Outdated/DUnknown r--- }
> Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_perform_update:
> Sent update 51: fail-count-caweb-vd=INFINITY
> Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_ais_dispatch:
> Update relayed from hgvmcore2
> Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_trigger_update:
> Sending flush op to all hosts for: last-failure-caweb-vd (1276709418)
> Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_perform_update:
> Sent update 54: last-failure-caweb-vd=1276709418
> Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: RA output:
> (drbd-genome-nt:1:stop:stdout)
> Jun 16 10:30:18 hgvmcore1 crm_attribute: [20593]: info: Invoked:
> crm_attribute -N hgvmcore1 -n master-drbd-genome-nt:1 -l reboot -v
> -INFINITY
> Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_trigger_update:
> Sending flush op to all hosts for: master-drbd-genome-nt:1 (-INFINITY)
> Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_perform_update:
> Sent update 56: master-drbd-genome-nt:1=-INFINITY
> Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: RA output:
> (drbd-genome-nt:1:stop:stdout)
> Jun 16 10:30:18 hgvmcore1 crm_attribute: [20597]: info: Invoked:
> crm_attribute -N hgvmcore1 -n master-drbd-genome-nt:1 -l reboot -D
> Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_trigger_update:
> Sending flush op to all hosts for: master-drbd-genome-nt:1 (<null>)
> Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_perform_update:
> Sent delete 58: node=hgvmcore1, attr=master-drbd-genome-nt:1, id=<n/a>,
> set=(null), section=status
> Jun 16 10:30:18 hgvmcore1 lrmd: [15560]: info: RA output:
> (drbd-genome-nt:1:stop:stdout)
> Jun 16 10:30:18 hgvmcore1 attrd: [15561]: info: attrd_perform_update:
> Sent delete 60: node=hgvmcore1, attr=master-drbd-genome-nt:1, id=<n/a>,
> set=(null), section=status
> Jun 16 10:30:18 hgvmcore1 crmd: [15563]: info: process_lrm_event: LRM
> operation drbd-genome-nt:1_stop_0 (call=25, rc=0, cib-update=35,
> confirmed=true) ok
> Jun 16 10:30:19 hgvmcore1 crmd: [15563]: info: do_lrm_rsc_op: Performing
> key=85:449:0:13ca092e-78e7-4345-8728-7f8e8c2ddb29
> op=drbd-caweb:0_notify_0 )
> Jun 16 10:30:19 hgvmcore1 lrmd: [15560]: info: rsc:drbd-caweb:0:26: notify
> Jun 16 10:30:19 hgvmcore1 crmd: [15563]: info: process_lrm_event: LRM
> operation drbd-caweb:0_notify_0 (call=26, rc=0, cib-update=36,
> confirmed=true) ok
> Jun 16 10:30:48 hgvmcore1 corosync[15541]: [pcmk ] notice:
> pcmk_shutdown: Still waiting for crmd (pid=15563, seq=6) to terminate...
>
> and crm_mon shows this:
>
> ============
> Last updated: Wed Jun 16 10:30:18 2010
> Stack: openais
> Current DC: hgvmcore2 - partition with quorum
> Version: 1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7
> 2 Nodes configured, 2 expected votes
> 4 Resources configured.
> ============
>
> Online: [ hgvmcore1 hgvmcore2 ]
>
> Master/Slave Set: ms-drbd-caweb
> Masters: [ hgvmcore1 ]
> Slaves: [ hgvmcore2 ]
> caweb-vd (ocf::heartbeat:VirtualDomain): Started hgvmcore1 (unmanaged)
> FAILED
> Master/Slave Set: ms-drbd-genome-nt
> Masters: [ hgvmcore2 ]
> Stopped: [ drbd-genome-nt:1 ]
> genome-nt-vd (ocf::heartbeat:VirtualDomain): Started hgvmcore2
>
> Failed actions:
> caweb-vd_stop_0 (node=hgvmcore1, call=22, rc=1, status=complete):
> unknown error
>
> So it is having trouble shutting down...? After like 15-20 minutes the
> computer actually does shut down/reboot, but the VMs shut down quickly,
> not sure about DRBD. I don't know if it would help, but here is my crm
> config:
>
> node hgvmcore1
> node hgvmcore2
> primitive caweb-vd ocf:heartbeat:VirtualDomain \
> params config="/etc/libvirt/qemu/caweb.xml" hypervisor="qemu:///system" \
> meta allow-migrate="false" is-managed="true" target-role="Started" \
> op start interval="0" timeout="120s" \
> op stop interval="0" timeout="120s" \
> op monitor interval="10" timeout="30" depth="0"
> primitive drbd-caweb ocf:linbit:drbd \
> params drbd_resource="caweb" \
> op monitor interval="15s" \
> op start interval="0" timeout="240s" \
> op stop interval="0" timeout="100s"
> primitive drbd-genome-nt ocf:linbit:drbd \
> params drbd_resource="genome-nt" \
> op monitor interval="15s" \
> op start interval="0" timeout="240s" \
> op stop interval="0" timeout="100s"
> primitive genome-nt-vd ocf:heartbeat:VirtualDomain \
> params config="/etc/libvirt/qemu/genome-nt.xml"
> hypervisor="qemu:///system" \
> meta allow-migrate="false" is-managed="true" target-role="Started" \
> op start interval="0" timeout="120s" \
> op stop interval="0" timeout="120s" \
> op monitor interval="10" timeout="30" depth="0"
> ms ms-drbd-caweb drbd-caweb \
> meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"
> notify="true" target-role="Started" is-managed="true"
> ms ms-drbd-genome-nt drbd-genome-nt \
> meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1"
> notify="true" target-role="Started"
> location caweb-prefers-hgvmcore1 caweb-vd 50: hgvmcore1
> location genome-nt-prefers-hgvmcore2 genome-nt-vd 50: hgvmcore2
> colocation caweb-vd-on-drbd inf: caweb-vd ms-drbd-caweb:Master
> colocation genome-nt-vd-on-drbd inf: genome-nt-vd ms-drbd-genome-nt:Master
> order caweb-after-drbd inf: ms-drbd-caweb:promote caweb-vd:start
> order genome-nt-after-drbd inf: ms-drbd-genome-nt:promote
> genome-nt-vd:start
> property $id="cib-bootstrap-options" \
> dc-version="1.0.8-9881a7350d6182bae9e8e557cf20a3cc5dac3ee7" \
> cluster-infrastructure="openais" \
> expected-quorum-votes="2" \
> stonith-enabled="false" \
> no-quorum-policy="ignore" \
> last-lrm-refresh="1276708976"
> rsc_defaults $id="rsc-options" \
> resource-stickiness="100"
>
> And here is the corosync config:
>
> # cat /etc/corosync/corosync.conf
> # Please read the corosync.conf.5 manual page
> compatibility: whitetank
>
> aisexec {
> user: root
> group: root
> }
>
> totem {
> version: 2
>
> # How long before declaring a token lost (ms)
> token: 5000
>
> # How many token retransmits before forming a new configuration
> token_retransmits_before_loss_const: 20
>
> # How long to wait for join messages in the membership protocol (ms)
> join: 1000
>
> # How long to wait for consensus to be achieved before starting a new
> round of membership configuration (ms)
> consensus: 7500
>
> # Turn off the virtual synchrony filter
> vsftype: none
>
> # Number of messages that may be sent by one processor on receipt of the
> token
> max_messages: 20
>
> # Disable encryption
> secauth: off
>
> # How many threads to use for encryption/decryption
> threads: 0
>
> # Limit generated nodeids to 31-bits (positive signed integers)
> clear_node_high_bit: yes
>
> # Optionally assign a fixed node id (integer)
> # nodeid: 1234
> interface {
> ringnumber: 0
> bindnetaddr: 10.1.0.255
> mcastaddr: 226.94.1.55
> mcastport: 4001
> }
> }
>
> logging {
> fileline: off
> to_stderr: yes
> to_logfile: yes
> to_syslog: yes
> logfile: /var/log/corosync.log
> debug: off
> timestamp: on
> logger_subsys {
> subsys: AMF
> debug: off
> }
> }
>
> amf {
> mode: disabled
> }
>
> service {
> # Load the Pacemaker Cluster Resource Manager
> name: pacemaker
> ver: 0
> }
>
> Is this normal, or maybe a known issue? Any help much appreciated!!!
>
> BTW - if I try to shut down a VM via:
>
> # crm resource stop my_vm_resource
>
> crm_mon shows the same thing, as in "caweb-vd
> (ocf::heartbeat:VirtualDomain): Started hgvmcore1 (unmanaged) FAILED"
> and doesn't seem to cleanly exit, at least through pacemaker, even
> though the VM does shut down cleanly. I never put it in 'unmanaged'
> mode, it seems to go into that state by itself.
>
> Cheers,
> -erich
>
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs:
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>

-- 
: Andreas Kurz					
: LINBIT | Your Way to High Availability
: Tel +43-1-8178292-64, Fax +43-1-8178292-82
:
: http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

This e-mail is solely for use by the intended recipient(s). Information
contained in this e-mail and its attachments may be confidential,
privileged or copyrighted. If you are not the intended recipient you are
hereby formally notified that any use, copying, disclosure or
distribution of the contents of this e-mail, in whole or in part, is
prohibited. Also please notify immediately the sender by return e-mail
and delete this e-mail from your system. Thank you for your co-operation.