[Pacemaker] clmvd hangs on node1 if node2 is fenced
Tim Serong
tserong at novell.com
Fri Aug 27 02:50:59 UTC 2010
On 8/27/2010 at 08:50 AM, Michael Smith <msmith at cbnco.com> wrote:
>> Xinwei Hu <hxinwei at ...> writes:
> >
> > > That sounds worrying actually.
> > > I think this is logged as bug 585419 on SLES' bugzilla.
> > > If you can reproduce this issue, it worths to reopen it I think.
>
> I've got a pair of fully patched SLES11 SP1 nodes and they're showing
> what I guess is the same behaviour: if I hard-poweroff node2, operations
> like "vgdisplay -v" hang on node1 for quite some time. Sometimes a
> minute, sometimes two, sometimes forever. They get stuck here:
>
> Aug 26 18:31:42 xen-test1 clvmd[8906]: doing PRE command LOCK_VG
> 'V_vm_store' at
> 1 (client=0x7f2714000b40)
> Aug 26 18:31:42 xen-test1 clvmd[8906]: lock_resource 'V_vm_store',
> flags=0, mode=3
>
>
> After a few seconds, corosync & dlm notice the node is gone, but
> vg_display and
> friends still hang while trying to lock the VG.
>
> Aug 26 18:31:44 xen-test1 corosync[8476]: [TOTEM ] A processor failed,
> forming new configuration.
> Aug 26 18:31:50 xen-test1 cluster-dlm[8870]: update_cluster: Processing
> membership 1260
> Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: dlm_process_node: Skipped
> active node 219878572: born-on=1256, last-seen=1260, this-event=1260,
> last-event=1256
> Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: del_configfs_node:
> del_configfs_node rmdir "/sys/kernel/config/dlm/cluster/comms/236655788"
> Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: dlm_process_node: Removed
> inactive node 236655788: born-on=1252, last-seen=1256, this-event=1260,
> last-event=1256
> Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: log_config: dlm:controld
> conf 1 0 1 memb 219878572 join left 236655788
> Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: log_config: dlm:ls:clvmd
> conf 1 0 1 memb 219878572 join left 236655788
> Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: add_change: clvmd
> add_change cg 3 remove nodeid 236655788 reason 3
> Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: add_change: clvmd
> add_change cg 3 counts member 1 joined 0 remove 1 failed 1
> Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: stop_kernel: clvmd
> stop_kernel cg 3
> Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: do_sysfs: write "0" to
> "/sys/kernel/dlm/clvmd/control"
> Aug 26 18:31:51 xen-test1 kernel: [ 365.267802] dlm: closing connection
> to node 236655788
> Aug 26 18:31:51 xen-test1 clvmd[8906]: confchg callback. 0 joined, 1
> left, 1 members
> Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: fence_node_time: Node
> 236655788/xen-test2 has not been shot yet
> Aug 26 18:31:51 xen-test1 cluster-dlm[8870]: check_fencing_done: clvmd
> check_fencing 23665578 not fenced add 1282861615 fence 0
> Aug 26 18:31:51 xen-test1 crmd: [8489]: info: ais_dispatch: Membership
> 1260: quorum still lost
> Aug 26 18:31:51 xen-test1 cluster-dlm: [8870]: info: ais_dispatch:
> Membership 1260: quorum still lost
Do you have STONITH configured? Note that it says "xen-test2 has not
been shot yet" and "clvmd ... not fenced". It's just going to sit there
until the down node is successfully fenced - this is intentional, as it's
not safe to keep running until you *know* the dead node is dead.
Regards,
Tim
--
Tim Serong <tserong at novell.com>
Senior Clustering Engineer, OPS Engineering, Novell Inc.
More information about the Pacemaker
mailing list