[Pacemaker] cluster-dlm: set_fs_notified: set_fs_notified no nodeid 1812048064#012

Sat Aug 28 07:41:41 UTC 2010

Thanks,
who should I contact? Which mailing list?
I've discovered that this problem occours when the port of my switch
where the cluster ring is connected became "blocked" due spanning tree.
I've resolved the bug using for the ring a separate switch without
spanning tre enabled and different subnet.
Is there a configuration to avoid that before the spanning tree
recalculate the route due a failure, the cluster nodes doesn't hang?
The hang occurses on SLES11sp1 too where the servers are up running, the
cluster status is ok, but when try to connect to the server with ssh,
after the login hang the session.

Usually the recalculate takes 50 seconds.

Regards,
Roberto.

On 08/26/2010 10:24 AM, Dejan Muhamedagic wrote:
> Hi,
>
> On Thu, Aug 26, 2010 at 09:36:10AM +0200, Andrew Beekhof wrote:
>   
>> On Wed, Aug 18, 2010 at 6:24 PM, Roberto Giordani <r.giordani at libero.it> wrote:
>>     
>>> Hello,
>>> I'll explain what’s happened after a network black-out
>>> I've a cluster with pacemaker on Opensuse 11.2 64bit
>>> ============
>>> Last updated: Wed Aug 18 18:13:33 2010
>>> Current DC: nodo1 (nodo1)
>>> Version: 1.0.2-ec6b0bbee1f3aa72c4c2559997e675db6ab39160
>>> 3 Nodes configured.
>>> 11 Resources configured.
>>> ============
>>>
>>> Node: nodo1 (nodo1): online
>>> Node: nodo3 (nodo3): online
>>> Node: nodo4 (nodo4): online
>>>
>>> Clone Set: dlm-clone
>>>     dlm:0       (ocf::pacemaker:controld):      Started nodo3
>>>     dlm:1       (ocf::pacemaker:controld):      Started nodo1
>>>     dlm:2       (ocf::pacemaker:controld):      Started nodo4
>>> Clone Set: o2cb-clone
>>>     o2cb:0      (ocf::ocfs2:o2cb):      Started nodo3
>>>     o2cb:1      (ocf::ocfs2:o2cb):      Started nodo1
>>>     o2cb:2      (ocf::ocfs2:o2cb):      Started nodo4
>>> Clone Set: XencfgFS-Clone
>>>     XencfgFS:0  (ocf::heartbeat:Filesystem):    Started nodo3
>>>     XencfgFS:1  (ocf::heartbeat:Filesystem):    Started nodo1
>>>     XencfgFS:2  (ocf::heartbeat:Filesystem):    Started nodo4
>>> Clone Set: XenimageFS-Clone
>>>     XenimageFS:0        (ocf::heartbeat:Filesystem):    Started nodo3
>>>     XenimageFS:1        (ocf::heartbeat:Filesystem):    Started nodo1
>>>     XenimageFS:2        (ocf::heartbeat:Filesystem):    Started nodo4
>>> rsa1-fencing    (stonith:external/ibmrsa-telnet):       Started nodo4
>>> rsa2-fencing    (stonith:external/ibmrsa-telnet):       Started nodo3
>>> rsa3-fencing    (stonith:external/ibmrsa-telnet):       Started nodo4
>>> rsa4-fencing    (stonith:external/ibmrsa-telnet):       Started nodo3
>>> mailsrv-rm      (ocf::heartbeat:Xen):   Started nodo3
>>> dbsrv-rm        (ocf::heartbeat:Xen):   Started nodo4
>>> websrv-rm       (ocf::heartbeat:Xen):   Started nodo4
>>>
>>> After a  switch failure all the nodes and the rsa stonith devices was
>>> unreachable.
>>>
>>> On the cluster happen the following error on one node
>>>
>>> Aug 18 13:11:38 nodo1 cluster-dlm: receive_plocks_stored:
>>> receive_plocks_stored 1778493632:2 need_plocks 0#012
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272025] ------------[ cut here
>>> ]------------
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272036] kernel BUG at
>>> /usr/src/packages/BUILD/kernel-xen-2.6.31.12/linux-2.6.31/fs/inode.c:1323!
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272042] invalid opcode: 0000 [#1] SMP
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272046] last sysfs file:
>>> /sys/kernel/dlm/0BB443F896254AD3BA8FB960C425B666/control
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272050] CPU 1
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272053] Modules linked in:
>>> nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_physdev
>>> iptable_filter ip_tables x_tables ocfs2 ocfs2_nodemanager quota_tree
>>> ocfs2_stack_user ocfs2_stackglue dlm configfs netbk coretemp blkbk
>>> blkback_pagemap blktap xenbus_be ipmi_si edd dm_round_robin scsi_dh_rdac
>>> dm_multipath scsi_dh bridge stp llc bonding ipv6 fuse ext4 jbd2 crc16 loop
>>> dm_mod sr_mod ide_pci_generic ide_core iTCO_wdt ata_generic ibmpex i5k_amb
>>> ibmaem iTCO_vendor_support ipmi_msghandler bnx2 i5000_edac 8250_pnp shpchp
>>> ata_piix pcspkr ics932s401 joydev edac_core i2c_i801 ses pci_hotplug 8250
>>> i2c_core serio_raw enclosure serial_core button sg reiserfs usbhid hid
>>> uhci_hcd ehci_hcd xenblk cdrom xennet fan processor pata_acpi lpfc thermal
>>> thermal_sys hwmon aacraid [last unloaded: ocfs2_stackglue]
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272111] Pid: 8889, comm: dlm_send Not
>>> tainted 2.6.31.12-0.2-xen #1 IBM System x3650 -[7979AC1]-
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272113] RIP: e030:[<ffffffff801331c2>]
>>> [<ffffffff801331c2>] iput+0x82/0x90
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272121] RSP: e02b:ffff88014ec03c30
>>> EFLAGS: 00010246
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272122] RAX: 0000000000000000 RBX:
>>> ffff880148a703c8 RCX: 0000000000000000
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272123] RDX: ffffc90000010000 RSI:
>>> ffff880148a70380 RDI: ffff880148a703c8
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272125] RBP: ffff88014ec03c50 R08:
>>> b038000000000000 R09: fe99594c51a57607
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272126] R10: ffff880040410270 R11:
>>> 0000000000000000 R12: ffff8801713e6e08
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272128] R13: ffff88014ec03d20 R14:
>>> 0000000000000000 R15: ffffc9000331d108
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272133] FS: 00007ff4cb11a730(0000)
>>> GS:ffffc90000010000(0000) knlGS:0000000000000000
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272135] CS: e033 DS: 0000 ES: 0000 CR0:
>>> 000000008005003b
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272136] CR2: 00007ff4c5c45000 CR3:
>>> 0000000135b2a000 CR4: 0000000000002660
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272138] DR0: 0000000000000000 DR1:
>>> 0000000000000000 DR2: 0000000000000000
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272140] DR3: 0000000000000000 DR6:
>>> 00000000ffff0ff0 DR7: 0000000000000400
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272142] Process dlm_send (pid: 8889,
>>> threadinfo ffff88014ec02000, task ffff8801381e45c0)
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272143] Stack:
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272144] 0000000000000000
>>> 00000000072f0874 ffff880148a70380 ffff880148a70380
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272146] <0> ffff88014ec03c80
>>> ffffffff803add09 ffff88014ec03c80 00000000072f0874
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272147] <0> ffff8801713e6df8
>>> ffff8801713e6e08 ffff88014ec03de0 ffffffffa05661e1
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272150] Call Trace:
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272164] [<ffffffff803add09>]
>>> sock_release+0x89/0xa0
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272177] [<ffffffffa05661e1>]
>>> tcp_connect_to_sock+0x161/0x2b0 [dlm]
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272206] [<ffffffffa0568764>]
>>> process_send_sockets+0x34/0x60 [dlm]
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272222] [<ffffffff800693f3>]
>>> run_workqueue+0x83/0x230
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272227] [<ffffffff80069654>]
>>> worker_thread+0xb4/0x140
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272231] [<ffffffff8006fac6>]
>>> kthread+0xb6/0xc0
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272236] [<ffffffff8000d38a>]
>>> child_rip+0xa/0x20
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272240] Code: 42 20 48 c7 c2 b0 4c 13
>>> 80 48 85 c0 48 0f 44 c2 48 89 df ff d0 48 8b 45 e8 65 48 33 04 25 28 00 00
>>> 00 75 0b 48 83 c4 18 5b c9 c3 <0f> 0b eb fe e8 35 c6 f1 ff 0f 1f 44 00 00 55
>>> 48 8d 97 10 02 00
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272256] RIP [<ffffffff801331c2>]
>>> iput+0x82/0x90
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272259] RSP <ffff88014ec03c30>
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272264] ---[ end trace 7707d0d92a7f5415
>>> ]---
>>>
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272495] dlm: connect from non cluster
>>> node
>>>
>>> and after few log lines the following line repeated until the node was
>>> killed by me
>>>
>>> Aug 18 13:12:31 nodo1 cluster-dlm: start_kernel: start_kernel cg 3
>>> member_count 1#012
>>>
>>> Aug 18 13:12:31 nodo1 cluster-dlm: update_dir_members: dir_member
>>> 1812048064#012
>>>
>>> Aug 18 13:12:31 nodo1 cluster-dlm: update_dir_members: dir_member
>>> 1778493632#012
>>>
>>> Aug 18 13:12:31 nodo1 cluster-dlm: set_configfs_members: set_members rmdir
>>> "/sys/kernel/config/dlm/cluster/spaces/0BB443F896254AD3BA8FB960C425B666/nodes/1812048064"#012
>>>
>>> Aug 18 13:12:31 nodo1 cluster-dlm: do_sysfs: write "1" to
>>> "/sys/kernel/dlm/0BB443F896254AD3BA8FB960C425B666/control"#012
>>>
>>> Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no
>>> nodeid 1812048064#012
>>>
>>> Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no
>>> nodeid 1812048064#012
>>>
>>> Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no
>>> nodeid 1812048064#012
>>>
>>> Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no
>>> nodeid 1812048064#012
>>>
>>> Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no
>>> nodeid 1812048064#012
>>>
>>> Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no
>>> nodeid 1812048064#012
>>>
>>> Aug 18 13:12:31 nodo1 cluster-dlm: set_fs_notified: set_fs_notified no
>>> nodeid 1812048064#012
>>>
>>> Attached the log file
>>>
>>> Someone can explain what is the reason?
>>>       
>> Perhaps the membership got out of sync...
>>
>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272495] dlm: connect from non cluster node
>>
>> Maybe lmb or dejan can suggest something... I dont have much to do
>> with ocfs2 anymore.
>>     
> Me neither. But this looks like a kernel bug:
>
>   
>>> Aug 18 13:11:38 nodo1 kernel: [ 4154.272036] kernel BUG at
>>> /usr/src/packages/BUILD/kernel-xen-2.6.31.12/linux-2.6.31/fs/inode.c:1323!
>>>       
> Perhaps ask on the kernel ML?
>
> Thanks,
>
> Dejan
>
>
>   
>> _______________________________________________
>> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
>> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>>
>> Project Home: http://www.clusterlabs.org
>> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
>> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>>     
> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
>