[Pacemaker] node1 fencing itself after node2 being fenced

Mon Feb 10 08:26:48 EST 2014

10.02.2014 14:46, Asgaroth wrote:
> Hi All,
> 
>  
> 
> OK, here is my testing using cman/clvmd enabled on system startup and
> clvmd outside of pacemaker control. I still seem to be getting the clvmd
> hang/fail situation even when running outside of pacemaker control, I
> cannot see off-hand where the issue is occurring, but maybe it is
> related to what Vladislav was saying where clvmd hangs if it is not
> running on a cluster node that has cman running, however, I have both
> cman/clvmd enable to start at boot. Here is a little synopsis of what
> appears to be happening here:
> 
>  
> 
> [1] Everything is fine here, both nodes up and running:
> 
>  
> 
> # cman_tool nodes
> 
> Node  Sts   Inc   Joined               Name
> 
>    1   M    444   2014-02-07 10:25:00  test01
> 
>    2   M    440   2014-02-07 10:25:00  test02
> 
>  
> 
> # dlm_tool ls
> 
> dlm lockspaces
> 
> name          clvmd
> 
> id            0x4104eefa
> 
> flags         0x00000000
> 
> change        member 2 joined 1 remove 0 failed 0 seq 1,1
> 
> members       1 2
> 
>  
> 
> [2] Here I “echo c > /proc/sysrq-trigger” on node2 (test02), I can see
> crm_mon saying that node 2 is in unclean state and fencing kicks in
> (reboot node 2)
> 
>  
> 
> # cman_tool nodes
> 
> Node  Sts   Inc   Joined               Name
> 
>    1   M    440   2014-02-07 10:27:58  test01
> 
>    2   X    444                                              test02
> 
>  
> 
> # dlm_tool ls
> 
> dlm lockspaces
> 
> name          clvmd
> 
> id            0x4104eefa
> 
> flags         0x00000004 kern_stop
> 
> change        member 2 joined 1 remove 0 failed 0 seq 2,2
> 
> members       1 2
> 
> new change    member 1 joined 0 remove 1 failed 1 seq 3,3
> 
> new status    wait_messages 0 wait_condition 1 fencing
> 
> new members   1
> 
>  
> 
> [3] So the above looks fine so far, to my untrained eye, dlm in
> kern_stop state while waiting on successful fence, and the node reboots
> and we have the following state:
> 
>  
> 
> # cman_tool nodes
> 
> Node  Sts   Inc   Joined               Name
> 
>    1   M    440   2014-02-07 10:27:58  test01
> 
>    2   M    456   2014-02-07 10:35:42  test02
> 
>  
> 
> # dlm_tool ls
> 
> dlm lockspaces
> 
> name          clvmd
> 
> id            0x4104eefa
> 
> flags         0x00000000
> 
> change        member 2 joined 1 remove 0 failed 0 seq 4,4
> 
> members       1 2
> 
>  
> 
> So it looks like dlm and cman seem to be working properly (again, I
> could be wrong, my untrained eye and all J)
> 

Yep, all above is correct. And yes, at the dlm layer everything seems to
be perfect (did't look at the dump though, that is not needed for the
'ls' outputs you provided).

>  
> 
> However, if I try to run any lvm status/clvm status commands then they
> still just hang. Could this be related to clvmd doing a check when cman
> is up and running but clvmd has not started yet (As I understand from
> Vladislav’s previous email). Or do I have something fundamentally wrong
> with my fencing configuration.

I cannot really recall if it hangs or returns error for that (I moved to
corosync2 long ago).

Anyways you probably want to run clvmd with debugging enabled.
iirc you have two choices here, either you'd need to stop running
instance first and then run it in the console with -f -d1, or run clvmd
-C -d2 to ask all running instances to start debug logging to syslog.
I prefer first one, because modern syslogs do rate-limiting.
And, you'd need to run lvm commands with debugging enabled too.

Alternatively (or in addition to the above) you may want to run
"hang-suspect" under gdb (make sure you have relevant -debuginfo
packages installed, one for lvm2 should be enough), this way you can
obtain the backtrace of function calls which led to the hang, and more.

You may need to tell gdb that it shouldn't stop on some signals sent to
the binary being debugged if it stops immediately after you type 'run'
or 'cont' (f.e 'handle PIPE nostop'). Once you have daemon running under
debugger (do not forget to type 'set args -f -d1' in gdb prompt before
start), once you notice the hang, you can press Ctrl^C and then type 'bt
full' (you may need to do that for some/all threads, use 'info threads'
and 'thread #' commands to switch between them).

With all that you can find what exactly hangs and where and probably
even why.

Best,
Vladislav