[Pacemaker] can any1 tell me why wc01 kind of fenced itself?

Sun Feb 15 16:20:08 EST 2009

hi,

On Sun, February 15, 2009 7:37 pm, Raoul Bhatia [IPAX] wrote:
> hi,
>
> i had the cluster up and running (2 nodes, wc01 and wc02).
>
> 1) following recent changes to the globally-unique handling,
> i reviewed all my settings and explicitly set globally-unique=false to
> several resources.
>
> 2) i tried to change the logging to log to the local7 changing my
> logd.cf and ha.cf configuration files on the non-dc node.
>
> 3) i configured rsyslog to redirect local7 to /var/log/ha.log and
> /var/log/ha-debug.log.
>
>
> 4) i issued /etc/init.d/heartbeat reload

So you wanted all resources to be migrated to wc02 ... as this kills and
restarts all heartbeat/pacemaker processes. Why not setting the cluster
into unmanaged mode?

>
> i observed the following thing:
>
> a) the node kind of fenced itself. most programs (including ssh) got
> killed. wc02 took over all resources after some time.

Looks like the "group_webservice" also needs a "globally_unique=false"
attribute:

Feb 15 18:10:15 wc01 crmd: [17757]: info: do_lrm_rsc_op: Performing
key=134:44:0:9949f46a-b1a8-454e-a180-b7d746507937 op=fs_www:0_stop_0 )
Feb 15 18:10:15 wc01 lrmd: [17754]: info: rsc:fs_www:0: stop
Feb 15 18:10:15 wc01 crmd: [17757]: info: do_lrm_rsc_op: Performing
key=146:44:0:9949f46a-b1a8-454e-a180-b7d746507937 op=fs_www:1_stop_0 )
Feb 15 18:10:15 wc01 lrmd: [17754]: info: rsc:fs_www:1: stop

... the clones cannot be told apart, they seem to be running both on this
node so they are both stopped ....

Feb 15 18:10:15 wc01 Filesystem[22969]: INFO: unmounted /data/www
successfully

... first umount works as expected ....

Feb 15 18:10:15 wc01 lrmd: [17754]: info: RA output:
(fs_www:1:stop:stderr) umount2: Invalid argument
Feb 15 18:10:15 wc01 lrmd: [17754]: info: RA output:
(fs_www:1:stop:stderr) umount: /data/www: not mounted

... some sort of race condition ... the second umount returns an error and
the Filesystem RA now tries to kill all processes accessing the
"filesystem" which is now the root filesystem where the mountpoint
/data/www resides :-(

Feb 15 18:10:15 wc01 lrmd: [17754]: info: RA output: (fs_www:1:stop:stderr)
Feb 15 18:10:15 wc01 Filesystem[22970]: ERROR: Couldn't unmount /data/www;
trying cleanup with SIGTERM

... and _all_ processes accessing your root filesystem receive a SIGTERM.
So the node did some kind of suicide.

Regards,
Andreas

>
> can any1 explain why this happened?
>
> cheers, raoul --
> ____________________________________________________________________
> DI (FH) Raoul Bhatia M.Sc.          email.          r.bhatia at ipax.at
> Technischer Leiter
>
>
> IPAX - Aloy Bhatia Hava OEG         web.          http://www.ipax.at
> Barawitzkagasse 10/2/2/11           email.            office at ipax.at
> 1190 Wien                           tel.               +43 1 3670030
> FN 277995t HG Wien                  fax.            +43 1 3670030 15
> ____________________________________________________________________
> _______________________________________________
> Pacemaker mailing list
> Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
>
>

-- 
: Andreas Kurz
: LINBIT | Your Way to High Availability
: Tel +43-1-8178292-64, Fax +43-1-8178292-82
:
: http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.

This e-mail is solely for use by the intended recipient(s). Information
contained in this e-mail and its attachments may be confidential,
privileged or copyrighted. If you are not the intended recipient you are
hereby formally notified that any use, copying, disclosure or
distribution of the contents of this e-mail, in whole or in part, is
prohibited. Also please notify immediately the sender by return e-mail
and delete this e-mail from your system. Thank you for your co-operation.