[Pacemaker] unmanaged resource - cluster influence - ocf:heartbeat:Filesystem

Wed Jun 18 15:27:22 EDT 2014

Hi,

On Wed, Jun 18, 2014 at 06:11:22AM +0000, Bauer, Stefan (IZLBW Extern) wrote:
> Hello,
> 
> I'm using ocf:heartbeat:Filesystem to mount a cifs share. Additionally I enabled OCF_CHECK_LEVEL 20 to read/write from the cifs-share during monitor operation:
> 
> If I block the connection to the cifs-server with iptables, the monitor operation times out. After several tries, a restart of the resource is initiated. The resource fails to stop (another timeout) so it ends up in a INFINITY and the resource is unmanaged:
> 
> Jun 17 13:49:21 node1 lrmd: [15029]: WARN: p_cifs_pictures:monitor process (PID 18444) timed out (try 1).  Killing with signal SIGTERM (15).
> Jun 17 13:49:21 node1 lrmd: [15029]: WARN: operation monitor[43] on p_cifs_pictures for client 15032: pid 18444 timed out
> 
> Jun 17 13:49:21 node1 Filesystem2[18750]: INFO: Running stop for //cifs/share/pictures on /srv/cifs/pictures
> Jun 17 13:49:21 node1 Filesystem2[18750]: INFO: Trying to unmount /srv/cifs/pictures
> Jun 17 13:49:41 node1 lrmd: [15029]: WARN: p_cifs_pictures:stop process (PID 18750) timed out (try 1).  Killing with signal SIGTERM (15).
> 
> Jun 17 13:49:41 node1 crmd: [15032]: WARN: status_from_rc: Action 5 (p_cifs_pictures_stop_0) on node1 failed (target: 0 vs. rc: -2): Error
> Jun 17 13:49:41 node1 crmd: [15032]: WARN: update_failcount: Updating failcount for p_cifs_pictures on node1 after failed stop: rc=-2 (update=INFINITY, time=1403005781)
> Jun 17 13:49:41 node1 pengine: [15031]: WARN: common_apply_stickiness: Forcing p_cifs_pictures away from node1 after 1000000 failures (max=1000000)
> 
> So far so bad. How can I avoid a timeout during the recover? I mean what is the read/write check all about if it leaves the resource unmanaged at the end?

Why do you think that the one affects the other? Or is it that
when you turn the 20 level monitor off stop doesn't timeout? The
two should be unrelated.

> I fully understand, that if the resource is not securely shut down and stonith is not active, it should be unmanaged.

Right. Stop timeouts should really be avoided, whenever possible.
However, what is the meaning of the timeout in this case? Is
there a possibility of corruption if the filesystem cannot be
umounted in a regular way? I'm not an expert on ceph, that's why
I ask.

Thanks,

Dejan

> Thank you.
> 
> Stefan

> _______________________________________________
> Pacemaker mailing list: Pacemaker at oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org