[Pacemaker] Please Help - frequent cleanup is required for the resources on failover condition

Mon Aug 17 02:08:39 UTC 2009

Hello,

The node1 was chocked due to a big "messages" file. we have fixed that
problem in node1, then we ran a update for SLES11, every required
patches were installed properly (service openais stop - was done before
patching).  We have purposely switched off the node 2 during this
exercise to avoid any complications. 

After the update, we have started the system back online (node2 was
still kept off) and saw that the machine was refusing to function.
analysis found that the systems update had changed the contents
of /etc/hosts during update, node1 entry was taken out for some reason
in node1 hosts file. Pacemaker showed all services to be down even after
that fix. Couple of reboots - no help. I have attached forensics report
(cib and messages) of node1 in node1.tar

Whilst , we have switched off node1 after our enthu levels went down and
made node2 online. We were happy to see the things work well in it (we
have adjusted the timings - no cleanup was required - though we tested
this fact for only one reboot) - except the message node1 is offline. we
copied this cib of node2 using cibadmin - Q, switched it off and
switched on the node1 for cib injection.

@ node 1 we have cleared the pacemaker config using cibadmin -E --force,
then we injected the cib(after increasing the epoch values) using
cibadmin -U -x cib.xml. service openais restart revealed the wonderful
fact that node1 is still behaving the same way. no green signal except
node1 DC.

Heart broken we did a forensic evidence collection, switched of node1
and made node2 online for further study on its remaining files. Volla -
node 2 came up showing all red. no services running. only green i could
see was node2 dc. Any way forensic was done. Files are attached herewith
for your kind perusal.

severely broken, there was no more energy left in us for this 5 week
effort to bring up a HA cluster which will run postgres, apache on a
virtual ip. decided to switch off the msa array - after switching off
node2 (we lost hope in node 1 earlier).

10 minutes later - MSA was pushed online then node2 - then node1. Node 2
became dc and all is green. I really did not understand what went wrong
when and where. I tried to look in the log - but was not able to
understand anything (lack of confidence after multiple failure ).

One observation, which could be right or wrong - node 1 will fail to
function properly if node 2 is not available and vice versa. Well node1
is now having the latest patches, but node2 is still virgin. we didn't
have the heart to run update on node2 after experiencing the node 1
affair.

Please throw some light in to our mystery HA project.

Thank you in advance.

Take care, 

Abhin 

On Thu, 2009-08-13 at 14:16 +0200, Andrew Beekhof wrote: 
> First thing I'd do is fix this:
> 
> Aug  8 13:47:13 node1 cib: [3894]: ERROR: write_xml_file: Cannot write
> output to /var/lib/heartbeat/crm/cib.XLiyUG: No space left on device
> (28)
> 
> then i'd increase the timeouts:
> 
> Aug  8 13:39:42 node2 crmd: [3803]: ERROR: process_lrm_event: LRM
> operation fs:1_stop_0 (18) Timed Out (timeout=20000ms)
> Aug  8 13:45:16 node2 crmd: [3692]: ERROR: process_lrm_event: LRM
> operation postgres_start_0 (15) Timed Out (timeout=20000ms)
> Aug  8 13:48:57 node2 crmd: [3692]: ERROR: process_lrm_event: LRM
> operation fs:0_stop_0 (23) Timed Out (timeout=20000ms)
> Aug  8 13:53:06 node2 crmd: [3765]: ERROR: process_lrm_event: LRM
> operation postgres_start_0 (14) Timed Out (timeout=20000ms)
> 
> Try setting default-action-timeout to something higher than 20s
> 
> On Wed, Aug 12, 2009 at 11:54 AM, Abhin.G.S - DEUCN<deucn at inmail.sk> wrote:
> >
> > Hello Andrew,
> >
> > On behalf of Ajith, i'm sending you the details.
> >
> > /var/log/message  of node2 (truncated) = http://deucn.com/messages_new
> >
> > Attachments :
> >
> > 1> CIB.xml
> >
> > 2> extract of /var/log/messages of node1
> >
> > 3> complete /var/log/messages of node2 in zip format
> >
> > Please help us.
> >
> > Thank you,
> >
> > Warm Regards
> >
> > Abhin.G.S
> > ---- Original message ----
> > From: Andrew Beekhof <andrew at beekhof.net>
> > To: pacemaker at oss.clusterlabs.org
> > Date: 8/12/2009 12:49:00 PM
> > Subject: Re: [Pacemaker] Please Help - frequent cleanup is required for the
> > resources on failover condition
> >
> > On Sun, Aug 9, 2009 at 4:41 PM, Ajith Kumar<ajith.kgs.hk at gmail.com> wrote:
> >> Hello Everyone,
> >>
> >> I was behind a project to create a test cluster using Pacemaker on suse11.
> >> With kind help of lmb and beekhof @ #linux-cluster i was finally able to
> >> put
> >> up a two node cluster using HP ML350g5 each with two HBA connected to a
> >> MSA2012fcdc.
> >>
> >> The cluster resource both apache2 and postgresql requires cleanup every
> >> time
> >> i boot the cluster (this was a test cluster - which was switched off at
> >> the
> >> end of the day - or when i see the level of madness in me cross the
> >> barrier), when a simulated failover (by making the other node stand by) ,
> >> or
> >> when i pull the nic cable of one node. Ipaddress and stonith was working
> >> fine as planned. but the big boys - apache2 and postgresql is having
> >> trouble
> >> and i have to cleanup always.
> >>
> >> I would like to give the log file as attachment (/var/log/messages) - but
> >> it
> >> is 3.2GB in size
> >
> > limit the contents to just one instance of the problem and use bzip
> >
> >> and  has lot of repeated entries - whic i did not find
> >> relevant.
> >
> > actually its the only thing that i relevant
> >
> > _______________________________________________
> > Pacemaker mailing list
> > Pacemaker at oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > ------------------------------------
> > Abhin.G.S
> > =========
> > +91-9895-525880 | +91-471-2437189
> > D E U C N ® | http://www.deucn.com
> > ------------------------------------
> >
> > ----------
> > VYHLADAJTE VASE DOVOLENKOVE FOTOGRAFIE NA MAPE. Info na www.fotoskola.sk.
> >

----------
Ukazte svoje fotky na www.zonerama.sk
-------------- next part --------------
A non-text attachment was scrubbed...
Name: deucn.tar.gz
Type: application/x-compressed-tar
Size: 127877 bytes
Desc: not available
URL: <https://lists.clusterlabs.org/pipermail/pacemaker/attachments/20090817/b9a0da2a/attachment-0002.bin>