[gmx-users] GROMACS on glacier.westgrid.ca

Thu May 28 22:17:20 CEST 2009

Hi Payman,

you've had a few suggestions along the same lines here. I'm only posting to suggest that you investigate these suggestions and, if you are correct, 
give us some evidence that this is really not an MPI or cluster settings problem. Simply disagreeing with advice is a great way to ensure that you
don't get any more.

Chris.

--- original message ---

I am running other softwares in parallel on glacier. GROMACS has shown
such problem up to now!

On Thu, 2009-05-28 at 21:52 +0200, p.yamin at fz-juelich.de <http://www.gromacs.org/mailman/listinfo/gmx-users> wrote:
> It seems like your nodes cannot communicate.
> you might want to check your job-submitting script: does it call mdrun correctly? cpu/time allocation? try to see if mpi is installed correctly by submitting any other mpi-enabled code.
> 
> 
> Peyman Yamin
> Institut fuer Strukturbiologie und Biophysik (ISB)
> ISB-3: Strukturbiochemie
> Forschungszentrum Juelich
> D-52425 Juelich 
> Tel:	(49)-2461-61-2875  
> Fax:	(49)-2461-61-2023
> mailto: p.yamin[at]fz-juelich.de 
> 
> ----- Original Message -----
> From: Paymon Pirzadeh <ppirzade at ucalgary.ca <http://www.gromacs.org/mailman/listinfo/gmx-users>>
> Date: Thursday, May 28, 2009 6:35 pm
> Subject: [gmx-users] GROMACS on glacier.westgrid.ca
> 
> > Hello all,
> > A follow up from my previous problems on running the gromacs on 
> > glacierturned out that I had problems in installation. The guide in 
> > the manual
> > was not clear enough. So, I installed the code fine this time, so, 
> > I get
> > errors from 4.0.4 version. 
> > I ran the code interactively on the master node and everything went
> > smooth and fine without any problems. But as soon as I started running
> > it in parallel through que line, job gets killed. Following are the
> > messages I received:
> > 
> > PBS Job Id: 4686006.teva.westgrid.ubc
> > > >> Job Name:   sixsite-test
> > > >> Exec host:  ice1_2/1+ice1_2/0+ice1_1/1+ice1_1/0
> > > >> Aborted by PBS Server
> > > >> Job cannot be executed
> > 
> > The error messages I receive in error file:
> > > >> Killed by signal 2.^M
> > > >> Killed by signal 2.^M
> > > >> Killed by signal 2.^M
> > 
> > Running on 4 processors.
> > > >> Starting run at: Mon May 25 20:29:34 PDT 2009
> > > >> p2_11876:  p4_error: Timeout in establishing connection to remote
> > > >> process: 0
> > > >> p0_4273:  p4_error: net_recv read:  probable EOF on socket: 1
> > > >> Job finished at: Mon May 25 20:34:40 PDT 2009
> > 
> > All the clues show that the program had started working, initial
> > conditions were reviewed by the code, but suddenly everything 
> > crashed. I
> > really do not know where the error originates from!!! This is while my
> > friend has compiled the code in the same way as I did, and she is
> > running fine on another cluster.
> > Regards,
> > 
> > Payman
> >