[gmx-users] Possible bug in parallelization, PME or load-balancing on Gromacs 4.0_rc1 ??

Bjørn Steen Sæthre st01397 at student.uib.no
Mon Sep 29 19:50:57 CEST 2008


The only Error message I can find is the rather cryptic::

NOTE: Turning on dynamic load balancing

_pmii_daemon(SIGCHLD): PE 4 exit signal Killed
[NID 1412]Apid 159787: initiated application termination

There are no error's apart from that.
This may not be very helpful, but I googled this particular error and
came up with another massively parallel code: Gadget2, also doing domain
decomposition, and this link:

http://www.mpa-garching.mpg.de/gadget/gadget-list/0213.html

Furthermore I can now report that this error is endemic in all my sims
using harmonic position restraints in GROMACS 4.0_beta1 and GMX
4.0_rc1. 
(I have yet to check if it remains an issue without restraints, but I
strongly suspect it does:  I did some earlier sims on a gmx version
downloaded from CVS on 20/08/08 when the DD scheme was just barely
implemented, seeing similar unexplained crashes.) 

I thus have some reasons to think it has to do with the new
domain-decomposition implementation.

About core dumps. I will talk to our HPC staff, and get back to you with
something more substantial I hope.
I guess I could recompile gmx for the Totalview debugger, and give you
some debugging information from that. Would this be helpful?? 

Would it be helpful to give you diagnostics from  running mdrun
verbosely or with the -debug flag??

If you think it beneficial I can also provide the config.log.

My configscript is really quite minimal:
------------------------
! /bin/bash
export LDFLAGS="-lsci"
export CFLAGS="-march=barcelona -O3"

./configure --prefix=$HOME/gmx_latest_290908 --disable-fortran
--enable-mpi --without-x --without-xml --with-external-lapack
--with-external-blas --program-prefix=par  CC=cc  MPICC=cc
---------------------------------

I am using fftw-3.1.1, the gcc-4.2.0 quadcore-edition compiler.
Cray's optimized XT LibSci 10.3.0 blas/lapack routines
and Cray's optimized MPI library (based on MPICH2 I believe)

I will get back to you with more soon

Regards and thanks
Bjørn

> 
> 
> Can you produce core dump files?
> 
> Berk
> 

> > PBS .o: 
> > Application 159316 exit codes: 137
> > Application 159316 exit signals: Killed
> > Application 159316 resources: utime 0, stime 0
> > --------------------------------------------------
> > Begin PBS Epilogue hexagon.bccs.uib.no
> > Date: Mon Sep 29 12:32:54 CEST 2008
> > Job ID: 65643.nid00003
> > Username: bjornss
> > Group: bjornss
> > Job Name: pmf_hydanneal_heatup_400K
> > Session: 10156
> > Limits: walltime=05:00:00
> > Resources:
> > cput=00:00:00,mem=4940kb,vmem=22144kb,walltime=00:20:31
> > Queue: batch
> > Account: fysisk
> > Base login-node: login5
> > End PBS Epilogue Mon Sep 29 12:32:54 CEST 2008
> > 
> > PBS .err:
> > _pmii_daemon(SIGCHLD): PE 0 exit signal Killed
> > [NID 702]Apid 159316: initiated application termination.
> > 
> > As proper electrostatics is crucial to my modeling I am using PME
> which
> > comprises a large part of my calculation cost: 35-50%
> > In the most extreme case, I use the following startup-script
> > 
> > run.pbs:
> > 
> > #!/bin/bash
> > #PBS -A fysisk
> > #PBS -N pmf_hydanneal_heatup_400K
> > #PBS -o pmf_hydanneal.o
> > #PBS -e pmf.hydanneal.err
> > #PBS -l walltime=5:00:00,mppwidth=40,mppnppn=4
> > 
> > cd /work/bjornss/pmf/structII/hydrate_annealing/heatup_400K
> > source $HOME/gmx_latest_290908/bin/GMXRC
> > 
> > aprun -n 40 parmdrun -s topol.tpr -maxh 5 -npme 20
> > exit $?
> > 
> > 
> > Now, apart from a significant reduction in the system dipole moment,
> > there are no large changes in the system, nor significant
> translations
> > of the molecules in the box.
> > 
> > I enclose the md.log and my parameter file. The run-topology
> (topol.tpr)
> > can be found at:
> > 
> > http:/drop.io/mdanneal
> > 
> > if anyone wants to try and replicate the crash on their local
> cluster,
> > they are welcome.
> > If after such trials are attempted the error persists, I am willing
> to
> > post a bug on bugzilla.
> > 
> > 
> > If more information is needed I will try to provide it upon request
> > 
> > 
> > Regards and thanks for bothering
> > 
> > -- 
> > ---------------------
> > Bjørn Steen Saethre 
> > PhD-student
> > Theoretical and Energy Physics Unit
> > Institute of Physics and Technology
> > Allegt, 41
> > N-5020 Bergen
> > Norway
> > 
> > Tel(office) +47 55582869 
> > 
> > 
> 
> 
> ______________________________________________________________________
> Express yourself instantly with MSN Messenger! MSN Messenger
> _______________________________________________
> gmx-users mailing list    gmx-users at gromacs.org
> http://www.gromacs.org/mailman/listinfo/gmx-users
> Please search the archive at http://www.gromacs.org/search before posting!
> Please don't post (un)subscribe requests to the list. Use the 
> www interface or send it to gmx-users-request at gromacs.org.
> Can't post? Read http://www.gromacs.org/mailing_lists/users.php




More information about the gromacs.org_gmx-users mailing list