[gmx-users] replica exchange: >4 processors

Paul Whitford pwhitfor at ctbp.ucsd.edu
Fri Dec 7 06:18:09 CET 2007


I am using 3.3.2 and 3.3.1 and I get the following problem with both of
them.

If I run replica exchange on >4 processors (2 and 4 are fine), the
simulations finish, but mpi gives the following errors, thus the job never
terminates


this is the end of my log file

-----------------------------------------------------------------------

               NODE (s)   Real (s)      (%)
       Time: 158483.430 159636.000     99.3
                       1d20h01:23
               (Mnbf/s)   (MFlops)   (ns/day)  (hour/ns)
Performance:     18.919    818.029      2.726      8.805
p13_15442:  p4_error: Timeout in establishing connection to remote process:
0
p12_15407:  p4_error: Timeout in establishing connection to remote process:
0
Broken pipe
p11_2364:  p4_error: Timeout in establishing connection to remote process: 0
p9_20588:  p4_error: Timeout in establishing connection to remote process: 0
p10_2329:  p4_error: Timeout in establishing connection to remote process: 0
Broken pipe
Broken pipe
Broken pipe
Broken pipe
p6_24137:  p4_error: Timeout in establishing connection to remote process: 0
p7_24172:  p4_error: Timeout in establishing connection to remote process: 0
Broken pipe
Broken pipe


I have tried installing on three different clusters, using different
versions of mpich and they all do this.  BUT, I do not get the error if I am
running a single simulation on 8 processors, I only get this problem when I
run replica exchange.  Any ideas what is going on?  I'm also including my
submission script, perhaps I am missing something, but I'm just not seeing
it

#!/bin/bash
#
#$ -N switch_less
#$ -pe mpich 8
#$ -cwd
#$ -j y
#$ -S /bin/bash
#
#$ -l h_rt=00:05:00

MPIDIR=/opt/mpich/intel/bin/
MDDIR=/soft/linux/pkg/gromacs-3.3.1/bin
SYSTEM=free


INDEX=0
for T in 80 82 84 86 87 88 89 90
do
sed "s/TTTT/$T/g" MDRUN > mdrun.$INDEX.mdp

$MDDIR/grompp \
        -f mdrun.$INDEX \
        -c $SYSTEM.gro \
        -p $SYSTEM.top \
        -po mdout.$INDEX \
        -o $SYSTEM$INDEX.tpr
let "INDEX += 1"

done

if test $NSLOTS -eq $INDEX
then
$MPIDIR/mpirun -v -np $NSLOTS -machinefile $TMPDIR/machines \
  -nolocal $MDDIR/mdrun-mpi -v \
        -np $NSLOTS \
        -multi $NSLOTS \
        -replex 50 \
        -s $SYSTEM.tpr \
        -o $SYSTEM \
        -c $SYSTEM.out \
        -g $SYSTEM \
        -e $SYSTEM \
        -x $SYSTEM
else

echo 'wrong number of nodes for the number of replicas'
fi


I have tried using the -debug option when running gromacs, but I can't tell
what is going on with it.  Is there something I should look for in the debug
logfile?

thanks

-Paul
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://maillist.sys.kth.se/pipermail/gromacs.org_gmx-users/attachments/20071206/2811454f/attachment.html>


More information about the gromacs.org_gmx-users mailing list